The multiple imputation method: a case study involving secondary data analysis.
Walani, Salimah R; Cleland, Charles M
2015-05-01
To illustrate with the example of a secondary data analysis study the use of the multiple imputation method to replace missing data. Most large public datasets have missing data, which need to be handled by researchers conducting secondary data analysis studies. Multiple imputation is a technique widely used to replace missing values while preserving the sample size and sampling variability of the data. The 2004 National Sample Survey of Registered Nurses. The authors created a model to impute missing values using the chained equation method. They used imputation diagnostics procedures and conducted regression analysis of imputed data to determine the differences between the log hourly wages of internationally educated and US-educated registered nurses. The authors used multiple imputation procedures to replace missing values in a large dataset with 29,059 observations. Five multiple imputed datasets were created. Imputation diagnostics using time series and density plots showed that imputation was successful. The authors also present an example of the use of multiple imputed datasets to conduct regression analysis to answer a substantive research question. Multiple imputation is a powerful technique for imputing missing values in large datasets while preserving the sample size and variance of the data. Even though the chained equation method involves complex statistical computations, recent innovations in software and computation have made it possible for researchers to conduct this technique on large datasets. The authors recommend nurse researchers use multiple imputation methods for handling missing data to improve the statistical power and external validity of their studies.
Siddique, Juned; Harel, Ofer; Crespi, Catherine M.; Hedeker, Donald
2014-01-01
The true missing data mechanism is never known in practice. We present a method for generating multiple imputations for binary variables that formally incorporates missing data mechanism uncertainty. Imputations are generated from a distribution of imputation models rather than a single model, with the distribution reflecting subjective notions of missing data mechanism uncertainty. Parameter estimates and standard errors are obtained using rules for nested multiple imputation. Using simulation, we investigate the impact of missing data mechanism uncertainty on post-imputation inferences and show that incorporating this uncertainty can increase the coverage of parameter estimates. We apply our method to a longitudinal smoking cessation trial where nonignorably missing data were a concern. Our method provides a simple approach for formalizing subjective notions regarding nonresponse and can be implemented using existing imputation software. PMID:24634315
A comparison of multiple imputation methods for incomplete longitudinal binary data.
Yamaguchi, Yusuke; Misumi, Toshihiro; Maruo, Kazushi
2018-01-01
Longitudinal binary data are commonly encountered in clinical trials. Multiple imputation is an approach for getting a valid estimation of treatment effects under an assumption of missing at random mechanism. Although there are a variety of multiple imputation methods for the longitudinal binary data, a limited number of researches have reported on relative performances of the methods. Moreover, when focusing on the treatment effect throughout a period that has often been used in clinical evaluations of specific disease areas, no definite investigations comparing the methods have been available. We conducted an extensive simulation study to examine comparative performances of six multiple imputation methods available in the SAS MI procedure for longitudinal binary data, where two endpoints of responder rates at a specified time point and throughout a period were assessed. The simulation study suggested that results from naive approaches of a single imputation with non-responders and a complete case analysis could be very sensitive against missing data. The multiple imputation methods using a monotone method and a full conditional specification with a logistic regression imputation model were recommended for obtaining unbiased and robust estimations of the treatment effect. The methods were illustrated with data from a mental health research.
Alternative Multiple Imputation Inference for Mean and Covariance Structure Modeling
ERIC Educational Resources Information Center
Lee, Taehun; Cai, Li
2012-01-01
Model-based multiple imputation has become an indispensable method in the educational and behavioral sciences. Mean and covariance structure models are often fitted to multiply imputed data sets. However, the presence of multiple random imputations complicates model fit testing, which is an important aspect of mean and covariance structure…
Meta‐analysis of test accuracy studies using imputation for partial reporting of multiple thresholds
Deeks, J.J.; Martin, E.C.; Riley, R.D.
2017-01-01
Introduction For tests reporting continuous results, primary studies usually provide test performance at multiple but often different thresholds. This creates missing data when performing a meta‐analysis at each threshold. A standard meta‐analysis (no imputation [NI]) ignores such missing data. A single imputation (SI) approach was recently proposed to recover missing threshold results. Here, we propose a new method that performs multiple imputation of the missing threshold results using discrete combinations (MIDC). Methods The new MIDC method imputes missing threshold results by randomly selecting from the set of all possible discrete combinations which lie between the results for 2 known bounding thresholds. Imputed and observed results are then synthesised at each threshold. This is repeated multiple times, and the multiple pooled results at each threshold are combined using Rubin's rules to give final estimates. We compared the NI, SI, and MIDC approaches via simulation. Results Both imputation methods outperform the NI method in simulations. There was generally little difference in the SI and MIDC methods, but the latter was noticeably better in terms of estimating the between‐study variances and generally gave better coverage, due to slightly larger standard errors of pooled estimates. Given selective reporting of thresholds, the imputation methods also reduced bias in the summary receiver operating characteristic curve. Simulations demonstrate the imputation methods rely on an equal threshold spacing assumption. A real example is presented. Conclusions The SI and, in particular, MIDC methods can be used to examine the impact of missing threshold results in meta‐analysis of test accuracy studies. PMID:29052347
Multiple imputation for handling missing outcome data when estimating the relative risk.
Sullivan, Thomas R; Lee, Katherine J; Ryan, Philip; Salter, Amy B
2017-09-06
Multiple imputation is a popular approach to handling missing data in medical research, yet little is known about its applicability for estimating the relative risk. Standard methods for imputing incomplete binary outcomes involve logistic regression or an assumption of multivariate normality, whereas relative risks are typically estimated using log binomial models. It is unclear whether misspecification of the imputation model in this setting could lead to biased parameter estimates. Using simulated data, we evaluated the performance of multiple imputation for handling missing data prior to estimating adjusted relative risks from a correctly specified multivariable log binomial model. We considered an arbitrary pattern of missing data in both outcome and exposure variables, with missing data induced under missing at random mechanisms. Focusing on standard model-based methods of multiple imputation, missing data were imputed using multivariate normal imputation or fully conditional specification with a logistic imputation model for the outcome. Multivariate normal imputation performed poorly in the simulation study, consistently producing estimates of the relative risk that were biased towards the null. Despite outperforming multivariate normal imputation, fully conditional specification also produced somewhat biased estimates, with greater bias observed for higher outcome prevalences and larger relative risks. Deleting imputed outcomes from analysis datasets did not improve the performance of fully conditional specification. Both multivariate normal imputation and fully conditional specification produced biased estimates of the relative risk, presumably since both use a misspecified imputation model. Based on simulation results, we recommend researchers use fully conditional specification rather than multivariate normal imputation and retain imputed outcomes in the analysis when estimating relative risks. However fully conditional specification is not without its shortcomings, and so further research is needed to identify optimal approaches for relative risk estimation within the multiple imputation framework.
Bernhardt, Paul W; Wang, Huixia Judy; Zhang, Daowen
2014-01-01
Models for survival data generally assume that covariates are fully observed. However, in medical studies it is not uncommon for biomarkers to be censored at known detection limits. A computationally-efficient multiple imputation procedure for modeling survival data with covariates subject to detection limits is proposed. This procedure is developed in the context of an accelerated failure time model with a flexible seminonparametric error distribution. The consistency and asymptotic normality of the multiple imputation estimator are established and a consistent variance estimator is provided. An iterative version of the proposed multiple imputation algorithm that approximates the EM algorithm for maximum likelihood is also suggested. Simulation studies demonstrate that the proposed multiple imputation methods work well while alternative methods lead to estimates that are either biased or more variable. The proposed methods are applied to analyze the dataset from a recently-conducted GenIMS study.
Peterson, Josh F.; Eden, Svetlana K.; Moons, Karel G.; Ikizler, T. Alp; Matheny, Michael E.
2013-01-01
Summary Background and objectives Baseline creatinine (BCr) is frequently missing in AKI studies. Common surrogate estimates can misclassify AKI and adversely affect the study of related outcomes. This study examined whether multiple imputation improved accuracy of estimating missing BCr beyond current recommendations to apply assumed estimated GFR (eGFR) of 75 ml/min per 1.73 m2 (eGFR 75). Design, setting, participants, & measurements From 41,114 unique adult admissions (13,003 with and 28,111 without BCr data) at Vanderbilt University Hospital between 2006 and 2008, a propensity score model was developed to predict likelihood of missing BCr. Propensity scoring identified 6502 patients with highest likelihood of missing BCr among 13,003 patients with known BCr to simulate a “missing” data scenario while preserving actual reference BCr. Within this cohort (n=6502), the ability of various multiple-imputation approaches to estimate BCr and classify AKI were compared with that of eGFR 75. Results All multiple-imputation methods except the basic one more closely approximated actual BCr than did eGFR 75. Total AKI misclassification was lower with multiple imputation (full multiple imputation + serum creatinine) (9.0%) than with eGFR 75 (12.3%; P<0.001). Improvements in misclassification were greater in patients with impaired kidney function (full multiple imputation + serum creatinine) (15.3%) versus eGFR 75 (40.5%; P<0.001). Multiple imputation improved specificity and positive predictive value for detecting AKI at the expense of modestly decreasing sensitivity relative to eGFR 75. Conclusions Multiple imputation can improve accuracy in estimating missing BCr and reduce misclassification of AKI beyond currently proposed methods. PMID:23037980
Should multiple imputation be the method of choice for handling missing data in randomized trials?
Sullivan, Thomas R; White, Ian R; Salter, Amy B; Ryan, Philip; Lee, Katherine J
2016-01-01
The use of multiple imputation has increased markedly in recent years, and journal reviewers may expect to see multiple imputation used to handle missing data. However in randomized trials, where treatment group is always observed and independent of baseline covariates, other approaches may be preferable. Using data simulation we evaluated multiple imputation, performed both overall and separately by randomized group, across a range of commonly encountered scenarios. We considered both missing outcome and missing baseline data, with missing outcome data induced under missing at random mechanisms. Provided the analysis model was correctly specified, multiple imputation produced unbiased treatment effect estimates, but alternative unbiased approaches were often more efficient. When the analysis model overlooked an interaction effect involving randomized group, multiple imputation produced biased estimates of the average treatment effect when applied to missing outcome data, unless imputation was performed separately by randomized group. Based on these results, we conclude that multiple imputation should not be seen as the only acceptable way to handle missing data in randomized trials. In settings where multiple imputation is adopted, we recommend that imputation is carried out separately by randomized group. PMID:28034175
Should multiple imputation be the method of choice for handling missing data in randomized trials?
Sullivan, Thomas R; White, Ian R; Salter, Amy B; Ryan, Philip; Lee, Katherine J
2016-01-01
The use of multiple imputation has increased markedly in recent years, and journal reviewers may expect to see multiple imputation used to handle missing data. However in randomized trials, where treatment group is always observed and independent of baseline covariates, other approaches may be preferable. Using data simulation we evaluated multiple imputation, performed both overall and separately by randomized group, across a range of commonly encountered scenarios. We considered both missing outcome and missing baseline data, with missing outcome data induced under missing at random mechanisms. Provided the analysis model was correctly specified, multiple imputation produced unbiased treatment effect estimates, but alternative unbiased approaches were often more efficient. When the analysis model overlooked an interaction effect involving randomized group, multiple imputation produced biased estimates of the average treatment effect when applied to missing outcome data, unless imputation was performed separately by randomized group. Based on these results, we conclude that multiple imputation should not be seen as the only acceptable way to handle missing data in randomized trials. In settings where multiple imputation is adopted, we recommend that imputation is carried out separately by randomized group.
Multiple Imputation of Multilevel Missing Data-Rigor versus Simplicity
ERIC Educational Resources Information Center
Drechsler, Jörg
2015-01-01
Multiple imputation is widely accepted as the method of choice to address item-nonresponse in surveys. However, research on imputation strategies for the hierarchical structures that are typically found in the data in educational contexts is still limited. While a multilevel imputation model should be preferred from a theoretical point of view if…
ERIC Educational Resources Information Center
van Ginkel, Joost R.; van der Ark, L. Andries; Sijtsma, Klaas
2007-01-01
The performance of five simple multiple imputation methods for dealing with missing data were compared. In addition, random imputation and multivariate normal imputation were used as lower and upper benchmark, respectively. Test data were simulated and item scores were deleted such that they were either missing completely at random, missing at…
Sajobi, Tolulope T; Lix, Lisa M; Singh, Gurbakhshash; Lowerison, Mark; Engbers, Jordan; Mayo, Nancy E
2015-03-01
Response shift (RS) is an important phenomenon that influences the assessment of longitudinal changes in health-related quality of life (HRQOL) studies. Given that RS effects are often small, missing data due to attrition or item non-response can contribute to failure to detect RS effects. Since missing data are often encountered in longitudinal HRQOL data, effective strategies to deal with missing data are important to consider. This study aims to compare different imputation methods on the detection of reprioritization RS in the HRQOL of caregivers of stroke survivors. Data were from a Canadian multi-center longitudinal study of caregivers of stroke survivors over a one-year period. The Stroke Impact Scale physical function score at baseline, with a cutoff of 75, was used to measure patient stroke severity for the reprioritization RS analysis. Mean imputation, likelihood-based expectation-maximization imputation, and multiple imputation methods were compared in test procedures based on changes in relative importance weights to detect RS in SF-36 domains over a 6-month period. Monte Carlo simulation methods were used to compare the statistical powers of relative importance test procedures for detecting RS in incomplete longitudinal data under different missing data mechanisms and imputation methods. Of the 409 caregivers, 15.9 and 31.3 % of them had missing data at baseline and 6 months, respectively. There were no statistically significant changes in relative importance weights on any of the domains when complete-case analysis was adopted. But statistical significant changes were detected on physical functioning and/or vitality domains when mean imputation or EM imputation was adopted. There were also statistically significant changes in relative importance weights for physical functioning, mental health, and vitality domains when multiple imputation method was adopted. Our simulations revealed that relative importance test procedures were least powerful under complete-case analysis method and most powerful when a mean imputation or multiple imputation method was adopted for missing data, regardless of the missing data mechanism and proportion of missing data. Test procedures based on relative importance measures are sensitive to the type and amount of missing data and imputation method. Relative importance test procedures based on mean imputation and multiple imputation are recommended for detecting RS in incomplete data.
Howie, Bryan N.; Donnelly, Peter; Marchini, Jonathan
2009-01-01
Genotype imputation methods are now being widely used in the analysis of genome-wide association studies. Most imputation analyses to date have used the HapMap as a reference dataset, but new reference panels (such as controls genotyped on multiple SNP chips and densely typed samples from the 1,000 Genomes Project) will soon allow a broader range of SNPs to be imputed with higher accuracy, thereby increasing power. We describe a genotype imputation method (IMPUTE version 2) that is designed to address the challenges presented by these new datasets. The main innovation of our approach is a flexible modelling framework that increases accuracy and combines information across multiple reference panels while remaining computationally feasible. We find that IMPUTE v2 attains higher accuracy than other methods when the HapMap provides the sole reference panel, but that the size of the panel constrains the improvements that can be made. We also find that imputation accuracy can be greatly enhanced by expanding the reference panel to contain thousands of chromosomes and that IMPUTE v2 outperforms other methods in this setting at both rare and common SNPs, with overall error rates that are 15%–20% lower than those of the closest competing method. One particularly challenging aspect of next-generation association studies is to integrate information across multiple reference panels genotyped on different sets of SNPs; we show that our approach to this problem has practical advantages over other suggested solutions. PMID:19543373
Multiple imputation methods for bivariate outcomes in cluster randomised trials.
DiazOrdaz, K; Kenward, M G; Gomes, M; Grieve, R
2016-09-10
Missing observations are common in cluster randomised trials. The problem is exacerbated when modelling bivariate outcomes jointly, as the proportion of complete cases is often considerably smaller than the proportion having either of the outcomes fully observed. Approaches taken to handling such missing data include the following: complete case analysis, single-level multiple imputation that ignores the clustering, multiple imputation with a fixed effect for each cluster and multilevel multiple imputation. We contrasted the alternative approaches to handling missing data in a cost-effectiveness analysis that uses data from a cluster randomised trial to evaluate an exercise intervention for care home residents. We then conducted a simulation study to assess the performance of these approaches on bivariate continuous outcomes, in terms of confidence interval coverage and empirical bias in the estimated treatment effects. Missing-at-random clustered data scenarios were simulated following a full-factorial design. Across all the missing data mechanisms considered, the multiple imputation methods provided estimators with negligible bias, while complete case analysis resulted in biased treatment effect estimates in scenarios where the randomised treatment arm was associated with missingness. Confidence interval coverage was generally in excess of nominal levels (up to 99.8%) following fixed-effects multiple imputation and too low following single-level multiple imputation. Multilevel multiple imputation led to coverage levels of approximately 95% throughout. © 2016 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd. © 2016 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd.
Taylor, Sandra L; Ruhaak, L Renee; Kelly, Karen; Weiss, Robert H; Kim, Kyoungmi
2017-03-01
With expanded access to, and decreased costs of, mass spectrometry, investigators are collecting and analyzing multiple biological matrices from the same subject such as serum, plasma, tissue and urine to enhance biomarker discoveries, understanding of disease processes and identification of therapeutic targets. Commonly, each biological matrix is analyzed separately, but multivariate methods such as MANOVAs that combine information from multiple biological matrices are potentially more powerful. However, mass spectrometric data typically contain large amounts of missing values, and imputation is often used to create complete data sets for analysis. The effects of imputation on multiple biological matrix analyses have not been studied. We investigated the effects of seven imputation methods (half minimum substitution, mean substitution, k-nearest neighbors, local least squares regression, Bayesian principal components analysis, singular value decomposition and random forest), on the within-subject correlation of compounds between biological matrices and its consequences on MANOVA results. Through analysis of three real omics data sets and simulation studies, we found the amount of missing data and imputation method to substantially change the between-matrix correlation structure. The magnitude of the correlations was generally reduced in imputed data sets, and this effect increased with the amount of missing data. Significant results from MANOVA testing also were substantially affected. In particular, the number of false positives increased with the level of missing data for all imputation methods. No one imputation method was universally the best, but the simple substitution methods (Half Minimum and Mean) consistently performed poorly. © The Author 2016. Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com.
A Method for Imputing Response Options for Missing Data on Multiple-Choice Assessments
ERIC Educational Resources Information Center
Wolkowitz, Amanda A.; Skorupski, William P.
2013-01-01
When missing values are present in item response data, there are a number of ways one might impute a correct or incorrect response to a multiple-choice item. There are significantly fewer methods for imputing the actual response option an examinee may have provided if he or she had not omitted the item either purposely or accidentally. This…
A bias-corrected estimator in multiple imputation for missing data.
Tomita, Hiroaki; Fujisawa, Hironori; Henmi, Masayuki
2018-05-29
Multiple imputation (MI) is one of the most popular methods to deal with missing data, and its use has been rapidly increasing in medical studies. Although MI is rather appealing in practice since it is possible to use ordinary statistical methods for a complete data set once the missing values are fully imputed, the method of imputation is still problematic. If the missing values are imputed from some parametric model, the validity of imputation is not necessarily ensured, and the final estimate for a parameter of interest can be biased unless the parametric model is correctly specified. Nonparametric methods have been also proposed for MI, but it is not so straightforward as to produce imputation values from nonparametrically estimated distributions. In this paper, we propose a new method for MI to obtain a consistent (or asymptotically unbiased) final estimate even if the imputation model is misspecified. The key idea is to use an imputation model from which the imputation values are easily produced and to make a proper correction in the likelihood function after the imputation by using the density ratio between the imputation model and the true conditional density function for the missing variable as a weight. Although the conditional density must be nonparametrically estimated, it is not used for the imputation. The performance of our method is evaluated by both theory and simulation studies. A real data analysis is also conducted to illustrate our method by using the Duke Cardiac Catheterization Coronary Artery Disease Diagnostic Dataset. Copyright © 2018 John Wiley & Sons, Ltd.
Leyrat, Clémence; Seaman, Shaun R; White, Ian R; Douglas, Ian; Smeeth, Liam; Kim, Joseph; Resche-Rigon, Matthieu; Carpenter, James R; Williamson, Elizabeth J
2017-01-01
Inverse probability of treatment weighting is a popular propensity score-based approach to estimate marginal treatment effects in observational studies at risk of confounding bias. A major issue when estimating the propensity score is the presence of partially observed covariates. Multiple imputation is a natural approach to handle missing data on covariates: covariates are imputed and a propensity score analysis is performed in each imputed dataset to estimate the treatment effect. The treatment effect estimates from each imputed dataset are then combined to obtain an overall estimate. We call this method MIte. However, an alternative approach has been proposed, in which the propensity scores are combined across the imputed datasets (MIps). Therefore, there are remaining uncertainties about how to implement multiple imputation for propensity score analysis: (a) should we apply Rubin's rules to the inverse probability of treatment weighting treatment effect estimates or to the propensity score estimates themselves? (b) does the outcome have to be included in the imputation model? (c) how should we estimate the variance of the inverse probability of treatment weighting estimator after multiple imputation? We studied the consistency and balancing properties of the MIte and MIps estimators and performed a simulation study to empirically assess their performance for the analysis of a binary outcome. We also compared the performance of these methods to complete case analysis and the missingness pattern approach, which uses a different propensity score model for each pattern of missingness, and a third multiple imputation approach in which the propensity score parameters are combined rather than the propensity scores themselves (MIpar). Under a missing at random mechanism, complete case and missingness pattern analyses were biased in most cases for estimating the marginal treatment effect, whereas multiple imputation approaches were approximately unbiased as long as the outcome was included in the imputation model. Only MIte was unbiased in all the studied scenarios and Rubin's rules provided good variance estimates for MIte. The propensity score estimated in the MIte approach showed good balancing properties. In conclusion, when using multiple imputation in the inverse probability of treatment weighting context, MIte with the outcome included in the imputation model is the preferred approach.
Statistical Methods for Generalized Linear Models with Covariates Subject to Detection Limits.
Bernhardt, Paul W; Wang, Huixia J; Zhang, Daowen
2015-05-01
Censored observations are a common occurrence in biomedical data sets. Although a large amount of research has been devoted to estimation and inference for data with censored responses, very little research has focused on proper statistical procedures when predictors are censored. In this paper, we consider statistical methods for dealing with multiple predictors subject to detection limits within the context of generalized linear models. We investigate and adapt several conventional methods and develop a new multiple imputation approach for analyzing data sets with predictors censored due to detection limits. We establish the consistency and asymptotic normality of the proposed multiple imputation estimator and suggest a computationally simple and consistent variance estimator. We also demonstrate that the conditional mean imputation method often leads to inconsistent estimates in generalized linear models, while several other methods are either computationally intensive or lead to parameter estimates that are biased or more variable compared to the proposed multiple imputation estimator. In an extensive simulation study, we assess the bias and variability of different approaches within the context of a logistic regression model and compare variance estimation methods for the proposed multiple imputation estimator. Lastly, we apply several methods to analyze the data set from a recently-conducted GenIMS study.
Jiao, S; Tiezzi, F; Huang, Y; Gray, K A; Maltecca, C
2016-02-01
Obtaining accurate individual feed intake records is the key first step in achieving genetic progress toward more efficient nutrient utilization in pigs. Feed intake records collected by electronic feeding systems contain errors (erroneous and abnormal values exceeding certain cutoff criteria), which are due to feeder malfunction or animal-feeder interaction. In this study, we examined the use of a novel data-editing strategy involving multiple imputation to minimize the impact of errors and missing values on the quality of feed intake data collected by an electronic feeding system. Accuracy of feed intake data adjustment obtained from the conventional linear mixed model (LMM) approach was compared with 2 alternative implementations of multiple imputation by chained equation, denoted as MI (multiple imputation) and MICE (multiple imputation by chained equation). The 3 methods were compared under 3 scenarios, where 5, 10, and 20% feed intake error rates were simulated. Each of the scenarios was replicated 5 times. Accuracy of the alternative error adjustment was measured as the correlation between the true daily feed intake (DFI; daily feed intake in the testing period) or true ADFI (the mean DFI across testing period) and the adjusted DFI or adjusted ADFI. In the editing process, error cutoff criteria are used to define if a feed intake visit contains errors. To investigate the possibility that the error cutoff criteria may affect any of the 3 methods, the simulation was repeated with 2 alternative error cutoff values. Multiple imputation methods outperformed the LMM approach in all scenarios with mean accuracies of 96.7, 93.5, and 90.2% obtained with MI and 96.8, 94.4, and 90.1% obtained with MICE compared with 91.0, 82.6, and 68.7% using LMM for DFI. Similar results were obtained for ADFI. Furthermore, multiple imputation methods consistently performed better than LMM regardless of the cutoff criteria applied to define errors. In conclusion, multiple imputation is proposed as a more accurate and flexible method for error adjustments in feed intake data collected by electronic feeders.
ERIC Educational Resources Information Center
Mistler, Stephen A.; Enders, Craig K.
2017-01-01
Multiple imputation methods can generally be divided into two broad frameworks: joint model (JM) imputation and fully conditional specification (FCS) imputation. JM draws missing values simultaneously for all incomplete variables using a multivariate distribution, whereas FCS imputes variables one at a time from a series of univariate conditional…
Ambler, Gareth; Omar, Rumana Z; Royston, Patrick
2007-06-01
Risk models that aim to predict the future course and outcome of disease processes are increasingly used in health research, and it is important that they are accurate and reliable. Most of these risk models are fitted using routinely collected data in hospitals or general practices. Clinical outcomes such as short-term mortality will be near-complete, but many of the predictors may have missing values. A common approach to dealing with this is to perform a complete-case analysis. However, this may lead to overfitted models and biased estimates if entire patient subgroups are excluded. The aim of this paper is to investigate a number of methods for imputing missing data to evaluate their effect on risk model estimation and the reliability of the predictions. Multiple imputation methods, including hotdecking and multiple imputation by chained equations (MICE), were investigated along with several single imputation methods. A large national cardiac surgery database was used to create simulated yet realistic datasets. The results suggest that complete case analysis may produce unreliable risk predictions and should be avoided. Conditional mean imputation performed well in our scenario, but may not be appropriate if using variable selection methods. MICE was amongst the best performing multiple imputation methods with regards to the quality of the predictions. Additionally, it produced the least biased estimates, with good coverage, and hence is recommended for use in practice.
Filling the gap in functional trait databases: use of ecological hypotheses to replace missing data.
Taugourdeau, Simon; Villerd, Jean; Plantureux, Sylvain; Huguenin-Elie, Olivier; Amiaud, Bernard
2014-04-01
Functional trait databases are powerful tools in ecology, though most of them contain large amounts of missing values. The goal of this study was to test the effect of imputation methods on the evaluation of trait values at species level and on the subsequent calculation of functional diversity indices at community level using functional trait databases. Two simple imputation methods (average and median), two methods based on ecological hypotheses, and one multiple imputation method were tested using a large plant trait database, together with the influence of the percentage of missing data and differences between functional traits. At community level, the complete-case approach and three functional diversity indices calculated from grassland plant communities were included. At the species level, one of the methods based on ecological hypothesis was for all traits more accurate than imputation with average or median values, but the multiple imputation method was superior for most of the traits. The method based on functional proximity between species was the best method for traits with an unbalanced distribution, while the method based on the existence of relationships between traits was the best for traits with a balanced distribution. The ranking of the grassland communities for their functional diversity indices was not robust with the complete-case approach, even for low percentages of missing data. With the imputation methods based on ecological hypotheses, functional diversity indices could be computed with a maximum of 30% of missing data, without affecting the ranking between grassland communities. The multiple imputation method performed well, but not better than single imputation based on ecological hypothesis and adapted to the distribution of the trait values for the functional identity and range of the communities. Ecological studies using functional trait databases have to deal with missing data using imputation methods corresponding to their specific needs and making the most out of the information available in the databases. Within this framework, this study indicates the possibilities and limits of single imputation methods based on ecological hypothesis and concludes that they could be useful when studying the ranking of communities for their functional diversity indices.
Filling the gap in functional trait databases: use of ecological hypotheses to replace missing data
Taugourdeau, Simon; Villerd, Jean; Plantureux, Sylvain; Huguenin-Elie, Olivier; Amiaud, Bernard
2014-01-01
Functional trait databases are powerful tools in ecology, though most of them contain large amounts of missing values. The goal of this study was to test the effect of imputation methods on the evaluation of trait values at species level and on the subsequent calculation of functional diversity indices at community level using functional trait databases. Two simple imputation methods (average and median), two methods based on ecological hypotheses, and one multiple imputation method were tested using a large plant trait database, together with the influence of the percentage of missing data and differences between functional traits. At community level, the complete-case approach and three functional diversity indices calculated from grassland plant communities were included. At the species level, one of the methods based on ecological hypothesis was for all traits more accurate than imputation with average or median values, but the multiple imputation method was superior for most of the traits. The method based on functional proximity between species was the best method for traits with an unbalanced distribution, while the method based on the existence of relationships between traits was the best for traits with a balanced distribution. The ranking of the grassland communities for their functional diversity indices was not robust with the complete-case approach, even for low percentages of missing data. With the imputation methods based on ecological hypotheses, functional diversity indices could be computed with a maximum of 30% of missing data, without affecting the ranking between grassland communities. The multiple imputation method performed well, but not better than single imputation based on ecological hypothesis and adapted to the distribution of the trait values for the functional identity and range of the communities. Ecological studies using functional trait databases have to deal with missing data using imputation methods corresponding to their specific needs and making the most out of the information available in the databases. Within this framework, this study indicates the possibilities and limits of single imputation methods based on ecological hypothesis and concludes that they could be useful when studying the ranking of communities for their functional diversity indices. PMID:24772273
Lee, Minjung; Dignam, James J.; Han, Junhee
2014-01-01
We propose a nonparametric approach for cumulative incidence estimation when causes of failure are unknown or missing for some subjects. Under the missing at random assumption, we estimate the cumulative incidence function using multiple imputation methods. We develop asymptotic theory for the cumulative incidence estimators obtained from multiple imputation methods. We also discuss how to construct confidence intervals for the cumulative incidence function and perform a test for comparing the cumulative incidence functions in two samples with missing cause of failure. Through simulation studies, we show that the proposed methods perform well. The methods are illustrated with data from a randomized clinical trial in early stage breast cancer. PMID:25043107
ERIC Educational Resources Information Center
Si, Yajuan; Reiter, Jerome P.
2013-01-01
In many surveys, the data comprise a large number of categorical variables that suffer from item nonresponse. Standard methods for multiple imputation, like log-linear models or sequential regression imputation, can fail to capture complex dependencies and can be difficult to implement effectively in high dimensions. We present a fully Bayesian,…
Ipsative imputation for a 15-item Geriatric Depression Scale in community-dwelling elderly people.
Imai, Hissei; Furukawa, Toshiaki A; Kasahara, Yoriko; Ishimoto, Yasuko; Kimura, Yumi; Fukutomi, Eriko; Chen, Wen-Ling; Tanaka, Mire; Sakamoto, Ryota; Wada, Taizo; Fujisawa, Michiko; Okumiya, Kiyohito; Matsubayashi, Kozo
2014-09-01
Missing data are inevitable in almost all medical studies. Imputation methods using the probabilistic model are common, but they cannot impute individual data and require special software. In contrast, the ipsative imputation method, which substitutes the missing items by the mean of the remaining items within the individual, is easy and does not need any special software, but it can provide individual scores. The aim of the present study was to evaluate the validity of the ipsative imputation method using data involving the 15-item Geriatric Depression Scale. Participants were community-dwelling elderly individuals (n = 1178). A structural equation model was constructed. The model fit indexes were calculated to assess the validity of the imputation method when it is used for individuals who were missing 20% of data or less and 40% of data or less, depending on whether we assumed that their correlation coefficients were the same as the dataset with no missing items. Finally, we compared path coefficients of the dataset imputed by ipsative imputation with those by multiple imputation. When compared with the assumption that the datasets differed, all of the model fit indexes were better under the assumption that the dataset without missing data is the same as that that was missing 20% of data or less. However, by the same assumption, the model fit indexes were worse in the dataset that was missing 40% of data or less. The path coefficients of the dataset imputed by ipsative imputation and by multiple imputation were compatible with each other if the proportion of missing items was 20% or less. Ipsative imputation appears to be a valid imputation method and can be used to impute data in studies using the 15-item Geriatric Depression Scale, if the percentage of its missing items is 20% or less. © 2014 The Authors. Psychogeriatrics © 2014 Japanese Psychogeriatric Society.
Ma, Yan; Zhang, Wei; Lyman, Stephen; Huang, Yihe
2018-06-01
To identify the most appropriate imputation method for missing data in the HCUP State Inpatient Databases (SID) and assess the impact of different missing data methods on racial disparities research. HCUP SID. A novel simulation study compared four imputation methods (random draw, hot deck, joint multiple imputation [MI], conditional MI) for missing values for multiple variables, including race, gender, admission source, median household income, and total charges. The simulation was built on real data from the SID to retain their hierarchical data structures and missing data patterns. Additional predictive information from the U.S. Census and American Hospital Association (AHA) database was incorporated into the imputation. Conditional MI prediction was equivalent or superior to the best performing alternatives for all missing data structures and substantially outperformed each of the alternatives in various scenarios. Conditional MI substantially improved statistical inferences for racial health disparities research with the SID. © Health Research and Educational Trust.
Brock, Guy N; Shaffer, John R; Blakesley, Richard E; Lotz, Meredith J; Tseng, George C
2008-01-10
Gene expression data frequently contain missing values, however, most down-stream analyses for microarray experiments require complete data. In the literature many methods have been proposed to estimate missing values via information of the correlation patterns within the gene expression matrix. Each method has its own advantages, but the specific conditions for which each method is preferred remains largely unclear. In this report we describe an extensive evaluation of eight current imputation methods on multiple types of microarray experiments, including time series, multiple exposures, and multiple exposures x time series data. We then introduce two complementary selection schemes for determining the most appropriate imputation method for any given data set. We found that the optimal imputation algorithms (LSA, LLS, and BPCA) are all highly competitive with each other, and that no method is uniformly superior in all the data sets we examined. The success of each method can also depend on the underlying "complexity" of the expression data, where we take complexity to indicate the difficulty in mapping the gene expression matrix to a lower-dimensional subspace. We developed an entropy measure to quantify the complexity of expression matrixes and found that, by incorporating this information, the entropy-based selection (EBS) scheme is useful for selecting an appropriate imputation algorithm. We further propose a simulation-based self-training selection (STS) scheme. This technique has been used previously for microarray data imputation, but for different purposes. The scheme selects the optimal or near-optimal method with high accuracy but at an increased computational cost. Our findings provide insight into the problem of which imputation method is optimal for a given data set. Three top-performing methods (LSA, LLS and BPCA) are competitive with each other. Global-based imputation methods (PLS, SVD, BPCA) performed better on mcroarray data with lower complexity, while neighbour-based methods (KNN, OLS, LSA, LLS) performed better in data with higher complexity. We also found that the EBS and STS schemes serve as complementary and effective tools for selecting the optimal imputation algorithm.
ERIC Educational Resources Information Center
Fish, Laurel J.; Halcoussis, Dennis; Phillips, G. Michael
2017-01-01
The Monte Carlo method and related multiple imputation methods are traditionally used in math, physics and science to estimate and analyze data and are now becoming standard tools in analyzing business and financial problems. However, few sources explain the application of the Monte Carlo method for individuals and business professionals who are…
Reuse of imputed data in microarray analysis increases imputation efficiency
Kim, Ki-Yeol; Kim, Byoung-Jin; Yi, Gwan-Su
2004-01-01
Background The imputation of missing values is necessary for the efficient use of DNA microarray data, because many clustering algorithms and some statistical analysis require a complete data set. A few imputation methods for DNA microarray data have been introduced, but the efficiency of the methods was low and the validity of imputed values in these methods had not been fully checked. Results We developed a new cluster-based imputation method called sequential K-nearest neighbor (SKNN) method. This imputes the missing values sequentially from the gene having least missing values, and uses the imputed values for the later imputation. Although it uses the imputed values, the efficiency of this new method is greatly improved in its accuracy and computational complexity over the conventional KNN-based method and other methods based on maximum likelihood estimation. The performance of SKNN was in particular higher than other imputation methods for the data with high missing rates and large number of experiments. Application of Expectation Maximization (EM) to the SKNN method improved the accuracy, but increased computational time proportional to the number of iterations. The Multiple Imputation (MI) method, which is well known but not applied previously to microarray data, showed a similarly high accuracy as the SKNN method, with slightly higher dependency on the types of data sets. Conclusions Sequential reuse of imputed data in KNN-based imputation greatly increases the efficiency of imputation. The SKNN method should be practically useful to save the data of some microarray experiments which have high amounts of missing entries. The SKNN method generates reliable imputed values which can be used for further cluster-based analysis of microarray data. PMID:15504240
Cox regression analysis with missing covariates via nonparametric multiple imputation.
Hsu, Chiu-Hsieh; Yu, Mandi
2018-01-01
We consider the situation of estimating Cox regression in which some covariates are subject to missing, and there exists additional information (including observed event time, censoring indicator and fully observed covariates) which may be predictive of the missing covariates. We propose to use two working regression models: one for predicting the missing covariates and the other for predicting the missing probabilities. For each missing covariate observation, these two working models are used to define a nearest neighbor imputing set. This set is then used to non-parametrically impute covariate values for the missing observation. Upon the completion of imputation, Cox regression is performed on the multiply imputed datasets to estimate the regression coefficients. In a simulation study, we compare the nonparametric multiple imputation approach with the augmented inverse probability weighted (AIPW) method, which directly incorporates the two working models into estimation of Cox regression, and the predictive mean matching imputation (PMM) method. We show that all approaches can reduce bias due to non-ignorable missing mechanism. The proposed nonparametric imputation method is robust to mis-specification of either one of the two working models and robust to mis-specification of the link function of the two working models. In contrast, the PMM method is sensitive to misspecification of the covariates included in imputation. The AIPW method is sensitive to the selection probability. We apply the approaches to a breast cancer dataset from Surveillance, Epidemiology and End Results (SEER) Program.
Pappas, D J; Lizee, A; Paunic, V; Beutner, K R; Motyer, A; Vukcevic, D; Leslie, S; Biesiada, J; Meller, J; Taylor, K D; Zheng, X; Zhao, L P; Gourraud, P-A; Hollenbach, J A; Mack, S J; Maiers, M
2018-05-22
Four single nucleotide polymorphism (SNP)-based human leukocyte antigen (HLA) imputation methods (e-HLA, HIBAG, HLA*IMP:02 and MAGPrediction) were trained using 1000 Genomes SNP and HLA genotypes and assessed for their ability to accurately impute molecular HLA-A, -B, -C and -DRB1 genotypes in the Human Genome Diversity Project cell panel. Imputation concordance was high (>89%) across all methods for both HLA-A and HLA-C, but HLA-B and HLA-DRB1 proved generally difficult to impute. Overall, <27.8% of subjects were correctly imputed for all HLA loci by any method. Concordance across all loci was not enhanced via the application of confidence thresholds; reliance on confidence scores across methods only led to noticeable improvement (+3.2%) for HLA-DRB1. As the HLA complex is highly relevant to the study of human health and disease, a standardized assessment of SNP-based HLA imputation methods is crucial for advancing genomic research. Considerable room remains for the improvement of HLA-B and especially HLA-DRB1 imputation methods, and no imputation method is as accurate as molecular genotyping. The application of large, ancestrally diverse HLA and SNP reference data sets and multiple imputation methods has the potential to make SNP-based HLA imputation methods a tractable option for determining HLA genotypes.
Multiple imputation of missing fMRI data in whole brain analysis
Vaden, Kenneth I.; Gebregziabher, Mulugeta; Kuchinsky, Stefanie E.; Eckert, Mark A.
2012-01-01
Whole brain fMRI analyses rarely include the entire brain because of missing data that result from data acquisition limits and susceptibility artifact, in particular. This missing data problem is typically addressed by omitting voxels from analysis, which may exclude brain regions that are of theoretical interest and increase the potential for Type II error at cortical boundaries or Type I error when spatial thresholds are used to establish significance. Imputation could significantly expand statistical map coverage, increase power, and enhance interpretations of fMRI results. We examined multiple imputation for group level analyses of missing fMRI data using methods that leverage the spatial information in fMRI datasets for both real and simulated data. Available case analysis, neighbor replacement, and regression based imputation approaches were compared in a general linear model framework to determine the extent to which these methods quantitatively (effect size) and qualitatively (spatial coverage) increased the sensitivity of group analyses. In both real and simulated data analysis, multiple imputation provided 1) variance that was most similar to estimates for voxels with no missing data, 2) fewer false positive errors in comparison to mean replacement, and 3) fewer false negative errors in comparison to available case analysis. Compared to the standard analysis approach of omitting voxels with missing data, imputation methods increased brain coverage in this study by 35% (from 33,323 to 45,071 voxels). In addition, multiple imputation increased the size of significant clusters by 58% and number of significant clusters across statistical thresholds, compared to the standard voxel omission approach. While neighbor replacement produced similar results, we recommend multiple imputation because it uses an informed sampling distribution to deal with missing data across subjects that can include neighbor values and other predictors. Multiple imputation is anticipated to be particularly useful for 1) large fMRI data sets with inconsistent missing voxels across subjects and 2) addressing the problem of increased artifact at ultra-high field, which significantly limit the extent of whole brain coverage and interpretations of results. PMID:22500925
Luo, Yuan; Szolovits, Peter; Dighe, Anand S; Baron, Jason M
2018-06-01
A key challenge in clinical data mining is that most clinical datasets contain missing data. Since many commonly used machine learning algorithms require complete datasets (no missing data), clinical analytic approaches often entail an imputation procedure to "fill in" missing data. However, although most clinical datasets contain a temporal component, most commonly used imputation methods do not adequately accommodate longitudinal time-based data. We sought to develop a new imputation algorithm, 3-dimensional multiple imputation with chained equations (3D-MICE), that can perform accurate imputation of missing clinical time series data. We extracted clinical laboratory test results for 13 commonly measured analytes (clinical laboratory tests). We imputed missing test results for the 13 analytes using 3 imputation methods: multiple imputation with chained equations (MICE), Gaussian process (GP), and 3D-MICE. 3D-MICE utilizes both MICE and GP imputation to integrate cross-sectional and longitudinal information. To evaluate imputation method performance, we randomly masked selected test results and imputed these masked results alongside results missing from our original data. We compared predicted results to measured results for masked data points. 3D-MICE performed significantly better than MICE and GP-based imputation in a composite of all 13 analytes, predicting missing results with a normalized root-mean-square error of 0.342, compared to 0.373 for MICE alone and 0.358 for GP alone. 3D-MICE offers a novel and practical approach to imputing clinical laboratory time series data. 3D-MICE may provide an additional tool for use as a foundation in clinical predictive analytics and intelligent clinical decision support.
Combining multiple imputation and meta-analysis with individual participant data
Burgess, Stephen; White, Ian R; Resche-Rigon, Matthieu; Wood, Angela M
2013-01-01
Multiple imputation is a strategy for the analysis of incomplete data such that the impact of the missingness on the power and bias of estimates is mitigated. When data from multiple studies are collated, we can propose both within-study and multilevel imputation models to impute missing data on covariates. It is not clear how to choose between imputation models or how to combine imputation and inverse-variance weighted meta-analysis methods. This is especially important as often different studies measure data on different variables, meaning that we may need to impute data on a variable which is systematically missing in a particular study. In this paper, we consider a simulation analysis of sporadically missing data in a single covariate with a linear analysis model and discuss how the results would be applicable to the case of systematically missing data. We find in this context that ensuring the congeniality of the imputation and analysis models is important to give correct standard errors and confidence intervals. For example, if the analysis model allows between-study heterogeneity of a parameter, then we should incorporate this heterogeneity into the imputation model to maintain the congeniality of the two models. In an inverse-variance weighted meta-analysis, we should impute missing data and apply Rubin's rules at the study level prior to meta-analysis, rather than meta-analyzing each of the multiple imputations and then combining the meta-analysis estimates using Rubin's rules. We illustrate the results using data from the Emerging Risk Factors Collaboration. PMID:23703895
Generating Multiple Imputations for Matrix Sampling Data Analyzed with Item Response Models.
ERIC Educational Resources Information Center
Thomas, Neal; Gan, Nianci
1997-01-01
Describes and assesses missing data methods currently used to analyze data from matrix sampling designs implemented by the National Assessment of Educational Progress. Several improved methods are developed, and these models are evaluated using an EM algorithm to obtain maximum likelihood estimates followed by multiple imputation of complete data…
SPSS Syntax for Missing Value Imputation in Test and Questionnaire Data
ERIC Educational Resources Information Center
van Ginkel, Joost R.; van der Ark, L. Andries
2005-01-01
A well-known problem in the analysis of test and questionnaire data is that some item scores may be missing. Advanced methods for the imputation of missing data are available, such as multiple imputation under the multivariate normal model and imputation under the saturated logistic model (Schafer, 1997). Accompanying software was made available…
Resche-Rigon, Matthieu; White, Ian R
2018-06-01
In multilevel settings such as individual participant data meta-analysis, a variable is 'systematically missing' if it is wholly missing in some clusters and 'sporadically missing' if it is partly missing in some clusters. Previously proposed methods to impute incomplete multilevel data handle either systematically or sporadically missing data, but frequently both patterns are observed. We describe a new multiple imputation by chained equations (MICE) algorithm for multilevel data with arbitrary patterns of systematically and sporadically missing variables. The algorithm is described for multilevel normal data but can easily be extended for other variable types. We first propose two methods for imputing a single incomplete variable: an extension of an existing method and a new two-stage method which conveniently allows for heteroscedastic data. We then discuss the difficulties of imputing missing values in several variables in multilevel data using MICE, and show that even the simplest joint multilevel model implies conditional models which involve cluster means and heteroscedasticity. However, a simulation study finds that the proposed methods can be successfully combined in a multilevel MICE procedure, even when cluster means are not included in the imputation models.
Janet L. Ohmann; Matthew J. Gregory; Emilie B. Henderson; Heather M. Roberts
2011-01-01
Question: How can nearest-neighbour (NN) imputation be used to develop maps of multiple species and plant communities? Location: Western and central Oregon, USA, but methods are applicable anywhere. Methods: We demonstrate NN imputation by mapping woody plant communities for >100 000 km2 of diverse forests and woodlands. Species abundances on...
Multiple Imputation of a Randomly Censored Covariate Improves Logistic Regression Analysis.
Atem, Folefac D; Qian, Jing; Maye, Jacqueline E; Johnson, Keith A; Betensky, Rebecca A
2016-01-01
Randomly censored covariates arise frequently in epidemiologic studies. The most commonly used methods, including complete case and single imputation or substitution, suffer from inefficiency and bias. They make strong parametric assumptions or they consider limit of detection censoring only. We employ multiple imputation, in conjunction with semi-parametric modeling of the censored covariate, to overcome these shortcomings and to facilitate robust estimation. We develop a multiple imputation approach for randomly censored covariates within the framework of a logistic regression model. We use the non-parametric estimate of the covariate distribution or the semiparametric Cox model estimate in the presence of additional covariates in the model. We evaluate this procedure in simulations, and compare its operating characteristics to those from the complete case analysis and a survival regression approach. We apply the procedures to an Alzheimer's study of the association between amyloid positivity and maternal age of onset of dementia. Multiple imputation achieves lower standard errors and higher power than the complete case approach under heavy and moderate censoring and is comparable under light censoring. The survival regression approach achieves the highest power among all procedures, but does not produce interpretable estimates of association. Multiple imputation offers a favorable alternative to complete case analysis and ad hoc substitution methods in the presence of randomly censored covariates within the framework of logistic regression.
Coquet, Julia Becaria; Tumas, Natalia; Osella, Alberto Ruben; Tanzi, Matteo; Franco, Isabella; Diaz, Maria Del Pilar
2016-01-01
A number of studies have evidenced the effect of modifiable lifestyle factors such as diet, breastfeeding and nutritional status on breast cancer risk. However, none have addressed the missing data problem in nutritional epidemiologic research in South America. Missing data is a frequent problem in breast cancer studies and epidemiological settings in general. Estimates of effect obtained from these studies may be biased, if no appropriate method for handling missing data is applied. We performed Multiple Imputation for missing values on covariates in a breast cancer case-control study of Córdoba (Argentina) to optimize risk estimates. Data was obtained from a breast cancer case control study from 2008 to 2015 (318 cases, 526 controls). Complete case analysis and multiple imputation using chained equations were the methods applied to estimate the effects of a Traditional dietary pattern and other recognized factors associated with breast cancer. Physical activity and socioeconomic status were imputed. Logistic regression models were performed. When complete case analysis was performed only 31% of women were considered. Although a positive association of Traditional dietary pattern and breast cancer was observed from both approaches (complete case analysis OR=1.3, 95%CI=1.0-1.7; multiple imputation OR=1.4, 95%CI=1.2-1.7), effects of other covariates, like BMI and breastfeeding, were only identified when multiple imputation was considered. A Traditional dietary pattern, BMI and breastfeeding are associated with the occurrence of breast cancer in this Argentinean population when multiple imputation is appropriately performed. Multiple Imputation is suggested in Latin America’s epidemiologic studies to optimize effect estimates in the future. PMID:27892664
Palmer, Cameron; Pe’er, Itsik
2016-01-01
Missing data are an unavoidable component of modern statistical genetics. Different array or sequencing technologies cover different single nucleotide polymorphisms (SNPs), leading to a complicated mosaic pattern of missingness where both individual genotypes and entire SNPs are sporadically absent. Such missing data patterns cannot be ignored without introducing bias, yet cannot be inferred exclusively from nonmissing data. In genome-wide association studies, the accepted solution to missingness is to impute missing data using external reference haplotypes. The resulting probabilistic genotypes may be analyzed in the place of genotype calls. A general-purpose paradigm, called Multiple Imputation (MI), is known to model uncertainty in many contexts, yet it is not widely used in association studies. Here, we undertake a systematic evaluation of existing imputed data analysis methods and MI. We characterize biases related to uncertainty in association studies, and find that bias is introduced both at the imputation level, when imputation algorithms generate inconsistent genotype probabilities, and at the association level, when analysis methods inadequately model genotype uncertainty. We find that MI performs at least as well as existing methods or in some cases much better, and provides a straightforward paradigm for adapting existing genotype association methods to uncertain data. PMID:27310603
Covariate Selection for Multilevel Models with Missing Data
Marino, Miguel; Buxton, Orfeu M.; Li, Yi
2017-01-01
Missing covariate data hampers variable selection in multilevel regression settings. Current variable selection techniques for multiply-imputed data commonly address missingness in the predictors through list-wise deletion and stepwise-selection methods which are problematic. Moreover, most variable selection methods are developed for independent linear regression models and do not accommodate multilevel mixed effects regression models with incomplete covariate data. We develop a novel methodology that is able to perform covariate selection across multiply-imputed data for multilevel random effects models when missing data is present. Specifically, we propose to stack the multiply-imputed data sets from a multiple imputation procedure and to apply a group variable selection procedure through group lasso regularization to assess the overall impact of each predictor on the outcome across the imputed data sets. Simulations confirm the advantageous performance of the proposed method compared with the competing methods. We applied the method to reanalyze the Healthy Directions-Small Business cancer prevention study, which evaluated a behavioral intervention program targeting multiple risk-related behaviors in a working-class, multi-ethnic population. PMID:28239457
Seaman, Shaun R; Hughes, Rachael A
2018-06-01
Estimating the parameters of a regression model of interest is complicated by missing data on the variables in that model. Multiple imputation is commonly used to handle these missing data. Joint model multiple imputation and full-conditional specification multiple imputation are known to yield imputed data with the same asymptotic distribution when the conditional models of full-conditional specification are compatible with that joint model. We show that this asymptotic equivalence of imputation distributions does not imply that joint model multiple imputation and full-conditional specification multiple imputation will also yield asymptotically equally efficient inference about the parameters of the model of interest, nor that they will be equally robust to misspecification of the joint model. When the conditional models used by full-conditional specification multiple imputation are linear, logistic and multinomial regressions, these are compatible with a restricted general location joint model. We show that multiple imputation using the restricted general location joint model can be substantially more asymptotically efficient than full-conditional specification multiple imputation, but this typically requires very strong associations between variables. When associations are weaker, the efficiency gain is small. Moreover, full-conditional specification multiple imputation is shown to be potentially much more robust than joint model multiple imputation using the restricted general location model to mispecification of that model when there is substantial missingness in the outcome variable.
Multiple imputation for cure rate quantile regression with censored data.
Wu, Yuanshan; Yin, Guosheng
2017-03-01
The main challenge in the context of cure rate analysis is that one never knows whether censored subjects are cured or uncured, or whether they are susceptible or insusceptible to the event of interest. Considering the susceptible indicator as missing data, we propose a multiple imputation approach to cure rate quantile regression for censored data with a survival fraction. We develop an iterative algorithm to estimate the conditionally uncured probability for each subject. By utilizing this estimated probability and Bernoulli sample imputation, we can classify each subject as cured or uncured, and then employ the locally weighted method to estimate the quantile regression coefficients with only the uncured subjects. Repeating the imputation procedure multiple times and taking an average over the resultant estimators, we obtain consistent estimators for the quantile regression coefficients. Our approach relaxes the usual global linearity assumption, so that we can apply quantile regression to any particular quantile of interest. We establish asymptotic properties for the proposed estimators, including both consistency and asymptotic normality. We conduct simulation studies to assess the finite-sample performance of the proposed multiple imputation method and apply it to a lung cancer study as an illustration. © 2016, The International Biometric Society.
Multiple Imputation of Cognitive Performance as a Repeatedly Measured Outcome
Rawlings, Andreea M.; Sang, Yingying; Sharrett, A. Richey; Coresh, Josef; Griswold, Michael; Kucharska-Newton, Anna M.; Palta, Priya; Wruck, Lisa M.; Gross, Alden L.; Deal, Jennifer A.; Power, Melinda C.; Bandeen-Roche, Karen
2016-01-01
Background Longitudinal studies of cognitive performance are sensitive to dropout, as participants experiencing cognitive deficits are less likely to attend study visits, which may bias estimated associations between exposures of interest and cognitive decline. Multiple imputation is a powerful tool for handling missing data, however its use for missing cognitive outcome measures in longitudinal analyses remains limited. Methods We use multiple imputation by chained equations (MICE) to impute cognitive performance scores of participants who did not attend the 2011-2013 exam of the Atherosclerosis Risk in Communities Study. We examined the validity of imputed scores using observed and simulated data under varying assumptions. We examined differences in the estimated association between diabetes at baseline and 20-year cognitive decline with and without imputed values. Lastly, we discuss how different analytic methods (mixed models and models fit using generalized estimate equations) and choice of for whom to impute result in different estimands. Results Validation using observed data showed MICE produced unbiased imputations. Simulations showed a substantial reduction in the bias of the 20-year association between diabetes and cognitive decline comparing MICE (3-4% bias) to analyses of available data only (16-23% bias) in a construct where missingness was strongly informative but realistic. Associations between diabetes and 20-year cognitive decline were substantially stronger with MICE than in available-case analyses. Conclusions Our study suggests when informative data are available for non-examined participants, MICE can be an effective tool for imputing cognitive performance and improving assessment of cognitive decline, though careful thought should be given to target imputation population and analytic model chosen, as they may yield different estimands. PMID:27619926
Missing Data and Multiple Imputation in the Context of Multivariate Analysis of Variance
ERIC Educational Resources Information Center
Finch, W. Holmes
2016-01-01
Multivariate analysis of variance (MANOVA) is widely used in educational research to compare means on multiple dependent variables across groups. Researchers faced with the problem of missing data often use multiple imputation of values in place of the missing observations. This study compares the performance of 2 methods for combining p values in…
Multiple imputation of missing data in nested case-control and case-cohort studies.
Keogh, Ruth H; Seaman, Shaun R; Bartlett, Jonathan W; Wood, Angela M
2018-06-05
The nested case-control and case-cohort designs are two main approaches for carrying out a substudy within a prospective cohort. This article adapts multiple imputation (MI) methods for handling missing covariates in full-cohort studies for nested case-control and case-cohort studies. We consider data missing by design and data missing by chance. MI analyses that make use of full-cohort data and MI analyses based on substudy data only are described, alongside an intermediate approach in which the imputation uses full-cohort data but the analysis uses only the substudy. We describe adaptations to two imputation methods: the approximate method (MI-approx) of White and Royston () and the "substantive model compatible" (MI-SMC) method of Bartlett et al. (). We also apply the "MI matched set" approach of Seaman and Keogh () to nested case-control studies, which does not require any full-cohort information. The methods are investigated using simulation studies and all perform well when their assumptions hold. Substantial gains in efficiency can be made by imputing data missing by design using the full-cohort approach or by imputing data missing by chance in analyses using the substudy only. The intermediate approach brings greater gains in efficiency relative to the substudy approach and is more robust to imputation model misspecification than the full-cohort approach. The methods are illustrated using the ARIC Study cohort. Supplementary Materials provide R and Stata code. © 2018, The International Biometric Society.
Meta-analysis with missing study-level sample variance data.
Chowdhry, Amit K; Dworkin, Robert H; McDermott, Michael P
2016-07-30
We consider a study-level meta-analysis with a normally distributed outcome variable and possibly unequal study-level variances, where the object of inference is the difference in means between a treatment and control group. A common complication in such an analysis is missing sample variances for some studies. A frequently used approach is to impute the weighted (by sample size) mean of the observed variances (mean imputation). Another approach is to include only those studies with variances reported (complete case analysis). Both mean imputation and complete case analysis are only valid under the missing-completely-at-random assumption, and even then the inverse variance weights produced are not necessarily optimal. We propose a multiple imputation method employing gamma meta-regression to impute the missing sample variances. Our method takes advantage of study-level covariates that may be used to provide information about the missing data. Through simulation studies, we show that multiple imputation, when the imputation model is correctly specified, is superior to competing methods in terms of confidence interval coverage probability and type I error probability when testing a specified group difference. Finally, we describe a similar approach to handling missing variances in cross-over studies. Copyright © 2016 John Wiley & Sons, Ltd. Copyright © 2016 John Wiley & Sons, Ltd.
Jones, Rachael M; Stayner, Leslie T; Demirtas, Hakan
2014-10-01
Drinking water may contain pollutants that harm human health. The frequency of pollutant monitoring may occur quarterly, annually, or less frequently, depending upon the pollutant, the pollutant concentration, and community water system. However, birth and other health outcomes are associated with narrow time-windows of exposure. Infrequent monitoring impedes linkage between water quality and health outcomes for epidemiological analyses. To evaluate the performance of multiple imputation to fill in water quality values between measurements in community water systems (CWSs). The multiple imputation method was implemented in a simulated setting using data from the Atrazine Monitoring Program (AMP, 2006-2009 in five Midwestern states). Values were deleted from the AMP data to leave one measurement per month. Four patterns reflecting drinking water monitoring regulations were used to delete months of data in each CWS: three patterns were missing at random and one pattern was missing not at random. Synthetic health outcome data were created using a linear and a Poisson exposure-response relationship with five levels of hypothesized association, respectively. The multiple imputation method was evaluated by comparing the exposure-response relationships estimated based on multiply imputed data with the hypothesized association. The four patterns deleted 65-92% months of atrazine observations in AMP data. Even with these high rates of missing information, our procedure was able to recover most of the missing information when the synthetic health outcome was included for missing at random patterns and for missing not at random patterns with low-to-moderate exposure-response relationships. Multiple imputation appears to be an effective method for filling in water quality values between measurements. Copyright © 2014 Elsevier Inc. All rights reserved.
Liu, Benmei; Yu, Mandi; Graubard, Barry I; Troiano, Richard P; Schenker, Nathaniel
2016-01-01
The Physical Activity Monitor (PAM) component was introduced into the 2003-2004 National Health and Nutrition Examination Survey (NHANES) to collect objective information on physical activity including both movement intensity counts and ambulatory steps. Due to an error in the accelerometer device initialization process, the steps data were missing for all participants in several primary sampling units (PSUs), typically a single county or group of contiguous counties, who had intensity count data from their accelerometers. To avoid potential bias and loss in efficiency in estimation and inference involving the steps data, we considered methods to accurately impute the missing values for steps collected in the 2003-2004 NHANES. The objective was to come up with an efficient imputation method which minimized model-based assumptions. We adopted a multiple imputation approach based on Additive Regression, Bootstrapping and Predictive mean matching (ARBP) methods. This method fits alternative conditional expectation (ace) models, which use an automated procedure to estimate optimal transformations for both the predictor and response variables. This paper describes the approaches used in this imputation and evaluates the methods by comparing the distributions of the original and the imputed data. A simulation study using the observed data is also conducted as part of the model diagnostics. Finally some real data analyses are performed to compare the before and after imputation results. PMID:27488606
Hayati Rezvan, Panteha; Lee, Katherine J; Simpson, Julie A
2015-04-07
Missing data are common in medical research, which can lead to a loss in statistical power and potentially biased results if not handled appropriately. Multiple imputation (MI) is a statistical method, widely adopted in practice, for dealing with missing data. Many academic journals now emphasise the importance of reporting information regarding missing data and proposed guidelines for documenting the application of MI have been published. This review evaluated the reporting of missing data, the application of MI including the details provided regarding the imputation model, and the frequency of sensitivity analyses within the MI framework in medical research articles. A systematic review of articles published in the Lancet and New England Journal of Medicine between January 2008 and December 2013 in which MI was implemented was carried out. We identified 103 papers that used MI, with the number of papers increasing from 11 in 2008 to 26 in 2013. Nearly half of the papers specified the proportion of complete cases or the proportion with missing data by each variable. In the majority of the articles (86%) the imputed variables were specified. Of the 38 papers (37%) that stated the method of imputation, 20 used chained equations, 8 used multivariate normal imputation, and 10 used alternative methods. Very few articles (9%) detailed how they handled non-normally distributed variables during imputation. Thirty-nine papers (38%) stated the variables included in the imputation model. Less than half of the papers (46%) reported the number of imputations, and only two papers compared the distribution of imputed and observed data. Sixty-six papers presented the results from MI as a secondary analysis. Only three articles carried out a sensitivity analysis following MI to assess departures from the missing at random assumption, with details of the sensitivity analyses only provided by one article. This review outlined deficiencies in the documenting of missing data and the details provided about imputation. Furthermore, only a few articles performed sensitivity analyses following MI even though this is strongly recommended in guidelines. Authors are encouraged to follow the available guidelines and provide information on missing data and the imputation process.
Missing data and multiple imputation in clinical epidemiological research.
Pedersen, Alma B; Mikkelsen, Ellen M; Cronin-Fenton, Deirdre; Kristensen, Nickolaj R; Pham, Tra My; Pedersen, Lars; Petersen, Irene
2017-01-01
Missing data are ubiquitous in clinical epidemiological research. Individuals with missing data may differ from those with no missing data in terms of the outcome of interest and prognosis in general. Missing data are often categorized into the following three types: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). In clinical epidemiological research, missing data are seldom MCAR. Missing data can constitute considerable challenges in the analyses and interpretation of results and can potentially weaken the validity of results and conclusions. A number of methods have been developed for dealing with missing data. These include complete-case analyses, missing indicator method, single value imputation, and sensitivity analyses incorporating worst-case and best-case scenarios. If applied under the MCAR assumption, some of these methods can provide unbiased but often less precise estimates. Multiple imputation is an alternative method to deal with missing data, which accounts for the uncertainty associated with missing data. Multiple imputation is implemented in most statistical software under the MAR assumption and provides unbiased and valid estimates of associations based on information from the available data. The method affects not only the coefficient estimates for variables with missing data but also the estimates for other variables with no missing data.
Missing data and multiple imputation in clinical epidemiological research
Pedersen, Alma B; Mikkelsen, Ellen M; Cronin-Fenton, Deirdre; Kristensen, Nickolaj R; Pham, Tra My; Pedersen, Lars; Petersen, Irene
2017-01-01
Missing data are ubiquitous in clinical epidemiological research. Individuals with missing data may differ from those with no missing data in terms of the outcome of interest and prognosis in general. Missing data are often categorized into the following three types: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). In clinical epidemiological research, missing data are seldom MCAR. Missing data can constitute considerable challenges in the analyses and interpretation of results and can potentially weaken the validity of results and conclusions. A number of methods have been developed for dealing with missing data. These include complete-case analyses, missing indicator method, single value imputation, and sensitivity analyses incorporating worst-case and best-case scenarios. If applied under the MCAR assumption, some of these methods can provide unbiased but often less precise estimates. Multiple imputation is an alternative method to deal with missing data, which accounts for the uncertainty associated with missing data. Multiple imputation is implemented in most statistical software under the MAR assumption and provides unbiased and valid estimates of associations based on information from the available data. The method affects not only the coefficient estimates for variables with missing data but also the estimates for other variables with no missing data. PMID:28352203
Andridge, Rebecca. R.
2011-01-01
In cluster randomized trials (CRTs), identifiable clusters rather than individuals are randomized to study groups. Resulting data often consist of a small number of clusters with correlated observations within a treatment group. Missing data often present a problem in the analysis of such trials, and multiple imputation (MI) has been used to create complete data sets, enabling subsequent analysis with well-established analysis methods for CRTs. We discuss strategies for accounting for clustering when multiply imputing a missing continuous outcome, focusing on estimation of the variance of group means as used in an adjusted t-test or ANOVA. These analysis procedures are congenial to (can be derived from) a mixed effects imputation model; however, this imputation procedure is not yet available in commercial statistical software. An alternative approach that is readily available and has been used in recent studies is to include fixed effects for cluster, but the impact of using this convenient method has not been studied. We show that under this imputation model the MI variance estimator is positively biased and that smaller ICCs lead to larger overestimation of the MI variance. Analytical expressions for the bias of the variance estimator are derived in the case of data missing completely at random (MCAR), and cases in which data are missing at random (MAR) are illustrated through simulation. Finally, various imputation methods are applied to data from the Detroit Middle School Asthma Project, a recent school-based CRT, and differences in inference are compared. PMID:21259309
Zhou, Hanzhi; Elliott, Michael R; Raghunathan, Trivellore E
2016-06-01
Multistage sampling is often employed in survey samples for cost and convenience. However, accounting for clustering features when generating datasets for multiple imputation is a nontrivial task, particularly when, as is often the case, cluster sampling is accompanied by unequal probabilities of selection, necessitating case weights. Thus, multiple imputation often ignores complex sample designs and assumes simple random sampling when generating imputations, even though failing to account for complex sample design features is known to yield biased estimates and confidence intervals that have incorrect nominal coverage. In this article, we extend a recently developed, weighted, finite-population Bayesian bootstrap procedure to generate synthetic populations conditional on complex sample design data that can be treated as simple random samples at the imputation stage, obviating the need to directly model design features for imputation. We develop two forms of this method: one where the probabilities of selection are known at the first and second stages of the design, and the other, more common in public use files, where only the final weight based on the product of the two probabilities is known. We show that this method has advantages in terms of bias, mean square error, and coverage properties over methods where sample designs are ignored, with little loss in efficiency, even when compared with correct fully parametric models. An application is made using the National Automotive Sampling System Crashworthiness Data System, a multistage, unequal probability sample of U.S. passenger vehicle crashes, which suffers from a substantial amount of missing data in "Delta-V," a key crash severity measure.
Zhou, Hanzhi; Elliott, Michael R.; Raghunathan, Trivellore E.
2017-01-01
Multistage sampling is often employed in survey samples for cost and convenience. However, accounting for clustering features when generating datasets for multiple imputation is a nontrivial task, particularly when, as is often the case, cluster sampling is accompanied by unequal probabilities of selection, necessitating case weights. Thus, multiple imputation often ignores complex sample designs and assumes simple random sampling when generating imputations, even though failing to account for complex sample design features is known to yield biased estimates and confidence intervals that have incorrect nominal coverage. In this article, we extend a recently developed, weighted, finite-population Bayesian bootstrap procedure to generate synthetic populations conditional on complex sample design data that can be treated as simple random samples at the imputation stage, obviating the need to directly model design features for imputation. We develop two forms of this method: one where the probabilities of selection are known at the first and second stages of the design, and the other, more common in public use files, where only the final weight based on the product of the two probabilities is known. We show that this method has advantages in terms of bias, mean square error, and coverage properties over methods where sample designs are ignored, with little loss in efficiency, even when compared with correct fully parametric models. An application is made using the National Automotive Sampling System Crashworthiness Data System, a multistage, unequal probability sample of U.S. passenger vehicle crashes, which suffers from a substantial amount of missing data in “Delta-V,” a key crash severity measure. PMID:29226161
Liu, Siwei; Molenaar, Peter C M
2014-12-01
This article introduces iVAR, an R program for imputing missing data in multivariate time series on the basis of vector autoregressive (VAR) models. We conducted a simulation study to compare iVAR with three methods for handling missing data: listwise deletion, imputation with sample means and variances, and multiple imputation ignoring time dependency. The results showed that iVAR produces better estimates for the cross-lagged coefficients than do the other three methods. We demonstrate the use of iVAR with an empirical example of time series electrodermal activity data and discuss the advantages and limitations of the program.
Multiple Imputation for Incomplete Data in Epidemiologic Studies
Harel, Ofer; Mitchell, Emily M; Perkins, Neil J; Cole, Stephen R; Tchetgen Tchetgen, Eric J; Sun, BaoLuo; Schisterman, Enrique F
2018-01-01
Abstract Epidemiologic studies are frequently susceptible to missing information. Omitting observations with missing variables remains a common strategy in epidemiologic studies, yet this simple approach can often severely bias parameter estimates of interest if the values are not missing completely at random. Even when missingness is completely random, complete-case analysis can reduce the efficiency of estimated parameters, because large amounts of available data are simply tossed out with the incomplete observations. Alternative methods for mitigating the influence of missing information, such as multiple imputation, are becoming an increasing popular strategy in order to retain all available information, reduce potential bias, and improve efficiency in parameter estimation. In this paper, we describe the theoretical underpinnings of multiple imputation, and we illustrate application of this method as part of a collaborative challenge to assess the performance of various techniques for dealing with missing data (Am J Epidemiol. 2018;187(3):568–575). We detail the steps necessary to perform multiple imputation on a subset of data from the Collaborative Perinatal Project (1959–1974), where the goal is to estimate the odds of spontaneous abortion associated with smoking during pregnancy. PMID:29165547
A nonparametric multiple imputation approach for missing categorical data.
Zhou, Muhan; He, Yulei; Yu, Mandi; Hsu, Chiu-Hsieh
2017-06-06
Incomplete categorical variables with more than two categories are common in public health data. However, most of the existing missing-data methods do not use the information from nonresponse (missingness) probabilities. We propose a nearest-neighbour multiple imputation approach to impute a missing at random categorical outcome and to estimate the proportion of each category. The donor set for imputation is formed by measuring distances between each missing value with other non-missing values. The distance function is calculated based on a predictive score, which is derived from two working models: one fits a multinomial logistic regression for predicting the missing categorical outcome (the outcome model) and the other fits a logistic regression for predicting missingness probabilities (the missingness model). A weighting scheme is used to accommodate contributions from two working models when generating the predictive score. A missing value is imputed by randomly selecting one of the non-missing values with the smallest distances. We conduct a simulation to evaluate the performance of the proposed method and compare it with several alternative methods. A real-data application is also presented. The simulation study suggests that the proposed method performs well when missingness probabilities are not extreme under some misspecifications of the working models. However, the calibration estimator, which is also based on two working models, can be highly unstable when missingness probabilities for some observations are extremely high. In this scenario, the proposed method produces more stable and better estimates. In addition, proper weights need to be chosen to balance the contributions from the two working models and achieve optimal results for the proposed method. We conclude that the proposed multiple imputation method is a reasonable approach to dealing with missing categorical outcome data with more than two levels for assessing the distribution of the outcome. In terms of the choices for the working models, we suggest a multinomial logistic regression for predicting the missing outcome and a binary logistic regression for predicting the missingness probability.
Wallace, Meredith L; Anderson, Stewart J; Mazumdar, Sati
2010-12-20
Missing covariate data present a challenge to tree-structured methodology due to the fact that a single tree model, as opposed to an estimated parameter value, may be desired for use in a clinical setting. To address this problem, we suggest a multiple imputation algorithm that adds draws of stochastic error to a tree-based single imputation method presented by Conversano and Siciliano (Technical Report, University of Naples, 2003). Unlike previously proposed techniques for accommodating missing covariate data in tree-structured analyses, our methodology allows the modeling of complex and nonlinear covariate structures while still resulting in a single tree model. We perform a simulation study to evaluate our stochastic multiple imputation algorithm when covariate data are missing at random and compare it to other currently used methods. Our algorithm is advantageous for identifying the true underlying covariate structure when complex data and larger percentages of missing covariate observations are present. It is competitive with other current methods with respect to prediction accuracy. To illustrate our algorithm, we create a tree-structured survival model for predicting time to treatment response in older, depressed adults. Copyright © 2010 John Wiley & Sons, Ltd.
Statistical methods for incomplete data: Some results on model misspecification.
McIsaac, Michael; Cook, R J
2017-02-01
Inverse probability weighted estimating equations and multiple imputation are two of the most studied frameworks for dealing with incomplete data in clinical and epidemiological research. We examine the limiting behaviour of estimators arising from inverse probability weighted estimating equations, augmented inverse probability weighted estimating equations and multiple imputation when the requisite auxiliary models are misspecified. We compute limiting values for settings involving binary responses and covariates and illustrate the effects of model misspecification using simulations based on data from a breast cancer clinical trial. We demonstrate that, even when both auxiliary models are misspecified, the asymptotic biases of double-robust augmented inverse probability weighted estimators are often smaller than the asymptotic biases of estimators arising from complete-case analyses, inverse probability weighting or multiple imputation. We further demonstrate that use of inverse probability weighting or multiple imputation with slightly misspecified auxiliary models can actually result in greater asymptotic bias than the use of naïve, complete case analyses. These asymptotic results are shown to be consistent with empirical results from simulation studies.
A Hot-Deck Multiple Imputation Procedure for Gaps in Longitudinal Recurrent Event Histories
Wang, Chia-Ning; Little, Roderick; Nan, Bin; Harlow, Siobán D.
2012-01-01
Summary We propose a regression-based hot deck multiple imputation method for gaps of missing data in longitudinal studies, where subjects experience a recurrent event process and a terminal event. Examples are repeated asthma episodes and death, or menstrual periods and the menopause, as in our motivating application. Research interest concerns the onset time of a marker event, defined by the recurrent-event process, or the duration from this marker event to the final event. Gaps in the recorded event history make it difficult to determine the onset time of the marker event, and hence, the duration from onset to the final event. Simple approaches such as jumping gap times or dropping cases with gaps have obvious limitations. We propose a procedure for imputing information in the gaps by substituting information in the gap from a matched individual with a completely recorded history in the corresponding interval. Predictive Mean Matching is used to incorporate information on longitudinal characteristics of the repeated process and the final event time. Multiple imputation is used to propagate imputation uncertainty. The procedure is applied to an important data set for assessing the timing and duration of the menopausal transition. The performance of the proposed method is assessed by a simulation study. PMID:21361886
Hamel, J F; Sebille, V; Le Neel, T; Kubis, G; Boyer, F C; Hardouin, J B
2017-12-01
Subjective health measurements using Patient Reported Outcomes (PRO) are increasingly used in randomized trials, particularly for patient groups comparisons. Two main types of analytical strategies can be used for such data: Classical Test Theory (CTT) and Item Response Theory models (IRT). These two strategies display very similar characteristics when data are complete, but in the common case when data are missing, whether IRT or CTT would be the most appropriate remains unknown and was investigated using simulations. We simulated PRO data such as quality of life data. Missing responses to items were simulated as being completely random, depending on an observable covariate or on an unobserved latent trait. The considered CTT-based methods allowed comparing scores using complete-case analysis, personal mean imputations or multiple-imputations based on a two-way procedure. The IRT-based method was the Wald test on a Rasch model including a group covariate. The IRT-based method and the multiple-imputations-based method for CTT displayed the highest observed power and were the only unbiased method whatever the kind of missing data. Online software and Stata® modules compatibles with the innate mi impute suite are provided for performing such analyses. Traditional procedures (listwise deletion and personal mean imputations) should be avoided, due to inevitable problems of biases and lack of power.
Caron, Alexandre; Clement, Guillaume; Heyman, Christophe; Aernout, Eva; Chazard, Emmanuel; Le Tertre, Alain
2015-01-01
Incompleteness of epidemiological databases is a major drawback when it comes to analyzing data. We conceived an epidemiological study to assess the association between newborn thyroid function and the exposure to perchlorates found in the tap water of the mother's home. Only 9% of newborn's exposure to perchlorate was known. The aim of our study was to design, test and evaluate an original method for imputing perchlorate exposure of newborns based on their maternity of birth. In a first database, an exhaustive collection of newborn's thyroid function measured during a systematic neonatal screening was collected. In this database the municipality of residence of the newborn's mother was only available for 2012. Between 2004 and 2011, the closest data available was the municipality of the maternity of birth. Exposure was assessed using a second database which contained the perchlorate levels for each municipality. We computed the catchment area of every maternity ward based on the French nationwide exhaustive database of inpatient stay. Municipality, and consequently perchlorate exposure, was imputed by a weighted draw in the catchment area. Missing values for remaining covariates were imputed by chained equation. A linear mixture model was computed on each imputed dataset. We compared odds ratios (ORs) and 95% confidence intervals (95% CI) estimated on real versus imputed 2012 data. The same model was then carried out for the whole imputed database. The ORs estimated on 36,695 observations by our multiple imputation method are comparable to the real 2012 data. On the 394,979 observations of the whole database, the ORs remain stable but the 95% CI tighten considerably. The model estimates computed on imputed data are similar to those calculated on real data. The main advantage of multiple imputation is to provide unbiased estimate of the ORs while maintaining their variances. Thus, our method will be used to increase the statistical power of future studies by including all 394,979 newborns.
Bounthavong, Mark; Watanabe, Jonathan H; Sullivan, Kevin M
2015-04-01
The complete capture of all values for each variable of interest in pharmacy research studies remains aspirational. The absence of these possibly influential values is a common problem for pharmacist investigators. Failure to account for missing data may translate to biased study findings and conclusions. Our goal in this analysis was to apply validated statistical methods for missing data to a previously analyzed data set and compare results when missing data methods were implemented versus standard analytics that ignore missing data effects. Using data from a retrospective cohort study, the statistical method of multiple imputation was used to provide regression-based estimates of the missing values to improve available data usable for study outcomes measurement. These findings were then contrasted with a complete-case analysis that restricted estimation to subjects in the cohort that had no missing values. Odds ratios were compared to assess differences in findings of the analyses. A nonadjusted regression analysis ("crude analysis") was also performed as a reference for potential bias. Veterans Integrated Systems Network that includes VA facilities in the Southern California and Nevada regions. New statin users between November 30, 2006, and December 2, 2007, with a diagnosis of dyslipidemia. We compared the odds ratios (ORs) and 95% confidence intervals (CIs) for the crude, complete-case, and multiple imputation analyses for the end points of a 25% or greater reduction in atherogenic lipids. Data were missing for 21.5% of identified patients (1665 subjects of 7739). Regression model results were similar for the crude, complete-case, and multiple imputation analyses with overlap of 95% confidence limits at each end point. The crude, complete-case, and multiple imputation ORs (95% CIs) for a 25% or greater reduction in low-density lipoprotein cholesterol were 3.5 (95% CI 3.1-3.9), 4.3 (95% CI 3.8-4.9), and 4.1 (95% CI 3.7-4.6), respectively. The crude, complete-case, and multiple imputation ORs (95% CIs) for a 25% or greater reduction in non-high-density lipoprotein cholesterol were 3.5 (95% CI 3.1-3.9), 4.5 (95% CI 4.0-5.2), and 4.4 (95% CI 3.9-4.9), respectively. The crude, complete-case, and multiple imputation ORs (95% CIs) for 25% or greater reduction in TGs were 3.1 (95% CI 2.8-3.6), 4.0 (95% CI 3.5-4.6), and 4.1 (95% CI 3.6-4.6), respectively. The use of the multiple imputation method to account for missing data did not alter conclusions based on a complete-case analysis. Given the frequency of missing data in research using electronic health records and pharmacy claims data, multiple imputation may play an important role in the validation of study findings. © 2015 Pharmacotherapy Publications, Inc.
Incomplete Data in Smart Grid: Treatment of Values in Electric Vehicle Charging Data
DOE Office of Scientific and Technical Information (OSTI.GOV)
Majipour, Mostafa; Chu, Peter; Gadh, Rajit
2014-11-03
In this paper, five imputation methods namely Constant (zero), Mean, Median, Maximum Likelihood, and Multiple Imputation methods have been applied to compensate for missing values in Electric Vehicle (EV) charging data. The outcome of each of these methods have been used as the input to a prediction algorithm to forecast the EV load in the next 24 hours at each individual outlet. The data is real world data at the outlet level from the UCLA campus parking lots. Given the sparsity of the data, both Median and Constant (=zero) imputations improved the prediction results. Since in most missing value casesmore » in our database, all values of that instance are missing, the multivariate imputation methods did not improve the results significantly compared to univariate approaches.« less
Time-dependent summary receiver operating characteristics for meta-analysis of prognostic studies.
Hattori, Satoshi; Zhou, Xiao-Hua
2016-11-20
Prognostic studies are widely conducted to examine whether biomarkers are associated with patient's prognoses and play important roles in medical decisions. Because findings from one prognostic study may be very limited, meta-analyses may be useful to obtain sound evidence. However, prognostic studies are often analyzed by relying on a study-specific cut-off value, which can lead to difficulty in applying the standard meta-analysis techniques. In this paper, we propose two methods to estimate a time-dependent version of the summary receiver operating characteristics curve for meta-analyses of prognostic studies with a right-censored time-to-event outcome. We introduce a bivariate normal model for the pair of time-dependent sensitivity and specificity and propose a method to form inferences based on summary statistics reported in published papers. This method provides a valid inference asymptotically. In addition, we consider a bivariate binomial model. To draw inferences from this bivariate binomial model, we introduce a multiple imputation method. The multiple imputation is found to be approximately proper multiple imputation, and thus the standard Rubin's variance formula is justified from a Bayesian view point. Our simulation study and application to a real dataset revealed that both methods work well with a moderate or large number of studies and the bivariate binomial model coupled with the multiple imputation outperforms the bivariate normal model with a small number of studies. Copyright © 2016 John Wiley & Sons, Ltd. Copyright © 2016 John Wiley & Sons, Ltd.
Should "Multiple Imputations" Be Treated as "Multiple Indicators"?
ERIC Educational Resources Information Center
Mislevy, Robert J.
1993-01-01
Multiple imputations for latent variables are constructed so that analyses treating them as true variables have the correct expectations for population characteristics. Analyzing multiple imputations in accordance with their construction yields correct estimates of population characteristics, whereas analyzing them as multiple indicators generally…
[Imputation methods for missing data in educational diagnostic evaluation].
Fernández-Alonso, Rubén; Suárez-Álvarez, Javier; Muñiz, José
2012-02-01
In the diagnostic evaluation of educational systems, self-reports are commonly used to collect data, both cognitive and orectic. For various reasons, in these self-reports, some of the students' data are frequently missing. The main goal of this research is to compare the performance of different imputation methods for missing data in the context of the evaluation of educational systems. On an empirical database of 5,000 subjects, 72 conditions were simulated: three levels of missing data, three types of loss mechanisms, and eight methods of imputation. The levels of missing data were 5%, 10%, and 20%. The loss mechanisms were set at: Missing completely at random, moderately conditioned, and strongly conditioned. The eight imputation methods used were: listwise deletion, replacement by the mean of the scale, by the item mean, the subject mean, the corrected subject mean, multiple regression, and Expectation-Maximization (EM) algorithm, with and without auxiliary variables. The results indicate that the recovery of the data is more accurate when using an appropriate combination of different methods of recovering lost data. When a case is incomplete, the mean of the subject works very well, whereas for completely lost data, multiple imputation with the EM algorithm is recommended. The use of this combination is especially recommended when data loss is greater and its loss mechanism is more conditioned. Lastly, the results are discussed, and some future lines of research are analyzed.
Seaman, Shaun R; White, Ian R; Carpenter, James R
2015-01-01
Missing covariate data commonly occur in epidemiological and clinical research, and are often dealt with using multiple imputation. Imputation of partially observed covariates is complicated if the substantive model is non-linear (e.g. Cox proportional hazards model), or contains non-linear (e.g. squared) or interaction terms, and standard software implementations of multiple imputation may impute covariates from models that are incompatible with such substantive models. We show how imputation by fully conditional specification, a popular approach for performing multiple imputation, can be modified so that covariates are imputed from models which are compatible with the substantive model. We investigate through simulation the performance of this proposal, and compare it with existing approaches. Simulation results suggest our proposal gives consistent estimates for a range of common substantive models, including models which contain non-linear covariate effects or interactions, provided data are missing at random and the assumed imputation models are correctly specified and mutually compatible. Stata software implementing the approach is freely available. PMID:24525487
DOT National Transportation Integrated Search
2002-01-01
The National Center for Statistics and Analysis (NCSA) of the National Highway Traffic Safety : Administration (NHTSA) has undertaken several approaches to remedy the problem of missing blood alcohol : test results in the Fatality Analysis Reporting ...
Lazar, Cosmin; Gatto, Laurent; Ferro, Myriam; Bruley, Christophe; Burger, Thomas
2016-04-01
Missing values are a genuine issue in label-free quantitative proteomics. Recent works have surveyed the different statistical methods to conduct imputation and have compared them on real or simulated data sets and recommended a list of missing value imputation methods for proteomics application. Although insightful, these comparisons do not account for two important facts: (i) depending on the proteomics data set, the missingness mechanism may be of different natures and (ii) each imputation method is devoted to a specific type of missingness mechanism. As a result, we believe that the question at stake is not to find the most accurate imputation method in general but instead the most appropriate one. We describe a series of comparisons that support our views: For instance, we show that a supposedly "under-performing" method (i.e., giving baseline average results), if applied at the "appropriate" time in the data-processing pipeline (before or after peptide aggregation) on a data set with the "appropriate" nature of missing values, can outperform a blindly applied, supposedly "better-performing" method (i.e., the reference method from the state-of-the-art). This leads us to formulate few practical guidelines regarding the choice and the application of an imputation method in a proteomics context.
NASA Astrophysics Data System (ADS)
Yozgatligil, Ceylan; Aslan, Sipan; Iyigun, Cem; Batmaz, Inci
2013-04-01
This study aims to compare several imputation methods to complete the missing values of spatio-temporal meteorological time series. To this end, six imputation methods are assessed with respect to various criteria including accuracy, robustness, precision, and efficiency for artificially created missing data in monthly total precipitation and mean temperature series obtained from the Turkish State Meteorological Service. Of these methods, simple arithmetic average, normal ratio (NR), and NR weighted with correlations comprise the simple ones, whereas multilayer perceptron type neural network and multiple imputation strategy adopted by Monte Carlo Markov Chain based on expectation-maximization (EM-MCMC) are computationally intensive ones. In addition, we propose a modification on the EM-MCMC method. Besides using a conventional accuracy measure based on squared errors, we also suggest the correlation dimension (CD) technique of nonlinear dynamic time series analysis which takes spatio-temporal dependencies into account for evaluating imputation performances. Depending on the detailed graphical and quantitative analysis, it can be said that although computational methods, particularly EM-MCMC method, are computationally inefficient, they seem favorable for imputation of meteorological time series with respect to different missingness periods considering both measures and both series studied. To conclude, using the EM-MCMC algorithm for imputing missing values before conducting any statistical analyses of meteorological data will definitely decrease the amount of uncertainty and give more robust results. Moreover, the CD measure can be suggested for the performance evaluation of missing data imputation particularly with computational methods since it gives more precise results in meteorological time series.
Combining Fourier and lagged k-nearest neighbor imputation for biomedical time series data.
Rahman, Shah Atiqur; Huang, Yuxiao; Claassen, Jan; Heintzman, Nathaniel; Kleinberg, Samantha
2015-12-01
Most clinical and biomedical data contain missing values. A patient's record may be split across multiple institutions, devices may fail, and sensors may not be worn at all times. While these missing values are often ignored, this can lead to bias and error when the data are mined. Further, the data are not simply missing at random. Instead the measurement of a variable such as blood glucose may depend on its prior values as well as that of other variables. These dependencies exist across time as well, but current methods have yet to incorporate these temporal relationships as well as multiple types of missingness. To address this, we propose an imputation method (FLk-NN) that incorporates time lagged correlations both within and across variables by combining two imputation methods, based on an extension to k-NN and the Fourier transform. This enables imputation of missing values even when all data at a time point is missing and when there are different types of missingness both within and across variables. In comparison to other approaches on three biological datasets (simulated and actual Type 1 diabetes datasets, and multi-modality neurological ICU monitoring) the proposed method has the highest imputation accuracy. This was true for up to half the data being missing and when consecutive missing values are a significant fraction of the overall time series length. Copyright © 2015 Elsevier Inc. All rights reserved.
Multi-task Gaussian process for imputing missing data in multi-trait and multi-environment trials.
Hori, Tomoaki; Montcho, David; Agbangla, Clement; Ebana, Kaworu; Futakuchi, Koichi; Iwata, Hiroyoshi
2016-11-01
A method based on a multi-task Gaussian process using self-measuring similarity gave increased accuracy for imputing missing phenotypic data in multi-trait and multi-environment trials. Multi-environmental trial (MET) data often encounter the problem of missing data. Accurate imputation of missing data makes subsequent analysis more effective and the results easier to understand. Moreover, accurate imputation may help to reduce the cost of phenotyping for thinned-out lines tested in METs. METs are generally performed for multiple traits that are correlated to each other. Correlation among traits can be useful information for imputation, but single-trait-based methods cannot utilize information shared by traits that are correlated. In this paper, we propose imputation methods based on a multi-task Gaussian process (MTGP) using self-measuring similarity kernels reflecting relationships among traits, genotypes, and environments. This framework allows us to use genetic correlation among multi-trait multi-environment data and also to combine MET data and marker genotype data. We compared the accuracy of three MTGP methods and iterative regularized PCA using rice MET data. Two scenarios for the generation of missing data at various missing rates were considered. The MTGP performed a better imputation accuracy than regularized PCA, especially at high missing rates. Under the 'uniform' scenario, in which missing data arise randomly, inclusion of marker genotype data in the imputation increased the imputation accuracy at high missing rates. Under the 'fiber' scenario, in which missing data arise in all traits for some combinations between genotypes and environments, the inclusion of marker genotype data decreased the imputation accuracy for most traits while increasing the accuracy in a few traits remarkably. The proposed methods will be useful for solving the missing data problem in MET data.
Imputation method for lifetime exposure assessment in air pollution epidemiologic studies
2013-01-01
Background Environmental epidemiology, when focused on the life course of exposure to a specific pollutant, requires historical exposure estimates that are difficult to obtain for the full time period due to gaps in the historical record, especially in earlier years. We show that these gaps can be filled by applying multiple imputation methods to a formal risk equation that incorporates lifetime exposure. We also address challenges that arise, including choice of imputation method, potential bias in regression coefficients, and uncertainty in age-at-exposure sensitivities. Methods During time periods when parameters needed in the risk equation are missing for an individual, the parameters are filled by an imputation model using group level information or interpolation. A random component is added to match the variance found in the estimates for study subjects not needing imputation. The process is repeated to obtain multiple data sets, whose regressions against health data can be combined statistically to develop confidence limits using Rubin’s rules to account for the uncertainty introduced by the imputations. To test for possible recall bias between cases and controls, which can occur when historical residence location is obtained by interview, and which can lead to misclassification of imputed exposure by disease status, we introduce an “incompleteness index,” equal to the percentage of dose imputed (PDI) for a subject. “Effective doses” can be computed using different functional dependencies of relative risk on age of exposure, allowing intercomparison of different risk models. To illustrate our approach, we quantify lifetime exposure (dose) from traffic air pollution in an established case–control study on Long Island, New York, where considerable in-migration occurred over a period of many decades. Results The major result is the described approach to imputation. The illustrative example revealed potential recall bias, suggesting that regressions against health data should be done as a function of PDI to check for consistency of results. The 1% of study subjects who lived for long durations near heavily trafficked intersections, had very high cumulative exposures. Thus, imputation methods must be designed to reproduce non-standard distributions. Conclusions Our approach meets a number of methodological challenges to extending historical exposure reconstruction over a lifetime and shows promise for environmental epidemiology. Application to assessment of breast cancer risks will be reported in a subsequent manuscript. PMID:23919666
Variable selection under multiple imputation using the bootstrap in a prognostic study
Heymans, Martijn W; van Buuren, Stef; Knol, Dirk L; van Mechelen, Willem; de Vet, Henrica CW
2007-01-01
Background Missing data is a challenging problem in many prognostic studies. Multiple imputation (MI) accounts for imputation uncertainty that allows for adequate statistical testing. We developed and tested a methodology combining MI with bootstrapping techniques for studying prognostic variable selection. Method In our prospective cohort study we merged data from three different randomized controlled trials (RCTs) to assess prognostic variables for chronicity of low back pain. Among the outcome and prognostic variables data were missing in the range of 0 and 48.1%. We used four methods to investigate the influence of respectively sampling and imputation variation: MI only, bootstrap only, and two methods that combine MI and bootstrapping. Variables were selected based on the inclusion frequency of each prognostic variable, i.e. the proportion of times that the variable appeared in the model. The discriminative and calibrative abilities of prognostic models developed by the four methods were assessed at different inclusion levels. Results We found that the effect of imputation variation on the inclusion frequency was larger than the effect of sampling variation. When MI and bootstrapping were combined at the range of 0% (full model) to 90% of variable selection, bootstrap corrected c-index values of 0.70 to 0.71 and slope values of 0.64 to 0.86 were found. Conclusion We recommend to account for both imputation and sampling variation in sets of missing data. The new procedure of combining MI with bootstrapping for variable selection, results in multivariable prognostic models with good performance and is therefore attractive to apply on data sets with missing values. PMID:17629912
An Introduction to Modern Missing Data Analyses
ERIC Educational Resources Information Center
Baraldi, Amanda N.; Enders, Craig K.
2010-01-01
A great deal of recent methodological research has focused on two modern missing data analysis methods: maximum likelihood and multiple imputation. These approaches are advantageous to traditional techniques (e.g. deletion and mean imputation techniques) because they require less stringent assumptions and mitigate the pitfalls of traditional…
Hopke, P K; Liu, C; Rubin, D B
2001-03-01
Many chemical and environmental data sets are complicated by the existence of fully missing values or censored values known to lie below detection thresholds. For example, week-long samples of airborne particulate matter were obtained at Alert, NWT, Canada, between 1980 and 1991, where some of the concentrations of 24 particulate constituents were coarsened in the sense of being either fully missing or below detection limits. To facilitate scientific analysis, it is appealing to create complete data by filling in missing values so that standard complete-data methods can be applied. We briefly review commonly used strategies for handling missing values and focus on the multiple-imputation approach, which generally leads to valid inferences when faced with missing data. Three statistical models are developed for multiply imputing the missing values of airborne particulate matter. We expect that these models are useful for creating multiple imputations in a variety of incomplete multivariate time series data sets.
Estimating Interaction Effects With Incomplete Predictor Variables
Enders, Craig K.; Baraldi, Amanda N.; Cham, Heining
2014-01-01
The existing missing data literature does not provide a clear prescription for estimating interaction effects with missing data, particularly when the interaction involves a pair of continuous variables. In this article, we describe maximum likelihood and multiple imputation procedures for this common analysis problem. We outline 3 latent variable model specifications for interaction analyses with missing data. These models apply procedures from the latent variable interaction literature to analyses with a single indicator per construct (e.g., a regression analysis with scale scores). We also discuss multiple imputation for interaction effects, emphasizing an approach that applies standard imputation procedures to the product of 2 raw score predictors. We thoroughly describe the process of probing interaction effects with maximum likelihood and multiple imputation. For both missing data handling techniques, we outline centering and transformation strategies that researchers can implement in popular software packages, and we use a series of real data analyses to illustrate these methods. Finally, we use computer simulations to evaluate the performance of the proposed techniques. PMID:24707955
Bozio, Catherine H; Flanders, W Dana; Finelli, Lyn; Bramley, Anna M; Reed, Carrie; Gandhi, Neel R; Vidal, Jorge E; Erdman, Dean; Levine, Min Z; Lindstrom, Stephen; Ampofo, Krow; Arnold, Sandra R; Self, Wesley H; Williams, Derek J; Grijalva, Carlos G; Anderson, Evan J; McCullers, Jonathan A; Edwards, Kathryn M; Pavia, Andrew T; Wunderink, Richard G; Jain, Seema
2018-01-01
Abstract Background Real-time polymerase chain reaction (PCR) on respiratory specimens and serology on paired blood specimens are used to determine the etiology of respiratory illnesses for research studies. However, convalescent serology is often not collected. We used multiple imputation to assign values for missing serology results to estimate virus-specific prevalence among pediatric and adult community-acquired pneumonia hospitalizations using data from an active population-based surveillance study. Methods Presence of adenoviruses, human metapneumovirus, influenza viruses, parainfluenza virus types 1–3, and respiratory syncytial virus was defined by positive PCR on nasopharyngeal/oropharyngeal specimens or a 4-fold rise in paired serology. We performed multiple imputation by developing a multivariable regression model for each virus using data from patients with available serology results. We calculated absolute and relative differences in the proportion of each virus detected comparing the imputed to observed (nonimputed) results. Results Among 2222 children and 2259 adults, 98.8% and 99.5% had nasopharyngeal/oropharyngeal specimens and 43.2% and 37.5% had paired serum specimens, respectively. Imputed results increased viral etiology assignments by an absolute difference of 1.6%–4.4% and 0.8%–2.8% in children and adults, respectively; relative differences were 1.1–3.0 times higher. Conclusions Multiple imputation can be used when serology results are missing, to refine virus-specific prevalence estimates, and these will likely increase estimates.
Chaurasia, Ashok; Harel, Ofer
2015-02-10
Tests for regression coefficients such as global, local, and partial F-tests are common in applied research. In the framework of multiple imputation, there are several papers addressing tests for regression coefficients. However, for simultaneous hypothesis testing, the existing methods are computationally intensive because they involve calculation with vectors and (inversion of) matrices. In this paper, we propose a simple method based on the scalar entity, coefficient of determination, to perform (global, local, and partial) F-tests with multiply imputed data. The proposed method is evaluated using simulated data and applied to suicide prevention data. Copyright © 2014 John Wiley & Sons, Ltd.
Shara, Nawar; Yassin, Sayf A.; Valaitis, Eduardas; Wang, Hong; Howard, Barbara V.; Wang, Wenyu; Lee, Elisa T.; Umans, Jason G.
2015-01-01
Kidney and cardiovascular disease are widespread among populations with high prevalence of diabetes, such as American Indians participating in the Strong Heart Study (SHS). Studying these conditions simultaneously in longitudinal studies is challenging, because the morbidity and mortality associated with these diseases result in missing data, and these data are likely not missing at random. When such data are merely excluded, study findings may be compromised. In this article, a subset of 2264 participants with complete renal function data from Strong Heart Exams 1 (1989–1991), 2 (1993–1995), and 3 (1998–1999) was used to examine the performance of five methods used to impute missing data: listwise deletion, mean of serial measures, adjacent value, multiple imputation, and pattern-mixture. Three missing at random models and one non-missing at random model were used to compare the performance of the imputation techniques on randomly and non-randomly missing data. The pattern-mixture method was found to perform best for imputing renal function data that were not missing at random. Determining whether data are missing at random or not can help in choosing the imputation method that will provide the most accurate results. PMID:26414328
Shara, Nawar; Yassin, Sayf A; Valaitis, Eduardas; Wang, Hong; Howard, Barbara V; Wang, Wenyu; Lee, Elisa T; Umans, Jason G
2015-01-01
Kidney and cardiovascular disease are widespread among populations with high prevalence of diabetes, such as American Indians participating in the Strong Heart Study (SHS). Studying these conditions simultaneously in longitudinal studies is challenging, because the morbidity and mortality associated with these diseases result in missing data, and these data are likely not missing at random. When such data are merely excluded, study findings may be compromised. In this article, a subset of 2264 participants with complete renal function data from Strong Heart Exams 1 (1989-1991), 2 (1993-1995), and 3 (1998-1999) was used to examine the performance of five methods used to impute missing data: listwise deletion, mean of serial measures, adjacent value, multiple imputation, and pattern-mixture. Three missing at random models and one non-missing at random model were used to compare the performance of the imputation techniques on randomly and non-randomly missing data. The pattern-mixture method was found to perform best for imputing renal function data that were not missing at random. Determining whether data are missing at random or not can help in choosing the imputation method that will provide the most accurate results.
Falcaro, Milena; Carpenter, James R
2017-06-01
Population-based net survival by tumour stage at diagnosis is a key measure in cancer surveillance. Unfortunately, data on tumour stage are often missing for a non-negligible proportion of patients and the mechanism giving rise to the missingness is usually anything but completely at random. In this setting, restricting analysis to the subset of complete records gives typically biased results. Multiple imputation is a promising practical approach to the issues raised by the missing data, but its use in conjunction with the Pohar-Perme method for estimating net survival has not been formally evaluated. We performed a resampling study using colorectal cancer population-based registry data to evaluate the ability of multiple imputation, used along with the Pohar-Perme method, to deliver unbiased estimates of stage-specific net survival and recover missing stage information. We created 1000 independent data sets, each containing 5000 patients. Stage data were then made missing at random under two scenarios (30% and 50% missingness). Complete records analysis showed substantial bias and poor confidence interval coverage. Across both scenarios our multiple imputation strategy virtually eliminated the bias and greatly improved confidence interval coverage. In the presence of missing stage data complete records analysis often gives severely biased results. We showed that combining multiple imputation with the Pohar-Perme estimator provides a valid practical approach for the estimation of stage-specific colorectal cancer net survival. As usual, when the percentage of missing data is high the results should be interpreted cautiously and sensitivity analyses are recommended. Copyright © 2017 Elsevier Ltd. All rights reserved.
Ondeck, Nathaniel T; Fu, Michael C; Skrip, Laura A; McLynn, Ryan P; Su, Edwin P; Grauer, Jonathan N
2018-03-01
Despite the advantages of large, national datasets, one continuing concern is missing data values. Complete case analysis, where only cases with complete data are analyzed, is commonly used rather than more statistically rigorous approaches such as multiple imputation. This study characterizes the potential selection bias introduced using complete case analysis and compares the results of common regressions using both techniques following unicompartmental knee arthroplasty. Patients undergoing unicompartmental knee arthroplasty were extracted from the 2005 to 2015 National Surgical Quality Improvement Program. As examples, the demographics of patients with and without missing preoperative albumin and hematocrit values were compared. Missing data were then treated with both complete case analysis and multiple imputation (an approach that reproduces the variation and associations that would have been present in a full dataset) and the conclusions of common regressions for adverse outcomes were compared. A total of 6117 patients were included, of which 56.7% were missing at least one value. Younger, female, and healthier patients were more likely to have missing preoperative albumin and hematocrit values. The use of complete case analysis removed 3467 patients from the study in comparison with multiple imputation which included all 6117 patients. The 2 methods of handling missing values led to differing associations of low preoperative laboratory values with commonly studied adverse outcomes. The use of complete case analysis can introduce selection bias and may lead to different conclusions in comparison with the statistically rigorous multiple imputation approach. Joint surgeons should consider the methods of handling missing values when interpreting arthroplasty research. Copyright © 2017 Elsevier Inc. All rights reserved.
McClure, Matthew C.; Sonstegard, Tad S.; Wiggans, George R.; Van Eenennaam, Alison L.; Weber, Kristina L.; Penedo, Cecilia T.; Berry, Donagh P.; Flynn, John; Garcia, Jose F.; Carmo, Adriana S.; Regitano, Luciana C. A.; Albuquerque, Milla; Silva, Marcos V. G. B.; Machado, Marco A.; Coffey, Mike; Moore, Kirsty; Boscher, Marie-Yvonne; Genestout, Lucie; Mazza, Raffaele; Taylor, Jeremy F.; Schnabel, Robert D.; Simpson, Barry; Marques, Elisa; McEwan, John C.; Cromie, Andrew; Coutinho, Luiz L.; Kuehn, Larry A.; Keele, John W.; Piper, Emily K.; Cook, Jim; Williams, Robert; Van Tassell, Curtis P.
2013-01-01
To assist cattle producers transition from microsatellite (MS) to single nucleotide polymorphism (SNP) genotyping for parental verification we previously devised an effective and inexpensive method to impute MS alleles from SNP haplotypes. While the reported method was verified with only a limited data set (N = 479) from Brown Swiss, Guernsey, Holstein, and Jersey cattle, some of the MS-SNP haplotype associations were concordant across these phylogenetically diverse breeds. This implied that some haplotypes predate modern breed formation and remain in strong linkage disequilibrium. To expand the utility of MS allele imputation across breeds, MS and SNP data from more than 8000 animals representing 39 breeds (Bos taurus and B. indicus) were used to predict 9410 SNP haplotypes, incorporating an average of 73 SNPs per haplotype, for which alleles from 12 MS markers could be accurately be imputed. Approximately 25% of the MS-SNP haplotypes were present in multiple breeds (N = 2 to 36 breeds). These shared haplotypes allowed for MS imputation in breeds that were not represented in the reference population with only a small increase in Mendelian inheritance inconsistancies. Our reported reference haplotypes can be used for any cattle breed and the reported methods can be applied to any species to aid the transition from MS to SNP genetic markers. While ~91% of the animals with imputed alleles for 12 MS markers had ≤1 Mendelian inheritance conflicts with their parents' reported MS genotypes, this figure was 96% for our reference animals, indicating potential errors in the reported MS genotypes. The workflow we suggest autocorrects for genotyping errors and rare haplotypes, by MS genotyping animals whose imputed MS alleles fail parentage verification, and then incorporating those animals into the reference dataset. PMID:24065982
Stanley J. Zarnoch; H. Ken Cordell; Carter J. Betz; John C. Bergstrom
2010-01-01
Multiple imputation is used to create values for missing family income data in the National Survey on Recreation and the Environment. We present an overview of the survey and a description of the missingness pattern for family income and other key variables. We create a logistic model for the multiple imputation process and to impute data sets for family income. We...
Imputation for multisource data with comparison and assessment techniques
Casleton, Emily Michele; Osthus, David Allen; Van Buren, Kendra Lu
2017-12-27
Missing data are prevalent issue in analyses involving data collection. The problem of missing data is exacerbated for multisource analysis, where data from multiple sensors are combined to arrive at a single conclusion. In this scenario, it is more likely to occur and can lead to discarding a large amount of data collected; however, the information from observed sensors can be leveraged to estimate those values not observed. We propose two methods for imputation of multisource data, both of which take advantage of potential correlation between data from different sensors, through ridge regression and a state-space model. These methods, asmore » well as the common median imputation, are applied to data collected from a variety of sensors monitoring an experimental facility. Performance of imputation methods is compared with the mean absolute deviation; however, rather than using this metric to solely rank themethods,we also propose an approach to identify significant differences. Imputation techniqueswill also be assessed by their ability to produce appropriate confidence intervals, through coverage and length, around the imputed values. Finally, performance of imputed datasets is compared with a marginalized dataset through a weighted k-means clustering. In general, we found that imputation through a dynamic linearmodel tended to be the most accurate and to produce the most precise confidence intervals, and that imputing the missing values and down weighting them with respect to observed values in the analysis led to the most accurate performance.« less
Imputation for multisource data with comparison and assessment techniques
DOE Office of Scientific and Technical Information (OSTI.GOV)
Casleton, Emily Michele; Osthus, David Allen; Van Buren, Kendra Lu
Missing data are prevalent issue in analyses involving data collection. The problem of missing data is exacerbated for multisource analysis, where data from multiple sensors are combined to arrive at a single conclusion. In this scenario, it is more likely to occur and can lead to discarding a large amount of data collected; however, the information from observed sensors can be leveraged to estimate those values not observed. We propose two methods for imputation of multisource data, both of which take advantage of potential correlation between data from different sensors, through ridge regression and a state-space model. These methods, asmore » well as the common median imputation, are applied to data collected from a variety of sensors monitoring an experimental facility. Performance of imputation methods is compared with the mean absolute deviation; however, rather than using this metric to solely rank themethods,we also propose an approach to identify significant differences. Imputation techniqueswill also be assessed by their ability to produce appropriate confidence intervals, through coverage and length, around the imputed values. Finally, performance of imputed datasets is compared with a marginalized dataset through a weighted k-means clustering. In general, we found that imputation through a dynamic linearmodel tended to be the most accurate and to produce the most precise confidence intervals, and that imputing the missing values and down weighting them with respect to observed values in the analysis led to the most accurate performance.« less
Characterizing and Managing Missing Structured Data in Electronic Health Records: Data Analysis.
Beaulieu-Jones, Brett K; Lavage, Daniel R; Snyder, John W; Moore, Jason H; Pendergrass, Sarah A; Bauer, Christopher R
2018-02-23
Missing data is a challenge for all studies; however, this is especially true for electronic health record (EHR)-based analyses. Failure to appropriately consider missing data can lead to biased results. While there has been extensive theoretical work on imputation, and many sophisticated methods are now available, it remains quite challenging for researchers to implement these methods appropriately. Here, we provide detailed procedures for when and how to conduct imputation of EHR laboratory results. The objective of this study was to demonstrate how the mechanism of missingness can be assessed, evaluate the performance of a variety of imputation methods, and describe some of the most frequent problems that can be encountered. We analyzed clinical laboratory measures from 602,366 patients in the EHR of Geisinger Health System in Pennsylvania, USA. Using these data, we constructed a representative set of complete cases and assessed the performance of 12 different imputation methods for missing data that was simulated based on 4 mechanisms of missingness (missing completely at random, missing not at random, missing at random, and real data modelling). Our results showed that several methods, including variations of Multivariate Imputation by Chained Equations (MICE) and softImpute, consistently imputed missing values with low error; however, only a subset of the MICE methods was suitable for multiple imputation. The analyses we describe provide an outline of considerations for dealing with missing EHR data, steps that researchers can perform to characterize missingness within their own data, and an evaluation of methods that can be applied to impute clinical data. While the performance of methods may vary between datasets, the process we describe can be generalized to the majority of structured data types that exist in EHRs, and all of our methods and code are publicly available. ©Brett K Beaulieu-Jones, Daniel R Lavage, John W Snyder, Jason H Moore, Sarah A Pendergrass, Christopher R Bauer. Originally published in JMIR Medical Informatics (http://medinform.jmir.org), 23.02.2018.
A Comparison of Item-Level and Scale-Level Multiple Imputation for Questionnaire Batteries
ERIC Educational Resources Information Center
Gottschall, Amanda C.; West, Stephen G.; Enders, Craig K.
2012-01-01
Behavioral science researchers routinely use scale scores that sum or average a set of questionnaire items to address their substantive questions. A researcher applying multiple imputation to incomplete questionnaire data can either impute the incomplete items prior to computing scale scores or impute the scale scores directly from other scale…
SparRec: An effective matrix completion framework of missing data imputation for GWAS
NASA Astrophysics Data System (ADS)
Jiang, Bo; Ma, Shiqian; Causey, Jason; Qiao, Linbo; Hardin, Matthew Price; Bitts, Ian; Johnson, Daniel; Zhang, Shuzhong; Huang, Xiuzhen
2016-10-01
Genome-wide association studies present computational challenges for missing data imputation, while the advances of genotype technologies are generating datasets of large sample sizes with sample sets genotyped on multiple SNP chips. We present a new framework SparRec (Sparse Recovery) for imputation, with the following properties: (1) The optimization models of SparRec, based on low-rank and low number of co-clusters of matrices, are different from current statistics methods. While our low-rank matrix completion (LRMC) model is similar to Mendel-Impute, our matrix co-clustering factorization (MCCF) model is completely new. (2) SparRec, as other matrix completion methods, is flexible to be applied to missing data imputation for large meta-analysis with different cohorts genotyped on different sets of SNPs, even when there is no reference panel. This kind of meta-analysis is very challenging for current statistics based methods. (3) SparRec has consistent performance and achieves high recovery accuracy even when the missing data rate is as high as 90%. Compared with Mendel-Impute, our low-rank based method achieves similar accuracy and efficiency, while the co-clustering based method has advantages in running time. The testing results show that SparRec has significant advantages and competitive performance over other state-of-the-art existing statistics methods including Beagle and fastPhase.
Simons, Claire L; Rivero-Arias, Oliver; Yu, Ly-Mee; Simon, Judit
2015-04-01
Missing data are a well-known and widely documented problem in cost-effectiveness analyses alongside clinical trials using individual patient-level data. Current methodological research recommends multiple imputation (MI) to deal with missing health outcome data, but there is little guidance on whether MI for multi-attribute questionnaires, such as the EQ-5D-3L, should be carried out at domain or at summary score level. In this paper, we evaluated the impact of imputing individual domains versus imputing index values to deal with missing EQ-5D-3L data using a simulation study and developed recommendations for future practice. We simulated missing data in a patient-level dataset with complete EQ-5D-3L data at one point in time from a large multinational clinical trial (n = 1,814). Different proportions of missing data were generated using a missing at random (MAR) mechanism and three different scenarios were studied. The performance of using each method was evaluated using root mean squared error and mean absolute error of the actual versus predicted EQ-5D-3L indices. In large sample sizes (n > 500) and a missing data pattern that follows mainly unit non-response, imputing domains or the index produced similar results. However, domain imputation became more accurate than index imputation with pattern of missingness following an item non-response. For smaller sample sizes (n < 100), index imputation was more accurate. When MI models were misspecified, both domain and index imputations were inaccurate for any proportion of missing data. The decision between imputing the domains or the EQ-5D-3L index scores depends on the observed missing data pattern and the sample size available for analysis. Analysts conducting this type of exercises should also evaluate the sensitivity of the analysis to the MAR assumption and whether the imputation model is correctly specified.
Nearest neighbor imputation of species-level, plot-scale forest structure attributes from LiDAR data
Andrew T. Hudak; Nicholas L. Crookston; Jeffrey S. Evans; David E. Hall; Michael J. Falkowski
2008-01-01
Meaningful relationships between forest structure attributes measured in representative field plots on the ground and remotely sensed data measured comprehensively across the same forested landscape facilitate the production of maps of forest attributes such as basal area (BA) and tree density (TD). Because imputation methods can efficiently predict multiple response...
Penalized regression procedures for variable selection in the potential outcomes framework
Ghosh, Debashis; Zhu, Yeying; Coffman, Donna L.
2015-01-01
A recent topic of much interest in causal inference is model selection. In this article, we describe a framework in which to consider penalized regression approaches to variable selection for causal effects. The framework leads to a simple ‘impute, then select’ class of procedures that is agnostic to the type of imputation algorithm as well as penalized regression used. It also clarifies how model selection involves a multivariate regression model for causal inference problems, and that these methods can be applied for identifying subgroups in which treatment effects are homogeneous. Analogies and links with the literature on machine learning methods, missing data and imputation are drawn. A difference LASSO algorithm is defined, along with its multiple imputation analogues. The procedures are illustrated using a well-known right heart catheterization dataset. PMID:25628185
Newgard, Craig; Malveau, Susan; Staudenmayer, Kristan; Wang, N. Ewen; Hsia, Renee Y.; Mann, N. Clay; Holmes, James F.; Kuppermann, Nathan; Haukoos, Jason S.; Bulger, Eileen M.; Dai, Mengtao; Cook, Lawrence J.
2012-01-01
Objectives The objective was to evaluate the process of using existing data sources, probabilistic linkage, and multiple imputation to create large population-based injury databases matched to outcomes. Methods This was a retrospective cohort study of injured children and adults transported by 94 emergency medical systems (EMS) agencies to 122 hospitals in seven regions of the western United States over a 36-month period (2006 to 2008). All injured patients evaluated by EMS personnel within specific geographic catchment areas were included, regardless of field disposition or outcome. The authors performed probabilistic linkage of EMS records to four hospital and postdischarge data sources (emergency department [ED] data, patient discharge data, trauma registries, and vital statistics files) and then handled missing values using multiple imputation. The authors compare and evaluate matched records, match rates (proportion of matches among eligible patients), and injury outcomes within and across sites. Results There were 381,719 injured patients evaluated by EMS personnel in the seven regions. Among transported patients, match rates ranged from 14.9% to 87.5% and were directly affected by the availability of hospital data sources and proportion of missing values for key linkage variables. For vital statistics records (1-year mortality), estimated match rates ranged from 88.0% to 98.7%. Use of multiple imputation (compared to complete case analysis) reduced bias for injury outcomes, although sample size, percentage missing, type of variable, and combined-site versus single-site imputation models all affected the resulting estimates and variance. Conclusions This project demonstrates the feasibility and describes the process of constructing population-based injury databases across multiple phases of care using existing data sources and commonly available analytic methods. Attention to key linkage variables and decisions for handling missing values can be used to increase match rates between data sources, minimize bias, and preserve sampling design. PMID:22506952
NASA Astrophysics Data System (ADS)
Poyatos, Rafael; Sus, Oliver; Vilà-Cabrera, Albert; Vayreda, Jordi; Badiella, Llorenç; Mencuccini, Maurizio; Martínez-Vilalta, Jordi
2016-04-01
Plant functional traits are increasingly being used in ecosystem ecology thanks to the growing availability of large ecological databases. However, these databases usually contain a large fraction of missing data because measuring plant functional traits systematically is labour-intensive and because most databases are compilations of datasets with different sampling designs. As a result, within a given database, there is an inevitable variability in the number of traits available for each data entry and/or the species coverage in a given geographical area. The presence of missing data may severely bias trait-based analyses, such as the quantification of trait covariation or trait-environment relationships and may hamper efforts towards trait-based modelling of ecosystem biogeochemical cycles. Several data imputation (i.e. gap-filling) methods have been recently tested on compiled functional trait databases, but the performance of imputation methods applied to a functional trait database with a regular spatial sampling has not been thoroughly studied. Here, we assess the effects of data imputation on five tree functional traits (leaf biomass to sapwood area ratio, foliar nitrogen, maximum height, specific leaf area and wood density) in the Ecological and Forest Inventory of Catalonia, an extensive spatial database (covering 31900 km2). We tested the performance of species mean imputation, single imputation by the k-nearest neighbors algorithm (kNN) and a multiple imputation method, Multivariate Imputation with Chained Equations (MICE) at different levels of missing data (10%, 30%, 50%, and 80%). We also assessed the changes in imputation performance when additional predictors (species identity, climate, forest structure, spatial structure) were added in kNN and MICE imputations. We evaluated the imputed datasets using a battery of indexes describing departure from the complete dataset in trait distribution, in the mean prediction error, in the correlation matrix and in selected bivariate trait relationships. MICE yielded imputations which better preserved the variability and covariance structure of the data and provided an estimate of between-imputation uncertainty. We found that adding species identity as a predictor in MICE and kNN improved imputation for all traits, but adding climate did not lead to any appreciable improvement. However, forest structure and spatial structure did reduce imputation errors in maximum height and in leaf biomass to sapwood area ratios, respectively. Although species mean imputations showed the lowest error for 3 out the 5 studied traits, dataset-averaged errors were lowest for MICE imputations with all additional predictors, when missing data levels were 50% or lower. Species mean imputations always resulted in larger errors in the correlation matrix and appreciably altered the studied bivariate trait relationships. In conclusion, MICE imputations using species identity, climate, forest structure and spatial structure as predictors emerged as the most suitable method of the ones tested here, but it was also evident that imputation performance deteriorates at high levels of missing data (80%).
Zhang, Guosheng; Huang, Kuan-Chieh; Xu, Zheng; Tzeng, Jung-Ying; Conneely, Karen N; Guan, Weihua; Kang, Jian; Li, Yun
2016-05-01
DNA methylation is a key epigenetic mark involved in both normal development and disease progression. Recent advances in high-throughput technologies have enabled genome-wide profiling of DNA methylation. However, DNA methylation profiling often employs different designs and platforms with varying resolution, which hinders joint analysis of methylation data from multiple platforms. In this study, we propose a penalized functional regression model to impute missing methylation data. By incorporating functional predictors, our model utilizes information from nonlocal probes to improve imputation quality. Here, we compared the performance of our functional model to linear regression and the best single probe surrogate in real data and via simulations. Specifically, we applied different imputation approaches to an acute myeloid leukemia dataset consisting of 194 samples and our method showed higher imputation accuracy, manifested, for example, by a 94% relative increase in information content and up to 86% more CpG sites passing post-imputation filtering. Our simulated association study further demonstrated that our method substantially improves the statistical power to identify trait-associated methylation loci. These findings indicate that the penalized functional regression model is a convenient and valuable imputation tool for methylation data, and it can boost statistical power in downstream epigenome-wide association study (EWAS). © 2016 WILEY PERIODICALS, INC.
Multivariate missing data in hydrology - Review and applications
NASA Astrophysics Data System (ADS)
Ben Aissia, Mohamed-Aymen; Chebana, Fateh; Ouarda, Taha B. M. J.
2017-12-01
Water resources planning and management require complete data sets of a number of hydrological variables, such as flood peaks and volumes. However, hydrologists are often faced with the problem of missing data (MD) in hydrological databases. Several methods are used to deal with the imputation of MD. During the last decade, multivariate approaches have gained popularity in the field of hydrology, especially in hydrological frequency analysis (HFA). However, treating the MD remains neglected in the multivariate HFA literature whereas the focus has been mainly on the modeling component. For a complete analysis and in order to optimize the use of data, MD should also be treated in the multivariate setting prior to modeling and inference. Imputation of MD in the multivariate hydrological framework can have direct implications on the quality of the estimation. Indeed, the dependence between the series represents important additional information that can be included in the imputation process. The objective of the present paper is to highlight the importance of treating MD in multivariate hydrological frequency analysis by reviewing and applying multivariate imputation methods and by comparing univariate and multivariate imputation methods. An application is carried out for multiple flood attributes on three sites in order to evaluate the performance of the different methods based on the leave-one-out procedure. The results indicate that, the performance of imputation methods can be improved by adopting the multivariate setting, compared to mean substitution and interpolation methods, especially when using the copula-based approach.
Probability genotype imputation method and integrated weighted lasso for QTL identification.
Demetrashvili, Nino; Van den Heuvel, Edwin R; Wit, Ernst C
2013-12-30
Many QTL studies have two common features: (1) often there is missing marker information, (2) among many markers involved in the biological process only a few are causal. In statistics, the second issue falls under the headings "sparsity" and "causal inference". The goal of this work is to develop a two-step statistical methodology for QTL mapping for markers with binary genotypes. The first step introduces a novel imputation method for missing genotypes. Outcomes of the proposed imputation method are probabilities which serve as weights to the second step, namely in weighted lasso. The sparse phenotype inference is employed to select a set of predictive markers for the trait of interest. Simulation studies validate the proposed methodology under a wide range of realistic settings. Furthermore, the methodology outperforms alternative imputation and variable selection methods in such studies. The methodology was applied to an Arabidopsis experiment, containing 69 markers for 165 recombinant inbred lines of a F8 generation. The results confirm previously identified regions, however several new markers are also found. On the basis of the inferred ROC behavior these markers show good potential for being real, especially for the germination trait Gmax. Our imputation method shows higher accuracy in terms of sensitivity and specificity compared to alternative imputation method. Also, the proposed weighted lasso outperforms commonly practiced multiple regression as well as the traditional lasso and adaptive lasso with three weighting schemes. This means that under realistic missing data settings this methodology can be used for QTL identification.
2014-01-01
BACKGROUND Elevated blood pressure (BP), a heritable risk factor for many age-related disorders, is commonly investigated in population and genetic studies, but antihypertensive use can confound study results. Routine methods to adjust for antihypertensives may not sufficiently account for newer treatment protocols (i.e., combination or multiple drug therapy) found in contemporary cohorts. METHODS We refined an existing method to impute unmedicated BP in individuals on antihypertensives by incorporating new treatment trends. We assessed BP and antihypertensive use in male twins (n = 1,237) from the Vietnam Era Twin Study of Aging: 36% reported antihypertensive use; 52% of those treated were on multiple drugs. RESULTS Estimated heritability was 0.43 (95% confidence interval (CI) = 0.20–0.50) and 0.44 (95% CI = 0.22–0.61) for measured systolic BP (SBP) and diastolic BP (DBP), respectively. We imputed BP for antihypertensives by 3 approaches: (i) addition of a fixed value of 10/5mm Hg to measured SBP/DBP; (ii) incremented addition of mm Hg to BP based on number of medications; and (iii) a refined approach adding mm Hg based on antihypertensive drug class and ethnicity. The imputations did not significantly affect estimated heritability of BP. However, use of our most refined imputation method and other methods resulted in significantly increased phenotypic correlations between BP and body mass index, a trait known to be correlated with BP. CONCLUSIONS This study highlights the potential usefulness of applying a representative adjustment for medication use, such as by considering drug class, ethnicity, and the combination of drugs when assessing the relationship between BP and risk factors. PMID:24532572
Tang, Yongqiang
2017-12-01
Control-based pattern mixture models (PMM) and delta-adjusted PMMs are commonly used as sensitivity analyses in clinical trials with non-ignorable dropout. These PMMs assume that the statistical behavior of outcomes varies by pattern in the experimental arm in the imputation procedure, but the imputed data are typically analyzed by a standard method such as the primary analysis model. In the multiple imputation (MI) inference, Rubin's variance estimator is generally biased when the imputation and analysis models are uncongenial. One objective of the article is to quantify the bias of Rubin's variance estimator in the control-based and delta-adjusted PMMs for longitudinal continuous outcomes. These PMMs assume the same observed data distribution as the mixed effects model for repeated measures (MMRM). We derive analytic expressions for the MI treatment effect estimator and the associated Rubin's variance in these PMMs and MMRM as functions of the maximum likelihood estimator from the MMRM analysis and the observed proportion of subjects in each dropout pattern when the number of imputations is infinite. The asymptotic bias is generally small or negligible in the delta-adjusted PMM, but can be sizable in the control-based PMM. This indicates that the inference based on Rubin's rule is approximately valid in the delta-adjusted PMM. A simple variance estimator is proposed to ensure asymptotically valid MI inferences in these PMMs, and compared with the bootstrap variance. The proposed method is illustrated by the analysis of an antidepressant trial, and its performance is further evaluated via a simulation study. © 2017, The International Biometric Society.
Prediction of regulatory gene pairs using dynamic time warping and gene ontology.
Yang, Andy C; Hsu, Hui-Huang; Lu, Ming-Da; Tseng, Vincent S; Shih, Timothy K
2014-01-01
Selecting informative genes is the most important task for data analysis on microarray gene expression data. In this work, we aim at identifying regulatory gene pairs from microarray gene expression data. However, microarray data often contain multiple missing expression values. Missing value imputation is thus needed before further processing for regulatory gene pairs becomes possible. We develop a novel approach to first impute missing values in microarray time series data by combining k-Nearest Neighbour (KNN), Dynamic Time Warping (DTW) and Gene Ontology (GO). After missing values are imputed, we then perform gene regulation prediction based on our proposed DTW-GO distance measurement of gene pairs. Experimental results show that our approach is more accurate when compared with existing missing value imputation methods on real microarray data sets. Furthermore, our approach can also discover more regulatory gene pairs that are known in the literature than other methods.
Zhang, Zhaoyang; Fang, Hua; Wang, Honggang
2016-06-01
Web-delivered trials are an important component in eHealth services. These trials, mostly behavior-based, generate big heterogeneous data that are longitudinal, high dimensional with missing values. Unsupervised learning methods have been widely applied in this area, however, validating the optimal number of clusters has been challenging. Built upon our multiple imputation (MI) based fuzzy clustering, MIfuzzy, we proposed a new multiple imputation based validation (MIV) framework and corresponding MIV algorithms for clustering big longitudinal eHealth data with missing values, more generally for fuzzy-logic based clustering methods. Specifically, we detect the optimal number of clusters by auto-searching and -synthesizing a suite of MI-based validation methods and indices, including conventional (bootstrap or cross-validation based) and emerging (modularity-based) validation indices for general clustering methods as well as the specific one (Xie and Beni) for fuzzy clustering. The MIV performance was demonstrated on a big longitudinal dataset from a real web-delivered trial and using simulation. The results indicate MI-based Xie and Beni index for fuzzy-clustering are more appropriate for detecting the optimal number of clusters for such complex data. The MIV concept and algorithms could be easily adapted to different types of clustering that could process big incomplete longitudinal trial data in eHealth services.
Zhang, Zhaoyang; Wang, Honggang
2016-01-01
Web-delivered trials are an important component in eHealth services. These trials, mostly behavior-based, generate big heterogeneous data that are longitudinal, high dimensional with missing values. Unsupervised learning methods have been widely applied in this area, however, validating the optimal number of clusters has been challenging. Built upon our multiple imputation (MI) based fuzzy clustering, MIfuzzy, we proposed a new multiple imputation based validation (MIV) framework and corresponding MIV algorithms for clustering big longitudinal eHealth data with missing values, more generally for fuzzy-logic based clustering methods. Specifically, we detect the optimal number of clusters by auto-searching and -synthesizing a suite of MI-based validation methods and indices, including conventional (bootstrap or cross-validation based) and emerging (modularity-based) validation indices for general clustering methods as well as the specific one (Xie and Beni) for fuzzy clustering. The MIV performance was demonstrated on a big longitudinal dataset from a real web-delivered trial and using simulation. The results indicate MI-based Xie and Beni index for fuzzy-clustering is more appropriate for detecting the optimal number of clusters for such complex data. The MIV concept and algorithms could be easily adapted to different types of clustering that could process big incomplete longitudinal trial data in eHealth services. PMID:27126063
Graffelman, Jan; Sánchez, Milagros; Cook, Samantha; Moreno, Victor
2013-01-01
In genetic association studies, tests for Hardy-Weinberg proportions are often employed as a quality control checking procedure. Missing genotypes are typically discarded prior to testing. In this paper we show that inference for Hardy-Weinberg proportions can be biased when missing values are discarded. We propose to use multiple imputation of missing values in order to improve inference for Hardy-Weinberg proportions. For imputation we employ a multinomial logit model that uses information from allele intensities and/or neighbouring markers. Analysis of an empirical data set of single nucleotide polymorphisms possibly related to colon cancer reveals that missing genotypes are not missing completely at random. Deviation from Hardy-Weinberg proportions is mostly due to a lack of heterozygotes. Inbreeding coefficients estimated by multiple imputation of the missings are typically lowered with respect to inbreeding coefficients estimated by discarding the missings. Accounting for missings by multiple imputation qualitatively changed the results of 10 to 17% of the statistical tests performed. Estimates of inbreeding coefficients obtained by multiple imputation showed high correlation with estimates obtained by single imputation using an external reference panel. Our conclusion is that imputation of missing data leads to improved statistical inference for Hardy-Weinberg proportions.
De Silva, Anurika Priyanjali; Moreno-Betancur, Margarita; De Livera, Alysha Madhu; Lee, Katherine Jane; Simpson, Julie Anne
2017-07-25
Missing data is a common problem in epidemiological studies, and is particularly prominent in longitudinal data, which involve multiple waves of data collection. Traditional multiple imputation (MI) methods (fully conditional specification (FCS) and multivariate normal imputation (MVNI)) treat repeated measurements of the same time-dependent variable as just another 'distinct' variable for imputation and therefore do not make the most of the longitudinal structure of the data. Only a few studies have explored extensions to the standard approaches to account for the temporal structure of longitudinal data. One suggestion is the two-fold fully conditional specification (two-fold FCS) algorithm, which restricts the imputation of a time-dependent variable to time blocks where the imputation model includes measurements taken at the specified and adjacent times. To date, no study has investigated the performance of two-fold FCS and standard MI methods for handling missing data in a time-varying covariate with a non-linear trajectory over time - a commonly encountered scenario in epidemiological studies. We simulated 1000 datasets of 5000 individuals based on the Longitudinal Study of Australian Children (LSAC). Three missing data mechanisms: missing completely at random (MCAR), and a weak and a strong missing at random (MAR) scenarios were used to impose missingness on body mass index (BMI) for age z-scores; a continuous time-varying exposure variable with a non-linear trajectory over time. We evaluated the performance of FCS, MVNI, and two-fold FCS for handling up to 50% of missing data when assessing the association between childhood obesity and sleep problems. The standard two-fold FCS produced slightly more biased and less precise estimates than FCS and MVNI. We observed slight improvements in bias and precision when using a time window width of two for the two-fold FCS algorithm compared to the standard width of one. We recommend the use of FCS or MVNI in a similar longitudinal setting, and when encountering convergence issues due to a large number of time points or variables with missing values, the two-fold FCS with exploration of a suitable time window.
Comparing multiple imputation methods for systematically missing subject-level data.
Kline, David; Andridge, Rebecca; Kaizar, Eloise
2017-06-01
When conducting research synthesis, the collection of studies that will be combined often do not measure the same set of variables, which creates missing data. When the studies to combine are longitudinal, missing data can occur on the observation-level (time-varying) or the subject-level (non-time-varying). Traditionally, the focus of missing data methods for longitudinal data has been on missing observation-level variables. In this paper, we focus on missing subject-level variables and compare two multiple imputation approaches: a joint modeling approach and a sequential conditional modeling approach. We find the joint modeling approach to be preferable to the sequential conditional approach, except when the covariance structure of the repeated outcome for each individual has homogenous variance and exchangeable correlation. Specifically, the regression coefficient estimates from an analysis incorporating imputed values based on the sequential conditional method are attenuated and less efficient than those from the joint method. Remarkably, the estimates from the sequential conditional method are often less efficient than a complete case analysis, which, in the context of research synthesis, implies that we lose efficiency by combining studies. Copyright © 2015 John Wiley & Sons, Ltd. Copyright © 2015 John Wiley & Sons, Ltd.
Missing Data and Multiple Imputation: An Unbiased Approach
NASA Technical Reports Server (NTRS)
Foy, M.; VanBaalen, M.; Wear, M.; Mendez, C.; Mason, S.; Meyers, V.; Alexander, D.; Law, J.
2014-01-01
The default method of dealing with missing data in statistical analyses is to only use the complete observations (complete case analysis), which can lead to unexpected bias when data do not meet the assumption of missing completely at random (MCAR). For the assumption of MCAR to be met, missingness cannot be related to either the observed or unobserved variables. A less stringent assumption, missing at random (MAR), requires that missingness not be associated with the value of the missing variable itself, but can be associated with the other observed variables. When data are truly MAR as opposed to MCAR, the default complete case analysis method can lead to biased results. There are statistical options available to adjust for data that are MAR, including multiple imputation (MI) which is consistent and efficient at estimating effects. Multiple imputation uses informing variables to determine statistical distributions for each piece of missing data. Then multiple datasets are created by randomly drawing on the distributions for each piece of missing data. Since MI is efficient, only a limited number, usually less than 20, of imputed datasets are required to get stable estimates. Each imputed dataset is analyzed using standard statistical techniques, and then results are combined to get overall estimates of effect. A simulation study will be demonstrated to show the results of using the default complete case analysis, and MI in a linear regression of MCAR and MAR simulated data. Further, MI was successfully applied to the association study of CO2 levels and headaches when initial analysis showed there may be an underlying association between missing CO2 levels and reported headaches. Through MI, we were able to show that there is a strong association between average CO2 levels and the risk of headaches. Each unit increase in CO2 (mmHg) resulted in a doubling in the odds of reported headaches.
Using full-cohort data in nested case-control and case-cohort studies by multiple imputation.
Keogh, Ruth H; White, Ian R
2013-10-15
In many large prospective cohorts, expensive exposure measurements cannot be obtained for all individuals. Exposure-disease association studies are therefore often based on nested case-control or case-cohort studies in which complete information is obtained only for sampled individuals. However, in the full cohort, there may be a large amount of information on cheaply available covariates and possibly a surrogate of the main exposure(s), which typically goes unused. We view the nested case-control or case-cohort study plus the remainder of the cohort as a full-cohort study with missing data. Hence, we propose using multiple imputation (MI) to utilise information in the full cohort when data from the sub-studies are analysed. We use the fully observed data to fit the imputation models. We consider using approximate imputation models and also using rejection sampling to draw imputed values from the true distribution of the missing values given the observed data. Simulation studies show that using MI to utilise full-cohort information in the analysis of nested case-control and case-cohort studies can result in important gains in efficiency, particularly when a surrogate of the main exposure is available in the full cohort. In simulations, this method outperforms counter-matching in nested case-control studies and a weighted analysis for case-cohort studies, both of which use some full-cohort information. Approximate imputation models perform well except when there are interactions or non-linear terms in the outcome model, where imputation using rejection sampling works well. Copyright © 2013 John Wiley & Sons, Ltd.
Missing continuous outcomes under covariate dependent missingness in cluster randomised trials
Diaz-Ordaz, Karla; Bartlett, Jonathan W
2016-01-01
Attrition is a common occurrence in cluster randomised trials which leads to missing outcome data. Two approaches for analysing such trials are cluster-level analysis and individual-level analysis. This paper compares the performance of unadjusted cluster-level analysis, baseline covariate adjusted cluster-level analysis and linear mixed model analysis, under baseline covariate dependent missingness in continuous outcomes, in terms of bias, average estimated standard error and coverage probability. The methods of complete records analysis and multiple imputation are used to handle the missing outcome data. We considered four scenarios, with the missingness mechanism and baseline covariate effect on outcome either the same or different between intervention groups. We show that both unadjusted cluster-level analysis and baseline covariate adjusted cluster-level analysis give unbiased estimates of the intervention effect only if both intervention groups have the same missingness mechanisms and there is no interaction between baseline covariate and intervention group. Linear mixed model and multiple imputation give unbiased estimates under all four considered scenarios, provided that an interaction of intervention and baseline covariate is included in the model when appropriate. Cluster mean imputation has been proposed as a valid approach for handling missing outcomes in cluster randomised trials. We show that cluster mean imputation only gives unbiased estimates when missingness mechanism is the same between the intervention groups and there is no interaction between baseline covariate and intervention group. Multiple imputation shows overcoverage for small number of clusters in each intervention group. PMID:27177885
Missing continuous outcomes under covariate dependent missingness in cluster randomised trials.
Hossain, Anower; Diaz-Ordaz, Karla; Bartlett, Jonathan W
2017-06-01
Attrition is a common occurrence in cluster randomised trials which leads to missing outcome data. Two approaches for analysing such trials are cluster-level analysis and individual-level analysis. This paper compares the performance of unadjusted cluster-level analysis, baseline covariate adjusted cluster-level analysis and linear mixed model analysis, under baseline covariate dependent missingness in continuous outcomes, in terms of bias, average estimated standard error and coverage probability. The methods of complete records analysis and multiple imputation are used to handle the missing outcome data. We considered four scenarios, with the missingness mechanism and baseline covariate effect on outcome either the same or different between intervention groups. We show that both unadjusted cluster-level analysis and baseline covariate adjusted cluster-level analysis give unbiased estimates of the intervention effect only if both intervention groups have the same missingness mechanisms and there is no interaction between baseline covariate and intervention group. Linear mixed model and multiple imputation give unbiased estimates under all four considered scenarios, provided that an interaction of intervention and baseline covariate is included in the model when appropriate. Cluster mean imputation has been proposed as a valid approach for handling missing outcomes in cluster randomised trials. We show that cluster mean imputation only gives unbiased estimates when missingness mechanism is the same between the intervention groups and there is no interaction between baseline covariate and intervention group. Multiple imputation shows overcoverage for small number of clusters in each intervention group.
Wang, Guoshen; Pan, Yi; Seth, Puja; Song, Ruiguang; Belcher, Lisa
2017-01-01
Missing data create challenges for determining progress made in linking HIV-positive persons to HIV medical care. Statistical methods are not used to address missing program data on linkage. In 2014, 61 health department jurisdictions were funded by Centers for Disease Control and Prevention (CDC) and submitted data on HIV testing, newly diagnosed HIV-positive persons, and linkage to HIV medical care. Missing or unusable data existed in our data set. A new approach using multiple imputation to address missing linkage data was proposed, and results were compared to the current approach that uses data with complete information. There were 12,472 newly diagnosed HIV-positive persons from CDC-funded HIV testing events in 2014. Using multiple imputation, 94.1% (95% confidence interval (CI): [93.7%, 94.6%]) of newly diagnosed persons were referred to HIV medical care, 88.6% (95% CI: [88.0%, 89.1%]) were linked to care within any time frame, and 83.6% (95% CI: [83.0%, 84.3%]) were linked to care within 90 days. Multiple imputation is recommended for addressing missing linkage data in future analyses when the missing percentage is high. The use of multiple imputation for missing values can result in a better understanding of how programs are performing on key HIV testing and HIV service delivery indicators.
Quartagno, M; Carpenter, J R
2016-07-30
Recently, multiple imputation has been proposed as a tool for individual patient data meta-analysis with sporadically missing observations, and it has been suggested that within-study imputation is usually preferable. However, such within study imputation cannot handle variables that are completely missing within studies. Further, if some of the contributing studies are relatively small, it may be appropriate to share information across studies when imputing. In this paper, we develop and evaluate a joint modelling approach to multiple imputation of individual patient data in meta-analysis, with an across-study probability distribution for the study specific covariance matrices. This retains the flexibility to allow for between-study heterogeneity when imputing while allowing (i) sharing information on the covariance matrix across studies when this is appropriate, and (ii) imputing variables that are wholly missing from studies. Simulation results show both equivalent performance to the within-study imputation approach where this is valid, and good results in more general, practically relevant, scenarios with studies of very different sizes, non-negligible between-study heterogeneity and wholly missing variables. We illustrate our approach using data from an individual patient data meta-analysis of hypertension trials. © 2015 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd. © 2015 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd.
Peyre, Hugo; Leplège, Alain; Coste, Joël
2011-03-01
Missing items are common in quality of life (QoL) questionnaires and present a challenge for research in this field. It remains unclear which of the various methods proposed to deal with missing data performs best in this context. We compared personal mean score, full information maximum likelihood, multiple imputation, and hot deck techniques using various realistic simulation scenarios of item missingness in QoL questionnaires constructed within the framework of classical test theory. Samples of 300 and 1,000 subjects were randomly drawn from the 2003 INSEE Decennial Health Survey (of 23,018 subjects representative of the French population and having completed the SF-36) and various patterns of missing data were generated according to three different item non-response rates (3, 6, and 9%) and three types of missing data (Little and Rubin's "missing completely at random," "missing at random," and "missing not at random"). The missing data methods were evaluated in terms of accuracy and precision for the analysis of one descriptive and one association parameter for three different scales of the SF-36. For all item non-response rates and types of missing data, multiple imputation and full information maximum likelihood appeared superior to the personal mean score and especially to hot deck in terms of accuracy and precision; however, the use of personal mean score was associated with insignificant bias (relative bias <2%) in all studied situations. Whereas multiple imputation and full information maximum likelihood are confirmed as reference methods, the personal mean score appears nonetheless appropriate for dealing with items missing from completed SF-36 questionnaires in most situations of routine use. These results can reasonably be extended to other questionnaires constructed according to classical test theory.
Multiple imputation of missing covariates for the Cox proportional hazards cure model
Beesley, Lauren J; Bartlett, Jonathan W; Wolf, Gregory T; Taylor, Jeremy M G
2016-01-01
We explore several approaches for imputing partially observed covariates when the outcome of interest is a censored event time and when there is an underlying subset of the population that will never experience the event of interest. We call these subjects “cured,” and we consider the case where the data are modeled using a Cox proportional hazards (CPH) mixture cure model. We study covariate imputation approaches using fully conditional specification (FCS). We derive the exact conditional distribution and suggest a sampling scheme for imputing partially observed covariates in the CPH cure model setting. We also propose several approximations to the exact distribution that are simpler and more convenient to use for imputation. A simulation study demonstrates that the proposed imputation approaches outperform existing imputation approaches for survival data without a cure fraction in terms of bias in estimating CPH cure model parameters. We apply our multiple imputation techniques to a study of patients with head and neck cancer. PMID:27439726
Wahl, Simone; Boulesteix, Anne-Laure; Zierer, Astrid; Thorand, Barbara; van de Wiel, Mark A
2016-10-26
Missing values are a frequent issue in human studies. In many situations, multiple imputation (MI) is an appropriate missing data handling strategy, whereby missing values are imputed multiple times, the analysis is performed in every imputed data set, and the obtained estimates are pooled. If the aim is to estimate (added) predictive performance measures, such as (change in) the area under the receiver-operating characteristic curve (AUC), internal validation strategies become desirable in order to correct for optimism. It is not fully understood how internal validation should be combined with multiple imputation. In a comprehensive simulation study and in a real data set based on blood markers as predictors for mortality, we compare three combination strategies: Val-MI, internal validation followed by MI on the training and test parts separately, MI-Val, MI on the full data set followed by internal validation, and MI(-y)-Val, MI on the full data set omitting the outcome followed by internal validation. Different validation strategies, including bootstrap und cross-validation, different (added) performance measures, and various data characteristics are considered, and the strategies are evaluated with regard to bias and mean squared error of the obtained performance estimates. In addition, we elaborate on the number of resamples and imputations to be used, and adopt a strategy for confidence interval construction to incomplete data. Internal validation is essential in order to avoid optimism, with the bootstrap 0.632+ estimate representing a reliable method to correct for optimism. While estimates obtained by MI-Val are optimistically biased, those obtained by MI(-y)-Val tend to be pessimistic in the presence of a true underlying effect. Val-MI provides largely unbiased estimates, with a slight pessimistic bias with increasing true effect size, number of covariates and decreasing sample size. In Val-MI, accuracy of the estimate is more strongly improved by increasing the number of bootstrap draws rather than the number of imputations. With a simple integrated approach, valid confidence intervals for performance estimates can be obtained. When prognostic models are developed on incomplete data, Val-MI represents a valid strategy to obtain estimates of predictive performance measures.
Examining Solutions to Missing Data in Longitudinal Nursing Research
Roberts, Mary B.; Sullivan, Mary C.; Winchester, Suzy B.
2017-01-01
Purpose Longitudinal studies are highly valuable in pediatrics because they provide useful data about developmental patterns of child health and behavior over time. When data are missing, the value of the research is impacted. The study’s purpose was to: (1) introduce a 3-step approach to assess and address missing data; (2) illustrate this approach using categorical and continuous level variables from a longitudinal study of premature infants. Methods A three-step approach with simulations was followed to assess the amount and pattern of missing data and to determine the most appropriate imputation method for the missing data. Patterns of missingness were Missing Completely at Random, Missing at Random, and Not Missing at Random. Missing continuous-level data were imputed using mean replacement, stochastic regression, multiple imputation, and fully conditional specification. Missing categorical-level data were imputed using last value carried forward, hot-decking, stochastic regression, and fully conditional specification. Simulations were used to evaluate these imputation methods under different patterns of missingness at different levels of missing data. Results The rate of missingness was 16–23% for continuous variables and 1–28% for categorical variables. Fully conditional specification imputation provided the least difference in mean and standard deviation estimates for continuous measures. Fully conditional specification imputation was acceptable for categorical measures. Results obtained through simulation reinforced and confirmed these findings. Practice Implications Significant investments are made in the collection of longitudinal data. The prudent handling of missing data can protect these investments and potentially improve the scientific information contained in pediatric longitudinal studies. PMID:28425202
Multiple imputation to account for measurement error in marginal structural models
Edwards, Jessie K.; Cole, Stephen R.; Westreich, Daniel; Crane, Heidi; Eron, Joseph J.; Mathews, W. Christopher; Moore, Richard; Boswell, Stephen L.; Lesko, Catherine R.; Mugavero, Michael J.
2015-01-01
Background Marginal structural models are an important tool for observational studies. These models typically assume that variables are measured without error. We describe a method to account for differential and non-differential measurement error in a marginal structural model. Methods We illustrate the method estimating the joint effects of antiretroviral therapy initiation and current smoking on all-cause mortality in a United States cohort of 12,290 patients with HIV followed for up to 5 years between 1998 and 2011. Smoking status was likely measured with error, but a subset of 3686 patients who reported smoking status on separate questionnaires composed an internal validation subgroup. We compared a standard joint marginal structural model fit using inverse probability weights to a model that also accounted for misclassification of smoking status using multiple imputation. Results In the standard analysis, current smoking was not associated with increased risk of mortality. After accounting for misclassification, current smoking without therapy was associated with increased mortality [hazard ratio (HR): 1.2 (95% CI: 0.6, 2.3)]. The HR for current smoking and therapy (0.4 (95% CI: 0.2, 0.7)) was similar to the HR for no smoking and therapy (0.4; 95% CI: 0.2, 0.6). Conclusions Multiple imputation can be used to account for measurement error in concert with methods for causal inference to strengthen results from observational studies. PMID:26214338
Rendall, Michael S.; Ghosh-Dastidar, Bonnie; Weden, Margaret M.; Baker, Elizabeth H.; Nazarov, Zafar
2013-01-01
Within-survey multiple imputation (MI) methods are adapted to pooled-survey regression estimation where one survey has more regressors, but typically fewer observations, than the other. This adaptation is achieved through: (1) larger numbers of imputations to compensate for the higher fraction of missing values; (2) model-fit statistics to check the assumption that the two surveys sample from a common universe; and (3) specificying the analysis model completely from variables present in the survey with the larger set of regressors, thereby excluding variables never jointly observed. In contrast to the typical within-survey MI context, cross-survey missingness is monotonic and easily satisfies the Missing At Random (MAR) assumption needed for unbiased MI. Large efficiency gains and substantial reduction in omitted variable bias are demonstrated in an application to sociodemographic differences in the risk of child obesity estimated from two nationally-representative cohort surveys. PMID:24223447
Bouwman, Aniek C; Veerkamp, Roel F
2014-10-03
The aim of this study was to determine the consequences of splitting sequencing effort over multiple breeds for imputation accuracy from a high-density SNP chip towards whole-genome sequence. Such information would assist for instance numerical smaller cattle breeds, but also pig and chicken breeders, who have to choose wisely how to spend their sequencing efforts over all the breeds or lines they evaluate. Sequence data from cattle breeds was used, because there are currently relatively many individuals from several breeds sequenced within the 1,000 Bull Genomes project. The advantage of whole-genome sequence data is that it carries the causal mutations, but the question is whether it is possible to impute the causal variants accurately. This study therefore focussed on imputation accuracy of variants with low minor allele frequency and breed specific variants. Imputation accuracy was assessed for chromosome 1 and 29 as the correlation between observed and imputed genotypes. For chromosome 1, the average imputation accuracy was 0.70 with a reference population of 20 Holstein, and increased to 0.83 when the reference population was increased by including 3 other dairy breeds with 20 animals each. When the same amount of animals from the Holstein breed were added the accuracy improved to 0.88, while adding the 3 other breeds to the reference population of 80 Holstein improved the average imputation accuracy marginally to 0.89. For chromosome 29, the average imputation accuracy was lower. Some variants benefitted from the inclusion of other breeds in the reference population, initially determined by the MAF of the variant in each breed, but even Holstein specific variants did gain imputation accuracy from the multi-breed reference population. This study shows that splitting sequencing effort over multiple breeds and combining the reference populations is a good strategy for imputation from high-density SNP panels towards whole-genome sequence when reference populations are small and sequencing effort is limiting. When sequencing effort is limiting and interest lays in multiple breeds or lines this provides imputation of each breed.
Examining solutions to missing data in longitudinal nursing research.
Roberts, Mary B; Sullivan, Mary C; Winchester, Suzy B
2017-04-01
Longitudinal studies are highly valuable in pediatrics because they provide useful data about developmental patterns of child health and behavior over time. When data are missing, the value of the research is impacted. The study's purpose was to (1) introduce a three-step approach to assess and address missing data and (2) illustrate this approach using categorical and continuous-level variables from a longitudinal study of premature infants. A three-step approach with simulations was followed to assess the amount and pattern of missing data and to determine the most appropriate imputation method for the missing data. Patterns of missingness were Missing Completely at Random, Missing at Random, and Not Missing at Random. Missing continuous-level data were imputed using mean replacement, stochastic regression, multiple imputation, and fully conditional specification (FCS). Missing categorical-level data were imputed using last value carried forward, hot-decking, stochastic regression, and FCS. Simulations were used to evaluate these imputation methods under different patterns of missingness at different levels of missing data. The rate of missingness was 16-23% for continuous variables and 1-28% for categorical variables. FCS imputation provided the least difference in mean and standard deviation estimates for continuous measures. FCS imputation was acceptable for categorical measures. Results obtained through simulation reinforced and confirmed these findings. Significant investments are made in the collection of longitudinal data. The prudent handling of missing data can protect these investments and potentially improve the scientific information contained in pediatric longitudinal studies. © 2017 Wiley Periodicals, Inc.
Karakaya, Jale; Karabulut, Erdem; Yucel, Recai M.
2015-01-01
Modern statistical methods using incomplete data have been increasingly applied in a wide variety of substantive problems. Similarly, receiver operating characteristic (ROC) analysis, a method used in evaluating diagnostic tests or biomarkers in medical research, has also been increasingly popular problem in both its development and application. While missing-data methods have been applied in ROC analysis, the impact of model mis-specification and/or assumptions (e.g. missing at random) underlying the missing data has not been thoroughly studied. In this work, we study the performance of multiple imputation (MI) inference in ROC analysis. Particularly, we investigate parametric and non-parametric techniques for MI inference under common missingness mechanisms. Depending on the coherency of the imputation model with the underlying data generation mechanism, our results show that MI generally leads to well-calibrated inferences under ignorable missingness mechanisms. PMID:26379316
Missing data imputation of solar radiation data under different atmospheric conditions.
Turrado, Concepción Crespo; López, María Del Carmen Meizoso; Lasheras, Fernando Sánchez; Gómez, Benigno Antonio Rodríguez; Rollé, José Luis Calvo; Juez, Francisco Javier de Cos
2014-10-29
Global solar broadband irradiance on a planar surface is measured at weather stations by pyranometers. In the case of the present research, solar radiation values from nine meteorological stations of the MeteoGalicia real-time observational network, captured and stored every ten minutes, are considered. In this kind of record, the lack of data and/or the presence of wrong values adversely affects any time series study. Consequently, when this occurs, a data imputation process must be performed in order to replace missing data with estimated values. This paper aims to evaluate the multivariate imputation of ten-minute scale data by means of the chained equations method (MICE). This method allows the network itself to impute the missing or wrong data of a solar radiation sensor, by using either all or just a group of the measurements of the remaining sensors. Very good results have been obtained with the MICE method in comparison with other methods employed in this field such as Inverse Distance Weighting (IDW) and Multiple Linear Regression (MLR). The average RMSE value of the predictions for the MICE algorithm was 13.37% while that for the MLR it was 28.19%, and 31.68% for the IDW.
Missing Data Imputation of Solar Radiation Data under Different Atmospheric Conditions
Turrado, Concepción Crespo; López, María del Carmen Meizoso; Lasheras, Fernando Sánchez; Gómez, Benigno Antonio Rodríguez; Rollé, José Luis Calvo; de Cos Juez, Francisco Javier
2014-01-01
Global solar broadband irradiance on a planar surface is measured at weather stations by pyranometers. In the case of the present research, solar radiation values from nine meteorological stations of the MeteoGalicia real-time observational network, captured and stored every ten minutes, are considered. In this kind of record, the lack of data and/or the presence of wrong values adversely affects any time series study. Consequently, when this occurs, a data imputation process must be performed in order to replace missing data with estimated values. This paper aims to evaluate the multivariate imputation of ten-minute scale data by means of the chained equations method (MICE). This method allows the network itself to impute the missing or wrong data of a solar radiation sensor, by using either all or just a group of the measurements of the remaining sensors. Very good results have been obtained with the MICE method in comparison with other methods employed in this field such as Inverse Distance Weighting (IDW) and Multiple Linear Regression (MLR). The average RMSE value of the predictions for the MICE algorithm was 13.37% while that for the MLR it was 28.19%, and 31.68% for the IDW. PMID:25356644
Zhang, Haixia; Zhao, Junkang; Gu, Caijiao; Cui, Yan; Rong, Huiying; Meng, Fanlong; Wang, Tong
2015-05-01
The study of the medical expenditure and its influencing factors among the students enrolling in Urban Resident Basic Medical Insurance (URBMI) in Taiyuan indicated that non response bias and selection bias coexist in dependent variable of the survey data. Unlike previous studies only focused on one missing mechanism, a two-stage method to deal with two missing mechanisms simultaneously was suggested in this study, combining multiple imputation with sample selection model. A total of 1 190 questionnaires were returned by the students (or their parents) selected in child care settings, schools and universities in Taiyuan by stratified cluster random sampling in 2012. In the returned questionnaires, 2.52% existed not missing at random (NMAR) of dependent variable and 7.14% existed missing at random (MAR) of dependent variable. First, multiple imputation was conducted for MAR by using completed data, then sample selection model was used to correct NMAR in multiple imputation, and a multi influencing factor analysis model was established. Based on 1 000 times resampling, the best scheme of filling the random missing values is the predictive mean matching (PMM) method under the missing proportion. With this optimal scheme, a two stage survey was conducted. Finally, it was found that the influencing factors on annual medical expenditure among the students enrolling in URBMI in Taiyuan included population group, annual household gross income, affordability of medical insurance expenditure, chronic disease, seeking medical care in hospital, seeking medical care in community health center or private clinic, hospitalization, hospitalization canceled due to certain reason, self medication and acceptable proportion of self-paid medical expenditure. The two-stage method combining multiple imputation with sample selection model can deal with non response bias and selection bias effectively in dependent variable of the survey data.
Data imputation analysis for Cosmic Rays time series
NASA Astrophysics Data System (ADS)
Fernandes, R. C.; Lucio, P. S.; Fernandez, J. H.
2017-05-01
The occurrence of missing data concerning Galactic Cosmic Rays time series (GCR) is inevitable since loss of data is due to mechanical and human failure or technical problems and different periods of operation of GCR stations. The aim of this study was to perform multiple dataset imputation in order to depict the observational dataset. The study has used the monthly time series of GCR Climax (CLMX) and Roma (ROME) from 1960 to 2004 to simulate scenarios of 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80% and 90% of missing data compared to observed ROME series, with 50 replicates. Then, the CLMX station as a proxy for allocation of these scenarios was used. Three different methods for monthly dataset imputation were selected: AMÉLIA II - runs the bootstrap Expectation Maximization algorithm, MICE - runs an algorithm via Multivariate Imputation by Chained Equations and MTSDI - an Expectation Maximization algorithm-based method for imputation of missing values in multivariate normal time series. The synthetic time series compared with the observed ROME series has also been evaluated using several skill measures as such as RMSE, NRMSE, Agreement Index, R, R2, F-test and t-test. The results showed that for CLMX and ROME, the R2 and R statistics were equal to 0.98 and 0.96, respectively. It was observed that increases in the number of gaps generate loss of quality of the time series. Data imputation was more efficient with MTSDI method, with negligible errors and best skill coefficients. The results suggest a limit of about 60% of missing data for imputation, for monthly averages, no more than this. It is noteworthy that CLMX, ROME and KIEL stations present no missing data in the target period. This methodology allowed reconstructing 43 time series.
Aßmann, C
2016-06-01
Besides large efforts regarding field work, provision of valid databases requires statistical and informational infrastructure to enable long-term access to longitudinal data sets on height, weight and related issues. To foster use of longitudinal data sets within the scientific community, provision of valid databases has to address data-protection regulations. It is, therefore, of major importance to hinder identifiability of individuals from publicly available databases. To reach this goal, one possible strategy is to provide a synthetic database to the public allowing for pretesting strategies for data analysis. The synthetic databases can be established using multiple imputation tools. Given the approval of the strategy, verification is based on the original data. Multiple imputation by chained equations is illustrated to facilitate provision of synthetic databases as it allows for capturing a wide range of statistical interdependencies. Also missing values, typically occurring within longitudinal databases for reasons of item non-response, can be addressed via multiple imputation when providing databases. The provision of synthetic databases using multiple imputation techniques is one possible strategy to ensure data protection, increase visibility of longitudinal databases and enhance the analytical potential.
Doidge, James C
2018-02-01
Population-based cohort studies are invaluable to health research because of the breadth of data collection over time, and the representativeness of their samples. However, they are especially prone to missing data, which can compromise the validity of analyses when data are not missing at random. Having many waves of data collection presents opportunity for participants' responsiveness to be observed over time, which may be informative about missing data mechanisms and thus useful as an auxiliary variable. Modern approaches to handling missing data such as multiple imputation and maximum likelihood can be difficult to implement with the large numbers of auxiliary variables and large amounts of non-monotone missing data that occur in cohort studies. Inverse probability-weighting can be easier to implement but conventional wisdom has stated that it cannot be applied to non-monotone missing data. This paper describes two methods of applying inverse probability-weighting to non-monotone missing data, and explores the potential value of including measures of responsiveness in either inverse probability-weighting or multiple imputation. Simulation studies are used to compare methods and demonstrate that responsiveness in longitudinal studies can be used to mitigate bias induced by missing data, even when data are not missing at random.
A Review of Methods for Missing Data.
ERIC Educational Resources Information Center
Pigott, Therese D.
2001-01-01
Reviews methods for handling missing data in a research study. Model-based methods, such as maximum likelihood using the EM algorithm and multiple imputation, hold more promise than ad hoc methods. Although model-based methods require more specialized computer programs and assumptions about the nature of missing data, these methods are appropriate…
Estimates of alcohol involvement in fatal crashes : new alcohol methodology
DOT National Transportation Integrated Search
2002-01-01
The National Highway Traffic Safety Administration (NHTSA) has adopted a new method to : estimate missing blood alcohol concentration (BAC) test result data. This new method, multiple : imputation, will be used by NHTSAs National Center for Statis...
Advancing US GHG Inventory by Incorporating Survey Data using Machine-Learning Techniques
NASA Astrophysics Data System (ADS)
Alsaker, C.; Ogle, S. M.; Breidt, J.
2017-12-01
Crop management data are used in the National Greenhouse Gas Inventory that is compiled annually and reported to the United Nations Framework Convention on Climate Change. Emissions for carbon stock change and N2O emissions for US agricultural soils are estimated using the USDA National Resources Inventory (NRI). NRI provides basic information on land use and cropping histories, but it does not provide much detail on other management practices. In contrast, the Conservation Effects Assessment Project (CEAP) survey collects detailed crop management data that could be used in the GHG Inventory. The survey data were collected from NRI survey locations that are a subset of the NRI every 10 years. Therefore, imputation of the CEAP are needed to represent the management practices across all NRI survey locations both spatially and temporally. Predictive mean matching and an artificial neural network methods have been applied to develop imputation model under a multiple imputation framework. Temporal imputation involves adjusting the imputation model using state-level USDA Agricultural Resource Management Survey data. Distributional and predictive accuracy is assessed for the imputed data, providing not only management data needed for the inventory but also rigorous estimates of uncertainty.
Lapidus, Nathanael; Chevret, Sylvie; Resche-Rigon, Matthieu
2014-12-30
Agreement between two assays is usually based on the concordance correlation coefficient (CCC), estimated from the means, standard deviations, and correlation coefficient of these assays. However, such data will often suffer from left-censoring because of lower limits of detection of these assays. To handle such data, we propose to extend a multiple imputation approach by chained equations (MICE) developed in a close setting of one left-censored assay. The performance of this two-step approach is compared with that of a previously published maximum likelihood estimation through a simulation study. Results show close estimates of the CCC by both methods, although the coverage is improved by our MICE proposal. An application to cytomegalovirus quantification data is provided. Copyright © 2014 John Wiley & Sons, Ltd.
Considerations of multiple imputation approaches for handling missing data in clinical trials.
Quan, Hui; Qi, Li; Luo, Xiaodong; Darchy, Loic
2018-07-01
Missing data exist in all clinical trials and missing data issue is a very serious issue in terms of the interpretability of the trial results. There is no universally applicable solution for all missing data problems. Methods used for handling missing data issue depend on the circumstances particularly the assumptions on missing data mechanisms. In recent years, if the missing at random mechanism cannot be assumed, conservative approaches such as the control-based and returning to baseline multiple imputation approaches are applied for dealing with the missing data issues. In this paper, we focus on the variability in data analysis of these approaches. As demonstrated by examples, the choice of the variability can impact the conclusion of the analysis. Besides the methods for continuous endpoints, we also discuss methods for binary and time to event endpoints as well as consideration for non-inferiority assessment. Copyright © 2018. Published by Elsevier Inc.
Bozio, Catherine H; Flanders, W Dana; Finelli, Lyn; Bramley, Anna M; Reed, Carrie; Gandhi, Neel R; Vidal, Jorge E; Erdman, Dean; Levine, Min Z; Lindstrom, Stephen; Ampofo, Krow; Arnold, Sandra R; Self, Wesley H; Williams, Derek J; Grijalva, Carlos G; Anderson, Evan J; McCullers, Jonathan A; Edwards, Kathryn M; Pavia, Andrew T; Wunderink, Richard G; Jain, Seema
2018-04-01
Real-time polymerase chain reaction (PCR) on respiratory specimens and serology on paired blood specimens are used to determine the etiology of respiratory illnesses for research studies. However, convalescent serology is often not collected. We used multiple imputation to assign values for missing serology results to estimate virus-specific prevalence among pediatric and adult community-acquired pneumonia hospitalizations using data from an active population-based surveillance study. Presence of adenoviruses, human metapneumovirus, influenza viruses, parainfluenza virus types 1-3, and respiratory syncytial virus was defined by positive PCR on nasopharyngeal/oropharyngeal specimens or a 4-fold rise in paired serology. We performed multiple imputation by developing a multivariable regression model for each virus using data from patients with available serology results. We calculated absolute and relative differences in the proportion of each virus detected comparing the imputed to observed (nonimputed) results. Among 2222 children and 2259 adults, 98.8% and 99.5% had nasopharyngeal/oropharyngeal specimens and 43.2% and 37.5% had paired serum specimens, respectively. Imputed results increased viral etiology assignments by an absolute difference of 1.6%-4.4% and 0.8%-2.8% in children and adults, respectively; relative differences were 1.1-3.0 times higher. Multiple imputation can be used when serology results are missing, to refine virus-specific prevalence estimates, and these will likely increase estimates.
Missing data in FFQs: making assumptions about item non-response.
Lamb, Karen E; Olstad, Dana Lee; Nguyen, Cattram; Milte, Catherine; McNaughton, Sarah A
2017-04-01
FFQs are a popular method of capturing dietary information in epidemiological studies and may be used to derive dietary exposures such as nutrient intake or overall dietary patterns and diet quality. As FFQs can involve large numbers of questions, participants may fail to respond to all questions, leaving researchers to decide how to deal with missing data when deriving intake measures. The aim of the present commentary is to discuss the current practice for dealing with item non-response in FFQs and to propose a research agenda for reporting and handling missing data in FFQs. Single imputation techniques, such as zero imputation (assuming no consumption of the item) or mean imputation, are commonly used to deal with item non-response in FFQs. However, single imputation methods make strong assumptions about the missing data mechanism and do not reflect the uncertainty created by the missing data. This can lead to incorrect inference about associations between diet and health outcomes. Although the use of multiple imputation methods in epidemiology has increased, these have seldom been used in the field of nutritional epidemiology to address missing data in FFQs. We discuss methods for dealing with item non-response in FFQs, highlighting the assumptions made under each approach. Researchers analysing FFQs should ensure that missing data are handled appropriately and clearly report how missing data were treated in analyses. Simulation studies are required to enable systematic evaluation of the utility of various methods for handling item non-response in FFQs under different assumptions about the missing data mechanism.
Multiple imputation of rainfall missing data in the Iberian Mediterranean context
NASA Astrophysics Data System (ADS)
Miró, Juan Javier; Caselles, Vicente; Estrela, María José
2017-11-01
Given the increasing need for complete rainfall data networks, in recent years have been proposed diverse methods for filling gaps in observed precipitation series, progressively more advanced that traditional approaches to overcome the problem. The present study has consisted in validate 10 methods (6 linear, 2 non-linear and 2 hybrid) that allow multiple imputation, i.e., fill at the same time missing data of multiple incomplete series in a dense network of neighboring stations. These were applied for daily and monthly rainfall in two sectors in the Júcar River Basin Authority (east Iberian Peninsula), which is characterized by a high spatial irregularity and difficulty of rainfall estimation. A classification of precipitation according to their genetic origin was applied as pre-processing, and a quantile-mapping adjusting as post-processing technique. The results showed in general a better performance for the non-linear and hybrid methods, highlighting that the non-linear PCA (NLPCA) method outperforms considerably the Self Organizing Maps (SOM) method within non-linear approaches. On linear methods, the Regularized Expectation Maximization method (RegEM) was the best, but far from NLPCA. Applying EOF filtering as post-processing of NLPCA (hybrid approach) yielded the best results.
Multiple imputation of missing passenger boarding data in the national census of ferry operators
DOT National Transportation Integrated Search
2008-08-01
This report presents findings from the 2006 National Census of Ferry Operators (NCFO) augmented : with imputed values for passengers and passenger miles. Due to the imputation procedures used to calculate missing data, totals in Table 1 may not corre...
Xue, Xiaonan; Shore, Roy E; Ye, Xiangyang; Kim, Mimi Y
2004-10-01
Occupational exposures are often recorded as zero when the exposure is below the minimum detection level (BMDL). This can lead to an underestimation of the doses received by individuals and can lead to biased estimates of risk in occupational epidemiologic studies. The extent of the exposure underestimation is increased with the magnitude of the minimum detection level (MDL) and the frequency of monitoring. This paper uses multiple imputation methods to impute values for the missing doses due to BMDL. A Gibbs sampling algorithm is developed to implement the method, which is applied to two distinct scenarios: when dose information is available for each measurement (but BMDL is recorded as zero or some other arbitrary value), or when the dose information available represents the summation of a series of measurements (e.g., only yearly cumulative exposure is available but based on, say, weekly measurements). Then the average of the multiple imputed exposure realizations for each individual is used to obtain an unbiased estimate of the relative risk associated with exposure. Simulation studies are used to evaluate the performance of the estimators. As an illustration, the method is applied to a sample of historical occupational radiation exposure data from the Oak Ridge National Laboratory.
Ascertainment bias from imputation methods evaluation in wheat.
Brandariz, Sofía P; González Reymúndez, Agustín; Lado, Bettina; Malosetti, Marcos; Garcia, Antonio Augusto Franco; Quincke, Martín; von Zitzewitz, Jarislav; Castro, Marina; Matus, Iván; Del Pozo, Alejandro; Castro, Ariel J; Gutiérrez, Lucía
2016-10-04
Whole-genome genotyping techniques like Genotyping-by-sequencing (GBS) are being used for genetic studies such as Genome-Wide Association (GWAS) and Genomewide Selection (GS), where different strategies for imputation have been developed. Nevertheless, imputation error may lead to poor performance (i.e. smaller power or higher false positive rate) when complete data is not required as it is for GWAS, and each marker is taken at a time. The aim of this study was to compare the performance of GWAS analysis for Quantitative Trait Loci (QTL) of major and minor effect using different imputation methods when no reference panel is available in a wheat GBS panel. In this study, we compared the power and false positive rate of dissecting quantitative traits for imputed and not-imputed marker score matrices in: (1) a complete molecular marker barley panel array, and (2) a GBS wheat panel with missing data. We found that there is an ascertainment bias in imputation method comparisons. Simulating over a complete matrix and creating missing data at random proved that imputation methods have a poorer performance. Furthermore, we found that when QTL were simulated with imputed data, the imputation methods performed better than the not-imputed ones. On the other hand, when QTL were simulated with not-imputed data, the not-imputed method and one of the imputation methods performed better for dissecting quantitative traits. Moreover, larger differences between imputation methods were detected for QTL of major effect than QTL of minor effect. We also compared the different marker score matrices for GWAS analysis in a real wheat phenotype dataset, and we found minimal differences indicating that imputation did not improve the GWAS performance when a reference panel was not available. Poorer performance was found in GWAS analysis when an imputed marker score matrix was used, no reference panel is available, in a wheat GBS panel.
Chan, Kelvin K W; Xie, Feng; Willan, Andrew R; Pullenayegum, Eleanor M
2017-04-01
Parameter uncertainty in value sets of multiattribute utility-based instruments (MAUIs) has received little attention previously. This false precision leads to underestimation of the uncertainty of the results of cost-effectiveness analyses. The aim of this study is to examine the use of multiple imputation as a method to account for this uncertainty of MAUI scoring algorithms. We fitted a Bayesian model with random effects for respondents and health states to the data from the original US EQ-5D-3L valuation study, thereby estimating the uncertainty in the EQ-5D-3L scoring algorithm. We applied these results to EQ-5D-3L data from the Commonwealth Fund (CWF) Survey for Sick Adults ( n = 3958), comparing the standard error of the estimated mean utility in the CWF population using the predictive distribution from the Bayesian mixed-effect model (i.e., incorporating parameter uncertainty in the value set) with the standard error of the estimated mean utilities based on multiple imputation and the standard error using the conventional approach of using MAUI (i.e., ignoring uncertainty in the value set). The mean utility in the CWF population based on the predictive distribution of the Bayesian model was 0.827 with a standard error (SE) of 0.011. When utilities were derived using the conventional approach, the estimated mean utility was 0.827 with an SE of 0.003, which is only 25% of the SE based on the full predictive distribution of the mixed-effect model. Using multiple imputation with 20 imputed sets, the mean utility was 0.828 with an SE of 0.011, which is similar to the SE based on the full predictive distribution. Ignoring uncertainty of the predicted health utilities derived from MAUIs could lead to substantial underestimation of the variance of mean utilities. Multiple imputation corrects for this underestimation so that the results of cost-effectiveness analyses using MAUIs can report the correct degree of uncertainty.
Habbous, Steven; Chu, Karen P.; Lau, Harold; Schorr, Melissa; Belayneh, Mathieos; Ha, Michael N.; Murray, Scott; O’Sullivan, Brian; Huang, Shao Hui; Snow, Stephanie; Parliament, Matthew; Hao, Desiree; Cheung, Winson Y.; Xu, Wei; Liu, Geoffrey
2017-01-01
BACKGROUND: The incidence of oropharyngeal cancer has risen over the past 2 decades. This rise has been attributed to human papillomavirus (HPV), but information on temporal trends in incidence of HPV-associated cancers across Canada is limited. METHODS: We collected social, clinical and demographic characteristics and p16 protein status (p16-positive or p16-negative, using this immunohistochemistry variable as a surrogate marker of HPV status) for 3643 patients with oropharyngeal cancer diagnosed between 2000 and 2012 at comprehensive cancer centres in British Columbia (6 centres), Edmonton, Calgary, Toronto and Halifax. We used receiver operating characteristic curves and multiple imputation to estimate the p16 status for missing values. We chose a best-imputation probability cut point on the basis of accuracy in samples with known p16 status and through an independent relation between p16 status and overall survival. We used logistic and Cox proportional hazard regression. RESULTS: We found no temporal changes in p16-positive status initially, but there was significant selection bias, with p16 testing significantly more likely to be performed in males, lifetime never-smokers, patients with tonsillar or base-of-tongue tumours and those with nodal involvement (p < 0.05 for each variable). We used the following variables associated with p16-positive status for multiple imputation: male sex, tonsillar or base-of-tongue tumours, smaller tumours, nodal involvement, less smoking and lower alcohol consumption (p < 0.05 for each variable). Using sensitivity analyses, we showed that different imputation probability cut points for p16-positive status each identified a rise from 2000 to 2012, with the best-probability cut point identifying an increase from 47.3% in 2000 to 73.7% in 2012 (p < 0.001). INTERPRETATION: Across multiple centres in Canada, there was a steady rise in the proportion of oropharyngeal cancers attributable to HPV from 2000 to 2012. PMID:28808115
Kudo, Daisuke; Hayakawa, Mineji; Ono, Kota; Yamakawa, Kazuma
2018-03-01
Anticoagulant therapy for patients with sepsis is not recommended in the latest Surviving Sepsis Campaign guidelines, and non-anticoagulant therapy is the global standard treatment approach at present. We aimed at elucidating the effect of non-anticoagulant therapy on patients with sepsis-induced disseminated intravascular coagulation (DIC), as evidence on this topic has remained inconclusive. Data from 3195 consecutive adult patients admitted to 42 intensive care units for the treatment of severe sepsis were retrospectively analyzed via propensity score analyses with and without multiple imputation. The primary outcome was in-hospital all-cause mortality. Among 1784 patients with sepsis-induced DIC, 745 (41.8%) were not treated with anticoagulants. The inverse probability of treatment-weighted (with and without multiple imputation) and quintile-stratified propensity score analyses (without multiple imputation) indicated a significant association between non-anticoagulant therapy and higher in-hospital all-cause mortality (odds ratio [95% confidence interval]: 1.59 [1.19-2.12], 1.32 [1.02-1.81], and 1.32 [1.03-1.69], respectively). However, quintile-stratified propensity score analyses with multiple imputation and propensity score matching analysis with and without multiple imputation did not show this association. Survival duration was not significantly different between patients in the propensity score-matched non-anticoagulant therapy group and those in the anticoagulant therapy group (Cox regression analysis with and without multiple imputation: hazard ratio [95% confidence interval]: 1.26 [1.00-1.60] and 1.22 [0.93-1.59], respectively). It remains controversial if non-anticoagulant therapy is harmful, equivalent, or beneficial compared with anticoagulant therapy in the treatment of patients with sepsis-induced DIC. Copyright © 2018 Elsevier Ltd. All rights reserved.
Multiple Imputation for Multivariate Missing-Data Problems: A Data Analyst's Perspective.
ERIC Educational Resources Information Center
Schafer, Joseph L.; Olsen, Maren K.
1998-01-01
The key ideas of multiple imputation for multivariate missing data problems are reviewed. Software programs available for this analysis are described, and their use is illustrated with data from the Adolescent Alcohol Prevention Trial (W. Hansen and J. Graham, 1991). (SLD)
Mikhchi, Abbas; Honarvar, Mahmood; Kashan, Nasser Emam Jomeh; Aminafshar, Mehdi
2016-06-21
Genotype imputation is an important tool for prediction of unknown genotypes for both unrelated individuals and parent-offspring trios. Several imputation methods are available and can either employ universal machine learning methods, or deploy algorithms dedicated to infer missing genotypes. In this research the performance of eight machine learning methods: Support Vector Machine, K-Nearest Neighbors, Extreme Learning Machine, Radial Basis Function, Random Forest, AdaBoost, LogitBoost, and TotalBoost compared in terms of the imputation accuracy, computation time and the factors affecting imputation accuracy. The methods employed using real and simulated datasets to impute the un-typed SNPs in parent-offspring trios. The tested methods show that imputation of parent-offspring trios can be accurate. The Random Forest and Support Vector Machine were more accurate than the other machine learning methods. The TotalBoost performed slightly worse than the other methods.The running times were different between methods. The ELM was always most fast algorithm. In case of increasing the sample size, the RBF requires long imputation time.The tested methods in this research can be an alternative for imputation of un-typed SNPs in low missing rate of data. However, it is recommended that other machine learning methods to be used for imputation. Copyright © 2016 Elsevier Ltd. All rights reserved.
Limitations in Using Multiple Imputation to Harmonize Individual Participant Data for Meta-Analysis.
Siddique, Juned; de Chavez, Peter J; Howe, George; Cruden, Gracelyn; Brown, C Hendricks
2018-02-01
Individual participant data (IPD) meta-analysis is a meta-analysis in which the individual-level data for each study are obtained and used for synthesis. A common challenge in IPD meta-analysis is when variables of interest are measured differently in different studies. The term harmonization has been coined to describe the procedure of placing variables on the same scale in order to permit pooling of data from a large number of studies. Using data from an IPD meta-analysis of 19 adolescent depression trials, we describe a multiple imputation approach for harmonizing 10 depression measures across the 19 trials by treating those depression measures that were not used in a study as missing data. We then apply diagnostics to address the fit of our imputation model. Even after reducing the scale of our application, we were still unable to produce accurate imputations of the missing values. We describe those features of the data that made it difficult to harmonize the depression measures and provide some guidelines for using multiple imputation for harmonization in IPD meta-analysis.
Ondeck, Nathaniel T; Fu, Michael C; Skrip, Laura A; McLynn, Ryan P; Cui, Jonathan J; Basques, Bryce A; Albert, Todd J; Grauer, Jonathan N
2018-04-09
The presence of missing data is a limitation of large datasets, including the National Surgical Quality Improvement Program (NSQIP). In addressing this issue, most studies use complete case analysis, which excludes cases with missing data, thus potentially introducing selection bias. Multiple imputation, a statistically rigorous approach that approximates missing data and preserves sample size, may be an improvement over complete case analysis. The present study aims to evaluate the impact of using multiple imputation in comparison with complete case analysis for assessing the associations between preoperative laboratory values and adverse outcomes following anterior cervical discectomy and fusion (ACDF) procedures. This is a retrospective review of prospectively collected data. Patients undergoing one-level ACDF were identified in NSQIP 2012-2015. Perioperative adverse outcome variables assessed included the occurrence of any adverse event, severe adverse events, and hospital readmission. Missing preoperative albumin and hematocrit values were handled using complete case analysis and multiple imputation. These preoperative laboratory levels were then tested for associations with 30-day postoperative outcomes using logistic regression. A total of 11,999 patients were included. Of this cohort, 63.5% of patients had missing preoperative albumin and 9.9% had missing preoperative hematocrit. When using complete case analysis, only 4,311 patients were studied. The removed patients were significantly younger, healthier, of a common body mass index, and male. Logistic regression analysis failed to identify either preoperative hypoalbuminemia or preoperative anemia as significantly associated with adverse outcomes. When employing multiple imputation, all 11,999 patients were included. Preoperative hypoalbuminemia was significantly associated with the occurrence of any adverse event and severe adverse events. Preoperative anemia was significantly associated with the occurrence of any adverse event, severe adverse events, and hospital readmission. Multiple imputation is a rigorous statistical procedure that is being increasingly used to address missing values in large datasets. Using this technique for ACDF avoided the loss of cases that may have affected the representativeness and power of the study and led to different results than complete case analysis. Multiple imputation should be considered for future spine studies. Copyright © 2018 Elsevier Inc. All rights reserved.
Gottfredson, Nisha C; Sterba, Sonya K; Jackson, Kristina M
2017-01-01
Random coefficient-dependent (RCD) missingness is a non-ignorable mechanism through which missing data can arise in longitudinal designs. RCD, for which we cannot test, is a problematic form of missingness that occurs if subject-specific random effects correlate with propensity for missingness or dropout. Particularly when covariate missingness is a problem, investigators typically handle missing longitudinal data by using single-level multiple imputation procedures implemented with long-format data, which ignores within-person dependency entirely, or implemented with wide-format (i.e., multivariate) data, which ignores some aspects of within-person dependency. When either of these standard approaches to handling missing longitudinal data is used, RCD missingness leads to parameter bias and incorrect inference. We explain why multilevel multiple imputation (MMI) should alleviate bias induced by a RCD missing data mechanism under conditions that contribute to stronger determinacy of random coefficients. We evaluate our hypothesis with a simulation study. Three design factors are considered: intraclass correlation (ICC; ranging from .25 to .75), number of waves (ranging from 4 to 8), and percent of missing data (ranging from 20 to 50%). We find that MMI greatly outperforms the single-level wide-format (multivariate) method for imputation under a RCD mechanism. For the MMI analyses, bias was most alleviated when the ICC is high, there were more waves of data, and when there was less missing data. Practical recommendations for handling longitudinal missing data are suggested.
Obtaining Predictions from Models Fit to Multiply Imputed Data
ERIC Educational Resources Information Center
Miles, Andrew
2016-01-01
Obtaining predictions from regression models fit to multiply imputed data can be challenging because treatments of multiple imputation seldom give clear guidance on how predictions can be calculated, and because available software often does not have built-in routines for performing the necessary calculations. This research note reviews how…
Genotype Imputation for Latinos Using the HapMap and 1000 Genomes Project Reference Panels.
Gao, Xiaoyi; Haritunians, Talin; Marjoram, Paul; McKean-Cowdin, Roberta; Torres, Mina; Taylor, Kent D; Rotter, Jerome I; Gauderman, William J; Varma, Rohit
2012-01-01
Genotype imputation is a vital tool in genome-wide association studies (GWAS) and meta-analyses of multiple GWAS results. Imputation enables researchers to increase genomic coverage and to pool data generated using different genotyping platforms. HapMap samples are often employed as the reference panel. More recently, the 1000 Genomes Project resource is becoming the primary source for reference panels. Multiple GWAS and meta-analyses are targeting Latinos, the most populous, and fastest growing minority group in the US. However, genotype imputation resources for Latinos are rather limited compared to individuals of European ancestry at present, largely because of the lack of good reference data. One choice of reference panel for Latinos is one derived from the population of Mexican individuals in Los Angeles contained in the HapMap Phase 3 project and the 1000 Genomes Project. However, a detailed evaluation of the quality of the imputed genotypes derived from the public reference panels has not yet been reported. Using simulation studies, the Illumina OmniExpress GWAS data from the Los Angles Latino Eye Study and the MACH software package, we evaluated the accuracy of genotype imputation in Latinos. Our results show that the 1000 Genomes Project AMR + CEU + YRI reference panel provides the highest imputation accuracy for Latinos, and that also including Asian samples in the panel can reduce imputation accuracy. We also provide the imputation accuracy for each autosomal chromosome using the 1000 Genomes Project panel for Latinos. Our results serve as a guide to future imputation based analysis in Latinos.
A Nonparametric, Multiple Imputation-Based Method for the Retrospective Integration of Data Sets.
Carrig, Madeline M; Manrique-Vallier, Daniel; Ranby, Krista W; Reiter, Jerome P; Hoyle, Rick H
2015-01-01
Complex research questions often cannot be addressed adequately with a single data set. One sensible alternative to the high cost and effort associated with the creation of large new data sets is to combine existing data sets containing variables related to the constructs of interest. The goal of the present research was to develop a flexible, broadly applicable approach to the integration of disparate data sets that is based on nonparametric multiple imputation and the collection of data from a convenient, de novo calibration sample. We demonstrate proof of concept for the approach by integrating three existing data sets containing items related to the extent of problematic alcohol use and associations with deviant peers. We discuss both necessary conditions for the approach to work well and potential strengths and weaknesses of the method compared to other data set integration approaches.
Best practices for missing data management in counseling psychology.
Schlomer, Gabriel L; Bauman, Sheri; Card, Noel A
2010-01-01
This article urges counseling psychology researchers to recognize and report how missing data are handled, because consumers of research cannot accurately interpret findings without knowing the amount and pattern of missing data or the strategies that were used to handle those data. Patterns of missing data are reviewed, and some of the common strategies for dealing with them are described. The authors provide an illustration in which data were simulated and evaluate 3 methods of handling missing data: mean substitution, multiple imputation, and full information maximum likelihood. Results suggest that mean substitution is a poor method for handling missing data, whereas both multiple imputation and full information maximum likelihood are recommended alternatives to this approach. The authors suggest that researchers fully consider and report the amount and pattern of missing data and the strategy for handling those data in counseling psychology research and that editors advise researchers of this expectation.
A Nonparametric, Multiple Imputation-Based Method for the Retrospective Integration of Data Sets
Carrig, Madeline M.; Manrique-Vallier, Daniel; Ranby, Krista W.; Reiter, Jerome P.; Hoyle, Rick H.
2015-01-01
Complex research questions often cannot be addressed adequately with a single data set. One sensible alternative to the high cost and effort associated with the creation of large new data sets is to combine existing data sets containing variables related to the constructs of interest. The goal of the present research was to develop a flexible, broadly applicable approach to the integration of disparate data sets that is based on nonparametric multiple imputation and the collection of data from a convenient, de novo calibration sample. We demonstrate proof of concept for the approach by integrating three existing data sets containing items related to the extent of problematic alcohol use and associations with deviant peers. We discuss both necessary conditions for the approach to work well and potential strengths and weaknesses of the method compared to other data set integration approaches. PMID:26257437
A Study of Imputation Algorithms. Working Paper Series.
ERIC Educational Resources Information Center
Hu, Ming-xiu; Salvucci, Sameena
Many imputation techniques and imputation software packages have been developed over the years to deal with missing data. Different methods may work well under different circumstances, and it is advisable to conduct a sensitivity analysis when choosing an imputation method for a particular survey. This study reviewed about 30 imputation methods…
A Statistical Model for Misreported Binary Outcomes in Clustered RCTs of Education Interventions
ERIC Educational Resources Information Center
Schochet, Peter Z.
2013-01-01
In randomized control trials (RCTs) of educational interventions, there is a growing literature on impact estimation methods to adjust for missing student outcome data using such methods as multiple imputation, the construction of nonresponse weights, casewise deletion, and maximum likelihood methods (see, for example, Allison, 2002; Graham, 2009;…
Sehgal, Muhammad Shoaib B; Gondal, Iqbal; Dooley, Laurence S
2005-05-15
Microarray data are used in a range of application areas in biology, although often it contains considerable numbers of missing values. These missing values can significantly affect subsequent statistical analysis and machine learning algorithms so there is a strong motivation to estimate these values as accurately as possible before using these algorithms. While many imputation algorithms have been proposed, more robust techniques need to be developed so that further analysis of biological data can be accurately undertaken. In this paper, an innovative missing value imputation algorithm called collateral missing value estimation (CMVE) is presented which uses multiple covariance-based imputation matrices for the final prediction of missing values. The matrices are computed and optimized using least square regression and linear programming methods. The new CMVE algorithm has been compared with existing estimation techniques including Bayesian principal component analysis imputation (BPCA), least square impute (LSImpute) and K-nearest neighbour (KNN). All these methods were rigorously tested to estimate missing values in three separate non-time series (ovarian cancer based) and one time series (yeast sporulation) dataset. Each method was quantitatively analyzed using the normalized root mean square (NRMS) error measure, covering a wide range of randomly introduced missing value probabilities from 0.01 to 0.2. Experiments were also undertaken on the yeast dataset, which comprised 1.7% actual missing values, to test the hypothesis that CMVE performed better not only for randomly occurring but also for a real distribution of missing values. The results confirmed that CMVE consistently demonstrated superior and robust estimation capability of missing values compared with other methods for both series types of data, for the same order of computational complexity. A concise theoretical framework has also been formulated to validate the improved performance of the CMVE algorithm. The CMVE software is available upon request from the authors.
Aghdam, Rosa; Baghfalaki, Taban; Khosravi, Pegah; Saberi Ansari, Elnaz
2017-12-01
Deciphering important genes and pathways from incomplete gene expression data could facilitate a better understanding of cancer. Different imputation methods can be applied to estimate the missing values. In our study, we evaluated various imputation methods for their performance in preserving significant genes and pathways. In the first step, 5% genes are considered in random for two types of ignorable and non-ignorable missingness mechanisms with various missing rates. Next, 10 well-known imputation methods were applied to the complete datasets. The significance analysis of microarrays (SAM) method was applied to detect the significant genes in rectal and lung cancers to showcase the utility of imputation approaches in preserving significant genes. To determine the impact of different imputation methods on the identification of important genes, the chi-squared test was used to compare the proportions of overlaps between significant genes detected from original data and those detected from the imputed datasets. Additionally, the significant genes are tested for their enrichment in important pathways, using the ConsensusPathDB. Our results showed that almost all the significant genes and pathways of the original dataset can be detected in all imputed datasets, indicating that there is no significant difference in the performance of various imputation methods tested. The source code and selected datasets are available on http://profiles.bs.ipm.ir/softwares/imputation_methods/. Copyright © 2017. Production and hosting by Elsevier B.V.
Missing data imputation: focusing on single imputation.
Zhang, Zhongheng
2016-01-01
Complete case analysis is widely used for handling missing data, and it is the default method in many statistical packages. However, this method may introduce bias and some useful information will be omitted from analysis. Therefore, many imputation methods are developed to make gap end. The present article focuses on single imputation. Imputations with mean, median and mode are simple but, like complete case analysis, can introduce bias on mean and deviation. Furthermore, they ignore relationship with other variables. Regression imputation can preserve relationship between missing values and other variables. There are many sophisticated methods exist to handle missing values in longitudinal data. This article focuses primarily on how to implement R code to perform single imputation, while avoiding complex mathematical calculations.
Missing data imputation: focusing on single imputation
2016-01-01
Complete case analysis is widely used for handling missing data, and it is the default method in many statistical packages. However, this method may introduce bias and some useful information will be omitted from analysis. Therefore, many imputation methods are developed to make gap end. The present article focuses on single imputation. Imputations with mean, median and mode are simple but, like complete case analysis, can introduce bias on mean and deviation. Furthermore, they ignore relationship with other variables. Regression imputation can preserve relationship between missing values and other variables. There are many sophisticated methods exist to handle missing values in longitudinal data. This article focuses primarily on how to implement R code to perform single imputation, while avoiding complex mathematical calculations. PMID:26855945
Attrition Bias Related to Missing Outcome Data: A Longitudinal Simulation Study.
Lewin, Antoine; Brondeel, Ruben; Benmarhnia, Tarik; Thomas, Frédérique; Chaix, Basile
2018-01-01
Most longitudinal studies do not address potential selection biases due to selective attrition. Using empirical data and simulating additional attrition, we investigated the effectiveness of common approaches to handle missing outcome data from attrition in the association between individual education level and change in body mass index (BMI). Using data from the two waves of the French RECORD Cohort Study (N = 7,172), we first examined how inverse probability weighting (IPW) and multiple imputation handled missing outcome data from attrition in the observed data (stage 1). Second, simulating additional missing data in BMI at follow-up under various missing-at-random scenarios, we quantified the impact of attrition and assessed how multiple imputation performed compared to complete case analysis and to a perfectly specified IPW model as a gold standard (stage 2). With the observed data in stage 1, we found an inverse association between individual education and change in BMI, with complete case analysis, as well as with IPW and multiple imputation. When we simulated additional attrition under a missing-at-random pattern (stage 2), the bias increased with the magnitude of selective attrition, and multiple imputation was useless to address it. Our simulations revealed that selective attrition in the outcome heavily biased the association of interest. The present article contributes to raising awareness that for missing outcome data, multiple imputation does not do better than complete case analysis. More effort is thus needed during the design phase to understand attrition mechanisms by collecting information on the reasons for dropout.
Multiple imputation in the presence of non-normal data.
Lee, Katherine J; Carlin, John B
2017-02-20
Multiple imputation (MI) is becoming increasingly popular for handling missing data. Standard approaches for MI assume normality for continuous variables (conditionally on the other variables in the imputation model). However, it is unclear how to impute non-normally distributed continuous variables. Using simulation and a case study, we compared various transformations applied prior to imputation, including a novel non-parametric transformation, to imputation on the raw scale and using predictive mean matching (PMM) when imputing non-normal data. We generated data from a range of non-normal distributions, and set 50% to missing completely at random or missing at random. We then imputed missing values on the raw scale, following a zero-skewness log, Box-Cox or non-parametric transformation and using PMM with both type 1 and 2 matching. We compared inferences regarding the marginal mean of the incomplete variable and the association with a fully observed outcome. We also compared results from these approaches in the analysis of depression and anxiety symptoms in parents of very preterm compared with term-born infants. The results provide novel empirical evidence that the decision regarding how to impute a non-normal variable should be based on the nature of the relationship between the variables of interest. If the relationship is linear in the untransformed scale, transformation can introduce bias irrespective of the transformation used. However, if the relationship is non-linear, it may be important to transform the variable to accurately capture this relationship. A useful alternative is to impute the variable using PMM with type 1 matching. Copyright © 2016 John Wiley & Sons, Ltd. Copyright © 2016 John Wiley & Sons, Ltd.
Imputation approaches for animal movement modeling
Scharf, Henry; Hooten, Mevin B.; Johnson, Devin S.
2017-01-01
The analysis of telemetry data is common in animal ecological studies. While the collection of telemetry data for individual animals has improved dramatically, the methods to properly account for inherent uncertainties (e.g., measurement error, dependence, barriers to movement) have lagged behind. Still, many new statistical approaches have been developed to infer unknown quantities affecting animal movement or predict movement based on telemetry data. Hierarchical statistical models are useful to account for some of the aforementioned uncertainties, as well as provide population-level inference, but they often come with an increased computational burden. For certain types of statistical models, it is straightforward to provide inference if the latent true animal trajectory is known, but challenging otherwise. In these cases, approaches related to multiple imputation have been employed to account for the uncertainty associated with our knowledge of the latent trajectory. Despite the increasing use of imputation approaches for modeling animal movement, the general sensitivity and accuracy of these methods have not been explored in detail. We provide an introduction to animal movement modeling and describe how imputation approaches may be helpful for certain types of models. We also assess the performance of imputation approaches in two simulation studies. Our simulation studies suggests that inference for model parameters directly related to the location of an individual may be more accurate than inference for parameters associated with higher-order processes such as velocity or acceleration. Finally, we apply these methods to analyze a telemetry data set involving northern fur seals (Callorhinus ursinus) in the Bering Sea. Supplementary materials accompanying this paper appear online.
yaImpute: An R package for kNN imputation
Nicholas L. Crookston; Andrew O. Finley
2008-01-01
This article introduces yaImpute, an R package for nearest neighbor search and imputation. Although nearest neighbor imputation is used in a host of disciplines, the methods implemented in the yaImpute package are tailored to imputation-based forest attribute estimation and mapping. The impetus to writing the yaImpute is a growing interest in nearest neighbor...
Imputing data that are missing at high rates using a boosting algorithm
DOE Office of Scientific and Technical Information (OSTI.GOV)
Cauthen, Katherine Regina; Lambert, Gregory; Ray, Jaideep
Traditional multiple imputation approaches may perform poorly for datasets with high rates of missingness unless many m imputations are used. This paper implements an alternative machine learning-based approach to imputing data that are missing at high rates. Here, we use boosting to create a strong learner from a weak learner fitted to a dataset missing many observations. This approach may be applied to a variety of types of learners (models). The approach is demonstrated by application to a spatiotemporal dataset for predicting dengue outbreaks in India from meteorological covariates. A Bayesian spatiotemporal CAR model is boosted to produce imputations, andmore » the overall RMSE from a k-fold cross-validation is used to assess imputation accuracy.« less
NASA Astrophysics Data System (ADS)
Hasan, Haliza; Ahmad, Sanizah; Osman, Balkish Mohd; Sapri, Shamsiah; Othman, Nadirah
2017-08-01
In regression analysis, missing covariate data has been a common problem. Many researchers use ad hoc methods to overcome this problem due to the ease of implementation. However, these methods require assumptions about the data that rarely hold in practice. Model-based methods such as Maximum Likelihood (ML) using the expectation maximization (EM) algorithm and Multiple Imputation (MI) are more promising when dealing with difficulties caused by missing data. Then again, inappropriate methods of missing value imputation can lead to serious bias that severely affects the parameter estimates. The main objective of this study is to provide a better understanding regarding missing data concept that can assist the researcher to select the appropriate missing data imputation methods. A simulation study was performed to assess the effects of different missing data techniques on the performance of a regression model. The covariate data were generated using an underlying multivariate normal distribution and the dependent variable was generated as a combination of explanatory variables. Missing values in covariate were simulated using a mechanism called missing at random (MAR). Four levels of missingness (10%, 20%, 30% and 40%) were imposed. ML and MI techniques available within SAS software were investigated. A linear regression analysis was fitted and the model performance measures; MSE, and R-Squared were obtained. Results of the analysis showed that MI is superior in handling missing data with highest R-Squared and lowest MSE when percent of missingness is less than 30%. Both methods are unable to handle larger than 30% level of missingness.
MacNeil Vroomen, Janet; Eekhout, Iris; Dijkgraaf, Marcel G; van Hout, Hein; de Rooij, Sophia E; Heymans, Martijn W; Bosmans, Judith E
2016-11-01
Cost and effect data often have missing data because economic evaluations are frequently added onto clinical studies where cost data are rarely the primary outcome. The objective of this article was to investigate which multiple imputation strategy is most appropriate to use for missing cost-effectiveness data in a randomized controlled trial. Three incomplete data sets were generated from a complete reference data set with 17, 35 and 50 % missing data in effects and costs. The strategies evaluated included complete case analysis (CCA), multiple imputation with predictive mean matching (MI-PMM), MI-PMM on log-transformed costs (log MI-PMM), and a two-step MI. Mean cost and effect estimates, standard errors and incremental net benefits were compared with the results of the analyses on the complete reference data set. The CCA, MI-PMM, and the two-step MI strategy diverged from the results for the reference data set when the amount of missing data increased. In contrast, the estimates of the Log MI-PMM strategy remained stable irrespective of the amount of missing data. MI provided better estimates than CCA in all scenarios. With low amounts of missing data the MI strategies appeared equivalent but we recommend using the log MI-PMM with missing data greater than 35 %.
Mukaka, Mavuto; White, Sarah A; Terlouw, Dianne J; Mwapasa, Victor; Kalilani-Phiri, Linda; Faragher, E Brian
2016-07-22
Missing outcomes can seriously impair the ability to make correct inferences from randomized controlled trials (RCTs). Complete case (CC) analysis is commonly used, but it reduces sample size and is perceived to lead to reduced statistical efficiency of estimates while increasing the potential for bias. As multiple imputation (MI) methods preserve sample size, they are generally viewed as the preferred analytical approach. We examined this assumption, comparing the performance of CC and MI methods to determine risk difference (RD) estimates in the presence of missing binary outcomes. We conducted simulation studies of 5000 simulated data sets with 50 imputations of RCTs with one primary follow-up endpoint at different underlying levels of RD (3-25 %) and missing outcomes (5-30 %). For missing at random (MAR) or missing completely at random (MCAR) outcomes, CC method estimates generally remained unbiased and achieved precision similar to or better than MI methods, and high statistical coverage. Missing not at random (MNAR) scenarios yielded invalid inferences with both methods. Effect size estimate bias was reduced in MI methods by always including group membership even if this was unrelated to missingness. Surprisingly, under MAR and MCAR conditions in the assessed scenarios, MI offered no statistical advantage over CC methods. While MI must inherently accompany CC methods for intention-to-treat analyses, these findings endorse CC methods for per protocol risk difference analyses in these conditions. These findings provide an argument for the use of the CC approach to always complement MI analyses, with the usual caveat that the validity of the mechanism for missingness be thoroughly discussed. More importantly, researchers should strive to collect as much data as possible.
Handling Missing Data: Analysis of a Challenging Data Set Using Multiple Imputation
ERIC Educational Resources Information Center
Pampaka, Maria; Hutcheson, Graeme; Williams, Julian
2016-01-01
Missing data is endemic in much educational research. However, practices such as step-wise regression common in the educational research literature have been shown to be dangerous when significant data are missing, and multiple imputation (MI) is generally recommended by statisticians. In this paper, we provide a review of these advances and their…
Reporting the Use of Multiple Imputation for Missing Data in Higher Education Research
ERIC Educational Resources Information Center
Manly, Catherine A.; Wells, Ryan S.
2015-01-01
Higher education researchers using survey data often face decisions about handling missing data. Multiple imputation (MI) is considered by many statisticians to be the most appropriate technique for addressing missing data in many circumstances. In particular, it has been shown to be preferable to listwise deletion, which has historically been a…
ERIC Educational Resources Information Center
Aßmann, Christian; Würbach, Ariane; Goßmann, Solange; Geissler, Ferdinand; Bela, Anika
2017-01-01
Large-scale surveys typically exhibit data structures characterized by rich mutual dependencies between surveyed variables and individual-specific skip patterns. Despite high efforts in fieldwork and questionnaire design, missing values inevitably occur. One approach for handling missing values is to provide multiply imputed data sets, thus…
Two-pass imputation algorithm for missing value estimation in gene expression time series.
Tsiporkova, Elena; Boeva, Veselka
2007-10-01
Gene expression microarray experiments frequently generate datasets with multiple values missing. However, most of the analysis, mining, and classification methods for gene expression data require a complete matrix of gene array values. Therefore, the accurate estimation of missing values in such datasets has been recognized as an important issue, and several imputation algorithms have already been proposed to the biological community. Most of these approaches, however, are not particularly suitable for time series expression profiles. In view of this, we propose a novel imputation algorithm, which is specially suited for the estimation of missing values in gene expression time series data. The algorithm utilizes Dynamic Time Warping (DTW) distance in order to measure the similarity between time expression profiles, and subsequently selects for each gene expression profile with missing values a dedicated set of candidate profiles for estimation. Three different DTW-based imputation (DTWimpute) algorithms have been considered: position-wise, neighborhood-wise, and two-pass imputation. These have initially been prototyped in Perl, and their accuracy has been evaluated on yeast expression time series data using several different parameter settings. The experiments have shown that the two-pass algorithm consistently outperforms, in particular for datasets with a higher level of missing entries, the neighborhood-wise and the position-wise algorithms. The performance of the two-pass DTWimpute algorithm has further been benchmarked against the weighted K-Nearest Neighbors algorithm, which is widely used in the biological community; the former algorithm has appeared superior to the latter one. Motivated by these findings, indicating clearly the added value of the DTW techniques for missing value estimation in time series data, we have built an optimized C++ implementation of the two-pass DTWimpute algorithm. The software also provides for a choice between three different initial rough imputation methods.
2012-01-01
Background Efficient, robust, and accurate genotype imputation algorithms make large-scale application of genomic selection cost effective. An algorithm that imputes alleles or allele probabilities for all animals in the pedigree and for all genotyped single nucleotide polymorphisms (SNP) provides a framework to combine all pedigree, genomic, and phenotypic information into a single-stage genomic evaluation. Methods An algorithm was developed for imputation of genotypes in pedigreed populations that allows imputation for completely ungenotyped animals and for low-density genotyped animals, accommodates a wide variety of pedigree structures for genotyped animals, imputes unmapped SNP, and works for large datasets. The method involves simple phasing rules, long-range phasing and haplotype library imputation and segregation analysis. Results Imputation accuracy was high and computational cost was feasible for datasets with pedigrees of up to 25 000 animals. The resulting single-stage genomic evaluation increased the accuracy of estimated genomic breeding values compared to a scenario in which phenotypes on relatives that were not genotyped were ignored. Conclusions The developed imputation algorithm and software and the resulting single-stage genomic evaluation method provide powerful new ways to exploit imputation and to obtain more accurate genetic evaluations. PMID:22462519
Voillet, Valentin; Besse, Philippe; Liaubet, Laurence; San Cristobal, Magali; González, Ignacio
2016-10-03
In omics data integration studies, it is common, for a variety of reasons, for some individuals to not be present in all data tables. Missing row values are challenging to deal with because most statistical methods cannot be directly applied to incomplete datasets. To overcome this issue, we propose a multiple imputation (MI) approach in a multivariate framework. In this study, we focus on multiple factor analysis (MFA) as a tool to compare and integrate multiple layers of information. MI involves filling the missing rows with plausible values, resulting in M completed datasets. MFA is then applied to each completed dataset to produce M different configurations (the matrices of coordinates of individuals). Finally, the M configurations are combined to yield a single consensus solution. We assessed the performance of our method, named MI-MFA, on two real omics datasets. Incomplete artificial datasets with different patterns of missingness were created from these data. The MI-MFA results were compared with two other approaches i.e., regularized iterative MFA (RI-MFA) and mean variable imputation (MVI-MFA). For each configuration resulting from these three strategies, the suitability of the solution was determined against the true MFA configuration obtained from the original data and a comprehensive graphical comparison showing how the MI-, RI- or MVI-MFA configurations diverge from the true configuration was produced. Two approaches i.e., confidence ellipses and convex hulls, to visualize and assess the uncertainty due to missing values were also described. We showed how the areas of ellipses and convex hulls increased with the number of missing individuals. A free and easy-to-use code was proposed to implement the MI-MFA method in the R statistical environment. We believe that MI-MFA provides a useful and attractive method for estimating the coordinates of individuals on the first MFA components despite missing rows. MI-MFA configurations were close to the true configuration even when many individuals were missing in several data tables. This method takes into account the uncertainty of MI-MFA configurations induced by the missing rows, thereby allowing the reliability of the results to be evaluated.
Baker, Jannah; White, Nicole; Mengersen, Kerrie
2014-11-20
Spatial analysis is increasingly important for identifying modifiable geographic risk factors for disease. However, spatial health data from surveys are often incomplete, ranging from missing data for only a few variables, to missing data for many variables. For spatial analyses of health outcomes, selection of an appropriate imputation method is critical in order to produce the most accurate inferences. We present a cross-validation approach to select between three imputation methods for health survey data with correlated lifestyle covariates, using as a case study, type II diabetes mellitus (DM II) risk across 71 Queensland Local Government Areas (LGAs). We compare the accuracy of mean imputation to imputation using multivariate normal and conditional autoregressive prior distributions. Choice of imputation method depends upon the application and is not necessarily the most complex method. Mean imputation was selected as the most accurate method in this application. Selecting an appropriate imputation method for health survey data, after accounting for spatial correlation and correlation between covariates, allows more complete analysis of geographic risk factors for disease with more confidence in the results to inform public policy decision-making.
Jia, Erik; Chen, Tianlu
2018-01-01
Left-censored missing values commonly exist in targeted metabolomics datasets and can be considered as missing not at random (MNAR). Improper data processing procedures for missing values will cause adverse impacts on subsequent statistical analyses. However, few imputation methods have been developed and applied to the situation of MNAR in the field of metabolomics. Thus, a practical left-censored missing value imputation method is urgently needed. We developed an iterative Gibbs sampler based left-censored missing value imputation approach (GSimp). We compared GSimp with other three imputation methods on two real-world targeted metabolomics datasets and one simulation dataset using our imputation evaluation pipeline. The results show that GSimp outperforms other imputation methods in terms of imputation accuracy, observation distribution, univariate and multivariate analyses, and statistical sensitivity. Additionally, a parallel version of GSimp was developed for dealing with large scale metabolomics datasets. The R code for GSimp, evaluation pipeline, tutorial, real-world and simulated targeted metabolomics datasets are available at: https://github.com/WandeRum/GSimp. PMID:29385130
Variable Selection in the Presence of Missing Data: Imputation-based Methods.
Zhao, Yize; Long, Qi
2017-01-01
Variable selection plays an essential role in regression analysis as it identifies important variables that associated with outcomes and is known to improve predictive accuracy of resulting models. Variable selection methods have been widely investigated for fully observed data. However, in the presence of missing data, methods for variable selection need to be carefully designed to account for missing data mechanisms and statistical techniques used for handling missing data. Since imputation is arguably the most popular method for handling missing data due to its ease of use, statistical methods for variable selection that are combined with imputation are of particular interest. These methods, valid used under the assumptions of missing at random (MAR) and missing completely at random (MCAR), largely fall into three general strategies. The first strategy applies existing variable selection methods to each imputed dataset and then combine variable selection results across all imputed datasets. The second strategy applies existing variable selection methods to stacked imputed datasets. The third variable selection strategy combines resampling techniques such as bootstrap with imputation. Despite recent advances, this area remains under-developed and offers fertile ground for further research.
Aloisio, Kathryn M.; Swanson, Sonja A.; Micali, Nadia; Field, Alison; Horton, Nicholas J.
2015-01-01
Clustered data arise in many settings, particularly within the social and biomedical sciences. As an example, multiple–source reports are commonly collected in child and adolescent psychiatric epidemiologic studies where researchers use various informants (e.g. parent and adolescent) to provide a holistic view of a subject’s symptomatology. Fitzmaurice et al. (1995) have described estimation of multiple source models using a standard generalized estimating equation (GEE) framework. However, these studies often have missing data due to additional stages of consent and assent required. The usual GEE is unbiased when missingness is Missing Completely at Random (MCAR) in the sense of Little and Rubin (2002). This is a strong assumption that may not be tenable. Other options such as weighted generalized estimating equations (WEEs) are computationally challenging when missingness is non–monotone. Multiple imputation is an attractive method to fit incomplete data models while only requiring the less restrictive Missing at Random (MAR) assumption. Previously estimation of partially observed clustered data was computationally challenging however recent developments in Stata have facilitated their use in practice. We demonstrate how to utilize multiple imputation in conjunction with a GEE to investigate the prevalence of disordered eating symptoms in adolescents reported by parents and adolescents as well as factors associated with concordance and prevalence. The methods are motivated by the Avon Longitudinal Study of Parents and their Children (ALSPAC), a cohort study that enrolled more than 14,000 pregnant mothers in 1991–92 and has followed the health and development of their children at regular intervals. While point estimates were fairly similar to the GEE under MCAR, the MAR model had smaller standard errors, while requiring less stringent assumptions regarding missingness. PMID:25642154
Guo, Wei-Li; Huang, De-Shuang
2017-08-22
Transcription factors (TFs) are DNA-binding proteins that have a central role in regulating gene expression. Identification of DNA-binding sites of TFs is a key task in understanding transcriptional regulation, cellular processes and disease. Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) enables genome-wide identification of in vivo TF binding sites. However, it is still difficult to map every TF in every cell line owing to cost and biological material availability, which poses an enormous obstacle for integrated analysis of gene regulation. To address this problem, we propose a novel computational approach, TFBSImpute, for predicting additional TF binding profiles by leveraging information from available ChIP-seq TF binding data. TFBSImpute fuses the dataset to a 3-mode tensor and imputes missing TF binding signals via simultaneous completion of multiple TF binding matrices with positional consistency. We show that signals predicted by our method achieve overall similarity with experimental data and that TFBSImpute significantly outperforms baseline approaches, by assessing the performance of imputation methods against observed ChIP-seq TF binding profiles. Besides, motif analysis shows that TFBSImpute preforms better in capturing binding motifs enriched in observed data compared with baselines, indicating that the higher performance of TFBSImpute is not simply due to averaging related samples. We anticipate that our approach will constitute a useful complement to experimental mapping of TF binding, which is beneficial for further study of regulation mechanisms and disease.
Study Protocol, Sample Characteristics, and Loss to Follow-Up: The OPPERA Prospective Cohort Study
Bair, Eric; Brownstein, Naomi C.; Ohrbach, Richard; Greenspan, Joel D.; Dubner, Ron; Fillingim, Roger B.; Maixner, William; Smith, Shad; Diatchenko, Luda; Gonzalez, Yoly; Gordon, Sharon; Lim, Pei-Feng; Ribeiro-Dasilva, Margarete; Dampier, Dawn; Knott, Charles; Slade, Gary D.
2013-01-01
When studying incidence of pain conditions such as temporomandibular disorders (TMDs), repeated monitoring is needed in prospective cohort studies. However, monitoring methods usually have limitations and, over a period of years, some loss to follow-up is inevitable. The OPPERA prospective cohort study of first-onset TMD screened for symptoms using quarterly questionnaires and examined symptomatic participants to definitively ascertain TMD incidence. During the median 2.8-year observation period, 16% of the 3,263 enrollees completed no follow-up questionnaires, others provided incomplete follow-up, and examinations were not conducted for one third of symptomatic episodes. Although screening methods and examinations were found to have excellent reliability and validity, they were not perfect. Loss to follow-up varied according to some putative TMD risk factors, although multiple imputation to correct the problem suggested that bias was minimal. A second method of multiple imputation that evaluated bias associated with omitted and dubious examinations revealed a slight underestimate of incidence and some small biases in hazard ratios used to quantify effects of risk factors. Although “bottom line” statistical conclusions were not affected, multiply-imputed estimates should be considered when evaluating the large number of risk factors under investigation in the OPPERA study. Perspective These findings support the validity of the OPPERA prospective cohort study for the purpose of investigating the etiology of first-onset TMD, providing the foundation for other papers investigating risk factors hypothesized in the OPPERA project. PMID:24275220
Mallinckrodt, C H; Lin, Q; Molenberghs, M
2013-01-01
The objective of this research was to demonstrate a framework for drawing inference from sensitivity analyses of incomplete longitudinal clinical trial data via a re-analysis of data from a confirmatory clinical trial in depression. A likelihood-based approach that assumed missing at random (MAR) was the primary analysis. Robustness to departure from MAR was assessed by comparing the primary result to those from a series of analyses that employed varying missing not at random (MNAR) assumptions (selection models, pattern mixture models and shared parameter models) and to MAR methods that used inclusive models. The key sensitivity analysis used multiple imputation assuming that after dropout the trajectory of drug-treated patients was that of placebo treated patients with a similar outcome history (placebo multiple imputation). This result was used as the worst reasonable case to define the lower limit of plausible values for the treatment contrast. The endpoint contrast from the primary analysis was - 2.79 (p = .013). In placebo multiple imputation, the result was - 2.17. Results from the other sensitivity analyses ranged from - 2.21 to - 3.87 and were symmetrically distributed around the primary result. Hence, no clear evidence of bias from missing not at random data was found. In the worst reasonable case scenario, the treatment effect was 80% of the magnitude of the primary result. Therefore, it was concluded that a treatment effect existed. The structured sensitivity framework of using a worst reasonable case result based on a controlled imputation approach with transparent and debatable assumptions supplemented a series of plausible alternative models under varying assumptions was useful in this specific situation and holds promise as a generally useful framework. Copyright © 2012 John Wiley & Sons, Ltd.
Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data
2016-01-01
Well-powered genomic studies require genome-wide marker coverage across many individuals. For non-model species with few genomic resources, high-throughput sequencing (HTS) methods, such as Genotyping-By-Sequencing (GBS), offer an inexpensive alternative to array-based genotyping. Although affordable, datasets derived from HTS methods suffer from sequencing error, alignment errors, and missing data, all of which introduce noise and uncertainty to variant discovery and genotype calling. Under such circumstances, meaningful analysis of the data is difficult. Our primary interest lies in the issue of how one can accurately infer or impute missing genotypes in HTS-derived datasets. Many of the existing genotype imputation algorithms and software packages were primarily developed by and optimized for the human genetics community, a field where a complete and accurate reference genome has been constructed and SNP arrays have, in large part, been the common genotyping platform. We set out to answer two questions: 1) can we use existing imputation methods developed by the human genetics community to impute missing genotypes in datasets derived from non-human species and 2) are these methods, which were developed and optimized to impute ascertained variants, amenable for imputation of missing genotypes at HTS-derived variants? We selected Beagle v.4, a widely used algorithm within the human genetics community with reportedly high accuracy, to serve as our imputation contender. We performed a series of cross-validation experiments, using GBS data collected from the species Manihot esculenta by the Next Generation (NEXTGEN) Cassava Breeding Project. NEXTGEN currently imputes missing genotypes in their datasets using a LASSO-penalized, linear regression method (denoted ‘glmnet’). We selected glmnet to serve as a benchmark imputation method for this reason. We obtained estimates of imputation accuracy by masking a subset of observed genotypes, imputing, and calculating the sample Pearson correlation between observed and imputed genotype dosages at the site and individual level; computation time served as a second metric for comparison. We then set out to examine factors affecting imputation accuracy, such as levels of missing data, read depth, minor allele frequency (MAF), and reference panel composition. PMID:27537694
Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data.
Chan, Ariel W; Hamblin, Martha T; Jannink, Jean-Luc
2016-01-01
Well-powered genomic studies require genome-wide marker coverage across many individuals. For non-model species with few genomic resources, high-throughput sequencing (HTS) methods, such as Genotyping-By-Sequencing (GBS), offer an inexpensive alternative to array-based genotyping. Although affordable, datasets derived from HTS methods suffer from sequencing error, alignment errors, and missing data, all of which introduce noise and uncertainty to variant discovery and genotype calling. Under such circumstances, meaningful analysis of the data is difficult. Our primary interest lies in the issue of how one can accurately infer or impute missing genotypes in HTS-derived datasets. Many of the existing genotype imputation algorithms and software packages were primarily developed by and optimized for the human genetics community, a field where a complete and accurate reference genome has been constructed and SNP arrays have, in large part, been the common genotyping platform. We set out to answer two questions: 1) can we use existing imputation methods developed by the human genetics community to impute missing genotypes in datasets derived from non-human species and 2) are these methods, which were developed and optimized to impute ascertained variants, amenable for imputation of missing genotypes at HTS-derived variants? We selected Beagle v.4, a widely used algorithm within the human genetics community with reportedly high accuracy, to serve as our imputation contender. We performed a series of cross-validation experiments, using GBS data collected from the species Manihot esculenta by the Next Generation (NEXTGEN) Cassava Breeding Project. NEXTGEN currently imputes missing genotypes in their datasets using a LASSO-penalized, linear regression method (denoted 'glmnet'). We selected glmnet to serve as a benchmark imputation method for this reason. We obtained estimates of imputation accuracy by masking a subset of observed genotypes, imputing, and calculating the sample Pearson correlation between observed and imputed genotype dosages at the site and individual level; computation time served as a second metric for comparison. We then set out to examine factors affecting imputation accuracy, such as levels of missing data, read depth, minor allele frequency (MAF), and reference panel composition.
ERIC Educational Resources Information Center
Acock, Alan C.
2005-01-01
Less than optimum strategies for missing values can produce biased estimates, distorted statistical power, and invalid conclusions. After reviewing traditional approaches (listwise, pairwise, and mean substitution), selected alternatives are covered including single imputation, multiple imputation, and full information maximum likelihood…
Risk-Stratified Imputation in Survival Analysis
Kennedy, Richard E.; Adragni, Kofi P.; Tiwari, Hemant K.; Voeks, Jenifer H.; Brott, Thomas G.; Howard, George
2013-01-01
Background Censoring that is dependent on covariates associated with survival can arise in randomized trials due to changes in recruitment and eligibility criteria to minimize withdrawals, potentially leading to biased treatment effect estimates. Imputation approaches have been proposed to address censoring in survival analysis; and while these approaches may provide unbiased estimates of treatment effects, imputation of a large number of outcomes may over- or underestimate the associated variance based on the imputation pool selected. Purpose We propose an improved method, risk-stratified imputation, as an alternative to address withdrawal related to the risk of events in the context of time-to-event analyses. Methods Our algorithm performs imputation from a pool of replacement subjects with similar values of both treatment and covariate(s) of interest, that is, from a risk-stratified sample. This stratification prior to imputation addresses the requirement of time-to-event analysis that censored observations are representative of all other observations in the risk group with similar exposure variables. We compared our risk-stratified imputation to case deletion and bootstrap imputation in a simulated dataset in which the covariate of interest (study withdrawal) was related to treatment. A motivating example from a recent clinical trial is also presented to demonstrate the utility of our method. Results In our simulations, risk-stratified imputation gives estimates of treatment effect comparable to bootstrap and auxiliary variable imputation while avoiding inaccuracies of the latter two in estimating the associated variance. Similar results were obtained in analysis of clinical trial data. Limitations Risk-stratified imputation has little advantage over other imputation methods when covariates of interest are not related to treatment, although its performance is superior when covariates are related to treatment. Risk-stratified imputation is intended for categorical covariates, and may be sensitive to the width of the matching window if continuous covariates are used. Conclusions The use of the risk-stratified imputation should facilitate the analysis of many clinical trials, in which one group has a higher withdrawal rate that is related to treatment. PMID:23818434
Batistatou, Evridiki; McNamee, Roseanne
2012-12-10
It is known that measurement error leads to bias in assessing exposure effects, which can however, be corrected if independent replicates are available. For expensive replicates, two-stage (2S) studies that produce data 'missing by design', may be preferred over a single-stage (1S) study, because in the second stage, measurement of replicates is restricted to a sample of first-stage subjects. Motivated by an occupational study on the acute effect of carbon black exposure on respiratory morbidity, we compare the performance of several bias-correction methods for both designs in a simulation study: an instrumental variable method (EVROS IV) based on grouping strategies, which had been recommended especially when measurement error is large, the regression calibration and the simulation extrapolation methods. For the 2S design, either the problem of 'missing' data was ignored or the 'missing' data were imputed using multiple imputations. Both in 1S and 2S designs, in the case of small or moderate measurement error, regression calibration was shown to be the preferred approach in terms of root mean square error. For 2S designs, regression calibration as implemented by Stata software is not recommended in contrast to our implementation of this method; the 'problematic' implementation of regression calibration although substantially improved with use of multiple imputations. The EVROS IV method, under a good/fairly good grouping, outperforms the regression calibration approach in both design scenarios when exposure mismeasurement is severe. Both in 1S and 2S designs with moderate or large measurement error, simulation extrapolation severely failed to correct for bias. Copyright © 2012 John Wiley & Sons, Ltd.
Impact of missing data imputation methods on gene expression clustering and classification.
de Souto, Marcilio C P; Jaskowiak, Pablo A; Costa, Ivan G
2015-02-26
Several missing value imputation methods for gene expression data have been proposed in the literature. In the past few years, researchers have been putting a great deal of effort into presenting systematic evaluations of the different imputation algorithms. Initially, most algorithms were assessed with an emphasis on the accuracy of the imputation, using metrics such as the root mean squared error. However, it has become clear that the success of the estimation of the expression value should be evaluated in more practical terms as well. One can consider, for example, the ability of the method to preserve the significant genes in the dataset, or its discriminative/predictive power for classification/clustering purposes. We performed a broad analysis of the impact of five well-known missing value imputation methods on three clustering and four classification methods, in the context of 12 cancer gene expression datasets. We employed a statistical framework, for the first time in this field, to assess whether different imputation methods improve the performance of the clustering/classification methods. Our results suggest that the imputation methods evaluated have a minor impact on the classification and downstream clustering analyses. Simple methods such as replacing the missing values by mean or the median values performed as well as more complex strategies. The datasets analyzed in this study are available at http://costalab.org/Imputation/ .
Corso, Phaedra S.; Ingels, Justin B.; Kogan, Steven M.; Foster, E. Michael; Chen, Yi-Fu; Brody, Gene H.
2013-01-01
Programmatic cost analyses of preventive interventions commonly have a number of methodological difficulties. To determine the mean total costs and properly characterize variability, one often has to deal with small sample sizes, skewed distributions, and especially missing data. Standard approaches for dealing with missing data such as multiple imputation may suffer from a small sample size, a lack of appropriate covariates, or too few details around the method used to handle the missing data. In this study, we estimate total programmatic costs for a prevention trial evaluating the Strong African American Families-Teen program. This intervention focuses on the prevention of substance abuse and risky sexual behavior. To account for missing data in the assessment of programmatic costs we compare multiple imputation to probabilistic sensitivity analysis. The latter approach uses collected cost data to create a distribution around each input parameter. We found that with the multiple imputation approach, the mean (95% confidence interval) incremental difference was $2149 ($397, $3901). With the probabilistic sensitivity analysis approach, the incremental difference was $2583 ($778, $4346). Although the true cost of the program is unknown, probabilistic sensitivity analysis may be a more viable alternative for capturing variability in estimates of programmatic costs when dealing with missing data, particularly with small sample sizes and the lack of strong predictor variables. Further, the larger standard errors produced by the probabilistic sensitivity analysis method may signal its ability to capture more of the variability in the data, thus better informing policymakers on the potentially true cost of the intervention. PMID:23299559
Corso, Phaedra S; Ingels, Justin B; Kogan, Steven M; Foster, E Michael; Chen, Yi-Fu; Brody, Gene H
2013-10-01
Programmatic cost analyses of preventive interventions commonly have a number of methodological difficulties. To determine the mean total costs and properly characterize variability, one often has to deal with small sample sizes, skewed distributions, and especially missing data. Standard approaches for dealing with missing data such as multiple imputation may suffer from a small sample size, a lack of appropriate covariates, or too few details around the method used to handle the missing data. In this study, we estimate total programmatic costs for a prevention trial evaluating the Strong African American Families-Teen program. This intervention focuses on the prevention of substance abuse and risky sexual behavior. To account for missing data in the assessment of programmatic costs we compare multiple imputation to probabilistic sensitivity analysis. The latter approach uses collected cost data to create a distribution around each input parameter. We found that with the multiple imputation approach, the mean (95 % confidence interval) incremental difference was $2,149 ($397, $3,901). With the probabilistic sensitivity analysis approach, the incremental difference was $2,583 ($778, $4,346). Although the true cost of the program is unknown, probabilistic sensitivity analysis may be a more viable alternative for capturing variability in estimates of programmatic costs when dealing with missing data, particularly with small sample sizes and the lack of strong predictor variables. Further, the larger standard errors produced by the probabilistic sensitivity analysis method may signal its ability to capture more of the variability in the data, thus better informing policymakers on the potentially true cost of the intervention.
Missing CD4+ cell response in randomized clinical trials of maraviroc and dolutegravir.
Cuffe, Robert; Barnett, Carly; Granier, Catherine; Machida, Mitsuaki; Wang, Cunshan; Roger, James
2015-10-01
Missing data can compromise inferences from clinical trials, yet the topic has received little attention in the clinical trial community. Shortcomings in commonly used methods used to analyze studies with missing data (complete case, last- or baseline-observation carried forward) have been highlighted in a recent Food and Drug Administration-sponsored report. This report recommends how to mitigate the issues associated with missing data. We present an example of the proposed concepts using data from recent clinical trials. CD4+ cell count data from the previously reported SINGLE and MOTIVATE studies of dolutegravir and maraviroc were analyzed using a variety of statistical methods to explore the impact of missing data. Four methodologies were used: complete case analysis, simple imputation, mixed models for repeated measures, and multiple imputation. We compared the sensitivity of conclusions to the volume of missing data and to the assumptions underpinning each method. Rates of missing data were greater in the MOTIVATE studies (35%-68% premature withdrawal) than in SINGLE (12%-20%). The sensitivity of results to assumptions about missing data was related to volume of missing data. Estimates of treatment differences by various analysis methods ranged across a 61 cells/mm3 window in MOTIVATE and a 22 cells/mm3 window in SINGLE. Where missing data are anticipated, analyses require robust statistical and clinical debate of the necessary but unverifiable underlying statistical assumptions. Multiple imputation makes these assumptions transparent, can accommodate a broad range of scenarios, and is a natural analysis for clinical trials in HIV with missing data.
A Review On Missing Value Estimation Using Imputation Algorithm
NASA Astrophysics Data System (ADS)
Armina, Roslan; Zain, Azlan Mohd; Azizah Ali, Nor; Sallehuddin, Roselina
2017-09-01
The presence of the missing value in the data set has always been a major problem for precise prediction. The method for imputing missing value needs to minimize the effect of incomplete data sets for the prediction model. Many algorithms have been proposed for countermeasure of missing value problem. In this review, we provide a comprehensive analysis of existing imputation algorithm, focusing on the technique used and the implementation of global or local information of data sets for missing value estimation. In addition validation method for imputation result and way to measure the performance of imputation algorithm also described. The objective of this review is to highlight possible improvement on existing method and it is hoped that this review gives reader better understanding of imputation method trend.
ERIC Educational Resources Information Center
Wolgast, Anett; Schwinger, Malte; Hahnel, Carolin; Stiensmeier-Pelster, Joachim
2017-01-01
Introduction: Multiple imputation (MI) is one of the most highly recommended methods for replacing missing values in research data. The scope of this paper is to demonstrate missing data handling in SEM by analyzing two modified data examples from educational psychology, and to give practical recommendations for applied researchers. Method: We…
Strategies for Dealing with Missing Accelerometer Data.
Stephens, Samantha; Beyene, Joseph; Tremblay, Mark S; Faulkner, Guy; Pullnayegum, Eleanor; Feldman, Brian M
2018-05-01
Missing data is a universal research problem that can affect studies examining the relationship between physical activity measured with accelerometers and health outcomes. Statistical techniques are available to deal with missing data; however, available techniques have not been synthesized. A scoping review was conducted to summarize the advantages and disadvantages of identified methods of dealing with missing data from accelerometers. Missing data poses a threat to the validity and interpretation of trials using physical activity data from accelerometry. Imputation using multiple imputation techniques is recommended to deal with missing data and improve the validity and interpretation of studies using accelerometry. Copyright © 2018 Elsevier Inc. All rights reserved.
Gaussian mixture clustering and imputation of microarray data.
Ouyang, Ming; Welsh, William J; Georgopoulos, Panos
2004-04-12
In microarray experiments, missing entries arise from blemishes on the chips. In large-scale studies, virtually every chip contains some missing entries and more than 90% of the genes are affected. Many analysis methods require a full set of data. Either those genes with missing entries are excluded, or the missing entries are filled with estimates prior to the analyses. This study compares methods of missing value estimation. Two evaluation metrics of imputation accuracy are employed. First, the root mean squared error measures the difference between the true values and the imputed values. Second, the number of mis-clustered genes measures the difference between clustering with true values and that with imputed values; it examines the bias introduced by imputation to clustering. The Gaussian mixture clustering with model averaging imputation is superior to all other imputation methods, according to both evaluation metrics, on both time-series (correlated) and non-time series (uncorrelated) data sets.
Browning, Brian L.; Browning, Sharon R.
2009-01-01
We present methods for imputing data for ungenotyped markers and for inferring haplotype phase in large data sets of unrelated individuals and parent-offspring trios. Our methods make use of known haplotype phase when it is available, and our methods are computationally efficient so that the full information in large reference panels with thousands of individuals is utilized. We demonstrate that substantial gains in imputation accuracy accrue with increasingly large reference panel sizes, particularly when imputing low-frequency variants, and that unphased reference panels can provide highly accurate genotype imputation. We place our methodology in a unified framework that enables the simultaneous use of unphased and phased data from trios and unrelated individuals in a single analysis. For unrelated individuals, our imputation methods produce well-calibrated posterior genotype probabilities and highly accurate allele-frequency estimates. For trios, our haplotype-inference method is four orders of magnitude faster than the gold-standard PHASE program and has excellent accuracy. Our methods enable genotype imputation to be performed with unphased trio or unrelated reference panels, thus accounting for haplotype-phase uncertainty in the reference panel. We present a useful measure of imputation accuracy, allelic R2, and show that this measure can be estimated accurately from posterior genotype probabilities. Our methods are implemented in version 3.0 of the BEAGLE software package. PMID:19200528
[Imputing missing data in public health: general concepts and application to dichotomous variables].
Hernández, Gilma; Moriña, David; Navarro, Albert
The presence of missing data in collected variables is common in health surveys, but the subsequent imputation thereof at the time of analysis is not. Working with imputed data may have certain benefits regarding the precision of the estimators and the unbiased identification of associations between variables. The imputation process is probably still little understood by many non-statisticians, who view this process as highly complex and with an uncertain goal. To clarify these questions, this note aims to provide a straightforward, non-exhaustive overview of the imputation process to enable public health researchers ascertain its strengths. All this in the context of dichotomous variables which are commonplace in public health. To illustrate these concepts, an example in which missing data is handled by means of simple and multiple imputation is introduced. Copyright © 2017 SESPAS. Publicado por Elsevier España, S.L.U. All rights reserved.
Tian, Ting; McLachlan, Geoffrey J.; Dieters, Mark J.; Basford, Kaye E.
2015-01-01
It is a common occurrence in plant breeding programs to observe missing values in three-way three-mode multi-environment trial (MET) data. We proposed modifications of models for estimating missing observations for these data arrays, and developed a novel approach in terms of hierarchical clustering. Multiple imputation (MI) was used in four ways, multiple agglomerative hierarchical clustering, normal distribution model, normal regression model, and predictive mean match. The later three models used both Bayesian analysis and non-Bayesian analysis, while the first approach used a clustering procedure with randomly selected attributes and assigned real values from the nearest neighbour to the one with missing observations. Different proportions of data entries in six complete datasets were randomly selected to be missing and the MI methods were compared based on the efficiency and accuracy of estimating those values. The results indicated that the models using Bayesian analysis had slightly higher accuracy of estimation performance than those using non-Bayesian analysis but they were more time-consuming. However, the novel approach of multiple agglomerative hierarchical clustering demonstrated the overall best performances. PMID:26689369
Tian, Ting; McLachlan, Geoffrey J; Dieters, Mark J; Basford, Kaye E
2015-01-01
It is a common occurrence in plant breeding programs to observe missing values in three-way three-mode multi-environment trial (MET) data. We proposed modifications of models for estimating missing observations for these data arrays, and developed a novel approach in terms of hierarchical clustering. Multiple imputation (MI) was used in four ways, multiple agglomerative hierarchical clustering, normal distribution model, normal regression model, and predictive mean match. The later three models used both Bayesian analysis and non-Bayesian analysis, while the first approach used a clustering procedure with randomly selected attributes and assigned real values from the nearest neighbour to the one with missing observations. Different proportions of data entries in six complete datasets were randomly selected to be missing and the MI methods were compared based on the efficiency and accuracy of estimating those values. The results indicated that the models using Bayesian analysis had slightly higher accuracy of estimation performance than those using non-Bayesian analysis but they were more time-consuming. However, the novel approach of multiple agglomerative hierarchical clustering demonstrated the overall best performances.
Integrative missing value estimation for microarray data.
Hu, Jianjun; Li, Haifeng; Waterman, Michael S; Zhou, Xianghong Jasmine
2006-10-12
Missing value estimation is an important preprocessing step in microarray analysis. Although several methods have been developed to solve this problem, their performance is unsatisfactory for datasets with high rates of missing data, high measurement noise, or limited numbers of samples. In fact, more than 80% of the time-series datasets in Stanford Microarray Database contain less than eight samples. We present the integrative Missing Value Estimation method (iMISS) by incorporating information from multiple reference microarray datasets to improve missing value estimation. For each gene with missing data, we derive a consistent neighbor-gene list by taking reference data sets into consideration. To determine whether the given reference data sets are sufficiently informative for integration, we use a submatrix imputation approach. Our experiments showed that iMISS can significantly and consistently improve the accuracy of the state-of-the-art Local Least Square (LLS) imputation algorithm by up to 15% improvement in our benchmark tests. We demonstrated that the order-statistics-based integrative imputation algorithms can achieve significant improvements over the state-of-the-art missing value estimation approaches such as LLS and is especially good for imputing microarray datasets with a limited number of samples, high rates of missing data, or very noisy measurements. With the rapid accumulation of microarray datasets, the performance of our approach can be further improved by incorporating larger and more appropriate reference datasets.
Genotype Imputation with Millions of Reference Samples
Browning, Brian L.; Browning, Sharon R.
2016-01-01
We present a genotype imputation method that scales to millions of reference samples. The imputation method, based on the Li and Stephens model and implemented in Beagle v.4.1, is parallelized and memory efficient, making it well suited to multi-core computer processors. It achieves fast, accurate, and memory-efficient genotype imputation by restricting the probability model to markers that are genotyped in the target samples and by performing linear interpolation to impute ungenotyped variants. We compare Beagle v.4.1 with Impute2 and Minimac3 by using 1000 Genomes Project data, UK10K Project data, and simulated data. All three methods have similar accuracy but different memory requirements and different computation times. When imputing 10 Mb of sequence data from 50,000 reference samples, Beagle’s throughput was more than 100× greater than Impute2’s throughput on our computer servers. When imputing 10 Mb of sequence data from 200,000 reference samples in VCF format, Minimac3 consumed 26× more memory per computational thread and 15× more CPU time than Beagle. We demonstrate that Beagle v.4.1 scales to much larger reference panels by performing imputation from a simulated reference panel having 5 million samples and a mean marker density of one marker per four base pairs. PMID:26748515
DISTMIX: direct imputation of summary statistics for unmeasured SNPs from mixed ethnicity cohorts.
Lee, Donghyung; Bigdeli, T Bernard; Williamson, Vernell S; Vladimirov, Vladimir I; Riley, Brien P; Fanous, Ayman H; Bacanu, Silviu-Alin
2015-10-01
To increase the signal resolution for large-scale meta-analyses of genome-wide association studies, genotypes at unmeasured single nucleotide polymorphisms (SNPs) are commonly imputed using large multi-ethnic reference panels. However, the ever increasing size and ethnic diversity of both reference panels and cohorts makes genotype imputation computationally challenging for moderately sized computer clusters. Moreover, genotype imputation requires subject-level genetic data, which unlike summary statistics provided by virtually all studies, is not publicly available. While there are much less demanding methods which avoid the genotype imputation step by directly imputing SNP statistics, e.g. Directly Imputing summary STatistics (DIST) proposed by our group, their implicit assumptions make them applicable only to ethnically homogeneous cohorts. To decrease computational and access requirements for the analysis of cosmopolitan cohorts, we propose DISTMIX, which extends DIST capabilities to the analysis of mixed ethnicity cohorts. The method uses a relevant reference panel to directly impute unmeasured SNP statistics based only on statistics at measured SNPs and estimated/user-specified ethnic proportions. Simulations show that the proposed method adequately controls the Type I error rates. The 1000 Genomes panel imputation of summary statistics from the ethnically diverse Psychiatric Genetic Consortium Schizophrenia Phase 2 suggests that, when compared to genotype imputation methods, DISTMIX offers comparable imputation accuracy for only a fraction of computational resources. DISTMIX software, its reference population data, and usage examples are publicly available at http://code.google.com/p/distmix. dlee4@vcu.edu Supplementary Data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press.
An Overview and Evaluation of Recent Machine Learning Imputation Methods Using Cardiac Imaging Data.
Liu, Yuzhe; Gopalakrishnan, Vanathi
2017-03-01
Many clinical research datasets have a large percentage of missing values that directly impacts their usefulness in yielding high accuracy classifiers when used for training in supervised machine learning. While missing value imputation methods have been shown to work well with smaller percentages of missing values, their ability to impute sparse clinical research data can be problem specific. We previously attempted to learn quantitative guidelines for ordering cardiac magnetic resonance imaging during the evaluation for pediatric cardiomyopathy, but missing data significantly reduced our usable sample size. In this work, we sought to determine if increasing the usable sample size through imputation would allow us to learn better guidelines. We first review several machine learning methods for estimating missing data. Then, we apply four popular methods (mean imputation, decision tree, k-nearest neighbors, and self-organizing maps) to a clinical research dataset of pediatric patients undergoing evaluation for cardiomyopathy. Using Bayesian Rule Learning (BRL) to learn ruleset models, we compared the performance of imputation-augmented models versus unaugmented models. We found that all four imputation-augmented models performed similarly to unaugmented models. While imputation did not improve performance, it did provide evidence for the robustness of our learned models.
Methods for Mediation Analysis with Missing Data
ERIC Educational Resources Information Center
Zhang, Zhiyong; Wang, Lijuan
2013-01-01
Despite wide applications of both mediation models and missing data techniques, formal discussion of mediation analysis with missing data is still rare. We introduce and compare four approaches to dealing with missing data in mediation analysis including list wise deletion, pairwise deletion, multiple imputation (MI), and a two-stage maximum…
ERIC Educational Resources Information Center
Finch, Holmes
2011-01-01
Methods of uniform differential item functioning (DIF) detection have been extensively studied in the complete data case. However, less work has been done examining the performance of these methods when missing item responses are present. Research that has been done in this regard appears to indicate that treating missing item responses as…
NASA Astrophysics Data System (ADS)
Kong, Jing
This thesis includes 4 pieces of work. In Chapter 1, we present the work with a method for examining mortality as it is seen to run in families, and lifestyle factors that are also seen to run in families, in a subpopulation of the Beaver Dam Eye Study that has died by 2011. We find significant distance correlations between death ages, lifestyle factors, and family relationships. Considering only sib pairs compared to unrelated persons, distance correlation between siblings and mortality is, not surprisingly, stronger than that between more distantly related family members and mortality. Chapter 2 introduces a feature screening procedure with the use of distance correlation and covariance. We demonstrate a property for distance covariance, which is incorporated in a novel feature screening procedure based on distance correlation as a stopping criterion. The approach is further implemented to two real examples, namely the famous small round blue cell tumors data and the Cancer Genome Atlas ovarian cancer data Chapter 3 pays attention to the right censored human longevity data and the estimation of lifetime expectancy. We propose a general framework of backward multiple imputation for estimating the conditional lifetime expectancy function and the variance of the estimator in the right censoring setting and prove the properties of the estimator. In addition, we apply the method to the Beaver Dam eye study data to study human longevity, where the expected human lifetime are modeled with smoothing spline ANOVA based on the covariates including baseline age, gender, lifestyle factors and disease variables. Chapter 4 compares two imputation methods for right censored data, namely the famous Buckley-James estimator and the backward imputation method proposed in Chapter 3 and shows that backward imputation method is less biased and more robust with heterogeneity.
References for Haplotype Imputation in the Big Data Era
Li, Wenzhi; Xu, Wei; Li, Qiling; Ma, Li; Song, Qing
2016-01-01
Imputation is a powerful in silico approach to fill in those missing values in the big datasets. This process requires a reference panel, which is a collection of big data from which the missing information can be extracted and imputed. Haplotype imputation requires ethnicity-matched references; a mismatched reference panel will significantly reduce the quality of imputation. However, currently existing big datasets cover only a small number of ethnicities, there is a lack of ethnicity-matched references for many ethnic populations in the world, which has hampered the data imputation of haplotypes and its downstream applications. To solve this issue, several approaches have been proposed and explored, including the mixed reference panel, the internal reference panel and genotype-converted reference panel. This review article provides the information and comparison between these approaches. Increasing evidence showed that not just one or two genetic elements dictate the gene activity and functions; instead, cis-interactions of multiple elements dictate gene activity. Cis-interactions require the interacting elements to be on the same chromosome molecule, therefore, haplotype analysis is essential for the investigation of cis-interactions among multiple genetic variants at different loci, and appears to be especially important for studying the common diseases. It will be valuable in a wide spectrum of applications from academic research, to clinical diagnosis, prevention, treatment, and pharmaceutical industry. PMID:27274952
LinkImpute: Fast and Accurate Genotype Imputation for Nonmodel Organisms
Money, Daniel; Gardner, Kyle; Migicovsky, Zoë; Schwaninger, Heidi; Zhong, Gan-Yuan; Myles, Sean
2015-01-01
Obtaining genome-wide genotype data from a set of individuals is the first step in many genomic studies, including genome-wide association and genomic selection. All genotyping methods suffer from some level of missing data, and genotype imputation can be used to fill in the missing data and improve the power of downstream analyses. Model organisms like human and cattle benefit from high-quality reference genomes and panels of reference genotypes that aid in imputation accuracy. In nonmodel organisms, however, genetic and physical maps often are either of poor quality or are completely absent, and there are no panels of reference genotypes available. There is therefore a need for imputation methods designed specifically for nonmodel organisms in which genomic resources are poorly developed and marker order is unreliable or unknown. Here we introduce LinkImpute, a software package based on a k-nearest neighbor genotype imputation method, LD-kNNi, which is designed for unordered markers. No physical or genetic maps are required, and it is designed to work on unphased genotype data from heterozygous species. It exploits the fact that markers useful for imputation often are not physically close to the missing genotype but rather distributed throughout the genome. Using genotyping-by-sequencing data from diverse and heterozygous accessions of apples, grapes, and maize, we compare LD-kNNi with several genotype imputation methods and show that LD-kNNi is fast, comparable in accuracy to the best-existing methods, and exhibits the least bias in allele frequency estimates. PMID:26377960
Genotype Imputation with Millions of Reference Samples.
Browning, Brian L; Browning, Sharon R
2016-01-07
We present a genotype imputation method that scales to millions of reference samples. The imputation method, based on the Li and Stephens model and implemented in Beagle v.4.1, is parallelized and memory efficient, making it well suited to multi-core computer processors. It achieves fast, accurate, and memory-efficient genotype imputation by restricting the probability model to markers that are genotyped in the target samples and by performing linear interpolation to impute ungenotyped variants. We compare Beagle v.4.1 with Impute2 and Minimac3 by using 1000 Genomes Project data, UK10K Project data, and simulated data. All three methods have similar accuracy but different memory requirements and different computation times. When imputing 10 Mb of sequence data from 50,000 reference samples, Beagle's throughput was more than 100× greater than Impute2's throughput on our computer servers. When imputing 10 Mb of sequence data from 200,000 reference samples in VCF format, Minimac3 consumed 26× more memory per computational thread and 15× more CPU time than Beagle. We demonstrate that Beagle v.4.1 scales to much larger reference panels by performing imputation from a simulated reference panel having 5 million samples and a mean marker density of one marker per four base pairs. Copyright © 2016 The American Society of Human Genetics. Published by Elsevier Inc. All rights reserved.
A Comparison of Imputation Methods for Bayesian Factor Analysis Models
ERIC Educational Resources Information Center
Merkle, Edgar C.
2011-01-01
Imputation methods are popular for the handling of missing data in psychology. The methods generally consist of predicting missing data based on observed data, yielding a complete data set that is amiable to standard statistical analyses. In the context of Bayesian factor analysis, this article compares imputation under an unrestricted…
A comprehensive SNP and indel imputability database.
Duan, Qing; Liu, Eric Yi; Croteau-Chonka, Damien C; Mohlke, Karen L; Li, Yun
2013-02-15
Genotype imputation has become an indispensible step in genome-wide association studies (GWAS). Imputation accuracy, directly influencing downstream analysis, has shown to be improved using re-sequencing-based reference panels; however, this comes at the cost of high computational burden due to the huge number of potentially imputable markers (tens of millions) discovered through sequencing a large number of individuals. Therefore, there is an increasing need for access to imputation quality information without actually conducting imputation. To facilitate this process, we have established a publicly available SNP and indel imputability database, aiming to provide direct access to imputation accuracy information for markers identified by the 1000 Genomes Project across four major populations and covering multiple GWAS genotyping platforms. SNP and indel imputability information can be retrieved through a user-friendly interface by providing the ID(s) of the desired variant(s) or by specifying the desired genomic region. The query results can be refined by selecting relevant GWAS genotyping platform(s). This is the first database providing variant imputability information specific to each continental group and to each genotyping platform. In Filipino individuals from the Cebu Longitudinal Health and Nutrition Survey, our database can achieve an area under the receiver-operating characteristic curve of 0.97, 0.91, 0.88 and 0.79 for markers with minor allele frequency >5%, 3-5%, 1-3% and 0.5-1%, respectively. Specifically, by filtering out 48.6% of markers (corresponding to a reduction of up to 48.6% in computational costs for actual imputation) based on the imputability information in our database, we can remove 77%, 58%, 51% and 42% of the poorly imputed markers at the cost of only 0.3%, 0.8%, 1.5% and 4.6% of the well-imputed markers with minor allele frequency >5%, 3-5%, 1-3% and 0.5-1%, respectively. http://www.unc.edu/∼yunmli/imputability.html
LinkImpute: Fast and Accurate Genotype Imputation for Nonmodel Organisms.
Money, Daniel; Gardner, Kyle; Migicovsky, Zoë; Schwaninger, Heidi; Zhong, Gan-Yuan; Myles, Sean
2015-09-15
Obtaining genome-wide genotype data from a set of individuals is the first step in many genomic studies, including genome-wide association and genomic selection. All genotyping methods suffer from some level of missing data, and genotype imputation can be used to fill in the missing data and improve the power of downstream analyses. Model organisms like human and cattle benefit from high-quality reference genomes and panels of reference genotypes that aid in imputation accuracy. In nonmodel organisms, however, genetic and physical maps often are either of poor quality or are completely absent, and there are no panels of reference genotypes available. There is therefore a need for imputation methods designed specifically for nonmodel organisms in which genomic resources are poorly developed and marker order is unreliable or unknown. Here we introduce LinkImpute, a software package based on a k-nearest neighbor genotype imputation method, LD-kNNi, which is designed for unordered markers. No physical or genetic maps are required, and it is designed to work on unphased genotype data from heterozygous species. It exploits the fact that markers useful for imputation often are not physically close to the missing genotype but rather distributed throughout the genome. Using genotyping-by-sequencing data from diverse and heterozygous accessions of apples, grapes, and maize, we compare LD-kNNi with several genotype imputation methods and show that LD-kNNi is fast, comparable in accuracy to the best-existing methods, and exhibits the least bias in allele frequency estimates. Copyright © 2015 Money et al.
Belger, Mark; Haro, Josep Maria; Reed, Catherine; Happich, Michael; Kahle-Wrobleski, Kristin; Argimon, Josep Maria; Bruno, Giuseppe; Dodel, Richard; Jones, Roy W; Vellas, Bruno; Wimo, Anders
2016-07-18
Missing data are a common problem in prospective studies with a long follow-up, and the volume, pattern and reasons for missing data may be relevant when estimating the cost of illness. We aimed to evaluate the effects of different methods for dealing with missing longitudinal cost data and for costing caregiver time on total societal costs in Alzheimer's disease (AD). GERAS is an 18-month observational study of costs associated with AD. Total societal costs included patient health and social care costs, and caregiver health and informal care costs. Missing data were classified as missing completely at random (MCAR), missing at random (MAR) or missing not at random (MNAR). Simulation datasets were generated from baseline data with 10-40 % missing total cost data for each missing data mechanism. Datasets were also simulated to reflect the missing cost data pattern at 18 months using MAR and MNAR assumptions. Naïve and multiple imputation (MI) methods were applied to each dataset and results compared with complete GERAS 18-month cost data. Opportunity and replacement cost approaches were used for caregiver time, which was costed with and without supervision included and with time for working caregivers only being costed. Total costs were available for 99.4 % of 1497 patients at baseline. For MCAR datasets, naïve methods performed as well as MI methods. For MAR, MI methods performed better than naïve methods. All imputation approaches were poor for MNAR data. For all approaches, percentage bias increased with missing data volume. For datasets reflecting 18-month patterns, a combination of imputation methods provided more accurate cost estimates (e.g. bias: -1 % vs -6 % for single MI method), although different approaches to costing caregiver time had a greater impact on estimated costs (29-43 % increase over base case estimate). Methods used to impute missing cost data in AD will impact on accuracy of cost estimates although varying approaches to costing informal caregiver time has the greatest impact on total costs. Tailoring imputation methods to the reason for missing data will further our understanding of the best analytical approach for studies involving cost outcomes.
Capers, Patrice L.; Brown, Andrew W.; Dawson, John A.; Allison, David B.
2015-01-01
Background: Meta-research can involve manual retrieval and evaluation of research, which is resource intensive. Creation of high throughput methods (e.g., search heuristics, crowdsourcing) has improved feasibility of large meta-research questions, but possibly at the cost of accuracy. Objective: To evaluate the use of double sampling combined with multiple imputation (DS + MI) to address meta-research questions, using as an example adherence of PubMed entries to two simple consolidated standards of reporting trials guidelines for titles and abstracts. Methods: For the DS large sample, we retrieved all PubMed entries satisfying the filters: RCT, human, abstract available, and English language (n = 322, 107). For the DS subsample, we randomly sampled 500 entries from the large sample. The large sample was evaluated with a lower rigor, higher throughput (RLOTHI) method using search heuristics, while the subsample was evaluated using a higher rigor, lower throughput (RHITLO) human rating method. Multiple imputation of the missing-completely at-random RHITLO data for the large sample was informed by: RHITLO data from the subsample; RLOTHI data from the large sample; whether a study was an RCT; and country and year of publication. Results: The RHITLO and RLOTHI methods in the subsample largely agreed (phi coefficients: title = 1.00, abstract = 0.92). Compliance with abstract and title criteria has increased over time, with non-US countries improving more rapidly. DS + MI logistic regression estimates were more precise than subsample estimates (e.g., 95% CI for change in title and abstract compliance by year: subsample RHITLO 1.050–1.174 vs. DS + MI 1.082–1.151). As evidence of improved accuracy, DS + MI coefficient estimates were closer to RHITLO than the large sample RLOTHI. Conclusion: Our results support our hypothesis that DS + MI would result in improved precision and accuracy. This method is flexible and may provide a practical way to examine large corpora of literature. PMID:25988135
Multiple imputation for estimating the risk of developing dementia and its impact on survival.
Yu, Binbing; Saczynski, Jane S; Launer, Lenore
2010-10-01
Dementia, Alzheimer's disease in particular, is one of the major causes of disability and decreased quality of life among the elderly and a leading obstacle to successful aging. Given the profound impact on public health, much research has focused on the age-specific risk of developing dementia and the impact on survival. Early work has discussed various methods of estimating age-specific incidence of dementia, among which the illness-death model is popular for modeling disease progression. In this article we use multiple imputation to fit multi-state models for survival data with interval censoring and left truncation. This approach allows semi-Markov models in which survival after dementia depends on onset age. Such models can be used to estimate the cumulative risk of developing dementia in the presence of the competing risk of dementia-free death. Simulations are carried out to examine the performance of the proposed method. Data from the Honolulu Asia Aging Study are analyzed to estimate the age-specific and cumulative risks of dementia and to examine the effect of major risk factors on dementia onset and death.
Missing data imputation and haplotype phase inference for genome-wide association studies
Browning, Sharon R.
2009-01-01
Imputation of missing data and the use of haplotype-based association tests can improve the power of genome-wide association studies (GWAS). In this article, I review methods for haplotype inference and missing data imputation, and discuss their application to GWAS. I discuss common features of the best algorithms for haplotype phase inference and missing data imputation in large-scale data sets, as well as some important differences between classes of methods, and highlight the methods that provide the highest accuracy and fastest computational performance. PMID:18850115
Loong, Bronwyn; Zaslavsky, Alan M.; He, Yulei; Harrington, David P.
2013-01-01
Statistical agencies have begun to partially synthesize public-use data for major surveys to protect the confidentiality of respondents’ identities and sensitive attributes, by replacing high disclosure risk and sensitive variables with multiple imputations. To date, there are few applications of synthetic data techniques to large-scale healthcare survey data. Here, we describe partial synthesis of survey data collected by CanCORS, a comprehensive observational study of the experiences, treatments, and outcomes of patients with lung or colorectal cancer in the United States. We review inferential methods for partially synthetic data, and discuss selection of high disclosure risk variables for synthesis, specification of imputation models, and identification disclosure risk assessment. We evaluate data utility by replicating published analyses and comparing results using original and synthetic data, and discuss practical issues in preserving inferential conclusions. We found that important subgroup relationships must be included in the synthetic data imputation model, to preserve the data utility of the observed data for a given analysis procedure. We conclude that synthetic CanCORS data are suited best for preliminary data analyses purposes. These methods address the requirement to share data in clinical research without compromising confidentiality. PMID:23670983
Nixon, Richard M; Duffy, Stephen W; Fender, Guy R K
2003-09-24
The Anglia Menorrhagia Education Study (AMES) is a randomized controlled trial testing the effectiveness of an education package applied to general practices. Binary data are available from two sources; general practitioner reported referrals to hospital, and referrals to hospital determined by independent audit of the general practices. The former may be regarded as a surrogate for the latter, which is regarded as the true endpoint. Data are only available for the true end point on a sub set of the practices, but there are surrogate data for almost all of the audited practices and for most of the remaining practices. The aim of this paper was to estimate the treatment effect using data from every practice in the study. Where the true endpoint was not available, it was estimated by three approaches, a regression method, multiple imputation and a full likelihood model. Including the surrogate data in the analysis yielded an estimate of the treatment effect which was more precise than an estimate gained from using the true end point data alone. The full likelihood method provides a new imputation tool at the disposal of trials with surrogate data.
Genotype imputation efficiency in Nelore Cattle
USDA-ARS?s Scientific Manuscript database
Genotype imputation efficiency in Nelore cattle was evaluated in different scenarios of lower density (LD) chips, imputation methods and sets of animals to have their genotypes imputed. Twelve commercial and virtual custom LD chips with densities varying from 7K to 75K SNPs were tested. Customized L...
Smuk, M; Carpenter, J R; Morris, T P
2017-02-06
Within epidemiological and clinical research, missing data are a common issue and often over looked in publications. When the issue of missing observations is addressed it is usually assumed that the missing data are 'missing at random' (MAR). This assumption should be checked for plausibility, however it is untestable, thus inferences should be assessed for robustness to departures from missing at random. We highlight the method of pattern mixture sensitivity analysis after multiple imputation using colorectal cancer data as an example. We focus on the Dukes' stage variable which has the highest proportion of missing observations. First, we find the probability of being in each Dukes' stage given the MAR imputed dataset. We use these probabilities in a questionnaire to elicit prior beliefs from experts on what they believe the probability would be in the missing data. The questionnaire responses are then used in a Dirichlet draw to create a Bayesian 'missing not at random' (MNAR) prior to impute the missing observations. The model of interest is applied and inferences are compared to those from the MAR imputed data. The inferences were largely insensitive to departure from MAR. Inferences under MNAR suggested a smaller association between Dukes' stage and death, though the association remained positive and with similarly low p values. We conclude by discussing the positives and negatives of our method and highlight the importance of making people aware of the need to test the MAR assumption.
Use of partial least squares regression to impute SNP genotypes in Italian cattle breeds.
Dimauro, Corrado; Cellesi, Massimo; Gaspa, Giustino; Ajmone-Marsan, Paolo; Steri, Roberto; Marras, Gabriele; Macciotta, Nicolò P P
2013-06-05
The objective of the present study was to test the ability of the partial least squares regression technique to impute genotypes from low density single nucleotide polymorphisms (SNP) panels i.e. 3K or 7K to a high density panel with 50K SNP. No pedigree information was used. Data consisted of 2093 Holstein, 749 Brown Swiss and 479 Simmental bulls genotyped with the Illumina 50K Beadchip. First, a single-breed approach was applied by using only data from Holstein animals. Then, to enlarge the training population, data from the three breeds were combined and a multi-breed analysis was performed. Accuracies of genotypes imputed using the partial least squares regression method were compared with those obtained by using the Beagle software. The impact of genotype imputation on breeding value prediction was evaluated for milk yield, fat content and protein content. In the single-breed approach, the accuracy of imputation using partial least squares regression was around 90 and 94% for the 3K and 7K platforms, respectively; corresponding accuracies obtained with Beagle were around 85% and 90%. Moreover, computing time required by the partial least squares regression method was on average around 10 times lower than computing time required by Beagle. Using the partial least squares regression method in the multi-breed resulted in lower imputation accuracies than using single-breed data. The impact of the SNP-genotype imputation on the accuracy of direct genomic breeding values was small. The correlation between estimates of genetic merit obtained by using imputed versus actual genotypes was around 0.96 for the 7K chip. Results of the present work suggested that the partial least squares regression imputation method could be useful to impute SNP genotypes when pedigree information is not available.
Using Bayesian Imputation to Assess Racial and Ethnic Disparities in Pediatric Performance Measures.
Brown, David P; Knapp, Caprice; Baker, Kimberly; Kaufmann, Meggen
2016-06-01
To analyze health care disparities in pediatric quality of care measures and determine the impact of data imputation. Five HEDIS measures are calculated based on 2012 administrative data for 145,652 children in two public insurance programs in Florida. The Bayesian Improved Surname and Geocoding (BISG) imputation method is used to impute missing race and ethnicity data for 42 percent of the sample (61,954 children). Models are estimated with and without the imputed race and ethnicity data. Dropping individuals with missing race and ethnicity data biases quality of care measures for minorities downward relative to nonminority children for several measures. These results provide further support for the importance of appropriately accounting for missing race and ethnicity data through imputation methods. © Health Research and Educational Trust.
Can We Spin Straw Into Gold? An Evaluation of Immigrant Legal Status Imputation Approaches
Van Hook, Jennifer; Bachmeier, James D.; Coffman, Donna; Harel, Ofer
2014-01-01
Researchers have developed logical, demographic, and statistical strategies for imputing immigrants’ legal status, but these methods have never been empirically assessed. We used Monte Carlo simulations to test whether, and under what conditions, legal status imputation approaches yield unbiased estimates of the association of unauthorized status with health insurance coverage. We tested five methods under a range of missing data scenarios. Logical and demographic imputation methods yielded biased estimates across all missing data scenarios. Statistical imputation approaches yielded unbiased estimates only when unauthorized status was jointly observed with insurance coverage; when this condition was not met, these methods overestimated insurance coverage for unauthorized relative to legal immigrants. We next showed how bias can be reduced by incorporating prior information about unauthorized immigrants. Finally, we demonstrated the utility of the best-performing statistical method for increasing power. We used it to produce state/regional estimates of insurance coverage among unauthorized immigrants in the Current Population Survey, a data source that contains no direct measures of immigrants’ legal status. We conclude that commonly employed legal status imputation approaches are likely to produce biased estimates, but data and statistical methods exist that could substantially reduce these biases. PMID:25511332
Blue, Elizabeth Marchani; Sun, Lei; Tintle, Nathan L.; Wijsman, Ellen M.
2014-01-01
When analyzing family data, we dream of perfectly informative data, even whole genome sequences (WGS) for all family members. Reality intervenes, and we find next-generation sequence (NGS) data have error, and are often too expensive or impossible to collect on everyone. Genetic Analysis Workshop 18 groups “Quality Control” and “Dropping WGS through families using GWAS framework” focused on finding, correcting, and using errors within the available sequence and family data, developing methods to infer and analyze missing sequence data among relatives, and testing for linkage and association with simulated blood pressure. We found that single nucleotide polymorphisms, NGS, and imputed data are generally concordant, but that errors are particularly likely at rare variants, homozygous genotypes, within regions with repeated sequences or structural variants, and within sequence data imputed from unrelateds. Admixture complicated identification of cryptic relatedness, but information from Mendelian transmission improved error detection and provided an estimate of the de novo mutation rate. Both genotype and pedigree errors had an adverse effect on subsequent analyses. Computationally fast rules-based imputation was accurate, but could not cover as many loci or subjects as more computationally demanding probability-based methods. Incorporating population-level data into pedigree-based imputation methods improved results. Observed data outperformed imputed data in association testing, but imputed data were also useful. We discuss the strengths and weaknesses of existing methods, and suggest possible future directions. Topics include improving communication between those performing data collection and analysis, establishing thresholds for and improving imputation quality, and incorporating error into imputation and analytical models. PMID:25112184
Missing value imputation for gene expression data by tailored nearest neighbors.
Faisal, Shahla; Tutz, Gerhard
2017-04-25
High dimensional data like gene expression and RNA-sequences often contain missing values. The subsequent analysis and results based on these incomplete data can suffer strongly from the presence of these missing values. Several approaches to imputation of missing values in gene expression data have been developed but the task is difficult due to the high dimensionality (number of genes) of the data. Here an imputation procedure is proposed that uses weighted nearest neighbors. Instead of using nearest neighbors defined by a distance that includes all genes the distance is computed for genes that are apt to contribute to the accuracy of imputed values. The method aims at avoiding the curse of dimensionality, which typically occurs if local methods as nearest neighbors are applied in high dimensional settings. The proposed weighted nearest neighbors algorithm is compared to existing missing value imputation techniques like mean imputation, KNNimpute and the recently proposed imputation by random forests. We use RNA-sequence and microarray data from studies on human cancer to compare the performance of the methods. The results from simulations as well as real studies show that the weighted distance procedure can successfully handle missing values for high dimensional data structures where the number of predictors is larger than the number of samples. The method typically outperforms the considered competitors.
Dealing with gene expression missing data.
Brás, L P; Menezes, J C
2006-05-01
Compared evaluation of different methods is presented for estimating missing values in microarray data: weighted K-nearest neighbours imputation (KNNimpute), regression-based methods such as local least squares imputation (LLSimpute) and partial least squares imputation (PLSimpute) and Bayesian principal component analysis (BPCA). The influence in prediction accuracy of some factors, such as methods' parameters, type of data relationships used in the estimation process (i.e. row-wise, column-wise or both), missing rate and pattern and type of experiment [time series (TS), non-time series (NTS) or mixed (MIX) experiments] is elucidated. Improvements based on the iterative use of data (iterative LLS and PLS imputation--ILLSimpute and IPLSimpute), the need to perform initial imputations (modified PLS and Helland PLS imputation--MPLSimpute and HPLSimpute) and the type of relationships employed (KNNarray, LLSarray, HPLSarray and alternating PLS--APLSimpute) are proposed. Overall, it is shown that data set properties (type of experiment, missing rate and pattern) affect the data similarity structure, therefore influencing the methods' performance. LLSimpute and ILLSimpute are preferable in the presence of data with a stronger similarity structure (TS and MIX experiments), whereas PLS-based methods (MPLSimpute, IPLSimpute and APLSimpute) are preferable when estimating NTS missing data.
A review of the handling of missing longitudinal outcome data in clinical trials
2014-01-01
The aim of this review was to establish the frequency with which trials take into account missingness, and to discover what methods trialists use for adjustment in randomised controlled trials with longitudinal measurements. Failing to address the problems that can arise from missing outcome data can result in misleading conclusions. Missing data should be addressed as a means of a sensitivity analysis of the complete case analysis results. One hundred publications of randomised controlled trials with longitudinal measurements were selected randomly from trial publications from the years 2005 to 2012. Information was extracted from these trials, including whether reasons for dropout were reported, what methods were used for handing the missing data, whether there was any explanation of the methods for missing data handling, and whether a statistician was involved in the analysis. The main focus of the review was on missing data post dropout rather than missing interim data. Of all the papers in the study, 9 (9%) had no missing data. More than half of the papers included in the study failed to make any attempt to explain the reasons for their choice of missing data handling method. Of the papers with clear missing data handling methods, 44 papers (50%) used adequate methods of missing data handling, whereas 30 (34%) of the papers used missing data methods which may not have been appropriate. In the remaining 17 papers (19%), it was difficult to assess the validity of the methods used. An imputation method was used in 18 papers (20%). Multiple imputation methods were introduced in 1987 and are an efficient way of accounting for missing data in general, and yet only 4 papers used these methods. Out of the 18 papers which used imputation, only 7 displayed the results as a sensitivity analysis of the complete case analysis results. 61% of the papers that used an imputation explained the reasons for their chosen method. Just under a third of the papers made no reference to reasons for missing outcome data. There was little consistency in reporting of missing data within longitudinal trials. PMID:24947664
Gaussian-based routines to impute categorical variables in health surveys.
Yucel, Recai M; He, Yulei; Zaslavsky, Alan M
2011-12-20
The multivariate normal (MVN) distribution is arguably the most popular parametric model used in imputation and is available in most software packages (e.g., SAS PROC MI, R package norm). When it is applied to categorical variables as an approximation, practitioners often either apply simple rounding techniques for ordinal variables or create a distinct 'missing' category and/or disregard the nominal variable from the imputation phase. All of these practices can potentially lead to biased and/or uninterpretable inferences. In this work, we develop a new rounding methodology calibrated to preserve observed distributions to multiply impute missing categorical covariates. The major attractiveness of this method is its flexibility to use any 'working' imputation software, particularly those based on MVN, allowing practitioners to obtain usable imputations with small biases. A simulation study demonstrates the clear advantage of the proposed method in rounding ordinal variables and, in some scenarios, its plausibility in imputing nominal variables. We illustrate our methods on a widely used National Survey of Children with Special Health Care Needs where incomplete values on race posed a valid threat on inferences pertaining to disparities. Copyright © 2011 John Wiley & Sons, Ltd.
Shah, Anoop D.; Bartlett, Jonathan W.; Carpenter, James; Nicholas, Owen; Hemingway, Harry
2014-01-01
Multivariate imputation by chained equations (MICE) is commonly used for imputing missing data in epidemiologic research. The “true” imputation model may contain nonlinearities which are not included in default imputation models. Random forest imputation is a machine learning technique which can accommodate nonlinearities and interactions and does not require a particular regression model to be specified. We compared parametric MICE with a random forest-based MICE algorithm in 2 simulation studies. The first study used 1,000 random samples of 2,000 persons drawn from the 10,128 stable angina patients in the CALIBER database (Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; 2001–2010) with complete data on all covariates. Variables were artificially made “missing at random,” and the bias and efficiency of parameter estimates obtained using different imputation methods were compared. Both MICE methods produced unbiased estimates of (log) hazard ratios, but random forest was more efficient and produced narrower confidence intervals. The second study used simulated data in which the partially observed variable depended on the fully observed variables in a nonlinear way. Parameter estimates were less biased using random forest MICE, and confidence interval coverage was better. This suggests that random forest imputation may be useful for imputing complex epidemiologic data sets in which some patients have missing data. PMID:24589914
Shah, Anoop D; Bartlett, Jonathan W; Carpenter, James; Nicholas, Owen; Hemingway, Harry
2014-03-15
Multivariate imputation by chained equations (MICE) is commonly used for imputing missing data in epidemiologic research. The "true" imputation model may contain nonlinearities which are not included in default imputation models. Random forest imputation is a machine learning technique which can accommodate nonlinearities and interactions and does not require a particular regression model to be specified. We compared parametric MICE with a random forest-based MICE algorithm in 2 simulation studies. The first study used 1,000 random samples of 2,000 persons drawn from the 10,128 stable angina patients in the CALIBER database (Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; 2001-2010) with complete data on all covariates. Variables were artificially made "missing at random," and the bias and efficiency of parameter estimates obtained using different imputation methods were compared. Both MICE methods produced unbiased estimates of (log) hazard ratios, but random forest was more efficient and produced narrower confidence intervals. The second study used simulated data in which the partially observed variable depended on the fully observed variables in a nonlinear way. Parameter estimates were less biased using random forest MICE, and confidence interval coverage was better. This suggests that random forest imputation may be useful for imputing complex epidemiologic data sets in which some patients have missing data.
Partitioning error components for accuracy-assessment of near-neighbor methods of imputation
Albert R. Stage; Nicholas L. Crookston
2007-01-01
Imputation is applied for two quite different purposes: to supply missing data to complete a data set for subsequent modeling analyses or to estimate subpopulation totals. Error properties of the imputed values have different effects in these two contexts. We partition errors of imputation derived from similar observation units as arising from three sources:...
Dore, David D.; Swaminathan, Shailender; Gutman, Roee; Trivedi, Amal N.; Mor, Vincent
2013-01-01
Objective To compare the assumptions and estimands across three approaches to estimating the effect of erythropoietin-stimulating agents (ESAs) on mortality. Study Design and Setting Using data from the Renal Management Information System, we conducted two analyses utilizing a change to bundled payment that we hypothesized mimicked random assignment to ESA (pre-post, difference-in-difference, and instrumental variable analyses). A third analysis was based on multiply imputing potential outcomes using propensity scores. Results There were 311,087 recipients of ESAs and 13,095 non-recipients. In the pre-post comparison, we identified no clear relationship between bundled payment (measured by calendar time) and the incidence of death within six months (risk difference -1.5%; 95% CI - 7.0% to 4.0%). In the instrumental variable analysis, the risk of mortality was similar among ESA recipients (risk difference -0.9%; 95% CI -2.1 to 0.3). In the multiple imputation analysis, we observed a 4.2% (95% CI 3.4% to 4.9%) absolute reduction in mortality risk with use of ESAs, but closer to the null for patients with baseline hematocrit >36%. Conclusion Methods emanating from different disciplines often rely on different assumptions, but can be informative about a similar causal contrast. The implications of these distinct approaches are discussed. PMID:23849152
Jakobsen, Janus Christian; Gluud, Christian; Wetterslev, Jørn; Winkel, Per
2017-12-06
Missing data may seriously compromise inferences from randomised clinical trials, especially if missing data are not handled appropriately. The potential bias due to missing data depends on the mechanism causing the data to be missing, and the analytical methods applied to amend the missingness. Therefore, the analysis of trial data with missing values requires careful planning and attention. The authors had several meetings and discussions considering optimal ways of handling missing data to minimise the bias potential. We also searched PubMed (key words: missing data; randomi*; statistical analysis) and reference lists of known studies for papers (theoretical papers; empirical studies; simulation studies; etc.) on how to deal with missing data when analysing randomised clinical trials. Handling missing data is an important, yet difficult and complex task when analysing results of randomised clinical trials. We consider how to optimise the handling of missing data during the planning stage of a randomised clinical trial and recommend analytical approaches which may prevent bias caused by unavoidable missing data. We consider the strengths and limitations of using of best-worst and worst-best sensitivity analyses, multiple imputation, and full information maximum likelihood. We also present practical flowcharts on how to deal with missing data and an overview of the steps that always need to be considered during the analysis stage of a trial. We present a practical guide and flowcharts describing when and how multiple imputation should be used to handle missing data in randomised clinical.
Missing value imputation in DNA microarrays based on conjugate gradient method.
Dorri, Fatemeh; Azmi, Paeiz; Dorri, Faezeh
2012-02-01
Analysis of gene expression profiles needs a complete matrix of gene array values; consequently, imputation methods have been suggested. In this paper, an algorithm that is based on conjugate gradient (CG) method is proposed to estimate missing values. k-nearest neighbors of the missed entry are first selected based on absolute values of their Pearson correlation coefficient. Then a subset of genes among the k-nearest neighbors is labeled as the best similar ones. CG algorithm with this subset as its input is then used to estimate the missing values. Our proposed CG based algorithm (CGimpute) is evaluated on different data sets. The results are compared with sequential local least squares (SLLSimpute), Bayesian principle component analysis (BPCAimpute), local least squares imputation (LLSimpute), iterated local least squares imputation (ILLSimpute) and adaptive k-nearest neighbors imputation (KNNKimpute) methods. The average of normalized root mean squares error (NRMSE) and relative NRMSE in different data sets with various missing rates shows CGimpute outperforms other methods. Copyright © 2011 Elsevier Ltd. All rights reserved.
Next-generation genotype imputation service and methods.
Das, Sayantan; Forer, Lukas; Schönherr, Sebastian; Sidore, Carlo; Locke, Adam E; Kwong, Alan; Vrieze, Scott I; Chew, Emily Y; Levy, Shawn; McGue, Matt; Schlessinger, David; Stambolian, Dwight; Loh, Po-Ru; Iacono, William G; Swaroop, Anand; Scott, Laura J; Cucca, Francesco; Kronenberg, Florian; Boehnke, Michael; Abecasis, Gonçalo R; Fuchsberger, Christian
2016-10-01
Genotype imputation is a key component of genetic association studies, where it increases power, facilitates meta-analysis, and aids interpretation of signals. Genotype imputation is computationally demanding and, with current tools, typically requires access to a high-performance computing cluster and to a reference panel of sequenced genomes. Here we describe improvements to imputation machinery that reduce computational requirements by more than an order of magnitude with no loss of accuracy in comparison to standard imputation tools. We also describe a new web-based service for imputation that facilitates access to new reference panels and greatly improves user experience and productivity.
USDA-ARS?s Scientific Manuscript database
Genotyping-by-sequencing allows for large-scale genetic analyses in plant species with no reference genome, creating the challenge of sound inference in the presence of uncertain genotypes. Here we report an imputation-based genome-wide association study (GWAS) in reed canarygrass (Phalaris arundina...
USDA-ARS?s Scientific Manuscript database
Genotyping by sequencing allows for large-scale genetic analyses in plant species with no reference genome, but sets the challenge of sound inference in presence of uncertain genotypes. We report an imputation-based genome-wide association study (GWAS) in reed canarygrass (Phalaris arundinacea L., P...
VIGAN: Missing View Imputation with Generative Adversarial Networks.
Shang, Chao; Palmer, Aaron; Sun, Jiangwen; Chen, Ko-Shin; Lu, Jin; Bi, Jinbo
2017-01-01
In an era when big data are becoming the norm, there is less concern with the quantity but more with the quality and completeness of the data. In many disciplines, data are collected from heterogeneous sources, resulting in multi-view or multi-modal datasets. The missing data problem has been challenging to address in multi-view data analysis. Especially, when certain samples miss an entire view of data, it creates the missing view problem. Classic multiple imputations or matrix completion methods are hardly effective here when no information can be based on in the specific view to impute data for such samples. The commonly-used simple method of removing samples with a missing view can dramatically reduce sample size, thus diminishing the statistical power of a subsequent analysis. In this paper, we propose a novel approach for view imputation via generative adversarial networks (GANs), which we name by VIGAN. This approach first treats each view as a separate domain and identifies domain-to-domain mappings via a GAN using randomly-sampled data from each view, and then employs a multi-modal denoising autoencoder (DAE) to reconstruct the missing view from the GAN outputs based on paired data across the views. Then, by optimizing the GAN and DAE jointly, our model enables the knowledge integration for domain mappings and view correspondences to effectively recover the missing view. Empirical results on benchmark datasets validate the VIGAN approach by comparing against the state of the art. The evaluation of VIGAN in a genetic study of substance use disorders further proves the effectiveness and usability of this approach in life science.
Combining item response theory with multiple imputation to equate health assessment questionnaires.
Gu, Chenyang; Gutman, Roee
2017-09-01
The assessment of patients' functional status across the continuum of care requires a common patient assessment tool. However, assessment tools that are used in various health care settings differ and cannot be easily contrasted. For example, the Functional Independence Measure (FIM) is used to evaluate the functional status of patients who stay in inpatient rehabilitation facilities, the Minimum Data Set (MDS) is collected for all patients who stay in skilled nursing facilities, and the Outcome and Assessment Information Set (OASIS) is collected if they choose home health care provided by home health agencies. All three instruments or questionnaires include functional status items, but the specific items, rating scales, and instructions for scoring different activities vary between the different settings. We consider equating different health assessment questionnaires as a missing data problem, and propose a variant of predictive mean matching method that relies on Item Response Theory (IRT) models to impute unmeasured item responses. Using real data sets, we simulated missing measurements and compared our proposed approach to existing methods for missing data imputation. We show that, for all of the estimands considered, and in most of the experimental conditions that were examined, the proposed approach provides valid inferences, and generally has better coverages, relatively smaller biases, and shorter interval estimates. The proposed method is further illustrated using a real data set. © 2016, The International Biometric Society.
A regressive methodology for estimating missing data in rainfall daily time series
NASA Astrophysics Data System (ADS)
Barca, E.; Passarella, G.
2009-04-01
The "presence" of gaps in environmental data time series represents a very common, but extremely critical problem, since it can produce biased results (Rubin, 1976). Missing data plagues almost all surveys. The problem is how to deal with missing data once it has been deemed impossible to recover the actual missing values. Apart from the amount of missing data, another issue which plays an important role in the choice of any recovery approach is the evaluation of "missingness" mechanisms. When data missing is conditioned by some other variable observed in the data set (Schafer, 1997) the mechanism is called MAR (Missing at Random). Otherwise, when the missingness mechanism depends on the actual value of the missing data, it is called NCAR (Not Missing at Random). This last is the most difficult condition to model. In the last decade interest arose in the estimation of missing data by using regression (single imputation). More recently multiple imputation has become also available, which returns a distribution of estimated values (Scheffer, 2002). In this paper an automatic methodology for estimating missing data is presented. In practice, given a gauging station affected by missing data (target station), the methodology checks the randomness of the missing data and classifies the "similarity" between the target station and the other gauging stations spread over the study area. Among different methods useful for defining the similarity degree, whose effectiveness strongly depends on the data distribution, the Spearman correlation coefficient was chosen. Once defined the similarity matrix, a suitable, nonparametric, univariate, and regressive method was applied in order to estimate missing data in the target station: the Theil method (Theil, 1950). Even though the methodology revealed to be rather reliable an improvement of the missing data estimation can be achieved by a generalization. A first possible improvement consists in extending the univariate technique to the multivariate approach. Another approach follows the paradigm of the "multiple imputation" (Rubin, 1987; Rubin, 1988), which consists in using a set of "similar stations" instead than the most similar. This way, a sort of estimation range can be determined allowing the introduction of uncertainty. Finally, time series can be grouped on the basis of monthly rainfall rates defining classes of wetness (i.e.: dry, moderately rainy and rainy), in order to achieve the estimation using homogeneous data subsets. We expect that integrating the methodology with these enhancements will certainly improve its reliability. The methodology was applied to the daily rainfall time series data registered in the Candelaro River Basin (Apulia - South Italy) from 1970 to 2001. REFERENCES D.B., Rubin, 1976. Inference and Missing Data. Biometrika 63 581-592 D.B. Rubin, 1987. Multiple Imputation for Nonresponce in Surveys, New York: John Wiley & Sons, Inc. D.B. Rubin, 1988. An overview of multiple imputation. In Survey Research Section, pp. 79-84, American Statistical Association, 1988. J.L., Schafer, 1997. Analysis of Incomplete Multivariate Data, Chapman & Hall. J., Scheffer, 2002. Dealing with Missing Data. Res. Lett. Inf. Math. Sci. 3, 153-160. Available online at http://www.massey.ac.nz/~wwiims/research/letters/ H. Theil, 1950. A rank-invariant method of linear and polynomial regression analysis. Indicationes Mathematicae, 12, pp.85-91.
NASA Astrophysics Data System (ADS)
Nishina, Kazuya; Ito, Akihiko; Hanasaki, Naota; Hayashi, Seiji
2017-02-01
Currently, available historical global N fertilizer map as an input data to global biogeochemical model is still limited and existing maps were not considered NH4+ and NO3- in the fertilizer application rates. This paper provides a method for constructing a new historical global nitrogen fertilizer application map (0.5° × 0.5° resolution) for the period 1961-2010 based on country-specific information from Food and Agriculture Organization statistics (FAOSTAT) and various global datasets. This new map incorporates the fraction of NH4+ (and NO3-) in N fertilizer inputs by utilizing fertilizer species information in FAOSTAT, in which species can be categorized as NH4+- and/or NO3--forming N fertilizers. During data processing, we applied a statistical data imputation method for the missing data (19 % of national N fertilizer consumption) in FAOSTAT. The multiple imputation method enabled us to fill gaps in the time-series data using plausible values using covariates information (year, population, GDP, and crop area). After the imputation, we downscaled the national consumption data to a gridded cropland map. Also, we applied the multiple imputation method to the available chemical fertilizer species consumption, allowing for the estimation of the NH4+ / NO3- ratio in national fertilizer consumption. In this study, the synthetic N fertilizer inputs in 2000 showed a general consistency with the existing N fertilizer map (Potter et al., 2010) in relation to the ranges of N fertilizer inputs. Globally, the estimated N fertilizer inputs based on the sum of filled data increased from 15 to 110 Tg-N during 1961-2010. On the other hand, the global NO3- input started to decline after the late 1980s and the fraction of NO3- in global N fertilizer decreased consistently from 35 to 13 % over a 50-year period. NH4+-forming fertilizers are dominant in most countries; however, the NH4+ / NO3- ratio in N fertilizer inputs shows clear differences temporally and geographically. This new map can be utilized as input data to global model studies and bring new insights for the assessment of historical terrestrial N cycling changes. Datasets available at doi:10.1594/PANGAEA.861203.
Fast and accurate imputation of summary statistics enhances evidence of functional enrichment
Pasaniuc, Bogdan; Zaitlen, Noah; Shi, Huwenbo; Bhatia, Gaurav; Gusev, Alexander; Pickrell, Joseph; Hirschhorn, Joel; Strachan, David P.; Patterson, Nick; Price, Alkes L.
2014-01-01
Motivation: Imputation using external reference panels (e.g. 1000 Genomes) is a widely used approach for increasing power in genome-wide association studies and meta-analysis. Existing hidden Markov models (HMM)-based imputation approaches require individual-level genotypes. Here, we develop a new method for Gaussian imputation from summary association statistics, a type of data that is becoming widely available. Results: In simulations using 1000 Genomes (1000G) data, this method recovers 84% (54%) of the effective sample size for common (>5%) and low-frequency (1–5%) variants [increasing to 87% (60%) when summary linkage disequilibrium information is available from target samples] versus the gold standard of 89% (67%) for HMM-based imputation, which cannot be applied to summary statistics. Our approach accounts for the limited sample size of the reference panel, a crucial step to eliminate false-positive associations, and it is computationally very fast. As an empirical demonstration, we apply our method to seven case–control phenotypes from the Wellcome Trust Case Control Consortium (WTCCC) data and a study of height in the British 1958 birth cohort (1958BC). Gaussian imputation from summary statistics recovers 95% (105%) of the effective sample size (as quantified by the ratio of χ2 association statistics) compared with HMM-based imputation from individual-level genotypes at the 227 (176) published single nucleotide polymorphisms (SNPs) in the WTCCC (1958BC height) data. In addition, for publicly available summary statistics from large meta-analyses of four lipid traits, we publicly release imputed summary statistics at 1000G SNPs, which could not have been obtained using previously published methods, and demonstrate their accuracy by masking subsets of the data. We show that 1000G imputation using our approach increases the magnitude and statistical evidence of enrichment at genic versus non-genic loci for these traits, as compared with an analysis without 1000G imputation. Thus, imputation of summary statistics will be a valuable tool in future functional enrichment analyses. Availability and implementation: Publicly available software package available at http://bogdan.bioinformatics.ucla.edu/software/. Contact: bpasaniuc@mednet.ucla.edu or aprice@hsph.harvard.edu Supplementary information: Supplementary materials are available at Bioinformatics online. PMID:24990607
Kontopantelis, Evangelos; Parisi, Rosa; Springate, David A; Reeves, David
2017-01-13
In modern health care systems, the computerization of all aspects of clinical care has led to the development of large data repositories. For example, in the UK, large primary care databases hold millions of electronic medical records, with detailed information on diagnoses, treatments, outcomes and consultations. Careful analyses of these observational datasets of routinely collected data can complement evidence from clinical trials or even answer research questions that cannot been addressed in an experimental setting. However, 'missingness' is a common problem for routinely collected data, especially for biological parameters over time. Absence of complete data for the whole of a individual's study period is a potential bias risk and standard complete-case approaches may lead to biased estimates. However, the structure of the data values makes standard cross-sectional multiple-imputation approaches unsuitable. In this paper we propose and evaluate mibmi, a new command for cleaning and imputing longitudinal body mass index data. The regression-based data cleaning aspects of the algorithm can be useful when researchers analyze messy longitudinal data. Although the multiple imputation algorithm is computationally expensive, it performed similarly or even better to existing alternatives, when interpolating observations. The mibmi algorithm can be a useful tool for analyzing longitudinal body mass index data, or other longitudinal data with very low individual-level variability.
NASA Astrophysics Data System (ADS)
Poyatos, Rafael; Sus, Oliver; Badiella, Llorenç; Mencuccini, Maurizio; Martínez-Vilalta, Jordi
2018-05-01
The ubiquity of missing data in plant trait databases may hinder trait-based analyses of ecological patterns and processes. Spatially explicit datasets with information on intraspecific trait variability are rare but offer great promise in improving our understanding of functional biogeography. At the same time, they offer specific challenges in terms of data imputation. Here we compare statistical imputation approaches, using varying levels of environmental information, for five plant traits (leaf biomass to sapwood area ratio, leaf nitrogen content, maximum tree height, leaf mass per area and wood density) in a spatially explicit plant trait dataset of temperate and Mediterranean tree species (Ecological and Forest Inventory of Catalonia, IEFC, dataset for Catalonia, north-east Iberian Peninsula, 31 900 km2). We simulated gaps at different missingness levels (10-80 %) in a complete trait matrix, and we used overall trait means, species means, k nearest neighbours (kNN), ordinary and regression kriging, and multivariate imputation using chained equations (MICE) to impute missing trait values. We assessed these methods in terms of their accuracy and of their ability to preserve trait distributions, multi-trait correlation structure and bivariate trait relationships. The relatively good performance of mean and species mean imputations in terms of accuracy masked a poor representation of trait distributions and multivariate trait structure. Species identity improved MICE imputations for all traits, whereas forest structure and topography improved imputations for some traits. No method performed best consistently for the five studied traits, but, considering all traits and performance metrics, MICE informed by relevant ecological variables gave the best results. However, at higher missingness (> 30 %), species mean imputations and regression kriging tended to outperform MICE for some traits. MICE informed by relevant ecological variables allowed us to fill the gaps in the IEFC incomplete dataset (5495 plots) and quantify imputation uncertainty. Resulting spatial patterns of the studied traits in Catalan forests were broadly similar when using species means, regression kriging or the best-performing MICE application, but some important discrepancies were observed at the local level. Our results highlight the need to assess imputation quality beyond just imputation accuracy and show that including environmental information in statistical imputation approaches yields more plausible imputations in spatially explicit plant trait datasets.
Jolani, Shahab
2018-03-01
In health and medical sciences, multiple imputation (MI) is now becoming popular to obtain valid inferences in the presence of missing data. However, MI of clustered data such as multicenter studies and individual participant data meta-analysis requires advanced imputation routines that preserve the hierarchical structure of data. In clustered data, a specific challenge is the presence of systematically missing data, when a variable is completely missing in some clusters, and sporadically missing data, when it is partly missing in some clusters. Unfortunately, little is known about how to perform MI when both types of missing data occur simultaneously. We develop a new class of hierarchical imputation approach based on chained equations methodology that simultaneously imputes systematically and sporadically missing data while allowing for arbitrary patterns of missingness among them. Here, we use a random effect imputation model and adopt a simplification over fully Bayesian techniques such as Gibbs sampler to directly obtain draws of parameters within each step of the chained equations. We justify through theoretical arguments and extensive simulation studies that the proposed imputation methodology has good statistical properties in terms of bias and coverage rates of parameter estimates. An illustration is given in a case study with eight individual participant datasets. © 2017 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Loong, Bronwyn; Zaslavsky, Alan M; He, Yulei; Harrington, David P
2013-10-30
Statistical agencies have begun to partially synthesize public-use data for major surveys to protect the confidentiality of respondents' identities and sensitive attributes by replacing high disclosure risk and sensitive variables with multiple imputations. To date, there are few applications of synthetic data techniques to large-scale healthcare survey data. Here, we describe partial synthesis of survey data collected by the Cancer Care Outcomes Research and Surveillance (CanCORS) project, a comprehensive observational study of the experiences, treatments, and outcomes of patients with lung or colorectal cancer in the USA. We review inferential methods for partially synthetic data and discuss selection of high disclosure risk variables for synthesis, specification of imputation models, and identification disclosure risk assessment. We evaluate data utility by replicating published analyses and comparing results using original and synthetic data and discuss practical issues in preserving inferential conclusions. We found that important subgroup relationships must be included in the synthetic data imputation model, to preserve the data utility of the observed data for a given analysis procedure. We conclude that synthetic CanCORS data are suited best for preliminary data analyses purposes. These methods address the requirement to share data in clinical research without compromising confidentiality. Copyright © 2013 John Wiley & Sons, Ltd.
ERIC Educational Resources Information Center
Robitzsch, Alexander; Rupp, Andre A.
2009-01-01
This article describes the results of a simulation study to investigate the impact of missing data on the detection of differential item functioning (DIF). Specifically, it investigates how four methods for dealing with missing data (listwise deletion, zero imputation, two-way imputation, response function imputation) interact with two methods of…
Time Series Imputation via L1 Norm-Based Singular Spectrum Analysis
NASA Astrophysics Data System (ADS)
Kalantari, Mahdi; Yarmohammadi, Masoud; Hassani, Hossein; Silva, Emmanuel Sirimal
Missing values in time series data is a well-known and important problem which many researchers have studied extensively in various fields. In this paper, a new nonparametric approach for missing value imputation in time series is proposed. The main novelty of this research is applying the L1 norm-based version of Singular Spectrum Analysis (SSA), namely L1-SSA which is robust against outliers. The performance of the new imputation method has been compared with many other established methods. The comparison is done by applying them to various real and simulated time series. The obtained results confirm that the SSA-based methods, especially L1-SSA can provide better imputation in comparison to other methods.
MaCH-Admix: Genotype Imputation for Admixed Populations
Liu, Eric Yi; Li, Mingyao; Wang, Wei; Li, Yun
2012-01-01
Imputation in admixed populations is an important problem but challenging due to the complex linkage disequilibrium (LD) pattern. The emergence of large reference panels such as that from the 1,000 Genomes Project enables more accurate imputation in general, and in particular for admixed populations and for uncommon variants. To efficiently benefit from these large reference panels, one key issue to consider in modern genotype imputation framework is the selection of effective reference panels. In this work, we consider a number of methods for effective reference panel construction inside a hidden Markov model and specific to each target individual. These methods fall into two categories: identity-by-state (IBS) based and ancestry-weighted approach. We evaluated the performance on individuals from recently admixed populations. Our target samples include 8,421 African Americans and 3,587 Hispanic Americans from the Women’s Health Initiative, which allow assessment of imputation quality for uncommon variants. Our experiments include both large and small reference panels; large, medium, and small target samples; and in genome regions of varying levels of LD. We also include BEAGLE and IMPUTE2 for comparison. Experiment results with large reference panel suggest that our novel piecewise IBS method yields consistently higher imputation quality than other methods/software. The advantage is particularly noteworthy among uncommon variants where we observe up to 5.1% information gain with the difference being highly significant (Wilcoxon signed rank test P-value < 0.0001). Our work is the first that considers various sensible approaches for imputation in admixed populations and presents a comprehensive comparison. PMID:23074066
Guo, Ying; Little, Roderick J; McConnell, Daniel S
2012-01-01
Covariate measurement error is common in epidemiologic studies. Current methods for correcting measurement error with information from external calibration samples are insufficient to provide valid adjusted inferences. We consider the problem of estimating the regression of an outcome Y on covariates X and Z, where Y and Z are observed, X is unobserved, but a variable W that measures X with error is observed. Information about measurement error is provided in an external calibration sample where data on X and W (but not Y and Z) are recorded. We describe a method that uses summary statistics from the calibration sample to create multiple imputations of the missing values of X in the regression sample, so that the regression coefficients of Y on X and Z and associated standard errors can be estimated using simple multiple imputation combining rules, yielding valid statistical inferences under the assumption of a multivariate normal distribution. The proposed method is shown by simulation to provide better inferences than existing methods, namely the naive method, classical calibration, and regression calibration, particularly for correction for bias and achieving nominal confidence levels. We also illustrate our method with an example using linear regression to examine the relation between serum reproductive hormone concentrations and bone mineral density loss in midlife women in the Michigan Bone Health and Metabolism Study. Existing methods fail to adjust appropriately for bias due to measurement error in the regression setting, particularly when measurement error is substantial. The proposed method corrects this deficiency.
Fast and accurate genotype imputation in genome-wide association studies through pre-phasing
Howie, Bryan; Fuchsberger, Christian; Stephens, Matthew; Marchini, Jonathan; Abecasis, Gonçalo R.
2013-01-01
Sequencing efforts, including the 1000 Genomes Project and disease-specific efforts, are producing large collections of haplotypes that can be used for genotype imputation in genome-wide association studies (GWAS). Imputing from these reference panels can help identify new risk alleles, but the use of large panels with existing methods imposes a high computational burden. To keep imputation broadly accessible, we introduce a strategy called “pre-phasing” that maintains the accuracy of leading methods while cutting computational costs by orders of magnitude. In brief, we first statistically estimate the haplotypes for each GWAS individual (“pre-phasing”) and then impute missing genotypes into these estimated haplotypes. This reduces the computational cost because: (i) the GWAS samples must be phased only once, whereas standard methods would implicitly re-phase with each reference panel update; (ii) it is much faster to match a phased GWAS haplotype to one reference haplotype than to match unphased GWAS genotypes to a pair of reference haplotypes. This strategy will be particularly valuable for repeated imputation as reference panels evolve. PMID:22820512
ERIC Educational Resources Information Center
Bokossa, Maxime C.; Huang, Gary G.
This report describes the imputation procedures used to deal with missing data in the National Education Longitudinal Study of 1988 (NELS:88), the only current National Center for Education Statistics (NCES) dataset that contains scores from cognitive tests given the same set of students at multiple time points. As is inevitable, cognitive test…
Dipnall, Joanna F.
2016-01-01
Background Atheoretical large-scale data mining techniques using machine learning algorithms have promise in the analysis of large epidemiological datasets. This study illustrates the use of a hybrid methodology for variable selection that took account of missing data and complex survey design to identify key biomarkers associated with depression from a large epidemiological study. Methods The study used a three-step methodology amalgamating multiple imputation, a machine learning boosted regression algorithm and logistic regression, to identify key biomarkers associated with depression in the National Health and Nutrition Examination Study (2009–2010). Depression was measured using the Patient Health Questionnaire-9 and 67 biomarkers were analysed. Covariates in this study included gender, age, race, smoking, food security, Poverty Income Ratio, Body Mass Index, physical activity, alcohol use, medical conditions and medications. The final imputed weighted multiple logistic regression model included possible confounders and moderators. Results After the creation of 20 imputation data sets from multiple chained regression sequences, machine learning boosted regression initially identified 21 biomarkers associated with depression. Using traditional logistic regression methods, including controlling for possible confounders and moderators, a final set of three biomarkers were selected. The final three biomarkers from the novel hybrid variable selection methodology were red cell distribution width (OR 1.15; 95% CI 1.01, 1.30), serum glucose (OR 1.01; 95% CI 1.00, 1.01) and total bilirubin (OR 0.12; 95% CI 0.05, 0.28). Significant interactions were found between total bilirubin with Mexican American/Hispanic group (p = 0.016), and current smokers (p<0.001). Conclusion The systematic use of a hybrid methodology for variable selection, fusing data mining techniques using a machine learning algorithm with traditional statistical modelling, accounted for missing data and complex survey sampling methodology and was demonstrated to be a useful tool for detecting three biomarkers associated with depression for future hypothesis generation: red cell distribution width, serum glucose and total bilirubin. PMID:26848571
Principled Approaches to Missing Data in Epidemiologic Studies
Perkins, Neil J; Cole, Stephen R; Harel, Ofer; Tchetgen Tchetgen, Eric J; Sun, BaoLuo; Mitchell, Emily M; Schisterman, Enrique F
2018-01-01
Abstract Principled methods with which to appropriately analyze missing data have long existed; however, broad implementation of these methods remains challenging. In this and 2 companion papers (Am J Epidemiol. 2018;187(3):576–584 and Am J Epidemiol. 2018;187(3):585–591), we discuss issues pertaining to missing data in the epidemiologic literature. We provide details regarding missing-data mechanisms and nomenclature and encourage the conduct of principled analyses through a detailed comparison of multiple imputation and inverse probability weighting. Data from the Collaborative Perinatal Project, a multisite US study conducted from 1959 to 1974, are used to create a masked data-analytical challenge with missing data induced by known mechanisms. We illustrate the deleterious effects of missing data with naive methods and show how principled methods can sometimes mitigate such effects. For example, when data were missing at random, naive methods showed a spurious protective effect of smoking on the risk of spontaneous abortion (odds ratio (OR) = 0.43, 95% confidence interval (CI): 0.19, 0.93), while implementation of principled methods multiple imputation (OR = 1.30, 95% CI: 0.95, 1.77) or augmented inverse probability weighting (OR = 1.40, 95% CI: 1.00, 1.97) provided estimates closer to the “true” full-data effect (OR = 1.31, 95% CI: 1.05, 1.64). We call for greater acknowledgement of and attention to missing data and for the broad use of principled missing-data methods in epidemiologic research. PMID:29165572
Principled Approaches to Missing Data in Epidemiologic Studies.
Perkins, Neil J; Cole, Stephen R; Harel, Ofer; Tchetgen Tchetgen, Eric J; Sun, BaoLuo; Mitchell, Emily M; Schisterman, Enrique F
2018-03-01
Principled methods with which to appropriately analyze missing data have long existed; however, broad implementation of these methods remains challenging. In this and 2 companion papers (Am J Epidemiol. 2018;187(3):576-584 and Am J Epidemiol. 2018;187(3):585-591), we discuss issues pertaining to missing data in the epidemiologic literature. We provide details regarding missing-data mechanisms and nomenclature and encourage the conduct of principled analyses through a detailed comparison of multiple imputation and inverse probability weighting. Data from the Collaborative Perinatal Project, a multisite US study conducted from 1959 to 1974, are used to create a masked data-analytical challenge with missing data induced by known mechanisms. We illustrate the deleterious effects of missing data with naive methods and show how principled methods can sometimes mitigate such effects. For example, when data were missing at random, naive methods showed a spurious protective effect of smoking on the risk of spontaneous abortion (odds ratio (OR) = 0.43, 95% confidence interval (CI): 0.19, 0.93), while implementation of principled methods multiple imputation (OR = 1.30, 95% CI: 0.95, 1.77) or augmented inverse probability weighting (OR = 1.40, 95% CI: 1.00, 1.97) provided estimates closer to the "true" full-data effect (OR = 1.31, 95% CI: 1.05, 1.64). We call for greater acknowledgement of and attention to missing data and for the broad use of principled missing-data methods in epidemiologic research.
Larsen, Lawrence C; Shah, Mena
2016-01-01
Although networks of environmental monitors are constantly improving through advances in technology and management, instances of missing data still occur. Many methods of imputing values for missing data are available, but they are often difficult to use or produce unsatisfactory results. I-Bot (short for "Imputation Robot") is a context-intensive approach to the imputation of missing data in data sets from networks of environmental monitors. I-Bot is easy to use and routinely produces imputed values that are highly reliable. I-Bot is described and demonstrated using more than 10 years of California data for daily maximum 8-hr ozone, 24-hr PM2.5 (particulate matter with an aerodynamic diameter <2.5 μm), mid-day average surface temperature, and mid-day average wind speed. I-Bot performance is evaluated by imputing values for observed data as if they were missing, and then comparing the imputed values with the observed values. In many cases, I-Bot is able to impute values for long periods with missing data, such as a week, a month, a year, or even longer. Qualitative visual methods and standard quantitative metrics demonstrate the effectiveness of the I-Bot methodology. Many resources are expended every year to analyze and interpret data sets from networks of environmental monitors. A large fraction of those resources is used to cope with difficulties due to the presence of missing data. The I-Bot method of imputing values for such missing data may help convert incomplete data sets into virtually complete data sets that facilitate the analysis and reliable interpretation of vital environmental data.
A Kriging based spatiotemporal approach for traffic volume data imputation
Han, Lee D.; Liu, Xiaohan; Pu, Li; Chin, Shih-miao; Hwang, Ho-ling
2018-01-01
Along with the rapid development of Intelligent Transportation Systems, traffic data collection technologies have progressed fast. The emergence of innovative data collection technologies such as remote traffic microwave sensor, Bluetooth sensor, GPS-based floating car method, and automated license plate recognition, has significantly increased the variety and volume of traffic data. Despite the development of these technologies, the missing data issue is still a problem that poses great challenge for data based applications such as traffic forecasting, real-time incident detection, dynamic route guidance, and massive evacuation optimization. A thorough literature review suggests most current imputation models either focus on the temporal nature of the traffic data and fail to consider the spatial information of neighboring locations or assume the data follow a certain distribution. These two issues reduce the imputation accuracy and limit the use of the corresponding imputation methods respectively. As a result, this paper presents a Kriging based data imputation approach that is able to fully utilize the spatiotemporal correlation in the traffic data and that does not assume the data follow any distribution. A set of scenarios with different missing rates are used to evaluate the performance of the proposed method. The performance of the proposed method was compared with that of two other widely used methods, historical average and K-nearest neighborhood. Comparison results indicate that the proposed method has the highest imputation accuracy and is more flexible compared to other methods. PMID:29664928
Method variation in the impact of missing data on response shift detection.
Schwartz, Carolyn E; Sajobi, Tolulope T; Verdam, Mathilde G E; Sebille, Veronique; Lix, Lisa M; Guilleux, Alice; Sprangers, Mirjam A G
2015-03-01
Missing data due to attrition or item non-response can result in biased estimates and loss of power in longitudinal quality-of-life (QOL) research. The impact of missing data on response shift (RS) detection is relatively unknown. This overview article synthesizes the findings of three methods tested in this special section regarding the impact of missing data patterns on RS detection in incomplete longitudinal data. The RS detection methods investigated include: (1) Relative importance analysis to detect reprioritization RS in stroke caregivers; (2) Oort's structural equation modeling (SEM) to detect recalibration, reprioritization, and reconceptualization RS in cancer patients; and (3) Rasch-based item-response theory-based (IRT) models as compared to SEM models to detect recalibration and reprioritization RS in hospitalized chronic disease patients. Each method dealt with missing data differently, either with imputation (1), attrition-based multi-group analysis (2), or probabilistic analysis that is robust to missingness due to the specific objectivity property (3). Relative importance analyses were sensitive to the type and amount of missing data and imputation method, with multiple imputation showing the largest RS effects. The attrition-based multi-group SEM revealed differential effects of both the changes in health-related QOL and the occurrence of response shift by attrition stratum, and enabled a more complete interpretation of findings. The IRT RS algorithm found evidence of small recalibration and reprioritization effects in General Health, whereas SEM mostly evidenced small recalibration effects. These differences may be due to differences between the two methods in handling of missing data. Missing data imputation techniques result in different conclusions about the presence of reprioritization RS using the relative importance method, while the attrition-based SEM approach highlighted different recalibration and reprioritization RS effects by attrition group. The IRT analyses detected more recalibration and reprioritization RS effects than SEM, presumably due to IRT's robustness to missing data. Future research should apply simulation techniques in order to make conclusive statements about the impacts of missing data according to the type and amount of RS.
B. Tyler Wilson; Andrew J. Lister; Rachel I. Riemann
2012-01-01
The paper describes an efficient approach for mapping multiple individual tree species over large spatial domains. The method integrates vegetation phenology derived from MODIS imagery and raster data describing relevant environmental parameters with extensive field plot data of tree species basal area to create maps of tree species abundance and distribution at a 250-...
Fast and accurate imputation of summary statistics enhances evidence of functional enrichment.
Pasaniuc, Bogdan; Zaitlen, Noah; Shi, Huwenbo; Bhatia, Gaurav; Gusev, Alexander; Pickrell, Joseph; Hirschhorn, Joel; Strachan, David P; Patterson, Nick; Price, Alkes L
2014-10-15
Imputation using external reference panels (e.g. 1000 Genomes) is a widely used approach for increasing power in genome-wide association studies and meta-analysis. Existing hidden Markov models (HMM)-based imputation approaches require individual-level genotypes. Here, we develop a new method for Gaussian imputation from summary association statistics, a type of data that is becoming widely available. In simulations using 1000 Genomes (1000G) data, this method recovers 84% (54%) of the effective sample size for common (>5%) and low-frequency (1-5%) variants [increasing to 87% (60%) when summary linkage disequilibrium information is available from target samples] versus the gold standard of 89% (67%) for HMM-based imputation, which cannot be applied to summary statistics. Our approach accounts for the limited sample size of the reference panel, a crucial step to eliminate false-positive associations, and it is computationally very fast. As an empirical demonstration, we apply our method to seven case-control phenotypes from the Wellcome Trust Case Control Consortium (WTCCC) data and a study of height in the British 1958 birth cohort (1958BC). Gaussian imputation from summary statistics recovers 95% (105%) of the effective sample size (as quantified by the ratio of [Formula: see text] association statistics) compared with HMM-based imputation from individual-level genotypes at the 227 (176) published single nucleotide polymorphisms (SNPs) in the WTCCC (1958BC height) data. In addition, for publicly available summary statistics from large meta-analyses of four lipid traits, we publicly release imputed summary statistics at 1000G SNPs, which could not have been obtained using previously published methods, and demonstrate their accuracy by masking subsets of the data. We show that 1000G imputation using our approach increases the magnitude and statistical evidence of enrichment at genic versus non-genic loci for these traits, as compared with an analysis without 1000G imputation. Thus, imputation of summary statistics will be a valuable tool in future functional enrichment analyses. Publicly available software package available at http://bogdan.bioinformatics.ucla.edu/software/. bpasaniuc@mednet.ucla.edu or aprice@hsph.harvard.edu Supplementary materials are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Epidemiologic Evaluation of Measurement Data in the Presence of Detection Limits
Lubin, Jay H.; Colt, Joanne S.; Camann, David; Davis, Scott; Cerhan, James R.; Severson, Richard K.; Bernstein, Leslie; Hartge, Patricia
2004-01-01
Quantitative measurements of environmental factors greatly improve the quality of epidemiologic studies but can pose challenges because of the presence of upper or lower detection limits or interfering compounds, which do not allow for precise measured values. We consider the regression of an environmental measurement (dependent variable) on several covariates (independent variables). Various strategies are commonly employed to impute values for interval-measured data, including assignment of one-half the detection limit to nondetected values or of “fill-in” values randomly selected from an appropriate distribution. On the basis of a limited simulation study, we found that the former approach can be biased unless the percentage of measurements below detection limits is small (5–10%). The fill-in approach generally produces unbiased parameter estimates but may produce biased variance estimates and thereby distort inference when 30% or more of the data are below detection limits. Truncated data methods (e.g., Tobit regression) and multiple imputation offer two unbiased approaches for analyzing measurement data with detection limits. If interest resides solely on regression parameters, then Tobit regression can be used. If individualized values for measurements below detection limits are needed for additional analysis, such as relative risk regression or graphical display, then multiple imputation produces unbiased estimates and nominal confidence intervals unless the proportion of missing data is extreme. We illustrate various approaches using measurements of pesticide residues in carpet dust in control subjects from a case–control study of non-Hodgkin lymphoma. PMID:15579415
Suyundikov, Anvar; Stevens, John R.; Corcoran, Christopher; Herrick, Jennifer; Wolff, Roger K.; Slattery, Martha L.
2015-01-01
Missing data can arise in bioinformatics applications for a variety of reasons, and imputation methods are frequently applied to such data. We are motivated by a colorectal cancer study where miRNA expression was measured in paired tumor-normal samples of hundreds of patients, but data for many normal samples were missing due to lack of tissue availability. We compare the precision and power performance of several imputation methods, and draw attention to the statistical dependence induced by K-Nearest Neighbors (KNN) imputation. This imputation-induced dependence has not previously been addressed in the literature. We demonstrate how to account for this dependence, and show through simulation how the choice to ignore or account for this dependence affects both power and type I error rate control. PMID:25849489
Multiple imputation to evaluate the impact of an assay change in national surveys
Sternberg, Maya
2017-01-01
National health surveys, such as the National Health and Nutrition Examination Survey, are used to monitor trends of nutritional biomarkers. These surveys try to maintain the same biomarker assay over time, but there are a variety of reasons why the assay may change. In these cases, it is important to evaluate the potential impact of a change so that any observed fluctuations in concentrations over time are not confounded by changes in the assay. To this end, a subset of stored specimens previously analyzed with the old assay is retested using the new assay. These paired data are used to estimate an adjustment equation, which is then used to ‘adjust’ all the old assay results and convert them into ‘equivalent’ units of the new assay. In this paper, we present a new way of approaching this problem using modern statistical methods designed for missing data. Using simulations, we compare the proposed multiple imputation approach with the adjustment equation approach currently in use. We also compare these approaches using real National Health and Nutrition Examination Survey data for 25-hydroxyvitamin D. PMID:28419523
Parameter estimation in Cox models with missing failure indicators and the OPPERA study.
Brownstein, Naomi C; Cai, Jianwen; Slade, Gary D; Bair, Eric
2015-12-30
In a prospective cohort study, examining all participants for incidence of the condition of interest may be prohibitively expensive. For example, the "gold standard" for diagnosing temporomandibular disorder (TMD) is a physical examination by a trained clinician. In large studies, examining all participants in this manner is infeasible. Instead, it is common to use questionnaires to screen for incidence of TMD and perform the "gold standard" examination only on participants who screen positively. Unfortunately, some participants may leave the study before receiving the "gold standard" examination. Within the framework of survival analysis, this results in missing failure indicators. Motivated by the Orofacial Pain: Prospective Evaluation and Risk Assessment (OPPERA) study, a large cohort study of TMD, we propose a method for parameter estimation in survival models with missing failure indicators. We estimate the probability of being an incident case for those lacking a "gold standard" examination using logistic regression. These estimated probabilities are used to generate multiple imputations of case status for each missing examination that are combined with observed data in appropriate regression models. The variance introduced by the procedure is estimated using multiple imputation. The method can be used to estimate both regression coefficients in Cox proportional hazard models as well as incidence rates using Poisson regression. We simulate data with missing failure indicators and show that our method performs as well as or better than competing methods. Finally, we apply the proposed method to data from the OPPERA study. Copyright © 2015 John Wiley & Sons, Ltd.
A meta-data based method for DNA microarray imputation.
Jörnsten, Rebecka; Ouyang, Ming; Wang, Hui-Yu
2007-03-29
DNA microarray experiments are conducted in logical sets, such as time course profiling after a treatment is applied to the samples, or comparisons of the samples under two or more conditions. Due to cost and design constraints of spotted cDNA microarray experiments, each logical set commonly includes only a small number of replicates per condition. Despite the vast improvement of the microarray technology in recent years, missing values are prevalent. Intuitively, imputation of missing values is best done using many replicates within the same logical set. In practice, there are few replicates and thus reliable imputation within logical sets is difficult. However, it is in the case of few replicates that the presence of missing values, and how they are imputed, can have the most profound impact on the outcome of downstream analyses (e.g. significance analysis and clustering). This study explores the feasibility of imputation across logical sets, using the vast amount of publicly available microarray data to improve imputation reliability in the small sample size setting. We download all cDNA microarray data of Saccharomyces cerevisiae, Arabidopsis thaliana, and Caenorhabditis elegans from the Stanford Microarray Database. Through cross-validation and simulation, we find that, for all three species, our proposed imputation using data from public databases is far superior to imputation within a logical set, sometimes to an astonishing degree. Furthermore, the imputation root mean square error for significant genes is generally a lot less than that of non-significant ones. Since downstream analysis of significant genes, such as clustering and network analysis, can be very sensitive to small perturbations of estimated gene effects, it is highly recommended that researchers apply reliable data imputation prior to further analysis. Our method can also be applied to cDNA microarray experiments from other species, provided good reference data are available.
A suggested approach for imputation of missing dietary data for young children in daycare.
Stevens, June; Ou, Fang-Shu; Truesdale, Kimberly P; Zeng, Donglin; Vaughn, Amber E; Pratt, Charlotte; Ward, Dianne S
2015-01-01
Parent-reported 24-h diet recalls are an accepted method of estimating intake in young children. However, many children eat while at childcare making accurate proxy reports by parents difficult. The goal of this study was to demonstrate a method to impute missing weekday lunch and daytime snack nutrient data for daycare children and to explore the concurrent predictive and criterion validity of the method. Data were from children aged 2-5 years in the My Parenting SOS project (n=308; 870 24-h diet recalls). Mixed models were used to simultaneously predict breakfast, dinner, and evening snacks (B+D+ES); lunch; and daytime snacks for all children after adjusting for age, sex, and body mass index (BMI). From these models, we imputed the missing weekday daycare lunches by interpolation using the mean lunch to B+D+ES [L/(B+D+ES)] ratio among non-daycare children on weekdays and the L/(B+D+ES) ratio for all children on weekends. Daytime snack data were used to impute snacks. The reported mean (± standard deviation) weekday intake was lower for daycare children [725 (±324) kcal] compared to non-daycare children [1,048 (±463) kcal]. Weekend intake for all children was 1,173 (±427) kcal. After imputation, weekday caloric intake for daycare children was 1,230 (±409) kcal. Daily intakes that included imputed data were associated with age and sex but not with BMI. This work indicates that imputation is a promising method for improving the precision of daily nutrient data from young children.
Haji-Maghsoudi, Saiedeh; Haghdoost, Ali-akbar; Rastegari, Azam; Baneshi, Mohammad Reza
2013-01-01
Background: Policy makers need models to be able to detect groups at high risk of HIV infection. Incomplete records and dirty data are frequently seen in national data sets. Presence of missing data challenges the practice of model development. Several studies suggested that performance of imputation methods is acceptable when missing rate is moderate. One of the issues which was of less concern, to be addressed here, is the role of the pattern of missing data. Methods: We used information of 2720 prisoners. Results derived from fitting regression model to whole data were served as gold standard. Missing data were then generated so that 10%, 20% and 50% of data were lost. In scenario 1, we generated missing values, at above rates, in one variable which was significant in gold model (age). In scenario 2, a small proportion of each of independent variable was dropped out. Four imputation methods, under different Event Per Variable (EPV) values, were compared in terms of selection of important variables and parameter estimation. Results: In scenario 2, bias in estimates was low and performances of all methods for handing missing data were similar. All methods at all missing rates were able to detect significance of age. In scenario 1, biases in estimations were increased, in particular at 50% missing rate. Here at EPVs of 10 and 5, imputation methods failed to capture effect of age. Conclusion: In scenario 2, all imputation methods at all missing rates, were able to detect age as being significant. This was not the case in scenario 1. Our results showed that performance of imputation methods depends on the pattern of missing data. PMID:24596839
Fielding, S; Ogbuagu, A; Sivasubramaniam, S; MacLennan, G; Ramsay, C R
2016-12-01
Missing data are a major problem in the analysis of data from randomised trials affecting power and potentially producing biased treatment effects. Specifically focussing on quality of life outcomes, we aimed to report the amount of missing data, whether imputation was used and what methods and was the missing mechanism discussed from four leading medical journals and compare the picture to our previous review nearly a decade ago. A random selection (50 %) of all RCTS published during 2013-2014 in BMJ, JAMA, Lancet and NEJM was obtained. RCTs reported in research letters, cluster RCTs, non-randomised designs, review articles and meta-analysis were excluded. We included 87 RCTs in the review of which 35 % the amount of missing primary QoL data was unclear, 31 (36 %) used imputation. Only 23 % discussed the missing data mechanism. Nearly half used complete case analysis. Reporting was more unclear for secondary QoL outcomes. Compared to the previous review, multiple imputation was used more prominently but mainly in sensitivity analysis. Inadequate reporting and handling of missing QoL data in RCTs are still an issue. There is a large gap between statistical methods research relating to missing data and the use of the methods in applications. A sensitivity analysis should be undertaken to explore the sensitivity of the main results to different missing data assumptions. Medical journals can help to improve the situation by requiring higher standards of reporting and analytical methods to deal with missing data, and by issuing guidance to authors on expected standard.
Kwon, Ji-Sun; Kim, Jihye; Nam, Dougu; Kim, Sangsoo
2012-06-01
Gene set analysis (GSA) is useful in interpreting a genome-wide association study (GWAS) result in terms of biological mechanism. We compared the performance of two different GSA implementations that accept GWAS p-values of single nucleotide polymorphisms (SNPs) or gene-by-gene summaries thereof, GSA-SNP and i-GSEA4GWAS, under the same settings of inputs and parameters. GSA runs were made with two sets of p-values from a Korean type 2 diabetes mellitus GWAS study: 259,188 and 1,152,947 SNPs of the original and imputed genotype datasets, respectively. When Gene Ontology terms were used as gene sets, i-GSEA4GWAS produced 283 and 1,070 hits for the unimputed and imputed datasets, respectively. On the other hand, GSA-SNP reported 94 and 38 hits, respectively, for both datasets. Similar, but to a lesser degree, trends were observed with Kyoto Encyclopedia of Genes and Genomes (KEGG) gene sets as well. The huge number of hits by i-GSEA4GWAS for the imputed dataset was probably an artifact due to the scaling step in the algorithm. The decrease in hits by GSA-SNP for the imputed dataset may be due to the fact that it relies on Z-statistics, which is sensitive to variations in the background level of associations. Judicious evaluation of the GSA outcomes, perhaps based on multiple programs, is recommended.
Khateeb, O M; Osborne, D; Mulla, Z D
2010-04-01
Invasive group A streptococcal (GAS) disease is a condition of clinical and public health significance. We conducted epidemiological analyses to determine if the presence of gastrointestinal (GI) complaints (diarrhea and/or vomiting) early in the course of invasive GAS disease is associated with either of two severe outcomes: GAS necrotizing fasciitis, or hospital mortality. Subjects were hospitalized for invasive GAS disease throughout the state of Florida, USA, during a 4-year period. Multiple imputation using the Markov chain Monte Carlo method was used to replace missing values with plausible values. Excluding cases with missing data resulted in a sample size of 138 invasive GAS patients (the complete subject analysis) while the imputed datasets contained 257 records. GI symptomatology within 48 h of hospital admission was not associated with hospital mortality in either the complete subject analysis [adjusted odds ratio (aOR) 0.86, 95% confidence interval (CI) 0.31-2.39] or in the imputed datasets. GI symptoms were significantly associated with GAS necrotizing fasciitis in the complete subject analysis (aOR 4.64, 95% CI 1.18-18.23) and in the imputed datasets but only in patients aged <55 years. The common cause of GI symptoms and necrotizing fasciitis may be streptococcal exotoxins. Clinicians who are treating young individuals presumed to be in the early stages of invasive GAS disease should take note of GI symptoms and remain vigilant for the development of a GAS necrotizing soft-tissue infection.
Song, Minsun; Wheeler, William; Caporaso, Neil E; Landi, Maria Teresa; Chatterjee, Nilanjan
2018-03-01
Genome-wide association studies (GWAS) are now routinely imputed for untyped single nucleotide polymorphisms (SNPs) based on various powerful statistical algorithms for imputation trained on reference datasets. The use of predicted allele counts for imputed SNPs as the dosage variable is known to produce valid score test for genetic association. In this paper, we investigate how to best handle imputed SNPs in various modern complex tests for genetic associations incorporating gene-environment interactions. We focus on case-control association studies where inference for an underlying logistic regression model can be performed using alternative methods that rely on varying degree on an assumption of gene-environment independence in the underlying population. As increasingly large-scale GWAS are being performed through consortia effort where it is preferable to share only summary-level information across studies, we also describe simple mechanisms for implementing score tests based on standard meta-analysis of "one-step" maximum-likelihood estimates across studies. Applications of the methods in simulation studies and a dataset from GWAS of lung cancer illustrate ability of the proposed methods to maintain type-I error rates for the underlying testing procedures. For analysis of imputed SNPs, similar to typed SNPs, the retrospective methods can lead to considerable efficiency gain for modeling of gene-environment interactions under the assumption of gene-environment independence. Methods are made available for public use through CGEN R software package. © 2017 WILEY PERIODICALS, INC.
Imputation of Missing Genotypes From Sparse to High Density Using Long-Range Phasing
USDA-ARS?s Scientific Manuscript database
Related individuals in a population share long chromosome segments which trace to a common ancestor. We describe a long-range phasing algorithm that makes use of this property to phase whole chromosomes and simultaneously impute a large number of missing markers. We test our method by imputing marke...
Standard and Robust Methods in Regression Imputation
ERIC Educational Resources Information Center
Moraveji, Behjat; Jafarian, Koorosh
2014-01-01
The aim of this paper is to provide an introduction of new imputation algorithms for estimating missing values from official statistics in larger data sets of data pre-processing, or outliers. The goal is to propose a new algorithm called IRMI (iterative robust model-based imputation). This algorithm is able to deal with all challenges like…
Sheikh, Mashhood Ahmed; Abelsen, Birgit; Olsen, Jan Abel
2017-11-01
Previous methods for assessing mediation assume no multiplicative interactions. The inverse odds weighting (IOW) approach has been presented as a method that can be used even when interactions exist. The substantive aim of this study was to assess the indirect effect of education on health and well-being via four indicators of adult socioeconomic status (SES): income, management position, occupational hierarchy position and subjective social status. 8516 men and women from the Tromsø Study (Norway) were followed for 17 years. Education was measured at age 25-74 years, while SES and health and well-being were measured at age 42-91 years. Natural direct and indirect effects (NIE) were estimated using weighted Poisson regression models with IOW. Stata code is provided that makes it easy to assess mediation in any multiple imputed dataset with multiple mediators and interactions. Low education was associated with lower SES. Consequently, low SES was associated with being unhealthy and having a low level of well-being. The effect (NIE) of education on health and well-being is mediated by income, management position, occupational hierarchy position and subjective social status. This study contributes to the literature on mediation analysis, as well as the literature on the importance of education for health-related quality of life and subjective well-being. The influence of education on health and well-being had different pathways in this Norwegian sample. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2017. All rights reserved. No commercial use is permitted unless otherwise expressly granted.
Outlier Removal in Model-Based Missing Value Imputation for Medical Datasets.
Huang, Min-Wei; Lin, Wei-Chao; Tsai, Chih-Fong
2018-01-01
Many real-world medical datasets contain some proportion of missing (attribute) values. In general, missing value imputation can be performed to solve this problem, which is to provide estimations for the missing values by a reasoning process based on the (complete) observed data. However, if the observed data contain some noisy information or outliers, the estimations of the missing values may not be reliable or may even be quite different from the real values. The aim of this paper is to examine whether a combination of instance selection from the observed data and missing value imputation offers better performance than performing missing value imputation alone. In particular, three instance selection algorithms, DROP3, GA, and IB3, and three imputation algorithms, KNNI, MLP, and SVM, are used in order to find out the best combination. The experimental results show that that performing instance selection can have a positive impact on missing value imputation over the numerical data type of medical datasets, and specific combinations of instance selection and imputation methods can improve the imputation results over the mixed data type of medical datasets. However, instance selection does not have a definitely positive impact on the imputation result for categorical medical datasets.
2013-01-01
Background Rapid development of highly saturated genetic maps aids molecular breeding, which can accelerate gain per breeding cycle in woody perennial plants such as Rubus idaeus (red raspberry). Recently, robust genotyping methods based on high-throughput sequencing were developed, which provide high marker density, but result in some genotype errors and a large number of missing genotype values. Imputation can reduce the number of missing values and can correct genotyping errors, but current methods of imputation require a reference genome and thus are not an option for most species. Results Genotyping by Sequencing (GBS) was used to produce highly saturated maps for a R. idaeus pseudo-testcross progeny. While low coverage and high variance in sequencing resulted in a large number of missing values for some individuals, a novel method of imputation based on maximum likelihood marker ordering from initial marker segregation overcame the challenge of missing values, and made map construction computationally tractable. The two resulting parental maps contained 4521 and 2391 molecular markers spanning 462.7 and 376.6 cM respectively over seven linkage groups. Detection of precise genomic regions with segregation distortion was possible because of map saturation. Microsatellites (SSRs) linked these results to published maps for cross-validation and map comparison. Conclusions GBS together with genome-independent imputation provides a rapid method for genetic map construction in any pseudo-testcross progeny. Our method of imputation estimates the correct genotype call of missing values and corrects genotyping errors that lead to inflated map size and reduced precision in marker placement. Comparison of SSRs to published R. idaeus maps showed that the linkage maps constructed with GBS and our method of imputation were robust, and marker positioning reliable. The high marker density allowed identification of genomic regions with segregation distortion in R. idaeus, which may help to identify deleterious alleles that are the basis of inbreeding depression in the species. PMID:23324311
Välikangas, Tommi; Suomi, Tomi; Elo, Laura L
2017-05-31
Label-free mass spectrometry (MS) has developed into an important tool applied in various fields of biological and life sciences. Several software exist to process the raw MS data into quantified protein abundances, including open source and commercial solutions. Each software includes a set of unique algorithms for different tasks of the MS data processing workflow. While many of these algorithms have been compared separately, a thorough and systematic evaluation of their overall performance is missing. Moreover, systematic information is lacking about the amount of missing values produced by the different proteomics software and the capabilities of different data imputation methods to account for them.In this study, we evaluated the performance of five popular quantitative label-free proteomics software workflows using four different spike-in data sets. Our extensive testing included the number of proteins quantified and the number of missing values produced by each workflow, the accuracy of detecting differential expression and logarithmic fold change and the effect of different imputation and filtering methods on the differential expression results. We found that the Progenesis software performed consistently well in the differential expression analysis and produced few missing values. The missing values produced by the other software decreased their performance, but this difference could be mitigated using proper data filtering or imputation methods. Among the imputation methods, we found that the local least squares (lls) regression imputation consistently increased the performance of the software in the differential expression analysis, and a combination of both data filtering and local least squares imputation increased performance the most in the tested data sets. © The Author 2017. Published by Oxford University Press.
Imputation of missing data in time series for air pollutants
NASA Astrophysics Data System (ADS)
Junger, W. L.; Ponce de Leon, A.
2015-02-01
Missing data are major concerns in epidemiological studies of the health effects of environmental air pollutants. This article presents an imputation-based method that is suitable for multivariate time series data, which uses the EM algorithm under the assumption of normal distribution. Different approaches are considered for filtering the temporal component. A simulation study was performed to assess validity and performance of proposed method in comparison with some frequently used methods. Simulations showed that when the amount of missing data was as low as 5%, the complete data analysis yielded satisfactory results regardless of the generating mechanism of the missing data, whereas the validity began to degenerate when the proportion of missing values exceeded 10%. The proposed imputation method exhibited good accuracy and precision in different settings with respect to the patterns of missing observations. Most of the imputations obtained valid results, even under missing not at random. The methods proposed in this study are implemented as a package called mtsdi for the statistical software system R.
NASA Astrophysics Data System (ADS)
Zhang, Zhongrong; Yang, Xuan; Li, Hao; Li, Weide; Yan, Haowen; Shi, Fei
2017-10-01
The techniques for data analyses have been widely developed in past years, however, missing data still represent a ubiquitous problem in many scientific fields. In particular, dealing with missing spatiotemporal data presents an enormous challenge. Nonetheless, in recent years, a considerable amount of research has focused on spatiotemporal problems, making spatiotemporal missing data imputation methods increasingly indispensable. In this paper, a novel spatiotemporal hybrid method is proposed to verify and imputed spatiotemporal missing values. This new method, termed SOM-FLSSVM, flexibly combines three advanced techniques: self-organizing feature map (SOM) clustering, the fruit fly optimization algorithm (FOA) and the least squares support vector machine (LSSVM). We employ a cross-validation (CV) procedure and FOA swarm intelligence optimization strategy that can search available parameters and determine the optimal imputation model. The spatiotemporal underground water data for Minqin County, China, were selected to test the reliability and imputation ability of SOM-FLSSVM. We carried out a validation experiment and compared three well-studied models with SOM-FLSSVM using a different missing data ratio from 0.1 to 0.8 in the same data set. The results demonstrate that the new hybrid method performs well in terms of both robustness and accuracy for spatiotemporal missing data.
Larmer, S G; Sargolzaei, M; Schenkel, F S
2014-05-01
Genomic selection requires a large reference population to accurately estimate single nucleotide polymorphism (SNP) effects. In some Canadian dairy breeds, the available reference populations are not large enough for accurate estimation of SNP effects for traits of interest. If marker phase is highly consistent across multiple breeds, it is theoretically possible to increase the accuracy of genomic prediction for one or all breeds by pooling several breeds into a common reference population. This study investigated the extent of linkage disequilibrium (LD) in 5 major dairy breeds using a 50,000 (50K) SNP panel and 3 of the same breeds using the 777,000 (777K) SNP panel. Correlation of pair-wise SNP phase was also investigated on both panels. The level of LD was measured using the squared correlation of alleles at 2 loci (r(2)), and the consistency of SNP gametic phases was correlated using the signed square root of these values. Because of the high cost of the 777K panel, the accuracy of imputation from lower density marker panels [6,000 (6K) or 50K] was examined both within breed and using a multi-breed reference population in Holstein, Ayrshire, and Guernsey. Imputation was carried out using FImpute V2.2 and Beagle 3.3.2 software. Imputation accuracies were then calculated as both the proportion of correct SNP filled in (concordance rate) and allelic R(2). Computation time was also explored to determine the efficiency of the different algorithms for imputation. Analysis showed that LD values >0.2 were found in all breeds at distances at or shorter than the average adjacent pair-wise distance between SNP on the 50K panel. Correlations of r-values, however, did not reach high levels (<0.9) at these distances. High correlation values of SNP phase between breeds were observed (>0.94) when the average pair-wise distances using the 777K SNP panel were examined. High concordance rate (0.968-0.995) and allelic R(2) (0.946-0.991) were found for all breeds when imputation was carried out with FImpute from 50K to 777K. Imputation accuracy for Guernsey and Ayrshire was slightly lower when using the imputation method in Beagle. Computing time was significantly greater when using Beagle software, with all comparable procedures being 9 to 13 times less efficient, in terms of time, compared with FImpute. These findings suggest that use of a multi-breed reference population might increase prediction accuracy using the 777K SNP panel and that 777K genotypes can be efficiently and effectively imputed using the lower density 50K SNP panel. Copyright © 2014 American Dairy Science Association. Published by Elsevier Inc. All rights reserved.
Missing data handling in non-inferiority and equivalence trials: A systematic review.
Rabe, Brooke A; Day, Simon; Fiero, Mallorie H; Bell, Melanie L
2018-05-25
Non-inferiority (NI) and equivalence clinical trials test whether a new treatment is therapeutically no worse than, or equivalent to, an existing standard of care. Missing data in clinical trials have been shown to reduce statistical power and potentially bias estimates of effect size; however, in NI and equivalence trials, they present additional issues. For instance, they may decrease sensitivity to differences between treatment groups and bias toward the alternative hypothesis of NI (or equivalence). Our primary aim was to review the extent of and methods for handling missing data (model-based methods, single imputation, multiple imputation, complete case), the analysis sets used (Intention-To-Treat, Per-Protocol, or both), and whether sensitivity analyses were used to explore departures from assumptions about the missing data. We conducted a systematic review of NI and equivalence trials published between May 2015 and April 2016 by searching the PubMed database. Articles were reviewed primarily by 2 reviewers, with 6 articles reviewed by both reviewers to establish consensus. Of 109 selected articles, 93% reported some missing data in the primary outcome. Among those, 50% reported complete case analysis, and 28% reported single imputation approaches for handling missing data. Only 32% reported conducting analyses of both intention-to-treat and per-protocol populations. Only 11% conducted any sensitivity analyses to test assumptions with respect to missing data. Missing data are common in NI and equivalence trials, and they are often handled by methods which may bias estimates and lead to incorrect conclusions. Copyright © 2018 John Wiley & Sons, Ltd.
Yang, Jian; Bakshi, Andrew; Zhu, Zhihong; Hemani, Gibran; Vinkhuyzen, Anna A.E.; Lee, Sang Hong; Robinson, Matthew R.; Perry, John R.B.; Nolte, Ilja M.; van Vliet-Ostaptchouk, Jana V.; Snieder, Harold; Esko, Tonu; Milani, Lili; Mägi, Reedik; Metspalu, Andres; Hamsten, Anders; Magnusson, Patrik K.E.; Pedersen, Nancy L.; Ingelsson, Erik; Soranzo, Nicole; Keller, Matthew C.; Wray, Naomi R.; Goddard, Michael E.; Visscher, Peter M.
2015-01-01
We propose a method (GREML-LDMS) to estimate heritability for human complex traits in unrelated individuals using whole-genome sequencing (WGS) data. We demonstrate using simulations based on WGS data that ~97% and ~68% of variation at common and rare variants, respectively, can be captured by imputation. Using the GREML-LDMS method, we estimate from 44,126 unrelated individuals that all ~17M imputed variants explain 56% (s.e. = 2.3%) of variance for height and 27% (s.e. = 2.5%) for body mass index (BMI), and find evidence that height- and BMI-associated variants have been under natural selection. Considering imperfect tagging of imputation and potential overestimation of heritability from previous family-based studies, heritability is likely to be 60–70% for height and 30–40% for BMI. Therefore, missing heritability is small for both traits. For further gene discovery of complex traits, a design with SNP arrays followed by imputation is more cost-effective than WGS at current prices. PMID:26323059
Regnerus, Mark
2017-09-01
The study of stigma's influence on health has surged in recent years. Hatzenbuehler et al.'s (2014) study of structural stigma's effect on mortality revealed an average of 12 years' shorter life expectancy for sexual minorities who resided in communities thought to exhibit high levels of anti-gay prejudice, using data from the 1988-2002 administrations of the US General Social Survey linked to mortality outcome data in the 2008 National Death Index. In the original study, the key predictor variable (structural stigma) led to results suggesting the profound negative influence of structural stigma on the mortality of sexual minorities. Attempts to replicate the study, in order to explore alternative hypotheses, repeatedly failed to generate the original study's key finding on structural stigma. Efforts to discern the source of the disparity in results revealed complications in the multiple imputation process for missing values of the components of structural stigma. This prompted efforts at replication using 10 different imputation approaches. Efforts to replicate Hatzenbuehler et al.'s (2014) key finding on structural stigma's notable influence on the premature mortality of sexual minorities, including a more refined imputation strategy than described in the original study, failed. No data imputation approach yielded parameters that supported the original study's conclusions. Alternative hypotheses, which originally motivated the present study, revealed little new information. Ten different approaches to multiple imputation of missing data yielded none in which the effect of structural stigma on the mortality of sexual minorities was statistically significant. Minimally, the original study's structural stigma variable (and hence its key result) is so sensitive to subjective measurement decisions as to be rendered unreliable. Copyright © 2016 The Author. Published by Elsevier Ltd.. All rights reserved.
Toh, Sengwee; García Rodríguez, Luis A; Hernán, Miguel A
2012-05-01
Electronic healthcare databases are commonly used in comparative effectiveness and safety research of therapeutics. Many databases now include additional confounder information in a subset of the study population through data linkage or data collection. We described and compared existing methods for analyzing such datasets. Using data from The Health Improvement Network and the relation between non-steroidal anti-inflammatory drugs and upper gastrointestinal bleeding as an example, we employed several methods to handle partially missing confounder information. The crude odds ratio (OR) of upper gastrointestinal bleeding was 1.50 (95% confidence interval: 0.98, 2.28) among selective cyclo-oxygenase-2 inhibitor initiators (n = 43 569) compared with traditional non-steroidal anti-inflammatory drug initiators (n = 411 616). The OR dropped to 0.81 (0.52, 1.27) upon adjustment for confounders recorded for all patients. When further considering three additional variables missing in 22% of the study population (smoking, alcohol consumption, body mass index), the OR was between 0.80 and 0.83 for the missing-category approach, the missing-indicator approach, single imputation by the most common category, multiple imputation by chained equations, and propensity score calibration. The OR was 0.65 (0.39, 1.09) and 0.67 (0.38, 1.16) for the unweighted and the inverse probability weighted complete-case analysis, respectively. Existing methods for handling partially missing confounder data require different assumptions and may produce different results. The unweighted complete-case analysis, the missing-category/indicator approach, and single imputation require often unrealistic assumptions and should be avoided. In this study, differences across methods were not substantial, likely due to relatively low proportion of missingness and weak confounding effect by the three additional variables upon adjustment for other variables. Copyright © 2012 John Wiley & Sons, Ltd.
Yeesoonsang, Seesai; Bilheem, Surichai; McNeil, Edward; Iamsirithaworn, Sophon; Jiraphongsa, Chuleeporn; Sriplung, Hutcha
2017-01-01
Histological specimens are not required for diagnosis of liver and bile duct (LBD) cancer, resulting in a high percentage of unknown histologies. We compared estimates of hepatocellular carcinoma (HCC) and cholangiocarcinoma (CCA) incidences by imputing these unknown histologies. A retrospective study was conducted using data from the Songkhla Cancer Registry, southern Thailand, from 1989 to 2013. Multivariate imputation by chained equations (mice) was used in re-classification of the unknown histologies. Age-standardized rates (ASR) of HCC and CCA by sex were calculated and the trends were compared. Of 2,387 LBD cases, 61% had unknown histology. After imputation, the ASR of HCC in males during 1989 to 2007 increased from 4 to 10 per 100,000 and then decreased after 2007. The ASR of CCA increased from 2 to 5.5 per 100,000, and the ASR of HCC in females decreased from 1.5 in 2009 to 1.3 in 2013 and that of CCA increased from less than 1 to 1.9 per 100,000 by 2013. of complete case analysis showed somewhat similar, although less dramatic, trends. In Songkhla, the incidence of CCA appears to be stable after increasing for 20 years whereas the incidence of HCC is now declining. The decline in incidence of HCC among males since 2007 is probably due to implementation of the hepatitis B virus vaccine in the 1990s. The rise in incidence of CCA is a concern and highlights the need for case control studies to elucidate the risk factors.
Using Audit Information to Adjust Parameter Estimates for Data Errors in Clinical Trials
Shepherd, Bryan E.; Shaw, Pamela A.; Dodd, Lori E.
2013-01-01
Background Audits are often performed to assess the quality of clinical trial data, but beyond detecting fraud or sloppiness, the audit data is generally ignored. In earlier work using data from a non-randomized study, Shepherd and Yu (2011) developed statistical methods to incorporate audit results into study estimates, and demonstrated that audit data could be used to eliminate bias. Purpose In this manuscript we examine the usefulness of audit-based error-correction methods in clinical trial settings where a continuous outcome is of primary interest. Methods We demonstrate the bias of multiple linear regression estimates in general settings with an outcome that may have errors and a set of covariates for which some may have errors and others, including treatment assignment, are recorded correctly for all subjects. We study this bias under different assumptions including independence between treatment assignment, covariates, and data errors (conceivable in a double-blinded randomized trial) and independence between treatment assignment and covariates but not data errors (possible in an unblinded randomized trial). We review moment-based estimators to incorporate the audit data and propose new multiple imputation estimators. The performance of estimators is studied in simulations. Results When treatment is randomized and unrelated to data errors, estimates of the treatment effect using the original error-prone data (i.e., ignoring the audit results) are unbiased. In this setting, both moment and multiple imputation estimators incorporating audit data are more variable than standard analyses using the original data. In contrast, in settings where treatment is randomized but correlated with data errors and in settings where treatment is not randomized, standard treatment effect estimates will be biased. And in all settings, parameter estimates for the original, error-prone covariates will be biased. Treatment and covariate effect estimates can be corrected by incorporating audit data using either the multiple imputation or moment-based approaches. Bias, precision, and coverage of confidence intervals improve as the audit size increases. Limitations The extent of bias and the performance of methods depend on the extent and nature of the error as well as the size of the audit. This work only considers methods for the linear model. Settings much different than those considered here need further study. Conclusions In randomized trials with continuous outcomes and treatment assignment independent of data errors, standard analyses of treatment effects will be unbiased and are recommended. However, if treatment assignment is correlated with data errors or other covariates, naive analyses may be biased. In these settings, and when covariate effects are of interest, approaches for incorporating audit results should be considered. PMID:22848072
Wang, Kevin Yuqi; Vankov, Emilian R; Lin, Doris Da May
2018-02-01
OBJECTIVE Oligodendroglioma is a rare primary CNS neoplasm in the pediatric population, and only a limited number of studies in the literature have characterized this entity. Existing studies are limited by small sample sizes and discrepant interstudy findings in identified prognostic factors. In the present study, the authors aimed to increase the statistical power in evaluating for potential prognostic factors of pediatric oligodendrogliomas and sought to reconcile the discrepant findings present among existing studies by performing an individual-patient-data (IPD) meta-analysis and using multiple imputation to address data not directly available from existing studies. METHODS A systematic search was performed, and all studies found to be related to pediatric oligodendrogliomas and associated outcomes were screened for inclusion. Each study was searched for specific demographic and clinical characteristics of each patient and the duration of event-free survival (EFS) and overall survival (OS). Given that certain demographic and clinical information of each patient was not available within all studies, a multivariable imputation via chained equations model was used to impute missing data after the mechanism of missing data was determined. The primary end points of interest were hazard ratios for EFS and OS, as calculated by the Cox proportional-hazards model. Both univariate and multivariate analyses were performed. The multivariate model was adjusted for age, sex, tumor grade, mixed pathologies, extent of resection, chemotherapy, radiation therapy, tumor location, and initial presentation. A p value of less than 0.05 was considered statistically significant. RESULTS A systematic search identified 24 studies with both time-to-event and IPD characteristics available, and a total of 237 individual cases were available for analysis. A median of 19.4% of the values among clinical, demographic, and outcome variables in the compiled 237 cases were missing. Multivariate Cox regression analysis revealed subtotal resection (p = 0.007 [EFS] and 0.043 [OS]), initial presentation of headache (p = 0.006 [EFS] and 0.004 [OS]), mixed pathologies (p = 0.005 [EFS] and 0.049 [OS]), and location of the tumor in the parietal lobe (p = 0.044 [EFS] and 0.030 [OS]) to be significant predictors of tumor progression or recurrence and death. CONCLUSIONS The use of IPD meta-analysis provides a valuable means for increasing statistical power in investigations of disease entities with a very low incidence. Missing data are common in research, and multiple imputation is a flexible and valid approach for addressing this issue, when it is used conscientiously. Undergoing subtotal resection, having a parietal tumor, having tumors with mixed pathologies, and suffering headaches at the time of diagnosis portended a poorer prognosis in pediatric patients with oligodendroglioma.
Fu, Yong-Bi
2014-01-01
Genotyping by sequencing (GBS) recently has emerged as a promising genomic approach for assessing genetic diversity on a genome-wide scale. However, concerns are not lacking about the uniquely large unbalance in GBS genotype data. Although some genotype imputation has been proposed to infer missing observations, little is known about the reliability of a genetic diversity analysis of GBS data, with up to 90% of observations missing. Here we performed an empirical assessment of accuracy in genetic diversity analysis of highly incomplete single nucleotide polymorphism genotypes with imputations. Three large single-nucleotide polymorphism genotype data sets for corn, wheat, and rice were acquired, and missing data with up to 90% of missing observations were randomly generated and then imputed for missing genotypes with three map-independent imputation methods. Estimating heterozygosity and inbreeding coefficient from original, missing, and imputed data revealed variable patterns of bias from assessed levels of missingness and genotype imputation, but the estimation biases were smaller for missing data without genotype imputation. The estimates of genetic differentiation were rather robust up to 90% of missing observations but became substantially biased when missing genotypes were imputed. The estimates of topology accuracy for four representative samples of interested groups generally were reduced with increased levels of missing genotypes. Probabilistic principal component analysis based imputation performed better in terms of topology accuracy than those analyses of missing data without genotype imputation. These findings are not only significant for understanding the reliability of the genetic diversity analysis with respect to large missing data and genotype imputation but also are instructive for performing a proper genetic diversity analysis of highly incomplete GBS or other genotype data. PMID:24626289
Missing value imputation strategies for metabolomics data.
Armitage, Emily Grace; Godzien, Joanna; Alonso-Herranz, Vanesa; López-Gonzálvez, Ángeles; Barbas, Coral
2015-12-01
The origin of missing values can be caused by different reasons and depending on these origins missing values should be considered differently and dealt with in different ways. In this research, four methods of imputation have been compared with respect to revealing their effects on the normality and variance of data, on statistical significance and on the approximation of a suitable threshold to accept missing data as truly missing. Additionally, the effects of different strategies for controlling familywise error rate or false discovery and how they work with the different strategies for missing value imputation have been evaluated. Missing values were found to affect normality and variance of data and k-means nearest neighbour imputation was the best method tested for restoring this. Bonferroni correction was the best method for maximizing true positives and minimizing false positives and it was observed that as low as 40% missing data could be truly missing. The range between 40 and 70% missing values was defined as a "gray area" and therefore a strategy has been proposed that provides a balance between the optimal imputation strategy that was k-means nearest neighbor and the best approximation of positioning real zeros. © 2015 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
ERIC Educational Resources Information Center
Golino, Hudson F.; Gomes, Cristiano M. A.
2016-01-01
This paper presents a non-parametric imputation technique, named random forest, from the machine learning field. The random forest procedure has two main tuning parameters: the number of trees grown in the prediction and the number of predictors used. Fifty experimental conditions were created in the imputation procedure, with different…
David M. Bell; Matthew J. Gregory; Janet L. Ohmann
2015-01-01
Imputation provides a useful method for mapping forest attributes across broad geographic areas based on field plot measurements and Landsat multi-spectral data, but the resulting map products may be of limited use without corresponding analyses of uncertainties in predictions. In the case of k-nearest neighbor (kNN) imputation with k = 1, such as the Gradient Nearest...
LinkImputeR: user-guided genotype calling and imputation for non-model organisms.
Money, Daniel; Migicovsky, Zoë; Gardner, Kyle; Myles, Sean
2017-07-10
Genomic studies such as genome-wide association and genomic selection require genome-wide genotype data. All existing technologies used to create these data result in missing genotypes, which are often then inferred using genotype imputation software. However, existing imputation methods most often make use only of genotypes that are successfully inferred after having passed a certain read depth threshold. Because of this, any read information for genotypes that did not pass the threshold, and were thus set to missing, is ignored. Most genomic studies also choose read depth thresholds and quality filters without investigating their effects on the size and quality of the resulting genotype data. Moreover, almost all genotype imputation methods require ordered markers and are therefore of limited utility in non-model organisms. Here we introduce LinkImputeR, a software program that exploits the read count information that is normally ignored, and makes use of all available DNA sequence information for the purposes of genotype calling and imputation. It is specifically designed for non-model organisms since it requires neither ordered markers nor a reference panel of genotypes. Using next-generation DNA sequence (NGS) data from apple, cannabis and grape, we quantify the effect of varying read count and missingness thresholds on the quantity and quality of genotypes generated from LinkImputeR. We demonstrate that LinkImputeR can increase the number of genotype calls by more than an order of magnitude, can improve genotyping accuracy by several percent and can thus improve the power of downstream analyses. Moreover, we show that the effects of quality and read depth filters can differ substantially between data sets and should therefore be investigated on a per-study basis. By exploiting DNA sequence data that is normally ignored during genotype calling and imputation, LinkImputeR can significantly improve both the quantity and quality of genotype data generated from NGS technologies. It enables the user to quickly and easily examine the effects of varying thresholds and filters on the number and quality of the resulting genotype calls. In this manner, users can decide on thresholds that are most suitable for their purposes. We show that LinkImputeR can significantly augment the value and utility of NGS data sets, especially in non-model organisms with poor genomic resources.
de Vocht, Frank; Lee, Brian
2014-08-01
Studies have suggested that residential exposure to extremely low frequency (50 Hz) electromagnetic fields (ELF-EMF) from high voltage cables, overhead power lines, electricity substations or towers are associated with reduced birth weight and may be associated with adverse birth outcomes or even miscarriages. We previously conducted a study of 140,356 singleton live births between 2004 and 2008 in Northwest England, which suggested that close residential proximity (≤ 50 m) to ELF-EMF sources was associated with reduced average birth weight of 212 g (95%CI: -395 to -29 g) but not with statistically significant increased risks for other adverse perinatal outcomes. However, the cohort was limited by missing data for most potentially confounding variables including maternal smoking during pregnancy, which was only available for a small subgroup, while also residual confounding could not be excluded. This study, using the same cohort, was conducted to minimize the effects of these problems using multiple imputation to address missing data and propensity score matching to minimize residual confounding. Missing data were imputed using multiple imputation using chained equations to generate five datasets. For each dataset 115 exposed women (residing ≤ 50 m from a residential ELF-EMF source) were propensity score matched to 1150 unexposed women. After doubly robust confounder adjustment, close proximity to a residential ELF-EMF source remained associated with a reduction in birth weight of -116 g (95% confidence interval: -224:-7 g). No effect was found for proximity ≤ 100 m compared to women living further away. These results indicate that although the effect size was about half of the effect previously reported, close maternal residential proximity to sources of ELF-EMF remained associated with suboptimal fetal growth. Copyright © 2014 Elsevier Ltd. All rights reserved.
Hydrologic Response to Climate Change: Missing Precipitation Data Matters for Computed Timing Trends
NASA Astrophysics Data System (ADS)
Daniels, B.
2016-12-01
This work demonstrates the derivation of climate timing statistics and applying them to determine resulting hydroclimate impacts. Long-term daily precipitation observations from 50 California stations were used to compute climate trends of precipitation event Intensity, event Duration and Pause between events. Each precipitation event trend was then applied as input to a PRMS hydrology model which showed hydrology changes to recharge, baseflow, streamflow, etc. An important concern was precipitation uncertainty induced by missing observation values and causing errors in quantification of precipitation trends. Many standard statistical techniques such as ARIMA and simple endogenous or even exogenous imputation were applied but failed to help resolve these uncertainties. What helped resolve these uncertainties was use of multiple imputation techniques. This involved fitting of Weibull probability distributions to multiple imputed values for the three precipitation trends.Permutation resampling techniques using Monte Carlo processing were then applied to the multiple imputation values to derive significance p-values for each trend. Significance at the 95% level for Intensity was found for 11 of the 50 stations, Duration from 16 of the 50, and Pause from 19, of which 12 were 99% significant. The significance weighted trends for California are Intensity -4.61% per decade, Duration +3.49% per decade, and Pause +3.58% per decade. Two California basins with PRMS hydrologic models were studied: Feather River in the northern Sierra Nevada mountains and the central coast Soquel-Aptos. Each local trend was changed without changing the other trends or the total precipitation. Feather River Basin's critical supply to Lake Oroville and the State Water Project benefited from a total streamflow increase of 1.5%. The Soquel-Aptos Basin water supply was impacted by a total groundwater recharge decrease of -7.5% and streamflow decrease of -3.2%.
CHAI, Lian En; LAW, Chow Kuan; MOHAMAD, Mohd Saberi; CHONG, Chuii Khim; CHOON, Yee Wen; DERIS, Safaai; ILLIAS, Rosli Md
2014-01-01
Background: Gene expression data often contain missing expression values. Therefore, several imputation methods have been applied to solve the missing values, which include k-nearest neighbour (kNN), local least squares (LLS), and Bayesian principal component analysis (BPCA). However, the effects of these imputation methods on the modelling of gene regulatory networks from gene expression data have rarely been investigated and analysed using a dynamic Bayesian network (DBN). Methods: In the present study, we separately imputed datasets of the Escherichia coli S.O.S. DNA repair pathway and the Saccharomyces cerevisiae cell cycle pathway with kNN, LLS, and BPCA, and subsequently used these to generate gene regulatory networks (GRNs) using a discrete DBN. We made comparisons on the basis of previous studies in order to select the gene network with the least error. Results: We found that BPCA and LLS performed better on larger networks (based on the S. cerevisiae dataset), whereas kNN performed better on smaller networks (based on the E. coli dataset). Conclusion: The results suggest that the performance of each imputation method is dependent on the size of the dataset, and this subsequently affects the modelling of the resultant GRNs using a DBN. In addition, on the basis of these results, a DBN has the capacity to discover potential edges, as well as display interactions, between genes. PMID:24876803
Krausch-Hofmann, Stefanie; Bogaerts, Kris; Hofmann, Michael; de Almeida Mello, Johanna; Fávaro Moreira, Nádia Cristina; Lesaffre, Emmanuel; Declerck, Dominique; Declercq, Anja; Duyck, Joke
2015-01-01
Missing data within the comprehensive geriatric assessment of the interRAI suite of assessment instruments potentially imply the under-detection of conditions that require care as well as the risk of biased statistical results. Impaired oral health in older individuals has to be registered accurately as it causes pain and discomfort and is related to the general health status. This study was based on interRAI-Home Care (HC) baseline data from 7590 subjects (mean age 81.2 years, SD 6.9) in Belgium. It was investigated if missingness of the oral health-related items was associated with selected variables of general health. It was also determined if multiple imputation of missing data affected the associations between oral and general health. Multivariable logistic regression was used to determine if the prevalence of missingness in the oral health-related variables was associated with activities of daily life (ADLH), cognitive performance (CPS2) and depression (DRS). Associations between oral health and ADLH, CPS2 and DRS were determined, with missing data treated by 1. the complete-case technique and 2. by multiple imputation, and results were compared. The individual oral health-related variables had a similar proportion of missing values, ranging from 16.3% to 17.2%. The prevalence of missing data in all oral health-related variables was significantly associated with symptoms of depression (dental prosthesis use OR 1.66, CI 1.41-1.95; damaged teeth OR 1.74, CI 1.48-2.04; chewing problems OR 1.74, CI 1.47-2.05; dry mouth OR 1.65, CI 1.40-1.94). Missingness in damaged teeth (OR 1.27, CI 1.08-1.48), chewing problems (OR 1.22, CI 1.04-1.44) and dry mouth (OR 1.23, CI 1.05-1.44) occurred more frequently in cognitively impaired subjects. ADLH was not associated with the prevalence of missing data. When comparing the complete-case technique with the multiple imputation approach, nearly identical odds ratios characterized the associations between oral and general health. Cognitively impaired and depressive individuals had a higher risk of missing oral health-related information. Associations between oral health and ADLH, CPS2 and DRS were not influenced by multiple imputation of missing data. Further research should concentrate on the mechanisms that mediate the occurrence of missingness to develop preventative strategies.
Nohara, Ryuki; Endo, Yui; Murai, Akihiko; Takemura, Hiroshi; Kouchi, Makiko; Tada, Mitsunori
2016-08-01
Individual human models are usually created by direct 3D scanning or deforming a template model according to the measured dimensions. In this paper, we propose a method to estimate all the necessary dimensions (full set) for the human model individualization from a small number of measured dimensions (subset) and human dimension database. For this purpose, we solved multiple regression equation from the dimension database given full set dimensions as the objective variable and subset dimensions as the explanatory variables. Thus, the full set dimensions are obtained by simply multiplying the subset dimensions to the coefficient matrix of the regression equation. We verified the accuracy of our method by imputing hand, foot, and whole body dimensions from their dimension database. The leave-one-out cross validation is employed in this evaluation. The mean absolute errors (MAE) between the measured and the estimated dimensions computed from 4 dimensions (hand length, breadth, middle finger breadth at proximal, and middle finger depth at proximal) in the hand, 2 dimensions (foot length, breadth, and lateral malleolus height) in the foot, and 1 dimension (height) and weight in the whole body are computed. The average MAE of non-measured dimensions were 4.58% in the hand, 4.42% in the foot, and 3.54% in the whole body, while that of measured dimensions were 0.00%.
The impact of missing trauma data on predicting massive transfusion
Trickey, Amber W.; Fox, Erin E.; del Junco, Deborah J.; Ning, Jing; Holcomb, John B.; Brasel, Karen J.; Cohen, Mitchell J.; Schreiber, Martin A.; Bulger, Eileen M.; Phelan, Herb A.; Alarcon, Louis H.; Myers, John G.; Muskat, Peter; Cotton, Bryan A.; Wade, Charles E.; Rahbar, Mohammad H.
2013-01-01
INTRODUCTION Missing data are inherent in clinical research and may be especially problematic for trauma studies. This study describes a sensitivity analysis to evaluate the impact of missing data on clinical risk prediction algorithms. Three blood transfusion prediction models were evaluated utilizing an observational trauma dataset with valid missing data. METHODS The PRospective Observational Multi-center Major Trauma Transfusion (PROMMTT) study included patients requiring ≥ 1 unit of red blood cells (RBC) at 10 participating U.S. Level I trauma centers from July 2009 – October 2010. Physiologic, laboratory, and treatment data were collected prospectively up to 24h after hospital admission. Subjects who received ≥ 10 RBC units within 24h of admission were classified as massive transfusion (MT) patients. Correct classification percentages for three MT prediction models were evaluated using complete case analysis and multiple imputation. A sensitivity analysis for missing data was conducted to determine the upper and lower bounds for correct classification percentages. RESULTS PROMMTT enrolled 1,245 subjects. MT was received by 297 patients (24%). Missing percentage ranged from 2.2% (heart rate) to 45% (respiratory rate). Proportions of complete cases utilized in the MT prediction models ranged from 41% to 88%. All models demonstrated similar correct classification percentages using complete case analysis and multiple imputation. In the sensitivity analysis, correct classification upper-lower bound ranges per model were 4%, 10%, and 12%. Predictive accuracy for all models using PROMMTT data was lower than reported in the original datasets. CONCLUSIONS Evaluating the accuracy clinical prediction models with missing data can be misleading, especially with many predictor variables and moderate levels of missingness per variable. The proposed sensitivity analysis describes the influence of missing data on risk prediction algorithms. Reporting upper/lower bounds for percent correct classification may be more informative than multiple imputation, which provided similar results to complete case analysis in this study. PMID:23778514
Dipnall, Joanna F; Pasco, Julie A; Berk, Michael; Williams, Lana J; Dodd, Seetal; Jacka, Felice N; Meyer, Denny
2016-01-01
Atheoretical large-scale data mining techniques using machine learning algorithms have promise in the analysis of large epidemiological datasets. This study illustrates the use of a hybrid methodology for variable selection that took account of missing data and complex survey design to identify key biomarkers associated with depression from a large epidemiological study. The study used a three-step methodology amalgamating multiple imputation, a machine learning boosted regression algorithm and logistic regression, to identify key biomarkers associated with depression in the National Health and Nutrition Examination Study (2009-2010). Depression was measured using the Patient Health Questionnaire-9 and 67 biomarkers were analysed. Covariates in this study included gender, age, race, smoking, food security, Poverty Income Ratio, Body Mass Index, physical activity, alcohol use, medical conditions and medications. The final imputed weighted multiple logistic regression model included possible confounders and moderators. After the creation of 20 imputation data sets from multiple chained regression sequences, machine learning boosted regression initially identified 21 biomarkers associated with depression. Using traditional logistic regression methods, including controlling for possible confounders and moderators, a final set of three biomarkers were selected. The final three biomarkers from the novel hybrid variable selection methodology were red cell distribution width (OR 1.15; 95% CI 1.01, 1.30), serum glucose (OR 1.01; 95% CI 1.00, 1.01) and total bilirubin (OR 0.12; 95% CI 0.05, 0.28). Significant interactions were found between total bilirubin with Mexican American/Hispanic group (p = 0.016), and current smokers (p<0.001). The systematic use of a hybrid methodology for variable selection, fusing data mining techniques using a machine learning algorithm with traditional statistical modelling, accounted for missing data and complex survey sampling methodology and was demonstrated to be a useful tool for detecting three biomarkers associated with depression for future hypothesis generation: red cell distribution width, serum glucose and total bilirubin.
Tang, Yongqiang
2018-04-30
The controlled imputation method refers to a class of pattern mixture models that have been commonly used as sensitivity analyses of longitudinal clinical trials with nonignorable dropout in recent years. These pattern mixture models assume that participants in the experimental arm after dropout have similar response profiles to the control participants or have worse outcomes than otherwise similar participants who remain on the experimental treatment. In spite of its popularity, the controlled imputation has not been formally developed for longitudinal binary and ordinal outcomes partially due to the lack of a natural multivariate distribution for such endpoints. In this paper, we propose 2 approaches for implementing the controlled imputation for binary and ordinal data based respectively on the sequential logistic regression and the multivariate probit model. Efficient Markov chain Monte Carlo algorithms are developed for missing data imputation by using the monotone data augmentation technique for the sequential logistic regression and a parameter-expanded monotone data augmentation scheme for the multivariate probit model. We assess the performance of the proposed procedures by simulation and the analysis of a schizophrenia clinical trial and compare them with the fully conditional specification, last observation carried forward, and baseline observation carried forward imputation methods. Copyright © 2018 John Wiley & Sons, Ltd.
A combined long-range phasing and long haplotype imputation method to impute phase for SNP genotypes
2011-01-01
Background Knowing the phase of marker genotype data can be useful in genome-wide association studies, because it makes it possible to use analysis frameworks that account for identity by descent or parent of origin of alleles and it can lead to a large increase in data quantities via genotype or sequence imputation. Long-range phasing and haplotype library imputation constitute a fast and accurate method to impute phase for SNP data. Methods A long-range phasing and haplotype library imputation algorithm was developed. It combines information from surrogate parents and long haplotypes to resolve phase in a manner that is not dependent on the family structure of a dataset or on the presence of pedigree information. Results The algorithm performed well in both simulated and real livestock and human datasets in terms of both phasing accuracy and computation efficiency. The percentage of alleles that could be phased in both simulated and real datasets of varying size generally exceeded 98% while the percentage of alleles incorrectly phased in simulated data was generally less than 0.5%. The accuracy of phasing was affected by dataset size, with lower accuracy for dataset sizes less than 1000, but was not affected by effective population size, family data structure, presence or absence of pedigree information, and SNP density. The method was computationally fast. In comparison to a commonly used statistical method (fastPHASE), the current method made about 8% less phasing mistakes and ran about 26 times faster for a small dataset. For larger datasets, the differences in computational time are expected to be even greater. A computer program implementing these methods has been made available. Conclusions The algorithm and software developed in this study make feasible the routine phasing of high-density SNP chips in large datasets. PMID:21388557
Yang, Jian; Bakshi, Andrew; Zhu, Zhihong; Hemani, Gibran; Vinkhuyzen, Anna A E; Lee, Sang Hong; Robinson, Matthew R; Perry, John R B; Nolte, Ilja M; van Vliet-Ostaptchouk, Jana V; Snieder, Harold; Esko, Tonu; Milani, Lili; Mägi, Reedik; Metspalu, Andres; Hamsten, Anders; Magnusson, Patrik K E; Pedersen, Nancy L; Ingelsson, Erik; Soranzo, Nicole; Keller, Matthew C; Wray, Naomi R; Goddard, Michael E; Visscher, Peter M
2015-10-01
We propose a method (GREML-LDMS) to estimate heritability for human complex traits in unrelated individuals using whole-genome sequencing data. We demonstrate using simulations based on whole-genome sequencing data that ∼97% and ∼68% of variation at common and rare variants, respectively, can be captured by imputation. Using the GREML-LDMS method, we estimate from 44,126 unrelated individuals that all ∼17 million imputed variants explain 56% (standard error (s.e.) = 2.3%) of variance for height and 27% (s.e. = 2.5%) of variance for body mass index (BMI), and we find evidence that height- and BMI-associated variants have been under natural selection. Considering the imperfect tagging of imputation and potential overestimation of heritability from previous family-based studies, heritability is likely to be 60-70% for height and 30-40% for BMI. Therefore, the missing heritability is small for both traits. For further discovery of genes associated with complex traits, a study design with SNP arrays followed by imputation is more cost-effective than whole-genome sequencing at current prices.
Ahmad, Meraj; Sinha, Anubhav; Ghosh, Sreya; Kumar, Vikrant; Davila, Sonia; Yajnik, Chittaranjan S; Chandak, Giriraj R
2017-07-27
Imputation is a computational method based on the principle of haplotype sharing allowing enrichment of genome-wide association study datasets. It depends on the haplotype structure of the population and density of the genotype data. The 1000 Genomes Project led to the generation of imputation reference panels which have been used globally. However, recent studies have shown that population-specific panels provide better enrichment of genome-wide variants. We compared the imputation accuracy using 1000 Genomes phase 3 reference panel and a panel generated from genome-wide data on 407 individuals from Western India (WIP). The concordance of imputed variants was cross-checked with next-generation re-sequencing data on a subset of genomic regions. Further, using the genome-wide data from 1880 individuals, we demonstrate that WIP works better than the 1000 Genomes phase 3 panel and when merged with it, significantly improves the imputation accuracy throughout the minor allele frequency range. We also show that imputation using only South Asian component of the 1000 Genomes phase 3 panel works as good as the merged panel, making it computationally less intensive job. Thus, our study stresses that imputation accuracy using 1000 Genomes phase 3 panel can be further improved by including population-specific reference panels from South Asia.
New Insights into Handling Missing Values in Environmental Epidemiological Studies
Roda, Célina; Nicolis, Ioannis; Momas, Isabelle; Guihenneuc, Chantal
2014-01-01
Missing data are unavoidable in environmental epidemiologic surveys. The aim of this study was to compare methods for handling large amounts of missing values: omission of missing values, single and multiple imputations (through linear regression or partial least squares regression), and a fully Bayesian approach. These methods were applied to the PARIS birth cohort, where indoor domestic pollutant measurements were performed in a random sample of babies' dwellings. A simulation study was conducted to assess performances of different approaches with a high proportion of missing values (from 50% to 95%). Different simulation scenarios were carried out, controlling the true value of the association (odds ratio of 1.0, 1.2, and 1.4), and varying the health outcome prevalence. When a large amount of data is missing, omitting these missing data reduced statistical power and inflated standard errors, which affected the significance of the association. Single imputation underestimated the variability, and considerably increased risk of type I error. All approaches were conservative, except the Bayesian joint model. In the case of a common health outcome, the fully Bayesian approach is the most efficient approach (low root mean square error, reasonable type I error, and high statistical power). Nevertheless for a less prevalent event, the type I error is increased and the statistical power is reduced. The estimated posterior distribution of the OR is useful to refine the conclusion. Among the methods handling missing values, no approach is absolutely the best but when usual approaches (e.g. single imputation) are not sufficient, joint modelling approach of missing process and health association is more efficient when large amounts of data are missing. PMID:25226278
Evaluation and application of summary statistic imputation to discover new height-associated loci.
Rüeger, Sina; McDaid, Aaron; Kutalik, Zoltán
2018-05-01
As most of the heritability of complex traits is attributed to common and low frequency genetic variants, imputing them by combining genotyping chips and large sequenced reference panels is the most cost-effective approach to discover the genetic basis of these traits. Association summary statistics from genome-wide meta-analyses are available for hundreds of traits. Updating these to ever-increasing reference panels is very cumbersome as it requires reimputation of the genetic data, rerunning the association scan, and meta-analysing the results. A much more efficient method is to directly impute the summary statistics, termed as summary statistics imputation, which we improved to accommodate variable sample size across SNVs. Its performance relative to genotype imputation and practical utility has not yet been fully investigated. To this end, we compared the two approaches on real (genotyped and imputed) data from 120K samples from the UK Biobank and show that, genotype imputation boasts a 3- to 5-fold lower root-mean-square error, and better distinguishes true associations from null ones: We observed the largest differences in power for variants with low minor allele frequency and low imputation quality. For fixed false positive rates of 0.001, 0.01, 0.05, using summary statistics imputation yielded a decrease in statistical power by 9, 43 and 35%, respectively. To test its capacity to discover novel associations, we applied summary statistics imputation to the GIANT height meta-analysis summary statistics covering HapMap variants, and identified 34 novel loci, 19 of which replicated using data in the UK Biobank. Additionally, we successfully replicated 55 out of the 111 variants published in an exome chip study. Our study demonstrates that summary statistics imputation is a very efficient and cost-effective way to identify and fine-map trait-associated loci. Moreover, the ability to impute summary statistics is important for follow-up analyses, such as Mendelian randomisation or LD-score regression.
Evaluation and application of summary statistic imputation to discover new height-associated loci
2018-01-01
As most of the heritability of complex traits is attributed to common and low frequency genetic variants, imputing them by combining genotyping chips and large sequenced reference panels is the most cost-effective approach to discover the genetic basis of these traits. Association summary statistics from genome-wide meta-analyses are available for hundreds of traits. Updating these to ever-increasing reference panels is very cumbersome as it requires reimputation of the genetic data, rerunning the association scan, and meta-analysing the results. A much more efficient method is to directly impute the summary statistics, termed as summary statistics imputation, which we improved to accommodate variable sample size across SNVs. Its performance relative to genotype imputation and practical utility has not yet been fully investigated. To this end, we compared the two approaches on real (genotyped and imputed) data from 120K samples from the UK Biobank and show that, genotype imputation boasts a 3- to 5-fold lower root-mean-square error, and better distinguishes true associations from null ones: We observed the largest differences in power for variants with low minor allele frequency and low imputation quality. For fixed false positive rates of 0.001, 0.01, 0.05, using summary statistics imputation yielded a decrease in statistical power by 9, 43 and 35%, respectively. To test its capacity to discover novel associations, we applied summary statistics imputation to the GIANT height meta-analysis summary statistics covering HapMap variants, and identified 34 novel loci, 19 of which replicated using data in the UK Biobank. Additionally, we successfully replicated 55 out of the 111 variants published in an exome chip study. Our study demonstrates that summary statistics imputation is a very efficient and cost-effective way to identify and fine-map trait-associated loci. Moreover, the ability to impute summary statistics is important for follow-up analyses, such as Mendelian randomisation or LD-score regression. PMID:29782485
Fiero, Mallorie H; Hsu, Chiu-Hsieh; Bell, Melanie L
2017-11-20
We extend the pattern-mixture approach to handle missing continuous outcome data in longitudinal cluster randomized trials, which randomize groups of individuals to treatment arms, rather than the individuals themselves. Individuals who drop out at the same time point are grouped into the same dropout pattern. We approach extrapolation of the pattern-mixture model by applying multilevel multiple imputation, which imputes missing values while appropriately accounting for the hierarchical data structure found in cluster randomized trials. To assess parameters of interest under various missing data assumptions, imputed values are multiplied by a sensitivity parameter, k, which increases or decreases imputed values. Using simulated data, we show that estimates of parameters of interest can vary widely under differing missing data assumptions. We conduct a sensitivity analysis using real data from a cluster randomized trial by increasing k until the treatment effect inference changes. By performing a sensitivity analysis for missing data, researchers can assess whether certain missing data assumptions are reasonable for their cluster randomized trial. Copyright © 2017 John Wiley & Sons, Ltd.
Webb-Robertson, Bobbie-Jo M.; Wiberg, Holli K.; Matzke, Melissa M.; ...
2015-04-09
In this review, we apply selected imputation strategies to label-free liquid chromatography–mass spectrometry (LC–MS) proteomics datasets to evaluate the accuracy with respect to metrics of variance and classification. We evaluate several commonly used imputation approaches for individual merits and discuss the caveats of each approach with respect to the example LC–MS proteomics data. In general, local similarity-based approaches, such as the regularized expectation maximization and least-squares adaptive algorithms, yield the best overall performances with respect to metrics of accuracy and robustness. However, no single algorithm consistently outperforms the remaining approaches, and in some cases, performing classification without imputation sometimes yieldedmore » the most accurate classification. Thus, because of the complex mechanisms of missing data in proteomics, which also vary from peptide to protein, no individual method is a single solution for imputation. In summary, on the basis of the observations in this review, the goal for imputation in the field of computational proteomics should be to develop new approaches that work generically for this data type and new strategies to guide users in the selection of the best imputation for their dataset and analysis objectives.« less
van Walraven, Carl
2017-04-01
Diagnostic codes used in administrative databases cause bias due to misclassification of patient disease status. It is unclear which methods minimize this bias. Serum creatinine measures were used to determine severe renal failure status in 50,074 hospitalized patients. The true prevalence of severe renal failure and its association with covariates were measured. These were compared to results for which renal failure status was determined using surrogate measures including the following: (1) diagnostic codes; (2) categorization of probability estimates of renal failure determined from a previously validated model; or (3) bootstrap methods imputation of disease status using model-derived probability estimates. Bias in estimates of severe renal failure prevalence and its association with covariates were minimal when bootstrap methods were used to impute renal failure status from model-based probability estimates. In contrast, biases were extensive when renal failure status was determined using codes or methods in which model-based condition probability was categorized. Bias due to misclassification from inaccurate diagnostic codes can be minimized using bootstrap methods to impute condition status using multivariable model-derived probability estimates. Copyright © 2017 Elsevier Inc. All rights reserved.
Comulada, W. Scott
2015-01-01
Stata’s mi commands provide powerful tools to conduct multiple imputation in the presence of ignorable missing data. In this article, I present Stata code to extend the capabilities of the mi commands to address two areas of statistical inference where results are not easily aggregated across imputed datasets. First, mi commands are restricted to covariate selection. I show how to address model fit to correctly specify a model. Second, the mi commands readily aggregate model-based standard errors. I show how standard errors can be bootstrapped for situations where model assumptions may not be met. I illustrate model specification and bootstrapping on frequency counts for the number of times that alcohol was consumed in data with missing observations from a behavioral intervention. PMID:26973439
DOE Office of Scientific and Technical Information (OSTI.GOV)
Langan, Roisin T.; Archibald, Richard K.; Lamberti, Vincent
We have applied a new imputation-based method for analyzing incomplete data, called Monte Carlo Bayesian Database Generation (MCBDG), to the Spent Fuel Isotopic Composition (SFCOMPO) database. About 60% of the entries are absent for SFCOMPO. The method estimates missing values of a property from a probability distribution created from the existing data for the property, and then generates multiple instances of the completed database for training a machine learning algorithm. Uncertainty in the data is represented by an empirical or an assumed error distribution. The method makes few assumptions about the underlying data, and compares favorably against results obtained bymore » replacing missing information with constant values.« less
Alternative Methods for Handling Attrition
Foster, E. Michael; Fang, Grace Y.
2009-01-01
Using data from the evaluation of the Fast Track intervention, this article illustrates three methods for handling attrition. Multiple imputation and ignorable maximum likelihood estimation produce estimates that are similar to those based on listwise-deleted data. A panel selection model that allows for selective dropout reveals that highly aggressive boys accumulate in the treatment group over time and produces a larger estimate of treatment effect. In contrast, this model produces a smaller treatment effect for girls. The article's conclusion discusses the strengths and weaknesses of the alternative approaches and outlines ways in which researchers might improve their handling of attrition. PMID:15358906
ERIC Educational Resources Information Center
Köse, Alper
2014-01-01
The primary objective of this study was to examine the effect of missing data on goodness of fit statistics in confirmatory factor analysis (CFA). For this aim, four missing data handling methods; listwise deletion, full information maximum likelihood, regression imputation and expectation maximization (EM) imputation were examined in terms of…
Investigation of Missing Responses in Implementation of Cognitive Diagnostic Models
ERIC Educational Resources Information Center
Dai, Shenghai
2017-01-01
This dissertation is aimed at investigating the impact of missing data and evaluating the performance of five selected methods for handling missing responses in the implementation of Cognitive Diagnostic Models (CDMs). The five methods are: a) treating missing data as incorrect (IN), b) person mean imputation (PM), c) two-way imputation (TW), d)…
Schminkey, Donna L; von Oertzen, Timo; Bullock, Linda
2016-08-01
With increasing access to population-based data and electronic health records for secondary analysis, missing data are common. In the social and behavioral sciences, missing data frequently are handled with multiple imputation methods or full information maximum likelihood (FIML) techniques, but healthcare researchers have not embraced these methodologies to the same extent and more often use either traditional imputation techniques or complete case analysis, which can compromise power and introduce unintended bias. This article is a review of options for handling missing data, concluding with a case study demonstrating the utility of multilevel structural equation modeling using full information maximum likelihood (MSEM with FIML) to handle large amounts of missing data. MSEM with FIML is a parsimonious and hypothesis-driven strategy to cope with large amounts of missing data without compromising power or introducing bias. This technique is relevant for nurse researchers faced with ever-increasing amounts of electronic data and decreasing research budgets. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.
METHODS FOR CLUSTERING TIME SERIES DATA ACQUIRED FROM MOBILE HEALTH APPS.
Tignor, Nicole; Wang, Pei; Genes, Nicholas; Rogers, Linda; Hershman, Steven G; Scott, Erick R; Zweig, Micol; Yvonne Chan, Yu-Feng; Schadt, Eric E
2017-01-01
In our recent Asthma Mobile Health Study (AMHS), thousands of asthma patients across the country contributed medical data through the iPhone Asthma Health App on a daily basis for an extended period of time. The collected data included daily self-reported asthma symptoms, symptom triggers, and real time geographic location information. The AMHS is just one of many studies occurring in the context of now many thousands of mobile health apps aimed at improving wellness and better managing chronic disease conditions, leveraging the passive and active collection of data from mobile, handheld smart devices. The ability to identify patient groups or patterns of symptoms that might predict adverse outcomes such as asthma exacerbations or hospitalizations from these types of large, prospectively collected data sets, would be of significant general interest. However, conventional clustering methods cannot be applied to these types of longitudinally collected data, especially survey data actively collected from app users, given heterogeneous patterns of missing values due to: 1) varying survey response rates among different users, 2) varying survey response rates over time of each user, and 3) non-overlapping periods of enrollment among different users. To handle such complicated missing data structure, we proposed a probability imputation model to infer missing data. We also employed a consensus clustering strategy in tandem with the multiple imputation procedure. Through simulation studies under a range of scenarios reflecting real data conditions, we identified favorable performance of the proposed method over other strategies that impute the missing value through low-rank matrix completion. When applying the proposed new method to study asthma triggers and symptoms collected as part of the AMHS, we identified several patient groups with distinct phenotype patterns. Further validation of the methods described in this paper might be used to identify clinically important patterns in large data sets with complicated missing data structure, improving the ability to use such data sets to identify at-risk populations for potential intervention.
NASA Astrophysics Data System (ADS)
Xiao, Q.; Liu, Y.
2017-12-01
Satellite aerosol optical depth (AOD) has been used to assess fine particulate matter (PM2.5) pollution worldwide. However, non-random missing AOD due to cloud cover or high surface reflectance can cause up to 80% data loss and bias model-estimated spatial and temporal trends of PM2.5. Previous studies filled the data gap largely by spatial smoothing which ignored the impact of cloud cover and meteorology on aerosol loadings and has been shown to exhibit poor performance when monitoring stations are sparse or when there is seasonal large-scale missingness. Using the Yangtze River Delta of China as an example, we present a flexible Multiple Imputation (MI) method that combines cloud fraction, elevation, humidity, temperature, and spatiotemporal trends to impute the missing AOD. A two-stage statistical model driven by gap-filled AOD, meteorology and land use information was then fitted to estimate daily ground PM2.5 concentrations in 2013 and 2014 at 1 km resolution with complete coverage in space and time. The daily MI models have an average R2 of 0.77, with an inter-quartile range of 0.71 to 0.82 across days. The overall model 10-fold cross-validation R2 were 0.81 and 0.73 (for year 2013 and 2014, respectively. Predictions with only observational AOD or only imputed AOD showed similar accuracy. This method provides reliable PM2.5 predictions with complete coverage at high resolution. By including all the pixels of all days into model development, this method corrected the sampling bias in satellite-driven air pollution modelling due to non-random missingness in AOD. Comparing with previously reported gap-filling methods, the MI method has the strength of not relying on ground PM2.5 measurements, therefore allows the prediction of historical PM2.5 levels prior to the establishment of regular ground monitoring networks.
Holman, Rebecca; Glas, Cees AW; Lindeboom, Robert; Zwinderman, Aeilko H; de Haan, Rob J
2004-01-01
Background Whenever questionnaires are used to collect data on constructs, such as functional status or health related quality of life, it is unlikely that all respondents will respond to all items. This paper examines ways of dealing with responses in a 'not applicable' category to items included in the AMC Linear Disability Score (ALDS) project item bank. Methods The data examined in this paper come from the responses of 392 respondents to 32 items and form part of the calibration sample for the ALDS item bank. The data are analysed using the one-parameter logistic item response theory model. The four practical strategies for dealing with this type of response are: cold deck imputation; hot deck imputation; treating the missing responses as if these items had never been offered to those individual patients; and using a model which takes account of the 'tendency to respond to items'. Results The item and respondent population parameter estimates were very similar for the strategies involving hot deck imputation; treating the missing responses as if these items had never been offered to those individual patients; and using a model which takes account of the 'tendency to respond to items'. The estimates obtained using the cold deck imputation method were substantially different. Conclusions The cold deck imputation method was not considered suitable for use in the ALDS item bank. The other three methods described can be usefully implemented in the ALDS item bank, depending on the purpose of the data analysis to be carried out. These three methods may be useful for other data sets examining similar constructs, when item response theory based methods are used. PMID:15200681
Missing Value Imputation Approach for Mass Spectrometry-based Metabolomics Data.
Wei, Runmin; Wang, Jingye; Su, Mingming; Jia, Erik; Chen, Shaoqiu; Chen, Tianlu; Ni, Yan
2018-01-12
Missing values exist widely in mass-spectrometry (MS) based metabolomics data. Various methods have been applied for handling missing values, but the selection can significantly affect following data analyses. Typically, there are three types of missing values, missing not at random (MNAR), missing at random (MAR), and missing completely at random (MCAR). Our study comprehensively compared eight imputation methods (zero, half minimum (HM), mean, median, random forest (RF), singular value decomposition (SVD), k-nearest neighbors (kNN), and quantile regression imputation of left-censored data (QRILC)) for different types of missing values using four metabolomics datasets. Normalized root mean squared error (NRMSE) and NRMSE-based sum of ranks (SOR) were applied to evaluate imputation accuracy. Principal component analysis (PCA)/partial least squares (PLS)-Procrustes analysis were used to evaluate the overall sample distribution. Student's t-test followed by correlation analysis was conducted to evaluate the effects on univariate statistics. Our findings demonstrated that RF performed the best for MCAR/MAR and QRILC was the favored one for left-censored MNAR. Finally, we proposed a comprehensive strategy and developed a public-accessible web-tool for the application of missing value imputation in metabolomics ( https://metabolomics.cc.hawaii.edu/software/MetImp/ ).
High-density marker imputation accuracy in sixteen French cattle breeds
2013-01-01
Background Genotyping with the medium-density Bovine SNP50 BeadChip® (50K) is now standard in cattle. The high-density BovineHD BeadChip®, which contains 777 609 single nucleotide polymorphisms (SNPs), was developed in 2010. Increasing marker density increases the level of linkage disequilibrium between quantitative trait loci (QTL) and SNPs and the accuracy of QTL localization and genomic selection. However, re-genotyping all animals with the high-density chip is not economically feasible. An alternative strategy is to genotype part of the animals with the high-density chip and to impute high-density genotypes for animals already genotyped with the 50K chip. Thus, it is necessary to investigate the error rate when imputing from the 50K to the high-density chip. Methods Five thousand one hundred and fifty three animals from 16 breeds (89 to 788 per breed) were genotyped with the high-density chip. Imputation error rates from the 50K to the high-density chip were computed for each breed with a validation set that included the 20% youngest animals. Marker genotypes were masked for animals in the validation population in order to mimic 50K genotypes. Imputation was carried out using the Beagle 3.3.0 software. Results Mean allele imputation error rates ranged from 0.31% to 2.41% depending on the breed. In total, 1980 SNPs had high imputation error rates in several breeds, which is probably due to genome assembly errors, and we recommend to discard these in future studies. Differences in imputation accuracy between breeds were related to the high-density-genotyped sample size and to the genetic relationship between reference and validation populations, whereas differences in effective population size and level of linkage disequilibrium showed limited effects. Accordingly, imputation accuracy was higher in breeds with large populations and in dairy breeds than in beef breeds. More than 99% of the alleles were correctly imputed if more than 300 animals were genotyped at high-density. No improvement was observed when multi-breed imputation was performed. Conclusion In all breeds, imputation accuracy was higher than 97%, which indicates that imputation to the high-density chip was accurate. Imputation accuracy depends mainly on the size of the reference population and the relationship between reference and target populations. PMID:24004563
Improved imputation of low-frequency and rare variants using the UK10K haplotype reference panel.
Huang, Jie; Howie, Bryan; McCarthy, Shane; Memari, Yasin; Walter, Klaudia; Min, Josine L; Danecek, Petr; Malerba, Giovanni; Trabetti, Elisabetta; Zheng, Hou-Feng; Gambaro, Giovanni; Richards, J Brent; Durbin, Richard; Timpson, Nicholas J; Marchini, Jonathan; Soranzo, Nicole
2015-09-14
Imputing genotypes from reference panels created by whole-genome sequencing (WGS) provides a cost-effective strategy for augmenting the single-nucleotide polymorphism (SNP) content of genome-wide arrays. The UK10K Cohorts project has generated a data set of 3,781 whole genomes sequenced at low depth (average 7x), aiming to exhaustively characterize genetic variation down to 0.1% minor allele frequency in the British population. Here we demonstrate the value of this resource for improving imputation accuracy at rare and low-frequency variants in both a UK and an Italian population. We show that large increases in imputation accuracy can be achieved by re-phasing WGS reference panels after initial genotype calling. We also present a method for combining WGS panels to improve variant coverage and downstream imputation accuracy, which we illustrate by integrating 7,562 WGS haplotypes from the UK10K project with 2,184 haplotypes from the 1000 Genomes Project. Finally, we introduce a novel approximation that maintains speed without sacrificing imputation accuracy for rare variants.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Webb-Robertson, Bobbie-Jo M.; Wiberg, Holli K.; Matzke, Melissa M.
In this review, we apply selected imputation strategies to label-free liquid chromatography–mass spectrometry (LC–MS) proteomics datasets to evaluate the accuracy with respect to metrics of variance and classification. We evaluate several commonly used imputation approaches for individual merits and discuss the caveats of each approach with respect to the example LC–MS proteomics data. In general, local similarity-based approaches, such as the regularized expectation maximization and least-squares adaptive algorithms, yield the best overall performances with respect to metrics of accuracy and robustness. However, no single algorithm consistently outperforms the remaining approaches, and in some cases, performing classification without imputation sometimes yieldedmore » the most accurate classification. Thus, because of the complex mechanisms of missing data in proteomics, which also vary from peptide to protein, no individual method is a single solution for imputation. In summary, on the basis of the observations in this review, the goal for imputation in the field of computational proteomics should be to develop new approaches that work generically for this data type and new strategies to guide users in the selection of the best imputation for their dataset and analysis objectives.« less
Genotype imputation in the domestic dog
Meurs, K. M.
2016-01-01
Application of imputation methods to accurately predict a dense array of SNP genotypes in the dog could provide an important supplement to current analyses of array-based genotyping data. Here, we developed a reference panel of 4,885,283 SNPs in 83 dogs across 15 breeds using whole genome sequencing. We used this panel to predict the genotypes of 268 dogs across three breeds with 84,193 SNP array-derived genotypes as inputs. We then (1) performed breed clustering of the actual and imputed data; (2) evaluated several reference panel breed combinations to determine an optimal reference panel composition; and (3) compared the accuracy of two commonly used software algorithms (Beagle and IMPUTE2). Breed clustering was well preserved in the imputation process across eigenvalues representing 75 % of the variation in the imputed data. Using Beagle with a target panel from a single breed, genotype concordance was highest using a multi-breed reference panel (92.4 %) compared to a breed-specific reference panel (87.0 %) or a reference panel containing no breeds overlapping with the target panel (74.9 %). This finding was confirmed using target panels derived from two other breeds. Additionally, using the multi-breed reference panel, genotype concordance was slightly higher with IMPUTE2 (94.1 %) compared to Beagle; Pearson correlation coefficients were slightly higher for both software packages (0.946 for Beagle, 0.961 for IMPUTE2). Our findings demonstrate that genotype imputation from SNP array-derived data to whole genome-level genotypes is both feasible and accurate in the dog with appropriate breed overlap between the target and reference panels. PMID:27129452
The Effects of Methods of Imputation for Missing Values on the Validity and Reliability of Scales
ERIC Educational Resources Information Center
Cokluk, Omay; Kayri, Murat
2011-01-01
The main aim of this study is the comparative examination of the factor structures, corrected item-total correlations, and Cronbach-alpha internal consistency coefficients obtained by different methods used in imputation for missing values in conditions of not having missing values, and having missing values of different rates in terms of testing…
Ellerbe, Caitlyn; Lawson, Andrew B.; Alia, Kassandra A.; Meyers, Duncan C.; Coulon, Sandra M.; Lawman, Hannah G.
2013-01-01
Background This study examined imputational modeling effects of spatial proximity and social factors of walking in African American adults. Purpose Models were compared that examined relationships between household proximity to a walking trail and social factors in determining walking status. Methods Participants (N=133; 66% female; mean age=55 yrs) were recruited to a police-supported walking and social marketing intervention. Bayesian modeling was used to identify predictors of walking at 12 months. Results Sensitivity analysis using different imputation approaches, and spatial contextual effects, were compared. All the imputation methods showed social life and income were significant predictors of walking, however, the complete data approach was the best model indicating Age (1.04, 95% OR: 1.00, 1.08), Social Life (0.83, 95% OR: 0.69, 0.98) and Income > $10,000 (0.10, 95% OR: 0.01, 0.97) were all predictors of walking. Conclusions The complete data approach was the best model of predictors of walking in African Americans. PMID:23481250
Bianca N. I. Eskelson; Hailemariam Temesgen; Valerie Lemay; Tara M. Barrett; Nicholas L. Crookston; Andrew T. Hudak
2009-01-01
Almost universally, forest inventory and monitoring databases are incomplete, ranging from missing data for only a few records and a few variables, common for small land areas, to missing data for many observations and many variables, common for large land areas. For a wide variety of applications, nearest neighbor (NN) imputation methods have been developed to fill in...
Bianca N.I. Eskelson; Hailemariam Temesgen; Tara M. Barrett
2009-01-01
Cavity tree and snag abundance data are highly variable and contain many zero observations. We predict cavity tree and snag abundance from variables that are readily available from forest cover maps or remotely sensed data using negative binomial (NB), zero-inflated NB, and zero-altered NB (ZANB) regression models as well as nearest neighbor (NN) imputation methods....
State Alcohol-Impaired-Driving Estimates
... For more information on multiple imputation see NHTSA’s Technical Report (DOT HS 809 403, www- nrd. nhtsa. ... involvement); and NHTSA’s National Center for Statistics and Analysis 1200 New Jersey Avenue SE., Washington, DC 20590 ...
A multiple imputation strategy for sequential multiple assignment randomized trials
Shortreed, Susan M.; Laber, Eric; Stroup, T. Scott; Pineau, Joelle; Murphy, Susan A.
2014-01-01
Sequential multiple assignment randomized trials (SMARTs) are increasingly being used to inform clinical and intervention science. In a SMART, each patient is repeatedly randomized over time. Each randomization occurs at a critical decision point in the treatment course. These critical decision points often correspond to milestones in the disease process or other changes in a patient’s health status. Thus, the timing and number of randomizations may vary across patients and depend on evolving patient-specific information. This presents unique challenges when analyzing data from a SMART in the presence of missing data. This paper presents the first comprehensive discussion of missing data issues typical of SMART studies: we describe five specific challenges, and propose a flexible imputation strategy to facilitate valid statistical estimation and inference using incomplete data from a SMART. To illustrate these contributions, we consider data from the Clinical Antipsychotic Trial of Intervention and Effectiveness (CATIE), one of the most well-known SMARTs to date. PMID:24919867
Nuclear Forensics Analysis with Missing and Uncertain Data
Langan, Roisin T.; Archibald, Richard K.; Lamberti, Vincent
2015-10-05
We have applied a new imputation-based method for analyzing incomplete data, called Monte Carlo Bayesian Database Generation (MCBDG), to the Spent Fuel Isotopic Composition (SFCOMPO) database. About 60% of the entries are absent for SFCOMPO. The method estimates missing values of a property from a probability distribution created from the existing data for the property, and then generates multiple instances of the completed database for training a machine learning algorithm. Uncertainty in the data is represented by an empirical or an assumed error distribution. The method makes few assumptions about the underlying data, and compares favorably against results obtained bymore » replacing missing information with constant values.« less
DAMBE7: New and Improved Tools for Data Analysis in Molecular Biology and Evolution.
Xia, Xuhua
2018-06-01
DAMBE is a comprehensive software package for genomic and phylogenetic data analysis on Windows, Linux, and Macintosh computers. New functions include imputing missing distances and phylogeny simultaneously (paving the way to build large phage and transposon trees), new bootstrapping/jackknifing methods for PhyPA (phylogenetics from pairwise alignments), and an improved function for fast and accurate estimation of the shape parameter of the gamma distribution for fitting rate heterogeneity over sites. Previous method corrects multiple hits for each site independently. DAMBE's new method uses all sites simultaneously for correction. DAMBE, featuring a user-friendly graphic interface, is freely available from http://dambe.bio.uottawa.ca (last accessed, April 17, 2018).
DISSCO: direct imputation of summary statistics allowing covariates
Xu, Zheng; Duan, Qing; Yan, Song; Chen, Wei; Li, Mingyao; Lange, Ethan; Li, Yun
2015-01-01
Background: Imputation of individual level genotypes at untyped markers using an external reference panel of genotyped or sequenced individuals has become standard practice in genetic association studies. Direct imputation of summary statistics can also be valuable, for example in meta-analyses where individual level genotype data are not available. Two methods (DIST and ImpG-Summary/LD), that assume a multivariate Gaussian distribution for the association summary statistics, have been proposed for imputing association summary statistics. However, both methods assume that the correlations between association summary statistics are the same as the correlations between the corresponding genotypes. This assumption can be violated in the presence of confounding covariates. Methods: We analytically show that in the absence of covariates, correlation among association summary statistics is indeed the same as that among the corresponding genotypes, thus serving as a theoretical justification for the recently proposed methods. We continue to prove that in the presence of covariates, correlation among association summary statistics becomes the partial correlation of the corresponding genotypes controlling for covariates. We therefore develop direct imputation of summary statistics allowing covariates (DISSCO). Results: We consider two real-life scenarios where the correlation and partial correlation likely make practical difference: (i) association studies in admixed populations; (ii) association studies in presence of other confounding covariate(s). Application of DISSCO to real datasets under both scenarios shows at least comparable, if not better, performance compared with existing correlation-based methods, particularly for lower frequency variants. For example, DISSCO can reduce the absolute deviation from the truth by 3.9–15.2% for variants with minor allele frequency <5%. Availability and implementation: http://www.unc.edu/∼yunmli/DISSCO. Contact: yunli@med.unc.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:25810429
NASA Astrophysics Data System (ADS)
Ho, M. W.; Lall, U.; Cook, E. R.
2015-12-01
Advances in paleoclimatology in the past few decades have provided opportunities to expand the temporal perspective of the hydrological and climatological variability across the world. The North American region is particularly fortunate in this respect where a relatively dense network of high resolution paleoclimate proxy records have been assembled. One such network is the annually-resolved Living Blended Drought Atlas (LBDA): a paleoclimate reconstruction of the Palmer Drought Severity Index (PDSI) that covers North America on a 0.5° × 0.5° grid based on tree-ring chronologies. However, the use of the LBDA to assess North American streamflow variability requires a model by which streamflow may be reconstructed. Paleoclimate reconstructions have typically used models that first seek to quantify the relationship between the paleoclimate variable and the environmental variable of interest before extrapolating the relationship back in time. In contrast, the pre-instrumental streamflow is here considered as "missing" data. A method of imputing the "missing" streamflow data, prior to the instrumental record, is applied through multiple imputation using chained equations for streamflow in the Missouri River Basin. In this method, the distribution of the instrumental streamflow and LBDA is used to estimate sets of plausible values for the "missing" streamflow data resulting in a ~600 year-long streamflow reconstruction. Past research into external climate forcings, oceanic-atmospheric variability and its teleconnections, and assessments of rare multi-centennial instrumental records demonstrate that large temporal oscillations in hydrological conditions are unlikely to be captured in most instrumental records. The reconstruction of multi-centennial records of streamflow will enable comprehensive assessments of current and future water resource infrastructure and operations under the existing scope of natural climate variability.
Clustering with Missing Values: No Imputation Required
NASA Technical Reports Server (NTRS)
Wagstaff, Kiri
2004-01-01
Clustering algorithms can identify groups in large data sets, such as star catalogs and hyperspectral images. In general, clustering methods cannot analyze items that have missing data values. Common solutions either fill in the missing values (imputation) or ignore the missing data (marginalization). Imputed values are treated as just as reliable as the truly observed data, but they are only as good as the assumptions used to create them. In contrast, we present a method for encoding partially observed features as a set of supplemental soft constraints and introduce the KSC algorithm, which incorporates constraints into the clustering process. In experiments on artificial data and data from the Sloan Digital Sky Survey, we show that soft constraints are an effective way to enable clustering with missing values.
A SPATIOTEMPORAL APPROACH FOR HIGH RESOLUTION TRAFFIC FLOW IMPUTATION
DOE Office of Scientific and Technical Information (OSTI.GOV)
Han, Lee; Chin, Shih-Miao; Hwang, Ho-Ling
Along with the rapid development of Intelligent Transportation Systems (ITS), traffic data collection technologies have been evolving dramatically. The emergence of innovative data collection technologies such as Remote Traffic Microwave Sensor (RTMS), Bluetooth sensor, GPS-based Floating Car method, automated license plate recognition (ALPR) (1), etc., creates an explosion of traffic data, which brings transportation engineering into the new era of Big Data. However, despite the advance of technologies, the missing data issue is still inevitable and has posed great challenges for research such as traffic forecasting, real-time incident detection and management, dynamic route guidance, and massive evacuation optimization, because themore » degree of success of these endeavors depends on the timely availability of relatively complete and reasonably accurate traffic data. A thorough literature review suggests most current imputation models, if not all, focus largely on the temporal nature of the traffic data and fail to consider the fact that traffic stream characteristics at a certain location are closely related to those at neighboring locations and utilize these correlations for data imputation. To this end, this paper presents a Kriging based spatiotemporal data imputation approach that is able to fully utilize the spatiotemporal information underlying in traffic data. Imputation performance of the proposed approach was tested using simulated scenarios and achieved stable imputation accuracy. Moreover, the proposed Kriging imputation model is more flexible compared to current models.« less
HLA imputation in an admixed population: An assessment of the 1000 Genomes data as a training set.
Nunes, Kelly; Zheng, Xiuwen; Torres, Margareth; Moraes, Maria Elisa; Piovezan, Bruno Z; Pontes, Gerlandia N; Kimura, Lilian; Carnavalli, Juliana E P; Mingroni Netto, Regina C; Meyer, Diogo
2016-03-01
Methods to impute HLA alleles based on dense single nucleotide polymorphism (SNP) data provide a valuable resource to association studies and evolutionary investigation of the MHC region. The availability of appropriate training sets is critical to the accuracy of HLA imputation, and the inclusion of samples with various ancestries is an important pre-requisite in studies of admixed populations. We assess the accuracy of HLA imputation using 1000 Genomes Project data as a training set, applying it to a highly admixed Brazilian population, the Quilombos from the state of São Paulo. To assess accuracy, we compared imputed and experimentally determined genotypes for 146 samples at 4 HLA classical loci. We found imputation accuracies of 82.9%, 81.8%, 94.8% and 86.6% for HLA-A, -B, -C and -DRB1 respectively (two-field resolution). Accuracies were improved when we included a subset of Quilombo individuals in the training set. We conclude that the 1000 Genomes data is a valuable resource for construction of training sets due to the diversity of ancestries and the potential for a large overlap of SNPs with the target population. We also show that tailoring training sets to features of the target population substantially enhances imputation accuracy. Copyright © 2016 American Society for Histocompatibility and Immunogenetics. Published by Elsevier Inc. All rights reserved.
USDA-ARS?s Scientific Manuscript database
Next-generation sequencing technology such as genotyping-by-sequencing (GBS) made low-cost, but often low-coverage, whole-genome sequencing widely available. Extensive inbreeding in crop plants provides an untapped, high quality source of phased haplotypes for imputing missing genotypes. We introduc...
Identifying Heat Waves in Florida: Considerations of Missing Weather Data
Leary, Emily; Young, Linda J.; DuClos, Chris; Jordan, Melissa M.
2015-01-01
Background Using current climate models, regional-scale changes for Florida over the next 100 years are predicted to include warming over terrestrial areas and very likely increases in the number of high temperature extremes. No uniform definition of a heat wave exists. Most past research on heat waves has focused on evaluating the aftermath of known heat waves, with minimal consideration of missing exposure information. Objectives To identify and discuss methods of handling and imputing missing weather data and how those methods can affect identified periods of extreme heat in Florida. Methods In addition to ignoring missing data, temporal, spatial, and spatio-temporal models are described and utilized to impute missing historical weather data from 1973 to 2012 from 43 Florida weather monitors. Calculated thresholds are used to define periods of extreme heat across Florida. Results Modeling of missing data and imputing missing values can affect the identified periods of extreme heat, through the missing data itself or through the computed thresholds. The differences observed are related to the amount of missingness during June, July, and August, the warmest months of the warm season (April through September). Conclusions Missing data considerations are important when defining periods of extreme heat. Spatio-temporal methods are recommended for data imputation. A heat wave definition that incorporates information from all monitors is advised. PMID:26619198
Meseck, Kristin; Jankowska, Marta M.; Schipperijn, Jasper; Natarajan, Loki; Godbole, Suneeta; Carlson, Jordan; Takemoto, Michelle; Crist, Katie; Kerr, Jacqueline
2016-01-01
The main purpose of the present study was to assess the impact of global positioning system (GPS) signal lapse on physical activity analyses, discover any existing associations between missing GPS data and environmental and demographics attributes, and to determine whether imputation is an accurate and viable method for correcting GPS data loss. Accelerometer and GPS data of 782 participants from 8 studies were pooled to represent a range of lifestyles and interactions with the built environment. Periods of GPS signal lapse were identified and extracted. Generalised linear mixed models were run with the number of lapses and the length of lapses as outcomes. The signal lapses were imputed using a simple ruleset, and imputation was validated against person-worn camera imagery. A final generalised linear mixed model was used to identify the difference between the amount of GPS minutes pre- and post-imputation for the activity categories of sedentary, light, and moderate-to-vigorous physical activity. Over 17% of the dataset was comprised of GPS data lapses. No strong associations were found between increasing lapse length and number of lapses and the demographic and built environment variables. A significant difference was found between the pre- and post-imputation minutes for each activity category. No demographic or environmental bias was found for length or number of lapses, but imputation of GPS data may make a significant difference for inclusion of physical activity data that occurred during a lapse. Imputing GPS data lapses is a viable technique for returning spatial context to accelerometer data and improving the completeness of the dataset. PMID:27245796
Joint modelling rationale for chained equations
2014-01-01
Background Chained equations imputation is widely used in medical research. It uses a set of conditional models, so is more flexible than joint modelling imputation for the imputation of different types of variables (e.g. binary, ordinal or unordered categorical). However, chained equations imputation does not correspond to drawing from a joint distribution when the conditional models are incompatible. Concurrently with our work, other authors have shown the equivalence of the two imputation methods in finite samples. Methods Taking a different approach, we prove, in finite samples, sufficient conditions for chained equations and joint modelling to yield imputations from the same predictive distribution. Further, we apply this proof in four specific cases and conduct a simulation study which explores the consequences when the conditional models are compatible but the conditions otherwise are not satisfied. Results We provide an additional “non-informative margins” condition which, together with compatibility, is sufficient. We show that the non-informative margins condition is not satisfied, despite compatible conditional models, in a situation as simple as two continuous variables and one binary variable. Our simulation study demonstrates that as a consequence of this violation order effects can occur; that is, systematic differences depending upon the ordering of the variables in the chained equations algorithm. However, the order effects appear to be small, especially when associations between variables are weak. Conclusions Since chained equations is typically used in medical research for datasets with different types of variables, researchers must be aware that order effects are likely to be ubiquitous, but our results suggest they may be small enough to be negligible. PMID:24559129
Impact of Missing Data for Body Mass Index in an Epidemiologic Study.
Razzaghi, Hilda; Tinker, Sarah C; Herring, Amy H; Howards, Penelope P; Waller, D Kim; Johnson, Candice Y
2016-07-01
Objective To assess the potential impact of missing data on body mass index (BMI) on the association between prepregnancy obesity and specific birth defects. Methods Data from the National Birth Defects Prevention Study (NBDPS) were analyzed. We assessed the factors associated with missing BMI data among mothers of infants without birth defects. Four analytic methods were then used to assess the impact of missing BMI data on the association between maternal prepregnancy obesity and three birth defects; spina bifida, gastroschisis, and cleft lip with/without cleft palate. The analytic methods were: (1) complete case analysis; (2) assignment of missing values to either obese or normal BMI; (3) multiple imputation; and (4) probabilistic sensitivity analysis. Logistic regression was used to estimate crude and adjusted odds ratios (aOR) and 95 % confidence intervals (CI). Results Of NBDPS control mothers 4.6 % were missing BMI data, and most of the missing values were attributable to missing height (~90 %). Missing BMI data was associated with birth outside of the US (aOR 8.6; 95 % CI 5.5, 13.4), interview in Spanish (aOR 2.4; 95 % CI 1.8, 3.2), Hispanic ethnicity (aOR 2.0; 95 % CI 1.2, 3.4), and <12 years education (aOR 2.3; 95 % CI 1.7, 3.1). Overall the results of the multiple imputation and probabilistic sensitivity analysis were similar to the complete case analysis. Conclusions Although in some scenarios missing BMI data can bias the magnitude of association, it does not appear likely to have impacted conclusions from a traditional complete case analysis of these data.
Impact of pre-imputation SNP-filtering on genotype imputation results
2014-01-01
Background Imputation of partially missing or unobserved genotypes is an indispensable tool for SNP data analyses. However, research and understanding of the impact of initial SNP-data quality control on imputation results is still limited. In this paper, we aim to evaluate the effect of different strategies of pre-imputation quality filtering on the performance of the widely used imputation algorithms MaCH and IMPUTE. Results We considered three scenarios: imputation of partially missing genotypes with usage of an external reference panel, without usage of an external reference panel, as well as imputation of completely un-typed SNPs using an external reference panel. We first created various datasets applying different SNP quality filters and masking certain percentages of randomly selected high-quality SNPs. We imputed these SNPs and compared the results between the different filtering scenarios by using established and newly proposed measures of imputation quality. While the established measures assess certainty of imputation results, our newly proposed measures focus on the agreement with true genotypes. These measures showed that pre-imputation SNP-filtering might be detrimental regarding imputation quality. Moreover, the strongest drivers of imputation quality were in general the burden of missingness and the number of SNPs used for imputation. We also found that using a reference panel always improves imputation quality of partially missing genotypes. MaCH performed slightly better than IMPUTE2 in most of our scenarios. Again, these results were more pronounced when using our newly defined measures of imputation quality. Conclusion Even a moderate filtering has a detrimental effect on the imputation quality. Therefore little or no SNP filtering prior to imputation appears to be the best strategy for imputing small to moderately sized datasets. Our results also showed that for these datasets, MaCH performs slightly better than IMPUTE2 in most scenarios at the cost of increased computing time. PMID:25112433
Loneliness in senior housing communities.
Taylor, Harry Owen; Wang, Yi; Morrow-Howell, Nancy
2018-05-23
There are many studies on loneliness among community-dwelling older adults; however, there is limited research examining the extent and correlates of loneliness among older adults who reside in senior housing communities. This study examines the extent and correlates of loneliness in three public senior housing communities in the St. Louis area. Data for this project was collected with survey questionnaires with a total sample size of 148 respondents. Loneliness was measured using the Hughes 3-item loneliness scale. Additionally, the questionnaire contained measures on socio-demographics, health/mental health, social engagement, and social support. Missing data for the hierarchical multivariate regression models were imputed using multiple imputation methods. Results showed approximately 30.8% of the sample was not lonely, 42.7% was moderately lonely, and 26.6% was severely lonely. In the multivariate analyses, loneliness was primarily associated with depressive symptoms. Contrary to popular opinion, our study found the prevalence of loneliness was high in senior housing communities. Nevertheless, senior housing communities could be ideal locations for reducing loneliness among older adults. Interventions should focus on concomitantly addressing both an individual's loneliness and mental health.
DISSCO: direct imputation of summary statistics allowing covariates.
Xu, Zheng; Duan, Qing; Yan, Song; Chen, Wei; Li, Mingyao; Lange, Ethan; Li, Yun
2015-08-01
Imputation of individual level genotypes at untyped markers using an external reference panel of genotyped or sequenced individuals has become standard practice in genetic association studies. Direct imputation of summary statistics can also be valuable, for example in meta-analyses where individual level genotype data are not available. Two methods (DIST and ImpG-Summary/LD), that assume a multivariate Gaussian distribution for the association summary statistics, have been proposed for imputing association summary statistics. However, both methods assume that the correlations between association summary statistics are the same as the correlations between the corresponding genotypes. This assumption can be violated in the presence of confounding covariates. We analytically show that in the absence of covariates, correlation among association summary statistics is indeed the same as that among the corresponding genotypes, thus serving as a theoretical justification for the recently proposed methods. We continue to prove that in the presence of covariates, correlation among association summary statistics becomes the partial correlation of the corresponding genotypes controlling for covariates. We therefore develop direct imputation of summary statistics allowing covariates (DISSCO). We consider two real-life scenarios where the correlation and partial correlation likely make practical difference: (i) association studies in admixed populations; (ii) association studies in presence of other confounding covariate(s). Application of DISSCO to real datasets under both scenarios shows at least comparable, if not better, performance compared with existing correlation-based methods, particularly for lower frequency variants. For example, DISSCO can reduce the absolute deviation from the truth by 3.9-15.2% for variants with minor allele frequency <5%. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
USDA-ARS?s Scientific Manuscript database
The objective of this study was to investigate alternative methods for designing and utilizing reduced single nucleotide polymorphism (SNP) panels for imputing SNP genotypes. Two purebred Hereford populations, an experimental population known as Line 1 Hereford (L1, N=240) and registered Hereford wi...
TRANSPOSABLE REGULARIZED COVARIANCE MODELS WITH AN APPLICATION TO MISSING DATA IMPUTATION
Allen, Genevera I.; Tibshirani, Robert
2015-01-01
Missing data estimation is an important challenge with high-dimensional data arranged in the form of a matrix. Typically this data matrix is transposable, meaning that either the rows, columns or both can be treated as features. To model transposable data, we present a modification of the matrix-variate normal, the mean-restricted matrix-variate normal, in which the rows and columns each have a separate mean vector and covariance matrix. By placing additive penalties on the inverse covariance matrices of the rows and columns, these so called transposable regularized covariance models allow for maximum likelihood estimation of the mean and non-singular covariance matrices. Using these models, we formulate EM-type algorithms for missing data imputation in both the multivariate and transposable frameworks. We present theoretical results exploiting the structure of our transposable models that allow these models and imputation methods to be applied to high-dimensional data. Simulations and results on microarray data and the Netflix data show that these imputation techniques often outperform existing methods and offer a greater degree of flexibility. PMID:26877823
TRANSPOSABLE REGULARIZED COVARIANCE MODELS WITH AN APPLICATION TO MISSING DATA IMPUTATION.
Allen, Genevera I; Tibshirani, Robert
2010-06-01
Missing data estimation is an important challenge with high-dimensional data arranged in the form of a matrix. Typically this data matrix is transposable , meaning that either the rows, columns or both can be treated as features. To model transposable data, we present a modification of the matrix-variate normal, the mean-restricted matrix-variate normal , in which the rows and columns each have a separate mean vector and covariance matrix. By placing additive penalties on the inverse covariance matrices of the rows and columns, these so called transposable regularized covariance models allow for maximum likelihood estimation of the mean and non-singular covariance matrices. Using these models, we formulate EM-type algorithms for missing data imputation in both the multivariate and transposable frameworks. We present theoretical results exploiting the structure of our transposable models that allow these models and imputation methods to be applied to high-dimensional data. Simulations and results on microarray data and the Netflix data show that these imputation techniques often outperform existing methods and offer a greater degree of flexibility.
Missing portion sizes in FFQ--alternatives to use of standard portions.
Køster-Rasmussen, Rasmus; Siersma, Volkert; Halldorsson, Thorhallur I; de Fine Olivarius, Niels; Henriksen, Jan E; Heitmann, Berit L
2015-08-01
Standard portions or substitution of missing portion sizes with medians may generate bias when quantifying the dietary intake from FFQ. The present study compared four different methods to include portion sizes in FFQ. We evaluated three stochastic methods for imputation of portion sizes based on information about anthropometry, sex, physical activity and age. Energy intakes computed with standard portion sizes, defined as sex-specific medians (median), or with portion sizes estimated with multinomial logistic regression (MLR), 'comparable categories' (Coca) or k-nearest neighbours (KNN) were compared with a reference based on self-reported portion sizes (quantified by a photographic food atlas embedded in the FFQ). The Danish Health Examination Survey 2007-2008. The study included 3728 adults with complete portion size data. Compared with the reference, the root-mean-square errors of the mean daily total energy intake (in kJ) computed with portion sizes estimated by the four methods were (men; women): median (1118; 1061), MLR (1060; 1051), Coca (1230; 1146), KNN (1281; 1181). The equivalent biases (mean error) were (in kJ): median (579; 469), MLR (248; 178), Coca (234; 188), KNN (-340; 218). The methods MLR and Coca provided the best agreement with the reference. The stochastic methods allowed for estimation of meaningful portion sizes by conditioning on information about physiology and they were suitable for multiple imputation. We propose to use MLR or Coca to substitute missing portion size values or when portion sizes needs to be included in FFQ without portion size data.
Improving record linkage performance in the presence of missing linkage data.
Ong, Toan C; Mannino, Michael V; Schilling, Lisa M; Kahn, Michael G
2014-12-01
Existing record linkage methods do not handle missing linking field values in an efficient and effective manner. The objective of this study is to investigate three novel methods for improving the accuracy and efficiency of record linkage when record linkage fields have missing values. By extending the Fellegi-Sunter scoring implementations available in the open-source Fine-grained Record Linkage (FRIL) software system we developed three novel methods to solve the missing data problem in record linkage, which we refer to as: Weight Redistribution, Distance Imputation, and Linkage Expansion. Weight Redistribution removes fields with missing data from the set of quasi-identifiers and redistributes the weight from the missing attribute based on relative proportions across the remaining available linkage fields. Distance Imputation imputes the distance between the missing data fields rather than imputing the missing data value. Linkage Expansion adds previously considered non-linkage fields to the linkage field set to compensate for the missing information in a linkage field. We tested the linkage methods using simulated data sets with varying field value corruption rates. The methods developed had sensitivity ranging from .895 to .992 and positive predictive values (PPV) ranging from .865 to 1 in data sets with low corruption rates. Increased corruption rates lead to decreased sensitivity for all methods. These new record linkage algorithms show promise in terms of accuracy and efficiency and may be valuable for combining large data sets at the patient level to support biomedical and clinical research. Copyright © 2014 Elsevier Inc. All rights reserved.
Tackling Missing Data in Community Health Studies Using Additive LS-SVM Classifier.
Wang, Guanjin; Deng, Zhaohong; Choi, Kup-Sze
2018-03-01
Missing data is a common issue in community health and epidemiological studies. Direct removal of samples with missing data can lead to reduced sample size and information bias, which deteriorates the significance of the results. While data imputation methods are available to deal with missing data, they are limited in performance and could introduce noises into the dataset. Instead of data imputation, a novel method based on additive least square support vector machine (LS-SVM) is proposed in this paper for predictive modeling when the input features of the model contain missing data. The method also determines simultaneously the influence of the features with missing values on the classification accuracy using the fast leave-one-out cross-validation strategy. The performance of the method is evaluated by applying it to predict the quality of life (QOL) of elderly people using health data collected in the community. The dataset involves demographics, socioeconomic status, health history, and the outcomes of health assessments of 444 community-dwelling elderly people, with 5% to 60% of data missing in some of the input features. The QOL is measured using a standard questionnaire of the World Health Organization. Results show that the proposed method outperforms four conventional methods for handling missing data-case deletion, feature deletion, mean imputation, and K-nearest neighbor imputation, with the average QOL prediction accuracy reaching 0.7418. It is potentially a promising technique for tackling missing data in community health research and other applications.
Missing value imputation: with application to handwriting data
NASA Astrophysics Data System (ADS)
Xu, Zhen; Srihari, Sargur N.
2015-01-01
Missing values make pattern analysis difficult, particularly with limited available data. In longitudinal research, missing values accumulate, thereby aggravating the problem. Here we consider how to deal with temporal data with missing values in handwriting analysis. In the task of studying development of individuality of handwriting, we encountered the fact that feature values are missing for several individuals at several time instances. Six algorithms, i.e., random imputation, mean imputation, most likely independent value imputation, and three methods based on Bayesian network (static Bayesian network, parameter EM, and structural EM), are compared with children's handwriting data. We evaluate the accuracy and robustness of the algorithms under different ratios of missing data and missing values, and useful conclusions are given. Specifically, static Bayesian network is used for our data which contain around 5% missing data to provide adequate accuracy and low computational cost.
Fedko, Iryna O; Hottenga, Jouke-Jan; Medina-Gomez, Carolina; Pappa, Irene; van Beijsterveldt, Catharina E M; Ehli, Erik A; Davies, Gareth E; Rivadeneira, Fernando; Tiemeier, Henning; Swertz, Morris A; Middeldorp, Christel M; Bartels, Meike; Boomsma, Dorret I
2015-09-01
Combining genotype data across cohorts increases power to estimate the heritability due to common single nucleotide polymorphisms (SNPs), based on analyzing a Genetic Relationship Matrix (GRM). However, the combination of SNP data across multiple cohorts may lead to stratification, when for example, different genotyping platforms are used. In the current study, we address issues of combining SNP data from different cohorts, the Netherlands Twin Register (NTR) and the Generation R (GENR) study. Both cohorts include children of Northern European Dutch background (N = 3102 + 2826, respectively) who were genotyped on different platforms. We explore imputation and phasing as a tool and compare three GRM-building strategies, when data from two cohorts are (1) just combined, (2) pre-combined and cross-platform imputed and (3) cross-platform imputed and post-combined. We test these three strategies with data on childhood height for unrelated individuals (N = 3124, average age 6.7 years) to explore their effect on SNP-heritability estimates and compare results to those obtained from the independent studies. All combination strategies result in SNP-heritability estimates with a standard error smaller than those of the independent studies. We did not observe significant difference in estimates of SNP-heritability based on various cross-platform imputed GRMs. SNP-heritability of childhood height was on average estimated as 0.50 (SE = 0.10). Introducing cohort as a covariate resulted in ≈2 % drop. Principal components (PCs) adjustment resulted in SNP-heritability estimates of about 0.39 (SE = 0.11). Strikingly, we did not find significant difference between cross-platform imputed and combined GRMs. All estimates were significant regardless the use of PCs adjustment. Based on these analyses we conclude that imputation with a reference set helps to increase power to estimate SNP-heritability by combining cohorts of the same ethnicity genotyped on different platforms. However, important factors should be taken into account such as remaining cohort stratification after imputation and/or phenotypic heterogeneity between and within cohorts. Whether one should use imputation, or just combine the genotype data, depends on the number of overlapping SNPs in relation to the total number of genotyped SNPs for both cohorts, and their ability to tag all the genetic variance related to the specific trait of interest.
Optimal Design of Low-Density SNP Arrays for Genomic Prediction: Algorithm and Applications.
Wu, Xiao-Lin; Xu, Jiaqi; Feng, Guofei; Wiggans, George R; Taylor, Jeremy F; He, Jun; Qian, Changsong; Qiu, Jiansheng; Simpson, Barry; Walker, Jeremy; Bauck, Stewart
2016-01-01
Low-density (LD) single nucleotide polymorphism (SNP) arrays provide a cost-effective solution for genomic prediction and selection, but algorithms and computational tools are needed for the optimal design of LD SNP chips. A multiple-objective, local optimization (MOLO) algorithm was developed for design of optimal LD SNP chips that can be imputed accurately to medium-density (MD) or high-density (HD) SNP genotypes for genomic prediction. The objective function facilitates maximization of non-gap map length and system information for the SNP chip, and the latter is computed either as locus-averaged (LASE) or haplotype-averaged Shannon entropy (HASE) and adjusted for uniformity of the SNP distribution. HASE performed better than LASE with ≤1,000 SNPs, but required considerably more computing time. Nevertheless, the differences diminished when >5,000 SNPs were selected. Optimization was accomplished conditionally on the presence of SNPs that were obligated to each chromosome. The frame location of SNPs on a chip can be either uniform (evenly spaced) or non-uniform. For the latter design, a tunable empirical Beta distribution was used to guide location distribution of frame SNPs such that both ends of each chromosome were enriched with SNPs. The SNP distribution on each chromosome was finalized through the objective function that was locally and empirically maximized. This MOLO algorithm was capable of selecting a set of approximately evenly-spaced and highly-informative SNPs, which in turn led to increased imputation accuracy compared with selection solely of evenly-spaced SNPs. Imputation accuracy increased with LD chip size, and imputation error rate was extremely low for chips with ≥3,000 SNPs. Assuming that genotyping or imputation error occurs at random, imputation error rate can be viewed as the upper limit for genomic prediction error. Our results show that about 25% of imputation error rate was propagated to genomic prediction in an Angus population. The utility of this MOLO algorithm was also demonstrated in a real application, in which a 6K SNP panel was optimized conditional on 5,260 obligatory SNP selected based on SNP-trait association in U.S. Holstein animals. With this MOLO algorithm, both imputation error rate and genomic prediction error rate were minimal.
Optimal Design of Low-Density SNP Arrays for Genomic Prediction: Algorithm and Applications
Wu, Xiao-Lin; Xu, Jiaqi; Feng, Guofei; Wiggans, George R.; Taylor, Jeremy F.; He, Jun; Qian, Changsong; Qiu, Jiansheng; Simpson, Barry; Walker, Jeremy; Bauck, Stewart
2016-01-01
Low-density (LD) single nucleotide polymorphism (SNP) arrays provide a cost-effective solution for genomic prediction and selection, but algorithms and computational tools are needed for the optimal design of LD SNP chips. A multiple-objective, local optimization (MOLO) algorithm was developed for design of optimal LD SNP chips that can be imputed accurately to medium-density (MD) or high-density (HD) SNP genotypes for genomic prediction. The objective function facilitates maximization of non-gap map length and system information for the SNP chip, and the latter is computed either as locus-averaged (LASE) or haplotype-averaged Shannon entropy (HASE) and adjusted for uniformity of the SNP distribution. HASE performed better than LASE with ≤1,000 SNPs, but required considerably more computing time. Nevertheless, the differences diminished when >5,000 SNPs were selected. Optimization was accomplished conditionally on the presence of SNPs that were obligated to each chromosome. The frame location of SNPs on a chip can be either uniform (evenly spaced) or non-uniform. For the latter design, a tunable empirical Beta distribution was used to guide location distribution of frame SNPs such that both ends of each chromosome were enriched with SNPs. The SNP distribution on each chromosome was finalized through the objective function that was locally and empirically maximized. This MOLO algorithm was capable of selecting a set of approximately evenly-spaced and highly-informative SNPs, which in turn led to increased imputation accuracy compared with selection solely of evenly-spaced SNPs. Imputation accuracy increased with LD chip size, and imputation error rate was extremely low for chips with ≥3,000 SNPs. Assuming that genotyping or imputation error occurs at random, imputation error rate can be viewed as the upper limit for genomic prediction error. Our results show that about 25% of imputation error rate was propagated to genomic prediction in an Angus population. The utility of this MOLO algorithm was also demonstrated in a real application, in which a 6K SNP panel was optimized conditional on 5,260 obligatory SNP selected based on SNP-trait association in U.S. Holstein animals. With this MOLO algorithm, both imputation error rate and genomic prediction error rate were minimal. PMID:27583971
Lazy collaborative filtering for data sets with missing values.
Ren, Yongli; Li, Gang; Zhang, Jun; Zhou, Wanlei
2013-12-01
As one of the biggest challenges in research on recommender systems, the data sparsity issue is mainly caused by the fact that users tend to rate a small proportion of items from the huge number of available items. This issue becomes even more problematic for the neighborhood-based collaborative filtering (CF) methods, as there are even lower numbers of ratings available in the neighborhood of the query item. In this paper, we aim to address the data sparsity issue in the context of neighborhood-based CF. For a given query (user, item), a set of key ratings is first identified by taking the historical information of both the user and the item into account. Then, an auto-adaptive imputation (AutAI) method is proposed to impute the missing values in the set of key ratings. We present a theoretical analysis to show that the proposed imputation method effectively improves the performance of the conventional neighborhood-based CF methods. The experimental results show that our new method of CF with AutAI outperforms six existing recommendation methods in terms of accuracy.
Identifying Heat Waves in Florida: Considerations of Missing Weather Data.
Leary, Emily; Young, Linda J; DuClos, Chris; Jordan, Melissa M
2015-01-01
Using current climate models, regional-scale changes for Florida over the next 100 years are predicted to include warming over terrestrial areas and very likely increases in the number of high temperature extremes. No uniform definition of a heat wave exists. Most past research on heat waves has focused on evaluating the aftermath of known heat waves, with minimal consideration of missing exposure information. To identify and discuss methods of handling and imputing missing weather data and how those methods can affect identified periods of extreme heat in Florida. In addition to ignoring missing data, temporal, spatial, and spatio-temporal models are described and utilized to impute missing historical weather data from 1973 to 2012 from 43 Florida weather monitors. Calculated thresholds are used to define periods of extreme heat across Florida. Modeling of missing data and imputing missing values can affect the identified periods of extreme heat, through the missing data itself or through the computed thresholds. The differences observed are related to the amount of missingness during June, July, and August, the warmest months of the warm season (April through September). Missing data considerations are important when defining periods of extreme heat. Spatio-temporal methods are recommended for data imputation. A heat wave definition that incorporates information from all monitors is advised.
Missing value imputation for microarray data: a comprehensive comparison study and a web tool.
Chiu, Chia-Chun; Chan, Shih-Yao; Wang, Chung-Ching; Wu, Wei-Sheng
2013-01-01
Microarray data are usually peppered with missing values due to various reasons. However, most of the downstream analyses for microarray data require complete datasets. Therefore, accurate algorithms for missing value estimation are needed for improving the performance of microarray data analyses. Although many algorithms have been developed, there are many debates on the selection of the optimal algorithm. The studies about the performance comparison of different algorithms are still incomprehensive, especially in the number of benchmark datasets used, the number of algorithms compared, the rounds of simulation conducted, and the performance measures used. In this paper, we performed a comprehensive comparison by using (I) thirteen datasets, (II) nine algorithms, (III) 110 independent runs of simulation, and (IV) three types of measures to evaluate the performance of each imputation algorithm fairly. First, the effects of different types of microarray datasets on the performance of each imputation algorithm were evaluated. Second, we discussed whether the datasets from different species have different impact on the performance of different algorithms. To assess the performance of each algorithm fairly, all evaluations were performed using three types of measures. Our results indicate that the performance of an imputation algorithm mainly depends on the type of a dataset but not on the species where the samples come from. In addition to the statistical measure, two other measures with biological meanings are useful to reflect the impact of missing value imputation on the downstream data analyses. Our study suggests that local-least-squares-based methods are good choices to handle missing values for most of the microarray datasets. In this work, we carried out a comprehensive comparison of the algorithms for microarray missing value imputation. Based on such a comprehensive comparison, researchers could choose the optimal algorithm for their datasets easily. Moreover, new imputation algorithms could be compared with the existing algorithms using this comparison strategy as a standard protocol. In addition, to assist researchers in dealing with missing values easily, we built a web-based and easy-to-use imputation tool, MissVIA (http://cosbi.ee.ncku.edu.tw/MissVIA), which supports many imputation algorithms. Once users upload a real microarray dataset and choose the imputation algorithms, MissVIA will determine the optimal algorithm for the users' data through a series of simulations, and then the imputed results can be downloaded for the downstream data analyses.
Survival analysis in hematologic malignancies: recommendations for clinicians
Delgado, Julio; Pereira, Arturo; Villamor, Neus; López-Guillermo, Armando; Rozman, Ciril
2014-01-01
The widespread availability of statistical packages has undoubtedly helped hematologists worldwide in the analysis of their data, but has also led to the inappropriate use of statistical methods. In this article, we review some basic concepts of survival analysis and also make recommendations about how and when to perform each particular test using SPSS, Stata and R. In particular, we describe a simple way of defining cut-off points for continuous variables and the appropriate and inappropriate uses of the Kaplan-Meier method and Cox proportional hazard regression models. We also provide practical advice on how to check the proportional hazards assumption and briefly review the role of relative survival and multiple imputation. PMID:25176982
Sun, Wanjie; Larsen, Michael D; Lachin, John M
2014-04-15
In longitudinal studies, a quantitative outcome (such as blood pressure) may be altered during follow-up by the administration of a non-randomized, non-trial intervention (such as anti-hypertensive medication) that may seriously bias the study results. Current methods mainly address this issue for cross-sectional studies. For longitudinal data, the current methods are either restricted to a specific longitudinal data structure or are valid only under special circumstances. We propose two new methods for estimation of covariate effects on the underlying (untreated) general longitudinal outcomes: a single imputation method employing a modified expectation-maximization (EM)-type algorithm and a multiple imputation (MI) method utilizing a modified Monte Carlo EM-MI algorithm. Each method can be implemented as one-step, two-step, and full-iteration algorithms. They combine the advantages of the current statistical methods while reducing their restrictive assumptions and generalizing them to realistic scenarios. The proposed methods replace intractable numerical integration of a multi-dimensionally censored MVN posterior distribution with a simplified, sufficiently accurate approximation. It is particularly attractive when outcomes reach a plateau after intervention due to various reasons. Methods are studied via simulation and applied to data from the Diabetes Control and Complications Trial/Epidemiology of Diabetes Interventions and Complications study of treatment for type 1 diabetes. Methods proved to be robust to high dimensions, large amounts of censored data, low within-subject correlation, and when subjects receive non-trial intervention to treat the underlying condition only (with high Y), or for treatment in the majority of subjects (with high Y) in combination with prevention for a small fraction of subjects (with normal Y). Copyright © 2013 John Wiley & Sons, Ltd.
Sulovari, Arvis; Li, Dawei
2014-07-19
Genome-wide association studies (GWAS) have successfully identified genes associated with complex human diseases. Although much of the heritability remains unexplained, combining single nucleotide polymorphism (SNP) genotypes from multiple studies for meta-analysis will increase the statistical power to identify new disease-associated variants. Meta-analysis requires same allele definition (nomenclature) and genome build among individual studies. Similarly, imputation, commonly-used prior to meta-analysis, requires the same consistency. However, the genotypes from various GWAS are generated using different genotyping platforms, arrays or SNP-calling approaches, resulting in use of different genome builds and allele definitions. Incorrect assumptions of identical allele definition among combined GWAS lead to a large portion of discarded genotypes or incorrect association findings. There is no published tool that predicts and converts among all major allele definitions. In this study, we have developed a tool, GACT, which stands for Genome build and Allele definition Conversion Tool, that predicts and inter-converts between any of the common SNP allele definitions and between the major genome builds. In addition, we assessed several factors that may affect imputation quality, and our results indicated that inclusion of singletons in the reference had detrimental effects while ambiguous SNPs had no measurable effect. Unexpectedly, exclusion of genotypes with missing rate > 0.001 (40% of study SNPs) showed no significant decrease of imputation quality (even significantly higher when compared to the imputation with singletons in the reference), especially for rare SNPs. GACT is a new, powerful, and user-friendly tool with both command-line and interactive online versions that can accurately predict, and convert between any of the common allele definitions and between genome builds for genome-wide meta-analysis and imputation of genotypes from SNP-arrays or deep-sequencing, particularly for data from the dbGaP and other public databases. http://www.uvm.edu/genomics/software/gact.
Salim, Agus; Mackinnon, Andrew; Christensen, Helen; Griffiths, Kathleen
2008-09-30
The pre-test-post-test design (PPD) is predominant in trials of psychotherapeutic treatments. Missing data due to withdrawals present an even bigger challenge in assessing treatment effectiveness under the PPD than under designs with more observations since dropout implies an absence of information about response to treatment. When confronted with missing data, often it is reasonable to assume that the mechanism underlying missingness is related to observed but not to unobserved outcomes (missing at random, MAR). Previous simulation and theoretical studies have shown that, under MAR, modern techniques such as maximum-likelihood (ML) based methods and multiple imputation (MI) can be used to produce unbiased estimates of treatment effects. In practice, however, ad hoc methods such as last observation carried forward (LOCF) imputation and complete-case (CC) analysis continue to be used. In order to better understand the behaviour of these methods in the PPD, we compare the performance of traditional approaches (LOCF, CC) and theoretically sound techniques (MI, ML), under various MAR mechanisms. We show that the LOCF method is seriously biased and conclude that its use should be abandoned. Complete-case analysis produces unbiased estimates only when the dropout mechanism does not depend on pre-test values even when dropout is related to fixed covariates including treatment group (covariate-dependent: CD). However, CC analysis is generally biased under MAR. The magnitude of the bias is largest when the correlation of post- and pre-test is relatively low.
Jiang, Z; Dou, Z; Song, W L; Xu, J; Wu, Z Y
2017-11-10
Objective: To compare results of different methods: in organizing HIV viral load (VL) data with missing values mechanism. Methods We used software SPSS 17.0 to simulate complete and missing data with different missing value mechanism from HIV viral loading data collected from MSM in 16 cities in China in 2013. Maximum Likelihood Methods Using the Expectation and Maximization Algorithm (EM), regressive method, mean imputation, delete method, and Markov Chain Monte Carlo (MCMC) were used to supplement missing data respectively. The results: of different methods were compared according to distribution characteristics, accuracy and precision. Results HIV VL data could not be transferred into a normal distribution. All the methods showed good results in iterating data which is Missing Completely at Random Mechanism (MCAR). For the other types of missing data, regressive and MCMC methods were used to keep the main characteristic of the original data. The means of iterating database with different methods were all close to the original one. The EM, regressive method, mean imputation, and delete method under-estimate VL while MCMC overestimates it. Conclusion: MCMC can be used as the main imputation method for HIV virus loading missing data. The iterated data can be used as a reference for mean HIV VL estimation among the investigated population.
Molgenis-impute: imputation pipeline in a box.
Kanterakis, Alexandros; Deelen, Patrick; van Dijk, Freerk; Byelas, Heorhiy; Dijkstra, Martijn; Swertz, Morris A
2015-08-19
Genotype imputation is an important procedure in current genomic analysis such as genome-wide association studies, meta-analyses and fine mapping. Although high quality tools are available that perform the steps of this process, considerable effort and expertise is required to set up and run a best practice imputation pipeline, particularly for larger genotype datasets, where imputation has to scale out in parallel on computer clusters. Here we present MOLGENIS-impute, an 'imputation in a box' solution that seamlessly and transparently automates the set up and running of all the steps of the imputation process. These steps include genome build liftover (liftovering), genotype phasing with SHAPEIT2, quality control, sample and chromosomal chunking/merging, and imputation with IMPUTE2. MOLGENIS-impute builds on MOLGENIS-compute, a simple pipeline management platform for submission and monitoring of bioinformatics tasks in High Performance Computing (HPC) environments like local/cloud servers, clusters and grids. All the required tools, data and scripts are downloaded and installed in a single step. Researchers with diverse backgrounds and expertise have tested MOLGENIS-impute on different locations and imputed over 30,000 samples so far using the 1,000 Genomes Project and new Genome of the Netherlands data as the imputation reference. The tests have been performed on PBS/SGE clusters, cloud VMs and in a grid HPC environment. MOLGENIS-impute gives priority to the ease of setting up, configuring and running an imputation. It has minimal dependencies and wraps the pipeline in a simple command line interface, without sacrificing flexibility to adapt or limiting the options of underlying imputation tools. It does not require knowledge of a workflow system or programming, and is targeted at researchers who just want to apply best practices in imputation via simple commands. It is built on the MOLGENIS compute workflow framework to enable customization with additional computational steps or it can be included in other bioinformatics pipelines. It is available as open source from: https://github.com/molgenis/molgenis-imputation.
Southam, Lorraine; Panoutsopoulou, Kalliope; Rayner, N William; Chapman, Kay; Durrant, Caroline; Ferreira, Teresa; Arden, Nigel; Carr, Andrew; Deloukas, Panos; Doherty, Michael; Loughlin, John; McCaskie, Andrew; Ollier, William E R; Ralston, Stuart; Spector, Timothy D; Valdes, Ana M; Wallis, Gillian A; Wilkinson, J Mark; Marchini, Jonathan; Zeggini, Eleftheria
2011-05-01
Imputation is an extremely valuable tool in conducting and synthesising genome-wide association studies (GWASs). Directly typed SNP quality control (QC) is thought to affect imputation quality. It is, therefore, common practise to use quality-controlled (QCed) data as an input for imputing genotypes. This study aims to determine the effect of commonly applied QC steps on imputation outcomes. We performed several iterations of imputing SNPs across chromosome 22 in a dataset consisting of 3177 samples with Illumina 610 k (Illumina, San Diego, CA, USA) GWAS data, applying different QC steps each time. The imputed genotypes were compared with the directly typed genotypes. In addition, we investigated the correlation between alternatively QCed data. We also applied a series of post-imputation QC steps balancing elimination of poorly imputed SNPs and information loss. We found that the difference between the unQCed data and the fully QCed data on imputation outcome was minimal. Our study shows that imputation of common variants is generally very accurate and robust to GWAS QC, which is not a major factor affecting imputation outcome. A minority of common-frequency SNPs with particular properties cannot be accurately imputed regardless of QC stringency. These findings may not generalise to the imputation of low frequency and rare variants.
Kenneth B. Pierce; Janet L. Ohmann; Michael C. Wimberly; Matthew J. Gregory; Jeremy S. Fried
2009-01-01
Land managers need consistent information about the geographic distribution of wildland fuels and forest structure over large areas to evaluate fire risk and plan fuel treatments. We compared spatial predictions for 12 fuel and forest structure variables across three regions in the western United States using gradient nearest neighbor (GNN) imputation, linear models (...
Andrew T. Hudak; Nicholas L. Crookston; Jeffrey S. Evans; David E. hall; Michael J. Falkowski
2009-01-01
The authors regret that an error was discovered in the code within the R software package, yaImpute (Crookston & Finley, 2008), which led to incorrect results reported in the above article. The Most Similar Neighbor (MSN) method computes the distance between reference observations and target observations in a projected space defined using canonical correlation...
Spatial Copula Model for Imputing Traffic Flow Data from Remote Microwave Sensors.
Ma, Xiaolei; Luan, Sen; Du, Bowen; Yu, Bin
2017-09-21
Issues of missing data have become increasingly serious with the rapid increase in usage of traffic sensors. Analyses of the Beijing ring expressway have showed that up to 50% of microwave sensors pose missing values. The imputation of missing traffic data must be urgently solved although a precise solution that cannot be easily achieved due to the significant number of missing portions. In this study, copula-based models are proposed for the spatial interpolation of traffic flow from remote traffic microwave sensors. Most existing interpolation methods only rely on covariance functions to depict spatial correlation and are unsuitable for coping with anomalies due to Gaussian consumption. Copula theory overcomes this issue and provides a connection between the correlation function and the marginal distribution function of traffic flow. To validate copula-based models, a comparison with three kriging methods is conducted. Results indicate that copula-based models outperform kriging methods, especially on roads with irregular traffic patterns. Copula-based models demonstrate significant potential to impute missing data in large-scale transportation networks.
Coyle, Kathryn; Carrier, Marc; Lazo-Langner, Alejandro; Shivakumar, Sudeep; Zarychanski, Ryan; Tagalakis, Vicky; Solymoss, Susan; Routhier, Nathalie; Douketis, James; Coyle, Douglas
2017-03-01
Unprovoked venous thromboembolism (VTE) can be the first manifestation of cancer. It is unclear if extensive screening for occult cancer including a comprehensive computed tomography (CT) scan of the abdomen/pelvis is cost-effective in this patient population. To assess the health care related costs, number of missed cancer cases and health related utility values of a limited screening strategy with and without the addition of a comprehensive CT scan of the abdomen/pelvis and to identify to what extent testing should be done in these circumstances to allow early detection of occult cancers. Cost effectiveness analysis using data that was collected alongside the SOME randomized controlled trial which compared an extensive occult cancer screening including a CT of the abdomen/pelvis to a more limited screening strategy in patients with a first unprovoked VTE, was used for the current analyses. Analyses were conducted with a one-year time horizon from a Canadian health care perspective. Primary analysis was based on complete cases, with sensitivity analysis using appropriate multiple imputation methods to account for missing data. Data from a total of 854 patients with a first unprovoked VTE were included in these analyses. The addition of a comprehensive CT scan was associated with higher costs ($551 CDN) with no improvement in utility values or number of missed cancers. Results were consistent when adopting multiple imputation methods. The addition of a comprehensive CT scan of the abdomen/pelvis for the screening of occult cancer in patients with unprovoked VTE is not cost effective, as it is both more costly and not more effective in detecting occult cancer. Copyright © 2017 Elsevier Ltd. All rights reserved.
Blood pressure and neuropsychological test performance in healthy postmenopausal women.
Alsumali, Adnan; Mekary, Rania A; Seeger, John; Regestein, Quentin
2016-06-01
To study the association between blood pressure and neuropsychological test performance in healthy postmenopausal women. Data from 88 healthy postmenopausal women aged 46-73 years, who were not experiencing hot flashes, and who had participated in a prior drug trial, were analyzed to find whether baseline blood pressure was associated with impaired performance on neuropsychological testing done at 3 follow-up visits separated by 4 weeks. Factor analysis was used to reduce the dimensions of neuropsychological test performance. Mixed linear modeling was used to evaluate the association between baseline blood pressure and repeatedly measured neuropsychological test performance at follow-up in a complete case analysis (n=53). In a sensitivity analysis (n=88), multiple-imputation using the Markov Chain Monte Carlo method was used to account for missing data (blood pressure results) for some visits. The variables recording neuropsychological test performance were reduced to two main factors (Factor 1=selective attention; Factor 2=complex processing). In the complete case analysis, the association between a 20-mmHg increase in diastolic blood pressure and Factor 1 remained statistically significant after adjusting for potential confounders, before adjusting for systolic blood pressure (slope=0.60; 95%CI=0.04,1.16), and after adjusting for systolic blood pressure (slope=0.76; 95%CI=0.06, 1.47). The positive slopes indicated an increase in the time spent performing a given task (i.e., a decrease in neuropsychological test performance). No other significant associations were found between systolic blood pressure and either factor. The results did not materially change after applying the multiple-imputation method. An increase in diastolic blood pressure was associated with a decrease in neuropsychological test performance among older healthy postmenopausal women experiencing hot flashes. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Kupek, Emil; de Assis, Maria Alice A
2016-09-01
External validation of food recall over 24 h in schoolchildren is often restricted to eating events in schools and is based on direct observation as the reference method. The aim of this study was to estimate the dietary intake out of school, and consequently the bias in such research design based on only part-time validated food recall, using multiple imputation (MI) conditioned on the information on child age, sex, BMI, family income, parental education and the school attended. The previous-day, web-based questionnaire WebCAAFE, structured as six meals/snacks and thirty-two foods/beverage, was answered by a sample of 7-11-year-old Brazilian schoolchildren (n 602) from five public schools. Food/beverage intake recalled by children was compared with the records provided by trained observers during school meals. Sensitivity analysis was performed with artificial data emulating those recalled by children on WebCAAFE in order to evaluate the impact of both differential and non-differential bias. Estimated bias was within ±30 % interval for 84·4 % of the thirty-two foods/beverages evaluated in WebCAAFE, and half of the latter reached statistical significance (P<0·05). Rarely (<3 %) consumed dietary items were often under-reported (fish/seafood, vegetable soup, cheese bread, French fries), whereas some of those most frequently reported (meat, bread/biscuits, fruits) showed large overestimation. Compared with the analysis restricted to fully validated data, MI reduced differential bias in sensitivity analysis but the bias still remained large in most cases. MI provided a suitable statistical framework for part-time validation design of dietary intake over six daily eating events.
Missing value imputation for microarray data: a comprehensive comparison study and a web tool
2013-01-01
Background Microarray data are usually peppered with missing values due to various reasons. However, most of the downstream analyses for microarray data require complete datasets. Therefore, accurate algorithms for missing value estimation are needed for improving the performance of microarray data analyses. Although many algorithms have been developed, there are many debates on the selection of the optimal algorithm. The studies about the performance comparison of different algorithms are still incomprehensive, especially in the number of benchmark datasets used, the number of algorithms compared, the rounds of simulation conducted, and the performance measures used. Results In this paper, we performed a comprehensive comparison by using (I) thirteen datasets, (II) nine algorithms, (III) 110 independent runs of simulation, and (IV) three types of measures to evaluate the performance of each imputation algorithm fairly. First, the effects of different types of microarray datasets on the performance of each imputation algorithm were evaluated. Second, we discussed whether the datasets from different species have different impact on the performance of different algorithms. To assess the performance of each algorithm fairly, all evaluations were performed using three types of measures. Our results indicate that the performance of an imputation algorithm mainly depends on the type of a dataset but not on the species where the samples come from. In addition to the statistical measure, two other measures with biological meanings are useful to reflect the impact of missing value imputation on the downstream data analyses. Our study suggests that local-least-squares-based methods are good choices to handle missing values for most of the microarray datasets. Conclusions In this work, we carried out a comprehensive comparison of the algorithms for microarray missing value imputation. Based on such a comprehensive comparison, researchers could choose the optimal algorithm for their datasets easily. Moreover, new imputation algorithms could be compared with the existing algorithms using this comparison strategy as a standard protocol. In addition, to assist researchers in dealing with missing values easily, we built a web-based and easy-to-use imputation tool, MissVIA (http://cosbi.ee.ncku.edu.tw/MissVIA), which supports many imputation algorithms. Once users upload a real microarray dataset and choose the imputation algorithms, MissVIA will determine the optimal algorithm for the users' data through a series of simulations, and then the imputed results can be downloaded for the downstream data analyses. PMID:24565220
Hayes, Timothy; Usami, Satoshi; Jacobucci, Ross; McArdle, John J
2015-12-01
In this article, we describe a recent development in the analysis of attrition: using classification and regression trees (CART) and random forest methods to generate inverse sampling weights. These flexible machine learning techniques have the potential to capture complex nonlinear, interactive selection models, yet to our knowledge, their performance in the missing data analysis context has never been evaluated. To assess the potential benefits of these methods, we compare their performance with commonly employed multiple imputation and complete case techniques in 2 simulations. These initial results suggest that weights computed from pruned CART analyses performed well in terms of both bias and efficiency when compared with other methods. We discuss the implications of these findings for applied researchers. (c) 2015 APA, all rights reserved).
A guide to missing data for the pediatric nephrologist.
Larkins, Nicholas G; Craig, Jonathan C; Teixeira-Pinto, Armando
2018-03-13
Missing data is an important and common source of bias in clinical research. Readers should be alert to and consider the impact of missing data when reading studies. Beyond preventing missing data in the first place, through good study design and conduct, there are different strategies available to handle data containing missing observations. Complete case analysis is often biased unless data are missing completely at random. Better methods of handling missing data include multiple imputation and models using likelihood-based estimation. With advancing computing power and modern statistical software, these methods are within the reach of clinician-researchers under guidance of a biostatistician. As clinicians reading papers, we need to continue to update our understanding of statistical methods, so that we understand the limitations of these techniques and can critically interpret literature.
Hayes, Timothy; Usami, Satoshi; Jacobucci, Ross; McArdle, John J.
2016-01-01
In this article, we describe a recent development in the analysis of attrition: using classification and regression trees (CART) and random forest methods to generate inverse sampling weights. These flexible machine learning techniques have the potential to capture complex nonlinear, interactive selection models, yet to our knowledge, their performance in the missing data analysis context has never been evaluated. To assess the potential benefits of these methods, we compare their performance with commonly employed multiple imputation and complete case techniques in 2 simulations. These initial results suggest that weights computed from pruned CART analyses performed well in terms of both bias and efficiency when compared with other methods. We discuss the implications of these findings for applied researchers. PMID:26389526
32 CFR 776.29 - Imputed disqualification: General rule.
Code of Federal Regulations, 2012 CFR
2012-07-01
... their federal, state, and local bar rules governing the representation of multiple or adverse clients within the same office before such representation is initiated, as such representation may expose them to... military (or Government) service may require representation of opposing sides by covered USG attorneys...
32 CFR 776.29 - Imputed disqualification: General rule.
Code of Federal Regulations, 2014 CFR
2014-07-01
... their federal, state, and local bar rules governing the representation of multiple or adverse clients within the same office before such representation is initiated, as such representation may expose them to... military (or Government) service may require representation of opposing sides by covered USG attorneys...
32 CFR 776.29 - Imputed disqualification: General rule.
Code of Federal Regulations, 2013 CFR
2013-07-01
... their federal, state, and local bar rules governing the representation of multiple or adverse clients within the same office before such representation is initiated, as such representation may expose them to... military (or Government) service may require representation of opposing sides by covered USG attorneys...
CGDSNPdb: a database resource for error-checked and imputed mouse SNPs.
Hutchins, Lucie N; Ding, Yueming; Szatkiewicz, Jin P; Von Smith, Randy; Yang, Hyuna; de Villena, Fernando Pardo-Manuel; Churchill, Gary A; Graber, Joel H
2010-07-06
The Center for Genome Dynamics Single Nucleotide Polymorphism Database (CGDSNPdb) is an open-source value-added database with more than nine million mouse single nucleotide polymorphisms (SNPs), drawn from multiple sources, with genotypes assigned to multiple inbred strains of laboratory mice. All SNPs are checked for accuracy and annotated for properties specific to the SNP as well as those implied by changes to overlapping protein-coding genes. CGDSNPdb serves as the primary interface to two unique data sets, the 'imputed genotype resource' in which a Hidden Markov Model was used to assess local haplotypes and the most probable base assignment at several million genomic loci in tens of strains of mice, and the Affymetrix Mouse Diversity Genotyping Array, a high density microarray with over 600,000 SNPs and over 900,000 invariant genomic probes. CGDSNPdb is accessible online through either a web-based query tool or a MySQL public login. Database URL: http://cgd.jax.org/cgdsnpdb/
Descalzo, Miguel Á; Garcia, Virginia Villaverde; González-Alvaro, Isidoro; Carbonell, Jordi; Balsa, Alejandro; Sanmartí, Raimon; Lisbona, Pilar; Hernandez-Barrera, Valentín; Jiménez-Garcia, Rodrigo; Carmona, Loreto
2013-02-01
To describe the results of different statistical ways of addressing radiographic outcome affected by missing data--multiple imputation technique, inverse probability weights and complete case analysis--using data from an observational study. A random sample of 96 RA patients was selected for a follow-up study in which radiographs of hands and feet were scored. Radiographic progression was tested by comparing the change in the total Sharp-van der Heijde radiographic score (TSS) and the joint erosion score (JES) from baseline to the end of the second year of follow-up. MI technique, inverse probability weights in weighted estimating equation (WEE) and CC analysis were used to fit a negative binomial regression. Major predictors of radiographic progression were JES and joint space narrowing (JSN) at baseline, together with baseline disease activity measured by DAS28 for TSS and MTX use for JES. Results from CC analysis show larger coefficients and s.e.s compared with MI and weighted techniques. The results from the WEE model were quite in line with those of MI. If it seems plausible that CC or MI analysis may be valid, then MI should be preferred because of its greater efficiency. CC analysis resulted in inefficient estimates or, translated into non-statistical terminology, could guide us into inaccurate results and unwise conclusions. The methods discussed here will contribute to the use of alternative approaches for tackling missing data in observational studies.
2013-01-01
Background It is important to know the impact of Very Preterm (VP) birth or Very Low Birth Weight (VLBW). The purpose of this study is to evaluate changes in Health-Related Quality of Life (HRQoL) of adults born VP or with a VLBW, between age 19 and age 28. Methods The 1983 nationwide Dutch Project On Preterm and Small for gestational age infants (POPS) cohort of 1338 VP (gestational age <32 weeks) or VLBW (<1500 g) infants, was contacted to complete online questionnaires at age 28. In total, 33.8% of eligible participants completed the Health Utilities Index (HUI3), the London Handicap Scale (LHS) and the WHOQoL-BREF. Multiple imputation was applied to correct for missing data and non-response. Results The mean HUI3 and LHS scores did not change significantly from age 19 to age 28. However, after multiple imputation, a significant, though not clinically relevant, increase of 0.02 on the overall HUI3 score was found. The mean HRQoL score measured with the HUI3 increased from 0.83 at age 19 to 0.85 at age 28. The lowest score on the WHOQoL was the psychological domain (74.4). Conclusions Overall, no important changes in HRQoL between age 19 and age 28 were found in the POPS cohort. Psychological and emotional problems stand out, from which recommendation for interventions could be derived. PMID:23531081
Janet L. Ohmann; Matthew J. Gregory
2002-01-01
Spatially explicit information on the species composition and structure of forest vegetation is needed at broad spatial scales for natural resource policy analysis and ecological research. We present a method for predictive vegetation mapping that applies direct gradient analysis and nearest-neighbor imputation to ascribe detailed ground attributes of vegetation to...
NASA Astrophysics Data System (ADS)
Manago, K. F.; Hogue, T. S.; Hering, A. S.
2014-12-01
In the City of Los Angeles, groundwater accounts for 11% of the total water supply on average, and 30% during drought years. Due to ongoing drought in California, increased reliance on local water supply highlights the need for better understanding of regional groundwater dynamics and estimating sustainable groundwater supply. However, in an urban setting, such as Los Angeles, understanding or modeling groundwater levels is extremely complicated due to various anthropogenic influences such as groundwater pumping, artificial recharge, landscape irrigation, leaking infrastructure, seawater intrusion, and extensive impervious surfaces. This study analyzes anthropogenic effects on groundwater levels using groundwater monitoring well data from the County of Los Angeles Department of Public Works. The groundwater data is irregularly sampled with large gaps between samples, resulting in a sparsely populated dataset. A multiple imputation method is used to fill the missing data, allowing for multiple ensembles and improved error estimates. The filled data is interpolated to create spatial groundwater maps utilizing information from all wells. The groundwater data is evaluated at a monthly time step over the last several decades to analyze the effect of land cover and identify other influencing factors on groundwater levels spatially and temporally. Preliminary results show irrigated parks have the largest influence on groundwater fluctuations, resulting in large seasonal changes, exceeding changes in spreading grounds. It is assumed that these fluctuations are caused by watering practices required to sustain non-native vegetation. Conversely, high intensity urbanized areas resulted in muted groundwater fluctuations and behavior decoupling from climate patterns. Results provides improved understanding of anthropogenic effects on groundwater levels in addition to providing high quality datasets for validation of regional groundwater models.
Normal Theory Two-Stage ML Estimator When Data Are Missing at the Item Level
Savalei, Victoria; Rhemtulla, Mijke
2017-01-01
In many modeling contexts, the variables in the model are linear composites of the raw items measured for each participant; for instance, regression and path analysis models rely on scale scores, and structural equation models often use parcels as indicators of latent constructs. Currently, no analytic estimation method exists to appropriately handle missing data at the item level. Item-level multiple imputation (MI), however, can handle such missing data straightforwardly. In this article, we develop an analytic approach for dealing with item-level missing data—that is, one that obtains a unique set of parameter estimates directly from the incomplete data set and does not require imputations. The proposed approach is a variant of the two-stage maximum likelihood (TSML) methodology, and it is the analytic equivalent of item-level MI. We compare the new TSML approach to three existing alternatives for handling item-level missing data: scale-level full information maximum likelihood, available-case maximum likelihood, and item-level MI. We find that the TSML approach is the best analytic approach, and its performance is similar to item-level MI. We recommend its implementation in popular software and its further study. PMID:29276371
Normal Theory Two-Stage ML Estimator When Data Are Missing at the Item Level.
Savalei, Victoria; Rhemtulla, Mijke
2017-08-01
In many modeling contexts, the variables in the model are linear composites of the raw items measured for each participant; for instance, regression and path analysis models rely on scale scores, and structural equation models often use parcels as indicators of latent constructs. Currently, no analytic estimation method exists to appropriately handle missing data at the item level. Item-level multiple imputation (MI), however, can handle such missing data straightforwardly. In this article, we develop an analytic approach for dealing with item-level missing data-that is, one that obtains a unique set of parameter estimates directly from the incomplete data set and does not require imputations. The proposed approach is a variant of the two-stage maximum likelihood (TSML) methodology, and it is the analytic equivalent of item-level MI. We compare the new TSML approach to three existing alternatives for handling item-level missing data: scale-level full information maximum likelihood, available-case maximum likelihood, and item-level MI. We find that the TSML approach is the best analytic approach, and its performance is similar to item-level MI. We recommend its implementation in popular software and its further study.
Pasaniuc, Bogdan; Zaitlen, Noah; Lettre, Guillaume; Chen, Gary K; Tandon, Arti; Kao, W H Linda; Ruczinski, Ingo; Fornage, Myriam; Siscovick, David S; Zhu, Xiaofeng; Larkin, Emma; Lange, Leslie A; Cupples, L Adrienne; Yang, Qiong; Akylbekova, Ermeg L; Musani, Solomon K; Divers, Jasmin; Mychaleckyj, Joe; Li, Mingyao; Papanicolaou, George J; Millikan, Robert C; Ambrosone, Christine B; John, Esther M; Bernstein, Leslie; Zheng, Wei; Hu, Jennifer J; Ziegler, Regina G; Nyante, Sarah J; Bandera, Elisa V; Ingles, Sue A; Press, Michael F; Chanock, Stephen J; Deming, Sandra L; Rodriguez-Gil, Jorge L; Palmer, Cameron D; Buxbaum, Sarah; Ekunwe, Lynette; Hirschhorn, Joel N; Henderson, Brian E; Myers, Simon; Haiman, Christopher A; Reich, David; Patterson, Nick; Wilson, James G; Price, Alkes L
2011-04-01
While genome-wide association studies (GWAS) have primarily examined populations of European ancestry, more recent studies often involve additional populations, including admixed populations such as African Americans and Latinos. In admixed populations, linkage disequilibrium (LD) exists both at a fine scale in ancestral populations and at a coarse scale (admixture-LD) due to chromosomal segments of distinct ancestry. Disease association statistics in admixed populations have previously considered SNP association (LD mapping) or admixture association (mapping by admixture-LD), but not both. Here, we introduce a new statistical framework for combining SNP and admixture association in case-control studies, as well as methods for local ancestry-aware imputation. We illustrate the gain in statistical power achieved by these methods by analyzing data of 6,209 unrelated African Americans from the CARe project genotyped on the Affymetrix 6.0 chip, in conjunction with both simulated and real phenotypes, as well as by analyzing the FGFR2 locus using breast cancer GWAS data from 5,761 African-American women. We show that, at typed SNPs, our method yields an 8% increase in statistical power for finding disease risk loci compared to the power achieved by standard methods in case-control studies. At imputed SNPs, we observe an 11% increase in statistical power for mapping disease loci when our local ancestry-aware imputation framework and the new scoring statistic are jointly employed. Finally, we show that our method increases statistical power in regions harboring the causal SNP in the case when the causal SNP is untyped and cannot be imputed. Our methods and our publicly available software are broadly applicable to GWAS in admixed populations.
de Vries, Paul S; Sabater-Lleal, Maria; Chasman, Daniel I; Trompet, Stella; Ahluwalia, Tarunveer S; Teumer, Alexander; Kleber, Marcus E; Chen, Ming-Huei; Wang, Jie Jin; Attia, John R; Marioni, Riccardo E; Steri, Maristella; Weng, Lu-Chen; Pool, Rene; Grossmann, Vera; Brody, Jennifer A; Venturini, Cristina; Tanaka, Toshiko; Rose, Lynda M; Oldmeadow, Christopher; Mazur, Johanna; Basu, Saonli; Frånberg, Mattias; Yang, Qiong; Ligthart, Symen; Hottenga, Jouke J; Rumley, Ann; Mulas, Antonella; de Craen, Anton J M; Grotevendt, Anne; Taylor, Kent D; Delgado, Graciela E; Kifley, Annette; Lopez, Lorna M; Berentzen, Tina L; Mangino, Massimo; Bandinelli, Stefania; Morrison, Alanna C; Hamsten, Anders; Tofler, Geoffrey; de Maat, Moniek P M; Draisma, Harmen H M; Lowe, Gordon D; Zoledziewska, Magdalena; Sattar, Naveed; Lackner, Karl J; Völker, Uwe; McKnight, Barbara; Huang, Jie; Holliday, Elizabeth G; McEvoy, Mark A; Starr, John M; Hysi, Pirro G; Hernandez, Dena G; Guan, Weihua; Rivadeneira, Fernando; McArdle, Wendy L; Slagboom, P Eline; Zeller, Tanja; Psaty, Bruce M; Uitterlinden, André G; de Geus, Eco J C; Stott, David J; Binder, Harald; Hofman, Albert; Franco, Oscar H; Rotter, Jerome I; Ferrucci, Luigi; Spector, Tim D; Deary, Ian J; März, Winfried; Greinacher, Andreas; Wild, Philipp S; Cucca, Francesco; Boomsma, Dorret I; Watkins, Hugh; Tang, Weihong; Ridker, Paul M; Jukema, Jan W; Scott, Rodney J; Mitchell, Paul; Hansen, Torben; O'Donnell, Christopher J; Smith, Nicholas L; Strachan, David P; Dehghan, Abbas
2017-01-01
An increasing number of genome-wide association (GWA) studies are now using the higher resolution 1000 Genomes Project reference panel (1000G) for imputation, with the expectation that 1000G imputation will lead to the discovery of additional associated loci when compared to HapMap imputation. In order to assess the improvement of 1000G over HapMap imputation in identifying associated loci, we compared the results of GWA studies of circulating fibrinogen based on the two reference panels. Using both HapMap and 1000G imputation we performed a meta-analysis of 22 studies comprising the same 91,953 individuals. We identified six additional signals using 1000G imputation, while 29 loci were associated using both HapMap and 1000G imputation. One locus identified using HapMap imputation was not significant using 1000G imputation. The genome-wide significance threshold of 5×10-8 is based on the number of independent statistical tests using HapMap imputation, and 1000G imputation may lead to further independent tests that should be corrected for. When using a stricter Bonferroni correction for the 1000G GWA study (P-value < 2.5×10-8), the number of loci significant only using HapMap imputation increased to 4 while the number of loci significant only using 1000G decreased to 5. In conclusion, 1000G imputation enabled the identification of 20% more loci than HapMap imputation, although the advantage of 1000G imputation became less clear when a stricter Bonferroni correction was used. More generally, our results provide insights that are applicable to the implementation of other dense reference panels that are under development.
de Vries, Paul S.; Sabater-Lleal, Maria; Chasman, Daniel I.; Trompet, Stella; Kleber, Marcus E.; Chen, Ming-Huei; Wang, Jie Jin; Attia, John R.; Marioni, Riccardo E.; Weng, Lu-Chen; Grossmann, Vera; Brody, Jennifer A.; Venturini, Cristina; Tanaka, Toshiko; Rose, Lynda M.; Oldmeadow, Christopher; Mazur, Johanna; Basu, Saonli; Yang, Qiong; Ligthart, Symen; Hottenga, Jouke J.; Rumley, Ann; Mulas, Antonella; de Craen, Anton J. M.; Grotevendt, Anne; Taylor, Kent D.; Delgado, Graciela E.; Kifley, Annette; Lopez, Lorna M.; Berentzen, Tina L.; Mangino, Massimo; Bandinelli, Stefania; Morrison, Alanna C.; Hamsten, Anders; Tofler, Geoffrey; de Maat, Moniek P. M.; Draisma, Harmen H. M.; Lowe, Gordon D.; Zoledziewska, Magdalena; Sattar, Naveed; Lackner, Karl J.; Völker, Uwe; McKnight, Barbara; Huang, Jie; Holliday, Elizabeth G.; McEvoy, Mark A.; Starr, John M.; Hysi, Pirro G.; Hernandez, Dena G.; Guan, Weihua; Rivadeneira, Fernando; McArdle, Wendy L.; Slagboom, P. Eline; Zeller, Tanja; Psaty, Bruce M.; Uitterlinden, André G.; de Geus, Eco J. C.; Stott, David J.; Binder, Harald; Hofman, Albert; Franco, Oscar H.; Rotter, Jerome I.; Ferrucci, Luigi; Spector, Tim D.; Deary, Ian J.; März, Winfried; Greinacher, Andreas; Wild, Philipp S.; Cucca, Francesco; Boomsma, Dorret I.; Watkins, Hugh; Tang, Weihong; Ridker, Paul M.; Jukema, Jan W.; Scott, Rodney J.; Mitchell, Paul; Hansen, Torben; O'Donnell, Christopher J.; Smith, Nicholas L.; Strachan, David P.
2017-01-01
An increasing number of genome-wide association (GWA) studies are now using the higher resolution 1000 Genomes Project reference panel (1000G) for imputation, with the expectation that 1000G imputation will lead to the discovery of additional associated loci when compared to HapMap imputation. In order to assess the improvement of 1000G over HapMap imputation in identifying associated loci, we compared the results of GWA studies of circulating fibrinogen based on the two reference panels. Using both HapMap and 1000G imputation we performed a meta-analysis of 22 studies comprising the same 91,953 individuals. We identified six additional signals using 1000G imputation, while 29 loci were associated using both HapMap and 1000G imputation. One locus identified using HapMap imputation was not significant using 1000G imputation. The genome-wide significance threshold of 5×10−8 is based on the number of independent statistical tests using HapMap imputation, and 1000G imputation may lead to further independent tests that should be corrected for. When using a stricter Bonferroni correction for the 1000G GWA study (P-value < 2.5×10−8), the number of loci significant only using HapMap imputation increased to 4 while the number of loci significant only using 1000G decreased to 5. In conclusion, 1000G imputation enabled the identification of 20% more loci than HapMap imputation, although the advantage of 1000G imputation became less clear when a stricter Bonferroni correction was used. More generally, our results provide insights that are applicable to the implementation of other dense reference panels that are under development. PMID:28107422
Wilson, Dawn K; Ellerbe, Caitlyn; Lawson, Andrew B; Alia, Kassandra A; Meyers, Duncan C; Coulon, Sandra M; Lawman, Hannah G
2013-03-01
This study examined imputational modeling effects of spatial proximity and social factors of walking in African American adults. Models were compared that examined relationships between household proximity to a walking trail and social factors in determining walking status. Participants (N=133; 66% female; mean age=55 years) were recruited to a police-supported walking and social marketing intervention. Bayesian modeling was used to identify predictors of walking at 12 months. Sensitivity analysis using different imputation approaches, and spatial contextual effects, were compared. All the imputation methods showed social life and income were significant predictors of walking, however, the complete data approach was the best model indicating Age (1.04, 95% OR: 1.00, 1.08), Social Life (0.83, 95% OR: 0.69, 0.98) and Income <$10,000 (0.10, 95% OR: 0.01, 0.97) were all predictors of walking. The complete data approach was the best model of predictors of walking in African Americans. Copyright © 2012 Elsevier Ltd. All rights reserved.
DrImpute: imputing dropout events in single cell RNA sequencing data.
Gong, Wuming; Kwak, Il-Youp; Pota, Pruthvi; Koyano-Nakagawa, Naoko; Garry, Daniel J
2018-06-08
The single cell RNA sequencing (scRNA-seq) technique begin a new era by allowing the observation of gene expression at the single cell level. However, there is also a large amount of technical and biological noise. Because of the low number of RNA transcriptomes and the stochastic nature of the gene expression pattern, there is a high chance of missing nonzero entries as zero, which are called dropout events. We develop DrImpute to impute dropout events in scRNA-seq data. We show that DrImpute has significantly better performance on the separation of the dropout zeros from true zeros than existing imputation algorithms. We also demonstrate that DrImpute can significantly improve the performance of existing tools for clustering, visualization and lineage reconstruction of nine published scRNA-seq datasets. DrImpute can serve as a very useful addition to the currently existing statistical tools for single cell RNA-seq analysis. DrImpute is implemented in R and is available at https://github.com/gongx030/DrImpute .
Purposeful Variable Selection and Stratification to Impute Missing FAST Data in Trauma Research
Fuchs, Paul A.; del Junco, Deborah J.; Fox, Erin E.; Holcomb, John B.; Rahbar, Mohammad H.; Wade, Charles A.; Alarcon, Louis H.; Brasel, Karen J.; Bulger, Eileen M.; Cohen, Mitchell J.; Myers, John G.; Muskat, Peter; Phelan, Herb A.; Schreiber, Martin A.; Cotton, Bryan A.
2013-01-01
Background The Focused Assessment with Sonography for Trauma (FAST) exam is an important variable in many retrospective trauma studies. The purpose of this study was to devise an imputation method to overcome missing data for the FAST exam. Due to variability in patients’ injuries and trauma care, these data are unlikely to be missing completely at random (MCAR), raising concern for validity when analyses exclude patients with missing values. Methods Imputation was conducted under a less restrictive, more plausible missing at random (MAR) assumption. Patients with missing FAST exams had available data on alternate, clinically relevant elements that were strongly associated with FAST results in complete cases, especially when considered jointly. Subjects with missing data (32.7%) were divided into eight mutually exclusive groups based on selected variables that both described the injury and were associated with missing FAST values. Additional variables were selected within each group to classify missing FAST values as positive or negative, and correct FAST exam classification based on these variables was determined for patients with non-missing FAST values. Results Severe head/neck injury (odds ratio, OR=2.04), severe extremity injury (OR=4.03), severe abdominal injury (OR=1.94), no injury (OR=1.94), other abdominal injury (OR=0.47), other head/neck injury (OR=0.57) and other extremity injury (OR=0.45) groups had significant ORs for missing data; the other group odds ratio was not significant (OR=0.84). All 407 missing FAST values were imputed, with 109 classified as positive. Correct classification of non-missing FAST results using the alternate variables was 87.2%. Conclusions Purposeful imputation for missing FAST exams based on interactions among selected variables assessed by simple stratification may be a useful adjunct to sensitivity analysis in the evaluation of imputation strategies under different missing data mechanisms. This approach has the potential for widespread application in clinical and translational research and validation is warranted. Level of Evidence Level II Prognostic or Epidemiological PMID:23778515
USDA-ARS?s Scientific Manuscript database
Microsatellite markers (MS) have traditionally been used for parental verification and are still the international standard in spite of their higher cost, error rate, and turnaround time compared with Single Nucleotide Polymorphisms (SNP)-based assays. Despite domestic and international demands fro...
Fast imputation using medium- or low-coverage sequence data
USDA-ARS?s Scientific Manuscript database
Direct imputation from raw sequence reads can be more accurate than calling genotypes first and then imputing, especially if read depth is low or error rates high, but different imputation strategies are required than those used for data from genotyping chips. A fast algorithm to impute from lower t...
Chou, Wen-Chi; Zheng, Hou-Feng; Cheng, Chia-Ho; Yan, Han; Wang, Li; Han, Fang; Richards, J. Brent; Karasik, David; Kiel, Douglas P.; Hsu, Yi-Hsiang
2016-01-01
Imputation using the 1000 Genomes haplotype reference panel has been widely adapted to estimate genotypes in genome wide association studies. To evaluate imputation quality with a relatively larger reference panel and a reference panel composed of different ethnic populations, we conducted imputations in the Framingham Heart Study and the North Chinese Study using a combined reference panel from the 1000 Genomes (N = 1,092) and UK10K (N = 3,781) projects. For rare variants with 0.01% < MAF ≤ 0.5%, imputation in the Framingham Heart Study with the combined reference panel increased well-imputed genotypes (with imputation quality score ≥0.4) from 62.9% to 76.1% when compared to imputation with the 1000 Genomes. For the North Chinese samples, imputation of rare variants with 0.01% < MAF ≤ 0.5% with the combined reference panel increased well-imputed genotypes by from 49.8% to 61.8%. The predominant European ancestry of the UK10K and the combined reference panels may explain why there was less of an increase in imputation success in the North Chinese samples. Our results underscore the importance and potential of larger reference panels to impute rare variants, while recognizing that increasing ethnic specific variants in reference panels may result in better imputation for genotypes in some ethnic groups. PMID:28004816
Estimates of cancer incidence, mortality and survival in aboriginal people from NSW, Australia
2012-01-01
Background Aboriginal status has been unreliably and incompletely recorded in health and vital registration data collections for the most populous areas of Australia, including NSW where 29% of Australian Aboriginal people reside. This paper reports an assessment of Aboriginal status recording in NSW cancer registrations and estimates incidence, mortality and survival from cancer in NSW Aboriginal people using multiple imputation of missing Aboriginal status in NSW Central Cancer Registry (CCR) records. Methods Logistic regression modelling and multiple imputation were used to assign Aboriginal status to those records of cancer diagnosed from 1999 to 2008 with missing Aboriginality (affecting 12-18% of NSW cancers registered in this period). Estimates of incidence, mortality and survival from cancer in NSW Aboriginal people were compared with the NSW total population, as standardised incidence and mortality ratios, and with the non-Aboriginal population. Results Following imputation, 146 (12.2%) extra cancers in Aboriginal males and 140 (12.5%) in Aboriginal females were found for 1999-2007. Mean annual cancer incidence in NSW Aboriginal people was estimated to be 660 per 100,000 and 462 per 100,000, 9% and 6% higher than all NSW males and females respectively. Mean annual cancer mortality in NSW Aboriginal people was estimated to be 373 per 100,000 in males and 240 per 100,000 in females, 68% and 73% higher than for all NSW males and females respectively. Despite similar incidence of localised cancer, mortality from localised cancer in Aboriginal people is significantly higher than in non-Aboriginal people, as is mortality from cancers with regional, distant and unknown degree of spread at diagnosis. Cancer survival in Aboriginal people is significantly lower: 51% of males and 43% of females had died of the cancer by 5 years following diagnosis, compared to 36% and 33% of non-Aboriginal males and females respectively. Conclusion The present study is the first to produce valid and reliable estimates of cancer incidence, survival and mortality in Australian Aboriginal people from NSW. Despite somewhat higher cancer incidence in Aboriginal than in non-Aboriginal people, substantially higher mortality and lower survival in Aboriginal people is only partly explained by more advanced cancer at diagnosis. PMID:22559220
Rosner, Bernard; Colditz, Graham A.
2011-01-01
Purpose Age at menopause, a major marker in the reproductive life, may bias results for evaluation of breast cancer risk after menopause. Methods We follow 38,948 premenopausal women in 1980 and identify 2,586 who reported hysterectomy without bilateral oophorectomy, and 31,626 who reported natural menopause during 22 years of follow-up. We evaluate risk factors for natural menopause, impute age at natural menopause for women reporting hysterectomy without bilateral oophorectomy and estimate the hazard of reaching natural menopause in the next 2 years. We apply this imputed age at menopause to both increase sample size and to evaluate the relation between postmenopausal exposures and risk of breast cancer. Results Age, cigarette smoking, age at menarche, pregnancy history, body mass index, history of benign breast disease, and history of breast cancer were each significantly related to age at natural menopause; duration of oral contraceptive use and family history of breast cancer were not. The imputation increased sample size substantially and although some risk factors after menopause were weaker in the expanded model (height, and alcohol use), use of hormone therapy is less biased. Conclusions Imputing age at menopause increases sample size, broadens generalizability making it applicable to women with hysterectomy, and reduces bias. PMID:21441037
Untreated brain arteriovenous malformation
Al-Shahi Salman, Rustam; McCulloch, Charles E.; Stapf, Christian; Young, William L.
2014-01-01
Objective: To identify risk factors for intracranial hemorrhage in the natural history course of brain arteriovenous malformations (AVMs) using individual patient data meta-analysis of 4 existing cohorts. Methods: We harmonized data from Kaiser Permanente of Northern California (n = 856), University of California San Francisco (n = 787), Columbia University (n = 672), and the Scottish Intracranial Vascular Malformation Study (n = 210). We censored patients at first treatment, death, last visit, or 10-year follow-up, and performed stratified Cox regression analysis of time-to-hemorrhage after evaluating hemorrhagic presentation, sex, age at diagnosis, deep venous drainage, and AVM size as predictors. Multiple imputation was performed to assess impact of missing data. Results: A total of 141 hemorrhage events occurred during 6,074 patient-years of follow-up (annual rate of 2.3%, 95% confidence interval [CI] 2.0%–2.7%), higher for ruptured (4.8%, 3.9%–5.9%) than unruptured (1.3%, 1.0%–1.7%) AVMs at presentation. Hemorrhagic presentation (hazard ratio 3.86, 95% CI 2.42–6.14) and increasing age (1.34 per decade, 1.17–1.53) independently predicted hemorrhage and remained significant predictors in the imputed dataset. Female sex (1.49, 95% CI 0.96–2.30) and exclusively deep venous drainage (1.60, 0.95–2.68, p = 0.02 in imputed dataset) may be additional predictors. AVM size was not associated with intracerebral hemorrhage in multivariable models (p > 0.5). Conclusion: This large, individual patient data meta-analysis identified hemorrhagic presentation and increasing age as independent predictors of hemorrhage during follow-up. Additional AVM cohort data may further improve precision of estimates, identify new risk factors, and allow validation of prediction models. PMID:25015366
Applied Missing Data Analysis. Methodology in the Social Sciences Series
ERIC Educational Resources Information Center
Enders, Craig K.
2010-01-01
Walking readers step by step through complex concepts, this book translates missing data techniques into something that applied researchers and graduate students can understand and utilize in their own research. Enders explains the rationale and procedural details for maximum likelihood estimation, Bayesian estimation, multiple imputation, and…
NASA Astrophysics Data System (ADS)
Deo, Ram K.
Credible spatial information characterizing the structure and site quality of forests is critical to sustainable forest management and planning, especially given the increasing demands and threats to forest products and services. Forest managers and planners are required to evaluate forest conditions over a broad range of scales, contingent on operational or reporting requirements. Traditionally, forest inventory estimates are generated via a design-based approach that involves generalizing sample plot measurements to characterize an unknown population across a larger area of interest. However, field plot measurements are costly and as a consequence spatial coverage is limited. Remote sensing technologies have shown remarkable success in augmenting limited sample plot data to generate stand- and landscape-level spatial predictions of forest inventory attributes. Further enhancement of forest inventory approaches that couple field measurements with cutting edge remotely sensed and geospatial datasets are essential to sustainable forest management. We evaluated a novel Random Forest based k Nearest Neighbors (RF-kNN) imputation approach to couple remote sensing and geospatial data with field inventory collected by different sampling methods to generate forest inventory information across large spatial extents. The forest inventory data collected by the FIA program of US Forest Service was integrated with optical remote sensing and other geospatial datasets to produce biomass distribution maps for a part of the Lake States and species-specific site index maps for the entire Lake State. Targeting small-area application of the state-of-art remote sensing, LiDAR (light detection and ranging) data was integrated with the field data collected by an inexpensive method, called variable plot sampling, in the Ford Forest of Michigan Tech to derive standing volume map in a cost-effective way. The outputs of the RF-kNN imputation were compared with independent validation datasets and extant map products based on different sampling and modeling strategies. The RF-kNN modeling approach was found to be very effective, especially for large-area estimation, and produced results statistically equivalent to the field observations or the estimates derived from secondary data sources. The models are useful to resource managers for operational and strategic purposes.
Spatial Copula Model for Imputing Traffic Flow Data from Remote Microwave Sensors
Ma, Xiaolei; Du, Bowen; Yu, Bin
2017-01-01
Issues of missing data have become increasingly serious with the rapid increase in usage of traffic sensors. Analyses of the Beijing ring expressway have showed that up to 50% of microwave sensors pose missing values. The imputation of missing traffic data must be urgently solved although a precise solution that cannot be easily achieved due to the significant number of missing portions. In this study, copula-based models are proposed for the spatial interpolation of traffic flow from remote traffic microwave sensors. Most existing interpolation methods only rely on covariance functions to depict spatial correlation and are unsuitable for coping with anomalies due to Gaussian consumption. Copula theory overcomes this issue and provides a connection between the correlation function and the marginal distribution function of traffic flow. To validate copula-based models, a comparison with three kriging methods is conducted. Results indicate that copula-based models outperform kriging methods, especially on roads with irregular traffic patterns. Copula-based models demonstrate significant potential to impute missing data in large-scale transportation networks. PMID:28934164
Baccini, Michela; Carreras, Giulia
2014-10-01
This paper describes the methods used to investigate variations in total alcoholic beverage consumption as related to selected control intervention policies and other socioeconomic factors (unplanned factors) within 12 European countries involved in the AMPHORA project. The analysis presented several critical points: presence of missing values, strong correlation among the unplanned factors, long-term waves or trends in both the time series of alcohol consumption and the time series of the main explanatory variables. These difficulties were addressed by implementing a multiple imputation procedure for filling in missing values, then specifying for each country a multiple regression model which accounted for time trend, policy measures and a limited set of unplanned factors, selected in advance on the basis of sociological and statistical considerations are addressed. This approach allowed estimating the "net" effect of the selected control policies on alcohol consumption, but not the association between each unplanned factor and the outcome.
Sung, Yun J; Gu, C Charles; Tiwari, Hemant K; Arnett, Donna K; Broeckel, Ulrich; Rao, Dabeeru C
2012-07-01
Genotype imputation provides imputation of untyped single nucleotide polymorphisms (SNPs) that are present on a reference panel such as those from the HapMap Project. It is popular for increasing statistical power and comparing results across studies using different platforms. Imputation for African American populations is challenging because their linkage disequilibrium blocks are shorter and also because no ideal reference panel is available due to admixture. In this paper, we evaluated three imputation strategies for African Americans. The intersection strategy used a combined panel consisting of SNPs polymorphic in both CEU and YRI. The union strategy used a panel consisting of SNPs polymorphic in either CEU or YRI. The merge strategy merged results from two separate imputations, one using CEU and the other using YRI. Because recent investigators are increasingly using the data from the 1000 Genomes (1KG) Project for genotype imputation, we evaluated both 1KG-based imputations and HapMap-based imputations. We used 23,707 SNPs from chromosomes 21 and 22 on Affymetrix SNP Array 6.0 genotyped for 1,075 HyperGEN African Americans. We found that 1KG-based imputations provided a substantially larger number of variants than HapMap-based imputations, about three times as many common variants and eight times as many rare and low-frequency variants. This higher yield is expected because the 1KG panel includes more SNPs. Accuracy rates using 1KG data were slightly lower than those using HapMap data before filtering, but slightly higher after filtering. The union strategy provided the highest imputation yield with next highest accuracy. The intersection strategy provided the lowest imputation yield but the highest accuracy. The merge strategy provided the lowest imputation accuracy. We observed that SNPs polymorphic only in CEU had much lower accuracy, reducing the accuracy of the union strategy. Our findings suggest that 1KG-based imputations can facilitate discovery of significant associations for SNPs across the whole MAF spectrum. Because the 1KG Project is still under way, we expect that later versions will provide better imputation performance. © 2012 Wiley Periodicals, Inc.
Taylor, Sandra L; Ruhaak, L Renee; Weiss, Robert H; Kelly, Karen; Kim, Kyoungmi
2017-01-01
High through-put mass spectrometry (MS) is now being used to profile small molecular compounds across multiple biological sample types from the same subjects with the goal of leveraging information across biospecimens. Multivariate statistical methods that combine information from all biospecimens could be more powerful than the usual univariate analyses. However, missing values are common in MS data and imputation can impact between-biospecimen correlation and multivariate analysis results. We propose two multivariate two-part statistics that accommodate missing values and combine data from all biospecimens to identify differentially regulated compounds. Statistical significance is determined using a multivariate permutation null distribution. Relative to univariate tests, the multivariate procedures detected more significant compounds in three biological datasets. In a simulation study, we showed that multi-biospecimen testing procedures were more powerful than single-biospecimen methods when compounds are differentially regulated in multiple biospecimens but univariate methods can be more powerful if compounds are differentially regulated in only one biospecimen. We provide R functions to implement and illustrate our method as supplementary information CONTACT: sltaylor@ucdavis.eduSupplementary information: Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
2014-01-01
Background Although the X chromosome is the second largest bovine chromosome, markers on the X chromosome are not used for genomic prediction in some countries and populations. In this study, we presented a method for computing genomic relationships using X chromosome markers, investigated the accuracy of imputation from a low density (7K) to the 54K SNP (single nucleotide polymorphism) panel, and compared the accuracy of genomic prediction with and without using X chromosome markers. Methods The impact of considering X chromosome markers on prediction accuracy was assessed using data from Nordic Holstein bulls and different sets of SNPs: (a) the 54K SNPs for reference and test animals, (b) SNPs imputed from the 7K to the 54K SNP panel for test animals, (c) SNPs imputed from the 7K to the 54K panel for half of the reference animals, and (d) the 7K SNP panel for all animals. Beagle and Findhap were used for imputation. GBLUP (genomic best linear unbiased prediction) models with or without X chromosome markers and with or without a residual polygenic effect were used to predict genomic breeding values for 15 traits. Results Averaged over the two imputation datasets, correlation coefficients between imputed and true genotypes for autosomal markers, pseudo-autosomal markers, and X-specific markers were 0.971, 0.831 and 0.935 when using Findhap, and 0.983, 0.856 and 0.937 when using Beagle. Estimated reliabilities of genomic predictions based on the imputed datasets using Findhap or Beagle were very close to those using the real 54K data. Genomic prediction using all markers gave slightly higher reliabilities than predictions without X chromosome markers. Based on our data which included only bulls, using a G matrix that accounted for sex-linked relationships did not improve prediction, compared with a G matrix that did not account for sex-linked relationships. A model that included a polygenic effect did not recover the loss of prediction accuracy from exclusion of X chromosome markers. Conclusions The results from this study suggest that markers on the X chromosome contribute to accuracy of genomic predictions and should be used for routine genomic evaluation. PMID:25080199
Testing independence of bivariate interval-censored data using modified Kendall's tau statistic.
Kim, Yuneung; Lim, Johan; Park, DoHwan
2015-11-01
In this paper, we study a nonparametric procedure to test independence of bivariate interval censored data; for both current status data (case 1 interval-censored data) and case 2 interval-censored data. To do it, we propose a score-based modification of the Kendall's tau statistic for bivariate interval-censored data. Our modification defines the Kendall's tau statistic with expected numbers of concordant and disconcordant pairs of data. The performance of the modified approach is illustrated by simulation studies and application to the AIDS study. We compare our method to alternative approaches such as the two-stage estimation method by Sun et al. (Scandinavian Journal of Statistics, 2006) and the multiple imputation method by Betensky and Finkelstein (Statistics in Medicine, 1999b). © 2015 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Order-restricted inference for means with missing values.
Wang, Heng; Zhong, Ping-Shou
2017-09-01
Missing values appear very often in many applications, but the problem of missing values has not received much attention in testing order-restricted alternatives. Under the missing at random (MAR) assumption, we impute the missing values nonparametrically using kernel regression. For data with imputation, the classical likelihood ratio test designed for testing the order-restricted means is no longer applicable since the likelihood does not exist. This article proposes a novel method for constructing test statistics for assessing means with an increasing order or a decreasing order based on jackknife empirical likelihood (JEL) ratio. It is shown that the JEL ratio statistic evaluated under the null hypothesis converges to a chi-bar-square distribution, whose weights depend on missing probabilities and nonparametric imputation. Simulation study shows that the proposed test performs well under various missing scenarios and is robust for normally and nonnormally distributed data. The proposed method is applied to an Alzheimer's disease neuroimaging initiative data set for finding a biomarker for the diagnosis of the Alzheimer's disease. © 2017, The International Biometric Society.
Identity-by-Descent-Based Phasing and Imputation in Founder Populations Using Graphical Models
Palin, Kimmo; Campbell, Harry; Wright, Alan F; Wilson, James F; Durbin, Richard
2011-01-01
Accurate knowledge of haplotypes, the combination of alleles co-residing on a single copy of a chromosome, enables powerful gene mapping and sequence imputation methods. Since humans are diploid, haplotypes must be derived from genotypes by a phasing process. In this study, we present a new computational model for haplotype phasing based on pairwise sharing of haplotypes inferred to be Identical-By-Descent (IBD). We apply the Bayesian network based model in a new phasing algorithm, called systematic long-range phasing (SLRP), that can capitalize on the close genetic relationships in isolated founder populations, and show with simulated and real genome-wide genotype data that SLRP substantially reduces the rate of phasing errors compared to previous phasing algorithms. Furthermore, the method accurately identifies regions of IBD, enabling linkage-like studies without pedigrees, and can be used to impute most genotypes with very low error rate. Genet. Epidemiol. 2011. © 2011 Wiley Periodicals, Inc.35:853-860, 2011 PMID:22006673
Microarray missing data imputation based on a set theoretic framework and biological knowledge.
Gan, Xiangchao; Liew, Alan Wee-Chung; Yan, Hong
2006-01-01
Gene expressions measured using microarrays usually suffer from the missing value problem. However, in many data analysis methods, a complete data matrix is required. Although existing missing value imputation algorithms have shown good performance to deal with missing values, they also have their limitations. For example, some algorithms have good performance only when strong local correlation exists in data while some provide the best estimate when data is dominated by global structure. In addition, these algorithms do not take into account any biological constraint in their imputation. In this paper, we propose a set theoretic framework based on projection onto convex sets (POCS) for missing data imputation. POCS allows us to incorporate different types of a priori knowledge about missing values into the estimation process. The main idea of POCS is to formulate every piece of prior knowledge into a corresponding convex set and then use a convergence-guaranteed iterative procedure to obtain a solution in the intersection of all these sets. In this work, we design several convex sets, taking into consideration the biological characteristic of the data: the first set mainly exploit the local correlation structure among genes in microarray data, while the second set captures the global correlation structure among arrays. The third set (actually a series of sets) exploits the biological phenomenon of synchronization loss in microarray experiments. In cyclic systems, synchronization loss is a common phenomenon and we construct a series of sets based on this phenomenon for our POCS imputation algorithm. Experiments show that our algorithm can achieve a significant reduction of error compared to the KNNimpute, SVDimpute and LSimpute methods.
Performance of genotype imputation for low frequency and rare variants from the 1000 genomes.
Zheng, Hou-Feng; Rong, Jing-Jing; Liu, Ming; Han, Fang; Zhang, Xing-Wei; Richards, J Brent; Wang, Li
2015-01-01
Genotype imputation is now routinely applied in genome-wide association studies (GWAS) and meta-analyses. However, most of the imputations have been run using HapMap samples as reference, imputation of low frequency and rare variants (minor allele frequency (MAF) < 5%) are not systemically assessed. With the emergence of next-generation sequencing, large reference panels (such as the 1000 Genomes panel) are available to facilitate imputation of these variants. Therefore, in order to estimate the performance of low frequency and rare variants imputation, we imputed 153 individuals, each of whom had 3 different genotype array data including 317k, 610k and 1 million SNPs, to three different reference panels: the 1000 Genomes pilot March 2010 release (1KGpilot), the 1000 Genomes interim August 2010 release (1KGinterim), and the 1000 Genomes phase1 November 2010 and May 2011 release (1KGphase1) by using IMPUTE version 2. The differences between these three releases of the 1000 Genomes data are the sample size, ancestry diversity, number of variants and their frequency spectrum. We found that both reference panel and GWAS chip density affect the imputation of low frequency and rare variants. 1KGphase1 outperformed the other 2 panels, at higher concordance rate, higher proportion of well-imputed variants (info>0.4) and higher mean info score in each MAF bin. Similarly, 1M chip array outperformed 610K and 317K. However for very rare variants (MAF ≤ 0.3%), only 0-1% of the variants were well imputed. We conclude that the imputation of low frequency and rare variants improves with larger reference panels and higher density of genome-wide genotyping arrays. Yet, despite a large reference panel size and dense genotyping density, very rare variants remain difficult to impute.
Widaman, Keith F.; Grimm, Kevin J.; Early, Dawnté R.; Robins, Richard W.; Conger, Rand D.
2013-01-01
Difficulties arise in multiple-group evaluations of factorial invariance if particular manifest variables are missing completely in certain groups. Ad hoc analytic alternatives can be used in such situations (e.g., deleting manifest variables), but some common approaches, such as multiple imputation, are not viable. At least 3 solutions to this problem are viable: analyzing differing sets of variables across groups, using pattern mixture approaches, and a new method using random number generation. The latter solution, proposed in this article, is to generate pseudo-random normal deviates for all observations for manifest variables that are missing completely in a given sample and then to specify multiple-group models in a way that respects the random nature of these values. An empirical example is presented in detail comparing the 3 approaches. The proposed solution can enable quantitative comparisons at the latent variable level between groups using programs that require the same number of manifest variables in each group. PMID:24019738
Mitt, Mario; Kals, Mart; Pärn, Kalle; Gabriel, Stacey B; Lander, Eric S; Palotie, Aarno; Ripatti, Samuli; Morris, Andrew P; Metspalu, Andres; Esko, Tõnu; Mägi, Reedik; Palta, Priit
2017-06-01
Genetic imputation is a cost-efficient way to improve the power and resolution of genome-wide association (GWA) studies. Current publicly accessible imputation reference panels accurately predict genotypes for common variants with minor allele frequency (MAF)≥5% and low-frequency variants (0.5≤MAF<5%) across diverse populations, but the imputation of rare variation (MAF<0.5%) is still rather limited. In the current study, we evaluate imputation accuracy achieved with reference panels from diverse populations with a population-specific high-coverage (30 ×) whole-genome sequencing (WGS) based reference panel, comprising of 2244 Estonian individuals (0.25% of adult Estonians). Although the Estonian-specific panel contains fewer haplotypes and variants, the imputation confidence and accuracy of imputed low-frequency and rare variants was significantly higher. The results indicate the utility of population-specific reference panels for human genetic studies.
Mitt, Mario; Kals, Mart; Pärn, Kalle; Gabriel, Stacey B; Lander, Eric S; Palotie, Aarno; Ripatti, Samuli; Morris, Andrew P; Metspalu, Andres; Esko, Tõnu; Mägi, Reedik; Palta, Priit
2017-01-01
Genetic imputation is a cost-efficient way to improve the power and resolution of genome-wide association (GWA) studies. Current publicly accessible imputation reference panels accurately predict genotypes for common variants with minor allele frequency (MAF)≥5% and low-frequency variants (0.5≤MAF<5%) across diverse populations, but the imputation of rare variation (MAF<0.5%) is still rather limited. In the current study, we evaluate imputation accuracy achieved with reference panels from diverse populations with a population-specific high-coverage (30 ×) whole-genome sequencing (WGS) based reference panel, comprising of 2244 Estonian individuals (0.25% of adult Estonians). Although the Estonian-specific panel contains fewer haplotypes and variants, the imputation confidence and accuracy of imputed low-frequency and rare variants was significantly higher. The results indicate the utility of population-specific reference panels for human genetic studies. PMID:28401899
[Recognition of occupational cancers: review of existing methods and perspectives].
Vandentorren, Stéphanie; Salmi, L Rachid; Brochard, Patrick
2005-09-01
Occupational risk factors represent a significant part of cancer causes and are involved in all type of cancers. Nonetheless, the frequency of these cancers is largely under-estimated. Parallel to the epidemiological approach (collective), the concept of occupational cancer is often linked (at the individual level) to the compensation of occupational diseases. To give rise to a financial compensation, the occupational origin of the exposition has to be established for a given cancer. Whatever the method used to explore an occupational cause, the approach is that of an imputation. The aim of this work is to synthesize and describe the main principles of recognition of occupational cancers, to discuss the limits of available methods and to consider the research needed to improve these methods. In France, the recognition of a cancer's occupational origin consists in tables of occupational diseases that are based on presumption of causality. These tables consist in medical, technical and administrative conditions that are necessary and sufficient for the recognition of an occupational disease and its financial compensation. Whenever causality presumption does not apply, imputation is based on case analyses run by experts within regional committees of occupational diseases recognition that lack reproducibility. They do not allow statistical quantization and do not always take into account the weight of associated factors. Nonetheless, reliability and validity of the expertise could be reinforced by the use of formal consensus techniques. This process could ideally lead to the generation of decision-making algorithms that could guide the user towards the decision of imputing or not the cancer to an occupational exposure. This would be adapted to the build-up of new tables. The imputation process would be better represented by statistical methods based on the use of Bayes' theorem. The application of these methods to occupational cancers is promising but remains limited due to the lack of epidemiological data. Acquiring these data and diffusing these methods should become research and development priorities in the cancer field.
Welch, Catherine A; Petersen, Irene; Bartlett, Jonathan W; White, Ian R; Marston, Louise; Morris, Richard W; Nazareth, Irwin; Walters, Kate; Carpenter, James
2014-01-01
Most implementations of multiple imputation (MI) of missing data are designed for simple rectangular data structures ignoring temporal ordering of data. Therefore, when applying MI to longitudinal data with intermittent patterns of missing data, some alternative strategies must be considered. One approach is to divide data into time blocks and implement MI independently at each block. An alternative approach is to include all time blocks in the same MI model. With increasing numbers of time blocks, this approach is likely to break down because of co-linearity and over-fitting. The new two-fold fully conditional specification (FCS) MI algorithm addresses these issues, by only conditioning on measurements, which are local in time. We describe and report the results of a novel simulation study to critically evaluate the two-fold FCS algorithm and its suitability for imputation of longitudinal electronic health records. After generating a full data set, approximately 70% of selected continuous and categorical variables were made missing completely at random in each of ten time blocks. Subsequently, we applied a simple time-to-event model. We compared efficiency of estimated coefficients from a complete records analysis, MI of data in the baseline time block and the two-fold FCS algorithm. The results show that the two-fold FCS algorithm maximises the use of data available, with the gain relative to baseline MI depending on the strength of correlations within and between variables. Using this approach also increases plausibility of the missing at random assumption by using repeated measures over time of variables whose baseline values may be missing. PMID:24782349
Moore, Lynne; Hanley, James A; Lavoie, André; Turgeon, Alexis
2009-05-01
The National Trauma Data Bank (NTDB) is plagued by the problem of missing physiological data. The Glasgow Coma Scale score, Respiratory Rate and Systolic Blood Pressure are an essential part of risk adjustment strategies for trauma system evaluation and clinical research. Missing data on these variables may compromise the feasibility and the validity of trauma group comparisons. To evaluate the validity of Multiple Imputation (MI) for completing missing physiological data in the National Trauma Data Bank (NTDB), by assessing the impact of MI on 1) frequency distributions, 2) associations with mortality, and 3) risk adjustment. Analyses were based on 170,956 NTDB observations with complete physiological data (observed data set). Missing physiological data were artificially imposed on this data set and then imputed using MI (MI data set). To assess the impact of MI on risk adjustment, 100 pairs of hospitals were randomly selected with replacement and compared using adjusted Odds Ratios (OR) of mortality. OR generated by the observed data set were then compared to those generated by the MI data set. Frequency distributions and associations with mortality were preserved following MI. The median absolute difference between adjusted OR of mortality generated by the observed data set and by the MI data set was 3.6% (inter-quartile range: 2.4%-6.1%). This study suggests that, provided it is implemented with care, MI of missing physiological data in the NTDB leads to valid frequency distributions, preserves associations with mortality, and does not compromise risk adjustment in inter-hospital comparisons of mortality.
Geeleher, Paul; Zhang, Zhenyu; Wang, Fan; Gruener, Robert F; Nath, Aritro; Morrison, Gladys; Bhutra, Steven; Grossman, Robert L; Huang, R Stephanie
2017-10-01
Obtaining accurate drug response data in large cohorts of cancer patients is very challenging; thus, most cancer pharmacogenomics discovery is conducted in preclinical studies, typically using cell lines and mouse models. However, these platforms suffer from serious limitations, including small sample sizes. Here, we have developed a novel computational method that allows us to impute drug response in very large clinical cancer genomics data sets, such as The Cancer Genome Atlas (TCGA). The approach works by creating statistical models relating gene expression to drug response in large panels of cancer cell lines and applying these models to tumor gene expression data in the clinical data sets (e.g., TCGA). This yields an imputed drug response for every drug in each patient. These imputed drug response data are then associated with somatic genetic variants measured in the clinical cohort, such as copy number changes or mutations in protein coding genes. These analyses recapitulated drug associations for known clinically actionable somatic genetic alterations and identified new predictive biomarkers for existing drugs. © 2017 Geeleher et al.; Published by Cold Spring Harbor Laboratory Press.
Accounting for one-channel depletion improves missing value imputation in 2-dye microarray data.
Ritz, Cecilia; Edén, Patrik
2008-01-19
For 2-dye microarray platforms, some missing values may arise from an un-measurably low RNA expression in one channel only. Information of such "one-channel depletion" is so far not included in algorithms for imputation of missing values. Calculating the mean deviation between imputed values and duplicate controls in five datasets, we show that KNN-based imputation gives a systematic bias of the imputed expression values of one-channel depleted spots. Evaluating the correction of this bias by cross-validation showed that the mean square deviation between imputed values and duplicates were reduced up to 51%, depending on dataset. By including more information in the imputation step, we more accurately estimate missing expression values.
Genotype imputation in a tropical crossbred dairy cattle population.
Oliveira Júnior, Gerson A; Chud, Tatiane C S; Ventura, Ricardo V; Garrick, Dorian J; Cole, John B; Munari, Danísio P; Ferraz, José B S; Mullart, Erik; DeNise, Sue; Smith, Shannon; da Silva, Marcos Vinícius G B
2017-12-01
The objective of this study was to investigate different strategies for genotype imputation in a population of crossbred Girolando (Gyr × Holstein) dairy cattle. The data set consisted of 478 Girolando, 583 Gyr, and 1,198 Holstein sires genotyped at high density with the Illumina BovineHD (Illumina, San Diego, CA) panel, which includes ∼777K markers. The accuracy of imputation from low (20K) and medium densities (50K and 70K) to the HD panel density and from low to 50K density were investigated. Seven scenarios using different reference populations (RPop) considering Girolando, Gyr, and Holstein breeds separately or combinations of animals of these breeds were tested for imputing genotypes of 166 randomly chosen Girolando animals. The population genotype imputation were performed using FImpute. Imputation accuracy was measured as the correlation between observed and imputed genotypes (CORR) and also as the proportion of genotypes that were imputed correctly (CR). This is the first paper on imputation accuracy in a Girolando population. The sample-specific imputation accuracies ranged from 0.38 to 0.97 (CORR) and from 0.49 to 0.96 (CR) imputing from low and medium densities to HD, and 0.41 to 0.95 (CORR) and from 0.50 to 0.94 (CR) for imputation from 20K to 50K. The CORR anim exceeded 0.96 (for 50K and 70K panels) when only Girolando animals were included in RPop (S1). We found smaller CORR anim when Gyr (S2) was used instead of Holstein (S3) as RPop. The same behavior was observed between S4 (Gyr + Girolando) and S5 (Holstein + Girolando) because the target animals were more related to the Holstein population than to the Gyr population. The highest imputation accuracies were observed for scenarios including Girolando animals in the reference population, whereas using only Gyr animals resulted in low imputation accuracies, suggesting that the haplotypes segregating in the Girolando population had a greater effect on accuracy than the purebred haplotypes. All chromosomes had similar imputation accuracies (CORR snp ) within each scenario. Crossbred animals (Girolando) must be included in the reference population to provide the best imputation accuracies. Copyright © 2017 American Dairy Science Association. Published by Elsevier Inc. All rights reserved.
Adapt-Mix: learning local genetic correlation structure improves summary statistics-based analyses
Park, Danny S.; Brown, Brielin; Eng, Celeste; Huntsman, Scott; Hu, Donglei; Torgerson, Dara G.; Burchard, Esteban G.; Zaitlen, Noah
2015-01-01
Motivation: Approaches to identifying new risk loci, training risk prediction models, imputing untyped variants and fine-mapping causal variants from summary statistics of genome-wide association studies are playing an increasingly important role in the human genetics community. Current summary statistics-based methods rely on global ‘best guess’ reference panels to model the genetic correlation structure of the dataset being studied. This approach, especially in admixed populations, has the potential to produce misleading results, ignores variation in local structure and is not feasible when appropriate reference panels are missing or small. Here, we develop a method, Adapt-Mix, that combines information across all available reference panels to produce estimates of local genetic correlation structure for summary statistics-based methods in arbitrary populations. Results: We applied Adapt-Mix to estimate the genetic correlation structure of both admixed and non-admixed individuals using simulated and real data. We evaluated our method by measuring the performance of two summary statistics-based methods: imputation and joint-testing. When using our method as opposed to the current standard of ‘best guess’ reference panels, we observed a 28% decrease in mean-squared error for imputation and a 73.7% decrease in mean-squared error for joint-testing. Availability and implementation: Our method is publicly available in a software package called ADAPT-Mix available at https://github.com/dpark27/adapt_mix. Contact: noah.zaitlen@ucsf.edu PMID:26072481
Estimating the Imputed Social Cost of Errors of Measurement.
1983-10-01
social cost of an error of measurement in the score on a unidimensional test, an asymptotic method, based on item response theory, is developed for...11111111 ij MICROCOPY RESOLUTION TEST CHART NATIONAL BUREAU OF STANDARDS-1963-A.5. ,,, I v.P I RR-83-33-ONR 4ESTIMATING THE IMPUTED SOCIAL COST S OF... SOCIAL COST OF ERRORS OF MEASUREMENT Frederic M. Lord This research was sponsored in part by the Personnel and Training Research Programs Psychological
Louzoun, Yoram; Alter, Idan; Gragert, Loren; Albrecht, Mark; Maiers, Martin
2018-05-01
Regardless of sampling depth, accurate genotype imputation is limited in regions of high polymorphism which often have a heavy-tailed haplotype frequency distribution. Many rare haplotypes are thus unobserved. Statistical methods to improve imputation by extending reference haplotype distributions using linkage disequilibrium patterns that relate allele and haplotype frequencies have not yet been explored. In the field of unrelated stem cell transplantation, imputation of highly polymorphic human leukocyte antigen (HLA) genes has an important application in identifying the best-matched stem cell donor when searching large registries totaling over 28,000,000 donors worldwide. Despite these large registry sizes, a significant proportion of searched patients present novel HLA haplotypes. Supporting this observation, HLA population genetic models have indicated that many extant HLA haplotypes remain unobserved. The absent haplotypes are a significant cause of error in haplotype matching. We have applied a Bayesian inference methodology for extending haplotype frequency distributions, using a model where new haplotypes are created by recombination of observed alleles. Applications of this joint probability model offer significant improvement in frequency distribution estimates over the best existing alternative methods, as we illustrate using five-locus HLA frequency data from the National Marrow Donor Program registry. Transplant matching algorithms and disease association studies involving phasing and imputation of rare variants may benefit from this statistical inference framework.
Ratcliffe, B; El-Dien, O G; Klápště, J; Porth, I; Chen, C; Jaquish, B; El-Kassaby, Y A
2015-01-01
Genomic selection (GS) potentially offers an unparalleled advantage over traditional pedigree-based selection (TS) methods by reducing the time commitment required to carry out a single cycle of tree improvement. This quality is particularly appealing to tree breeders, where lengthy improvement cycles are the norm. We explored the prospect of implementing GS for interior spruce (Picea engelmannii × glauca) utilizing a genotyped population of 769 trees belonging to 25 open-pollinated families. A series of repeated tree height measurements through ages 3–40 years permitted the testing of GS methods temporally. The genotyping-by-sequencing (GBS) platform was used for single nucleotide polymorphism (SNP) discovery in conjunction with three unordered imputation methods applied to a data set with 60% missing information. Further, three diverse GS models were evaluated based on predictive accuracy (PA), and their marker effects. Moderate levels of PA (0.31–0.55) were observed and were of sufficient capacity to deliver improved selection response over TS. Additionally, PA varied substantially through time accordingly with spatial competition among trees. As expected, temporal PA was well correlated with age-age genetic correlation (r=0.99), and decreased substantially with increasing difference in age between the training and validation populations (0.04–0.47). Moreover, our imputation comparisons indicate that k-nearest neighbor and singular value decomposition yielded a greater number of SNPs and gave higher predictive accuracies than imputing with the mean. Furthermore, the ridge regression (rrBLUP) and BayesCπ (BCπ) models both yielded equal, and better PA than the generalized ridge regression heteroscedastic effect model for the traits evaluated. PMID:26126540
Ratcliffe, B; El-Dien, O G; Klápště, J; Porth, I; Chen, C; Jaquish, B; El-Kassaby, Y A
2015-12-01
Genomic selection (GS) potentially offers an unparalleled advantage over traditional pedigree-based selection (TS) methods by reducing the time commitment required to carry out a single cycle of tree improvement. This quality is particularly appealing to tree breeders, where lengthy improvement cycles are the norm. We explored the prospect of implementing GS for interior spruce (Picea engelmannii × glauca) utilizing a genotyped population of 769 trees belonging to 25 open-pollinated families. A series of repeated tree height measurements through ages 3-40 years permitted the testing of GS methods temporally. The genotyping-by-sequencing (GBS) platform was used for single nucleotide polymorphism (SNP) discovery in conjunction with three unordered imputation methods applied to a data set with 60% missing information. Further, three diverse GS models were evaluated based on predictive accuracy (PA), and their marker effects. Moderate levels of PA (0.31-0.55) were observed and were of sufficient capacity to deliver improved selection response over TS. Additionally, PA varied substantially through time accordingly with spatial competition among trees. As expected, temporal PA was well correlated with age-age genetic correlation (r=0.99), and decreased substantially with increasing difference in age between the training and validation populations (0.04-0.47). Moreover, our imputation comparisons indicate that k-nearest neighbor and singular value decomposition yielded a greater number of SNPs and gave higher predictive accuracies than imputing with the mean. Furthermore, the ridge regression (rrBLUP) and BayesCπ (BCπ) models both yielded equal, and better PA than the generalized ridge regression heteroscedastic effect model for the traits evaluated.
ERIC Educational Resources Information Center
Pfaffel, Andreas; Spiel, Christiane
2016-01-01
Approaches to correcting correlation coefficients for range restriction have been developed under the framework of large sample theory. The accuracy of missing data techniques for correcting correlation coefficients for range restriction has thus far only been investigated with relatively large samples. However, researchers and evaluators are…
The Agricultural Health Study (AHS), a large prospective cohort, was designed to elucidate associations between pesticide use and other agricultural exposures and health outcomes. The cohort includes 57,310 pesticide applicators who were enrolled between 1993 and 1997 in Iowa and...
USDA-ARS?s Scientific Manuscript database
Microsatellite markers (MS) have traditionally been used for parental verification and are still the international standard in spite of their higher cost, error rate, and turnaround time compared with Single Nucleotide Polymorphisms (SNP) -based assays. Despite domestic and international demands fr...
ERIC Educational Resources Information Center
Asendorpf, Jens B.; van de Schoot, Rens; Denissen, Jaap J. A.; Hutteman, Roos
2014-01-01
Most longitudinal studies are plagued by drop-out related to variables at earlier assessments (systematic attrition). Although systematic attrition is often analysed in longitudinal studies, surprisingly few researchers attempt to reduce biases due to systematic attrition, even though this is possible and nowadays technically easy. This is…
The Technical Report of NAEP's 1990 Trial State Assessment Program.
ERIC Educational Resources Information Center
Koffler, Stephen L.; And Others
This report documents the design and data analysis procedures of the Trial State Assessment Program of the National Assessment of Educational Progress (NAEP). Today the NAEP is the only survey using advanced plausible values methodology that uses a multiple imputation procedure in a psychometric context. The 1990 Trial State Assessment collected…
Family structure as a predictor of screen time among youth.
McMillan, Rachel; McIsaac, Michael; Janssen, Ian
2015-01-01
The family plays a central role in the development of health-related behaviors among youth. The objective of this study was to determine whether non-traditional parental structure and shared custody arrangements predict how much time youth spend watching television, using a computer recreationally, and playing video games. Participants were a nationally representative sample of Canadian youth (N = 26,068) in grades 6-10 who participated in the 2009/10 Health Behaviour in School-aged Children Survey. Screen time in youth from single parent and reconstituted families, with or without regular visitation with their non-residential parent, was compared to that of youth from traditional dual-parent families. Multiple imputation was used to account for missing data. After multiple imputation, the relative odds of being in the highest television, computer use, video game, and total screen time quartiles were not different in boys and girls from non-traditional families by comparison to boys and girls from traditional dual-parent families. In conclusion, parental structure and child custody arrangements did not have a meaningful impact on screen time among youth.
Linear Regression with a Randomly Censored Covariate: Application to an Alzheimer's Study.
Atem, Folefac D; Qian, Jing; Maye, Jacqueline E; Johnson, Keith A; Betensky, Rebecca A
2017-01-01
The association between maternal age of onset of dementia and amyloid deposition (measured by in vivo positron emission tomography (PET) imaging) in cognitively normal older offspring is of interest. In a regression model for amyloid, special methods are required due to the random right censoring of the covariate of maternal age of onset of dementia. Prior literature has proposed methods to address the problem of censoring due to assay limit of detection, but not random censoring. We propose imputation methods and a survival regression method that do not require parametric assumptions about the distribution of the censored covariate. Existing imputation methods address missing covariates, but not right censored covariates. In simulation studies, we compare these methods to the simple, but inefficient complete case analysis, and to thresholding approaches. We apply the methods to the Alzheimer's study.
Genotyping by sequencing for genomic prediction in a soybean breeding population.
Jarquín, Diego; Kocak, Kyle; Posadas, Luis; Hyma, Katie; Jedlicka, Joseph; Graef, George; Lorenz, Aaron
2014-08-29
Advances in genotyping technology, such as genotyping by sequencing (GBS), are making genomic prediction more attractive to reduce breeding cycle times and costs associated with phenotyping. Genomic prediction and selection has been studied in several crop species, but no reports exist in soybean. The objectives of this study were (i) evaluate prospects for genomic selection using GBS in a typical soybean breeding program and (ii) evaluate the effect of GBS marker selection and imputation on genomic prediction accuracy. To achieve these objectives, a set of soybean lines sampled from the University of Nebraska Soybean Breeding Program were genotyped using GBS and evaluated for yield and other agronomic traits at multiple Nebraska locations. Genotyping by sequencing scored 16,502 single nucleotide polymorphisms (SNPs) with minor-allele frequency (MAF) > 0.05 and percentage of missing values ≤ 5% on 301 elite soybean breeding lines. When SNPs with up to 80% missing values were included, 52,349 SNPs were scored. Prediction accuracy for grain yield, assessed using cross validation, was estimated to be 0.64, indicating good potential for using genomic selection for grain yield in soybean. Filtering SNPs based on missing data percentage had little to no effect on prediction accuracy, especially when random forest imputation was used to impute missing values. The highest accuracies were observed when random forest imputation was used on all SNPs, but differences were not significant. A standard additive G-BLUP model was robust; modeling additive-by-additive epistasis did not provide any improvement in prediction accuracy. The effect of training population size on accuracy began to plateau around 100, but accuracy steadily climbed until the largest possible size was used in this analysis. Including only SNPs with MAF > 0.30 provided higher accuracies when training populations were smaller. Using GBS for genomic prediction in soybean holds good potential to expedite genetic gain. Our results suggest that standard additive G-BLUP models can be used on unfiltered, imputed GBS data without loss in accuracy.
Johnson, Eric O; Hancock, Dana B; Levy, Joshua L; Gaddis, Nathan C; Saccone, Nancy L; Bierut, Laura J; Page, Grier P
2013-05-01
A great promise of publicly sharing genome-wide association data is the potential to create composite sets of controls. However, studies often use different genotyping arrays, and imputation to a common set of SNPs has shown substantial bias: a problem which has no broadly applicable solution. Based on the idea that using differing genotyped SNP sets as inputs creates differential imputation errors and thus bias in the composite set of controls, we examined the degree to which each of the following occurs: (1) imputation based on the union of genotyped SNPs (i.e., SNPs available on one or more arrays) results in bias, as evidenced by spurious associations (type 1 error) between imputed genotypes and arbitrarily assigned case/control status; (2) imputation based on the intersection of genotyped SNPs (i.e., SNPs available on all arrays) does not evidence such bias; and (3) imputation quality varies by the size of the intersection of genotyped SNP sets. Imputations were conducted in European Americans and African Americans with reference to HapMap phase II and III data. Imputation based on the union of genotyped SNPs across the Illumina 1M and 550v3 arrays showed spurious associations for 0.2 % of SNPs: ~2,000 false positives per million SNPs imputed. Biases remained problematic for very similar arrays (550v1 vs. 550v3) and were substantial for dissimilar arrays (Illumina 1M vs. Affymetrix 6.0). In all instances, imputing based on the intersection of genotyped SNPs (as few as 30 % of the total SNPs genotyped) eliminated such bias while still achieving good imputation quality.
Pelé, Fabienne; Bajeux, Emma; Gendron, Hélène; Monfort, Christine; Rouget, Florence; Multigner, Luc; Viel, Jean-François; Cordier, Sylvaine
2013-12-02
Environmental exposures, including dietary contaminants, may influence the developing immune system. This study assesses the association between maternal pre-parturition consumption of seafood and wheeze, eczema, and food allergy in preschool children. Fish and shellfish were studied separately as they differ according to their levels of omega-3 polyunsaturated fatty acids (which have anti-allergic properties) and their levels of contaminants. The PELAGIE cohort included 3421 women recruited at the beginning of pregnancy. Maternal fish and shellfish intake was measured at inclusion by a food frequency questionnaire. Wheeze, eczema, and food allergy were evaluated by a questionnaire completed by the mother when the child was 2 years old (n = 1500). Examination of the associations between seafood intake and outcomes took major confounders into account. Complementary sensitivity analyses with multiple imputation enabled us to handle missing data, due mostly to attrition. Moderate maternal pre-parturition fish intake (1 to 4 times a month) was, at borderline significance, associated with a lower risk of wheeze (adjusted OR = 0.69 (0.45-1.05)) before age 2, compared with low intake (< once/month). This result was not, however, consistent: after multiple imputation, the adjusted OR was 0.86 (0.63-1.17). Shellfish intake at least once a month was associated with a higher risk of food allergy before age 2 (adjusted OR = 1.62 (1.11-2.37)) compared to low or no intake (< once/month). Multiple imputation confirmed this association (adjusted OR = 1.52 (1.05-2.21)). This study suggests that maternal pre-parturition shellfish consumption may increase the risk of food allergy. Further large-scale epidemiologic studies are needed to corroborate these results, identify the contaminants or components of shellfish responsible for the effects observed, determine the persistence of the associations seen at age 2, and investigate potential associations with health effects observable at later ages, such as allergic asthma.
Gottlieb, Alice B; Blauvelt, Andrew; Prinz, Jörg C; Papanastasiou, Philemon; Pathan, Rashidkhan; Nyirady, Judit; Fox, Todd; Papavassilis, Charis
2016-10-01
Secukinumab, a human monoclonal antibody that selectively targets interleukin-17A, is highly efficacious in the treatment of moderate-to-severe psoriasis, starting at early time points, with a sustained effect and a favorable safety profile. Patients with moderate-to-severe plaque psoriasis were randomized to secukinumab 300 mg, secukinumab 150 mg, or placebo self-administered by prefilled syringe at baseline, weeks 1, 2, and 3, and then every four weeks from week 4 to 48. Efficacy responses (≥ 75/90/100% improvement in Psoriasis Area and Severity Index [PASI 75/90/100] and clear/almost clear skin by Investigator's Global Assessment 2011 modified version [IGA mod 2011 0/1]) were measured to week 52. Patient-reported usability of the prefilled syringe was evaluated by the Self-Injection Assessment Questionnaire to week 48. The efficacy of secukinumab increased to week 16 and was maintained to week 52. With secukinumab 300 mg at week 52, PASI 75/90/100 and IGA mod 2011 0/1 responses were achieved by 83.5%/68.0%/47.5% and 71.5% of patients when analyzed by multiple imputation, respectively, and by 75.9%/62.1%/43.1% and 63.8% of patients when analyzed by nonresponder imputation, respectively. With secukinumab 150 mg at week 52, PASI 75/90/100 and IGA mod 2011 0/1 responses were achieved by 63.5%/50.3%/31.1% and 43.6% of patients when analyzed by multiple imputation, respectively, and by 61.0%/49.2%/30.5% and 42.4% of patients when analyzed by nonresponder imputation, respectively. Self-reported acceptability of the prefilled syringe was high throughout the study. The incidence of adverse events (AE) was well balanced between groups, with AEs reported in 74.4% of patients receiving secukinumab 300 mg and 77.3% of patients receiving secukinumab 150 mg. Nasopharyngitis was the most common AE across both secukinumab groups. Self-administration of secukinumab by prefilled syringe was associated with robust and sustained efficacy and a favorable safety profile up to week 52.
J Drugs Dermatol. 2016;15(10):1226-1234.
Hedden, Sarra L; Woolson, Robert F; Carter, Rickey E; Palesch, Yuko; Upadhyaya, Himanshu P; Malcolm, Robert J
2009-07-01
"Loss to follow-up" can be substantial in substance abuse clinical trials. When extensive losses to follow-up occur, one must cautiously analyze and interpret the findings of a research study. Aims of this project were to introduce the types of missing data mechanisms and describe several methods for analyzing data with loss to follow-up. Furthermore, a simulation study compared Type I error and power of several methods when missing data amount and mechanism varies. Methods compared were the following: Last observation carried forward (LOCF), multiple imputation (MI), modified stratified summary statistics (SSS), and mixed effects models. Results demonstrated nominal Type I error for all methods; power was high for all methods except LOCF. Mixed effect model, modified SSS, and MI are generally recommended for use; however, many methods require that the data are missing at random or missing completely at random (i.e., "ignorable"). If the missing data are presumed to be nonignorable, a sensitivity analysis is recommended.
Comparison of methods for dealing with missing values in the EPV-R.
Paniagua, David; Amor, Pedro J; Echeburúa, Enrique; Abad, Francisco J
2017-08-01
The development of an effective instrument to assess the risk of partner violence is a topic of great social relevance. This study evaluates the scale of “Predicción del Riesgo de Violencia Grave Contra la Pareja” –Revisada– (EPV-R - Severe Intimate Partner Violence Risk Prediction Scale-Revised), a tool developed in Spain, which is facing the problem of how to treat the high rate of missing values, as is usual in this type of scale. First, responses to the EPV-R in a sample of 1215 male abusers who were reported to the police were used to analyze the patterns of occurrence of missing values, as well as the factor structure. Second, we analyzed the performance of various imputation methods using simulated data that emulates the missing data mechanism found in the empirical database. The imputation procedure originally proposed by the authors of the scale provides acceptable results, although the application of a method based on the Item Response Theory could provide greater accuracy and offers some additional advantages. Item Response Theory appears to be a useful tool for imputing missing data in this type of questionnaire.
Improving cluster-based missing value estimation of DNA microarray data.
Brás, Lígia P; Menezes, José C
2007-06-01
We present a modification of the weighted K-nearest neighbours imputation method (KNNimpute) for missing values (MVs) estimation in microarray data based on the reuse of estimated data. The method was called iterative KNN imputation (IKNNimpute) as the estimation is performed iteratively using the recently estimated values. The estimation efficiency of IKNNimpute was assessed under different conditions (data type, fraction and structure of missing data) by the normalized root mean squared error (NRMSE) and the correlation coefficients between estimated and true values, and compared with that of other cluster-based estimation methods (KNNimpute and sequential KNN). We further investigated the influence of imputation on the detection of differentially expressed genes using SAM by examining the differentially expressed genes that are lost after MV estimation. The performance measures give consistent results, indicating that the iterative procedure of IKNNimpute can enhance the prediction ability of cluster-based methods in the presence of high missing rates, in non-time series experiments and in data sets comprising both time series and non-time series data, because the information of the genes having MVs is used more efficiently and the iterative procedure allows refining the MV estimates. More importantly, IKNN has a smaller detrimental effect on the detection of differentially expressed genes.
NASA Astrophysics Data System (ADS)
Moura, Ricardo; Sinha, Bimal; Coelho, Carlos A.
2017-06-01
The recent popularity of the use of synthetic data as a Statistical Disclosure Control technique has enabled the development of several methods of generating and analyzing such data, but almost always relying in asymptotic distributions and in consequence being not adequate for small sample datasets. Thus, a likelihood-based exact inference procedure is derived for the matrix of regression coefficients of the multivariate regression model, for multiply imputed synthetic data generated via Posterior Predictive Sampling. Since it is based in exact distributions this procedure may even be used in small sample datasets. Simulation studies compare the results obtained from the proposed exact inferential procedure with the results obtained from an adaptation of Reiters combination rule to multiply imputed synthetic datasets and an application to the 2000 Current Population Survey is discussed.
Missing data exploration: highlighting graphical presentation of missing pattern.
Zhang, Zhongheng
2015-12-01
Functions shipped with R base can fulfill many tasks of missing data handling. However, because the data volume of electronic medical record (EMR) system is always very large, more sophisticated methods may be helpful in data management. The article focuses on missing data handling by using advanced techniques. There are three types of missing data, that is, missing completely at random (MCAR), missing at random (MAR) and not missing at random (NMAR). This classification system depends on how missing values are generated. Two packages, Multivariate Imputation by Chained Equations (MICE) and Visualization and Imputation of Missing Values (VIM), provide sophisticated functions to explore missing data pattern. In particular, the VIM package is especially helpful in visual inspection of missing data. Finally, correlation analysis provides information on the dependence of missing data on other variables. Such information is useful in subsequent imputations.
Data estimation and prediction for natural resources public data
Hans T. Schreuder; Robin M. Reich
1998-01-01
A key product of both Forest Inventory and Analysis (FIA) of the USDA Forest Service and the Natural Resources Inventory (NRI) of the Natural Resources Conservation Service is a scientific data base that should be defensible in court. Multiple imputation procedures (MIPs) have been proposed both for missing value estimation and prediction of non-remeasured cells in...
Evaluating Multiple Imputation Models for the Southern Annual Forest Inventory
Gregory A. Reams; Joseph M. McCollum
1999-01-01
The USDA Forest Service's Southern Research Station is implementing an annualized forest survey in thirteen states. The sample design is a systematic sample of five interpenetrating grids (panels), where each panel is measured sequentially. For example, panel one information is collected in year one, and panel five in year five. The area representative and time...
ERIC Educational Resources Information Center
Li, Tiandong
2012-01-01
In large-scale assessments, such as the National Assessment of Educational Progress (NAEP), plausible values based on Multiple Imputations (MI) have been used to estimate population characteristics for latent constructs under complex sample designs. Mislevy (1991) derived a closed-form analytic solution for a fixed-effect model in creating…
Statistical primer: how to deal with missing data in scientific research?
Papageorgiou, Grigorios; Grant, Stuart W; Takkenberg, Johanna J M; Mokhles, Mostafa M
2018-05-10
Missing data are a common challenge encountered in research which can compromise the results of statistical inference when not handled appropriately. This paper aims to introduce basic concepts of missing data to a non-statistical audience, list and compare some of the most popular approaches for handling missing data in practice and provide guidelines and recommendations for dealing with and reporting missing data in scientific research. Complete case analysis and single imputation are simple approaches for handling missing data and are popular in practice, however, in most cases they are not guaranteed to provide valid inferences. Multiple imputation is a robust and general alternative which is appropriate for data missing at random, surpassing the disadvantages of the simpler approaches, but should always be conducted with care. The aforementioned approaches are illustrated and compared in an example application using Cox regression.
Genotype imputation in a coalescent model with infinitely-many-sites mutation
Huang, Lucy; Buzbas, Erkan O.; Rosenberg, Noah A.
2012-01-01
Empirical studies have identified population-genetic factors as important determinants of the properties of genotype-imputation accuracy in imputation-based disease association studies. Here, we develop a simple coalescent model of three sequences that we use to explore the theoretical basis for the influence of these factors on genotype-imputation accuracy, under the assumption of infinitely-many-sites mutation. Employing a demographic model in which two populations diverged at a given time in the past, we derive the approximate expectation and variance of imputation accuracy in a study sequence sampled from one of the two populations, choosing between two reference sequences, one sampled from the same population as the study sequence and the other sampled from the other population. We show that under this model, imputation accuracy—as measured by the proportion of polymorphic sites that are imputed correctly in the study sequence—increases in expectation with the mutation rate, the proportion of the markers in a chromosomal region that are genotyped, and the time to divergence between the study and reference populations. Each of these effects derives largely from an increase in information available for determining the reference sequence that is genetically most similar to the sequence targeted for imputation. We analyze as a function of divergence time the expected gain in imputation accuracy in the target using a reference sequence from the same population as the target rather than from the other population. Together with a growing body of empirical investigations of genotype imputation in diverse human populations, our modeling framework lays a foundation for extending imputation techniques to novel populations that have not yet been extensively examined. PMID:23079542
High-density marker imputation accuracy in sixteen French cattle breeds.
Hozé, Chris; Fouilloux, Marie-Noëlle; Venot, Eric; Guillaume, François; Dassonneville, Romain; Fritz, Sébastien; Ducrocq, Vincent; Phocas, Florence; Boichard, Didier; Croiseau, Pascal
2013-09-03
Genotyping with the medium-density Bovine SNP50 BeadChip® (50K) is now standard in cattle. The high-density BovineHD BeadChip®, which contains 777,609 single nucleotide polymorphisms (SNPs), was developed in 2010. Increasing marker density increases the level of linkage disequilibrium between quantitative trait loci (QTL) and SNPs and the accuracy of QTL localization and genomic selection. However, re-genotyping all animals with the high-density chip is not economically feasible. An alternative strategy is to genotype part of the animals with the high-density chip and to impute high-density genotypes for animals already genotyped with the 50K chip. Thus, it is necessary to investigate the error rate when imputing from the 50K to the high-density chip. Five thousand one hundred and fifty three animals from 16 breeds (89 to 788 per breed) were genotyped with the high-density chip. Imputation error rates from the 50K to the high-density chip were computed for each breed with a validation set that included the 20% youngest animals. Marker genotypes were masked for animals in the validation population in order to mimic 50K genotypes. Imputation was carried out using the Beagle 3.3.0 software. Mean allele imputation error rates ranged from 0.31% to 2.41% depending on the breed. In total, 1980 SNPs had high imputation error rates in several breeds, which is probably due to genome assembly errors, and we recommend to discard these in future studies. Differences in imputation accuracy between breeds were related to the high-density-genotyped sample size and to the genetic relationship between reference and validation populations, whereas differences in effective population size and level of linkage disequilibrium showed limited effects. Accordingly, imputation accuracy was higher in breeds with large populations and in dairy breeds than in beef breeds. More than 99% of the alleles were correctly imputed if more than 300 animals were genotyped at high-density. No improvement was observed when multi-breed imputation was performed. In all breeds, imputation accuracy was higher than 97%, which indicates that imputation to the high-density chip was accurate. Imputation accuracy depends mainly on the size of the reference population and the relationship between reference and target populations.
Missing data exploration: highlighting graphical presentation of missing pattern
2015-01-01
Functions shipped with R base can fulfill many tasks of missing data handling. However, because the data volume of electronic medical record (EMR) system is always very large, more sophisticated methods may be helpful in data management. The article focuses on missing data handling by using advanced techniques. There are three types of missing data, that is, missing completely at random (MCAR), missing at random (MAR) and not missing at random (NMAR). This classification system depends on how missing values are generated. Two packages, Multivariate Imputation by Chained Equations (MICE) and Visualization and Imputation of Missing Values (VIM), provide sophisticated functions to explore missing data pattern. In particular, the VIM package is especially helpful in visual inspection of missing data. Finally, correlation analysis provides information on the dependence of missing data on other variables. Such information is useful in subsequent imputations. PMID:26807411
Traffic speed data imputation method based on tensor completion.
Ran, Bin; Tan, Huachun; Feng, Jianshuai; Liu, Ying; Wang, Wuhong
2015-01-01
Traffic speed data plays a key role in Intelligent Transportation Systems (ITS); however, missing traffic data would affect the performance of ITS as well as Advanced Traveler Information Systems (ATIS). In this paper, we handle this issue by a novel tensor-based imputation approach. Specifically, tensor pattern is adopted for modeling traffic speed data and then High accurate Low Rank Tensor Completion (HaLRTC), an efficient tensor completion method, is employed to estimate the missing traffic speed data. This proposed method is able to recover missing entries from given entries, which may be noisy, considering severe fluctuation of traffic speed data compared with traffic volume. The proposed method is evaluated on Performance Measurement System (PeMS) database, and the experimental results show the superiority of the proposed approach over state-of-the-art baseline approaches.
Traffic Speed Data Imputation Method Based on Tensor Completion
Ran, Bin; Feng, Jianshuai; Liu, Ying; Wang, Wuhong
2015-01-01
Traffic speed data plays a key role in Intelligent Transportation Systems (ITS); however, missing traffic data would affect the performance of ITS as well as Advanced Traveler Information Systems (ATIS). In this paper, we handle this issue by a novel tensor-based imputation approach. Specifically, tensor pattern is adopted for modeling traffic speed data and then High accurate Low Rank Tensor Completion (HaLRTC), an efficient tensor completion method, is employed to estimate the missing traffic speed data. This proposed method is able to recover missing entries from given entries, which may be noisy, considering severe fluctuation of traffic speed data compared with traffic volume. The proposed method is evaluated on Performance Measurement System (PeMS) database, and the experimental results show the superiority of the proposed approach over state-of-the-art baseline approaches. PMID:25866501
Large-scale imputation of epigenomic datasets for systematic annotation of diverse human tissues.
Ernst, Jason; Kellis, Manolis
2015-04-01
With hundreds of epigenomic maps, the opportunity arises to exploit the correlated nature of epigenetic signals, across both marks and samples, for large-scale prediction of additional datasets. Here, we undertake epigenome imputation by leveraging such correlations through an ensemble of regression trees. We impute 4,315 high-resolution signal maps, of which 26% are also experimentally observed. Imputed signal tracks show overall similarity to observed signals and surpass experimental datasets in consistency, recovery of gene annotations and enrichment for disease-associated variants. We use the imputed data to detect low-quality experimental datasets, to find genomic sites with unexpected epigenomic signals, to define high-priority marks for new experiments and to delineate chromatin states in 127 reference epigenomes spanning diverse tissues and cell types. Our imputed datasets provide the most comprehensive human regulatory region annotation to date, and our approach and the ChromImpute software constitute a useful complement to large-scale experimental mapping of epigenomic information.
The use of multiple imputation in the Southern Annual Forest Inventory System
Gregory A. Reams; Joseph M. McCollum
2000-01-01
The Southern Research Station is currently implementing an annual forest survey in 7 of the 13 States that it is responsible for surveying. The Southern Annual Forest Inventory System (SAFIS) sampling design is a systematic sample of five interpenetrating grids, whereby an equal number of plots are measured each year. The area-representative and time-series...
The use of multiple imputation in the Southern Annual Forest Inventory System
Gregory A. Reams; Joseph M. McCollum
2000-01-01
The Southern Research Station is currently implementing an annual forest survey in 7 of the 13 states that it is responsible for surveying. The Southern Annual Forest Inventory System (SAFIS) sampling design is a systematic sample of five interpenetrating grids, whereby an equal number of plots are measured each year. The area representative and time series nature of...
ERIC Educational Resources Information Center
Monahan, Kathryn C.; Lee, Joanna M.; Steinberg, Laurence
2011-01-01
The impact of part-time employment on adolescent functioning remains unclear because most studies fail to adequately control for differential selection into the workplace. The present study reanalyzes data from L. Steinberg, S. Fegley, and S. M. Dornbusch (1993) using multiple imputation, which minimizes bias in effect size estimation, and 2 types…
Mercer, Theresa G; Frostick, Lynne E; Walmsley, Anthony D
2011-10-15
This paper presents a statistical technique that can be applied to environmental chemistry data where missing values and limit of detection levels prevent the application of statistics. A working example is taken from an environmental leaching study that was set up to determine if there were significant differences in levels of leached arsenic (As), chromium (Cr) and copper (Cu) between lysimeters containing preservative treated wood waste and those containing untreated wood. Fourteen lysimeters were setup and left in natural conditions for 21 weeks. The resultant leachate was analysed by ICP-OES to determine the As, Cr and Cu concentrations. However, due to the variation inherent in each lysimeter combined with the limits of detection offered by ICP-OES, the collected quantitative data was somewhat incomplete. Initial data analysis was hampered by the number of 'missing values' in the data. To recover the dataset, the statistical tool of Statistical Multiple Imputation (SMI) was applied, and the data was re-analysed successfully. It was demonstrated that using SMI did not affect the variance in the data, but facilitated analysis of the complete dataset. Copyright © 2011 Elsevier B.V. All rights reserved.
Keshavarzi, Sareh; Ayatollahi, Seyyed Mohammad Taghi; Zare, Najaf; Pakfetrat, Maryam
2012-01-01
BACKGROUND. In many studies with longitudinal data, time-dependent covariates can only be measured intermittently (not at all observation times), and this presents difficulties for standard statistical analyses. This situation is common in medical studies, and methods that deal with this challenge would be useful. METHODS. In this study, we performed the seemingly unrelated regression (SUR) based models, with respect to each observation time in longitudinal data with intermittently observed time-dependent covariates and further compared these models with mixed-effect regression models (MRMs) under three classic imputation procedures. Simulation studies were performed to compare the sample size properties of the estimated coefficients for different modeling choices. RESULTS. In general, the proposed models in the presence of intermittently observed time-dependent covariates showed a good performance. However, when we considered only the observed values of the covariate without any imputations, the resulted biases were greater. The performances of the proposed SUR-based models in comparison with MRM using classic imputation methods were nearly similar with approximately equal amounts of bias and MSE. CONCLUSION. The simulation study suggests that the SUR-based models work as efficiently as MRM in the case of intermittently observed time-dependent covariates. Thus, it can be used as an alternative to MRM.
Cleveland, M A; Hickey, J M
2013-08-01
Genomic selection can be implemented in pig breeding at a reduced cost using genotype imputation. Accuracy of imputation and the impact on resulting genomic breeding values (gEBV) was investigated. High-density genotype data was available for 4,763 animals from a single pig line. Three low-density genotype panels were constructed with SNP densities of 450 (L450), 3,071 (L3k) and 5,963 (L6k). Accuracy of imputation was determined using 184 test individuals with no genotyped descendants in the data but with parents and grandparents genotyped using the Illumina PorcineSNP60 Beadchip. Alternative genotyping scenarios were created in which parents, grandparents, and individuals that were not direct ancestors of test animals (Other) were genotyped at high density (S1), grandparents were not genotyped (S2), dams and granddams were not genotyped (S3), and dams and granddams were genotyped at low density (S4). Four additional scenarios were created by excluding Other animal genotypes. Test individuals were always genotyped at low density. Imputation was performed with AlphaImpute. Genomic breeding values were calculated using the single-step genomic evaluation. Test animals were evaluated for the information retained in the gEBV, calculated as the correlation between gEBV using imputed genotypes and gEBV using true genotypes. Accuracy of imputation was high for all scenarios but decreased with fewer SNP on the low-density panel (0.995 to 0.965 for S1) and with reduced genotyping of ancestors, where the largest changes were for L450 (0.965 in S1 to 0.914 in S3). Exclusion of genotypes for Other animals resulted in only small accuracy decreases. Imputation accuracy was not consistent across the genome. Information retained in the gEBV was related to genotyping scenario and thus to imputation accuracy. Reducing the number of SNP on the low-density panel reduced the information retained in the gEBV, with the largest decrease observed from L3k to L450. Excluding Other animal genotypes had little impact on imputation accuracy but caused large decreases in the information retained in the gEBV. These results indicate that accuracy of gEBV from imputed genotypes depends on the level of genotyping in close relatives and the size of the genotyped dataset. Fewer high-density genotyped individuals are needed to obtain accurate imputation than are needed to obtain accurate gEBV. Strategies to optimize development of low-density panels can improve both imputation and gEBV accuracy.
Hensman, James; Lawrence, Neil D; Rattray, Magnus
2013-08-20
Time course data from microarrays and high-throughput sequencing experiments require simple, computationally efficient and powerful statistical models to extract meaningful biological signal, and for tasks such as data fusion and clustering. Existing methodologies fail to capture either the temporal or replicated nature of the experiments, and often impose constraints on the data collection process, such as regularly spaced samples, or similar sampling schema across replications. We propose hierarchical Gaussian processes as a general model of gene expression time-series, with application to a variety of problems. In particular, we illustrate the method's capacity for missing data imputation, data fusion and clustering.The method can impute data which is missing both systematically and at random: in a hold-out test on real data, performance is significantly better than commonly used imputation methods. The method's ability to model inter- and intra-cluster variance leads to more biologically meaningful clusters. The approach removes the necessity for evenly spaced samples, an advantage illustrated on a developmental Drosophila dataset with irregular replications. The hierarchical Gaussian process model provides an excellent statistical basis for several gene-expression time-series tasks. It has only a few additional parameters over a regular GP, has negligible additional complexity, is easily implemented and can be integrated into several existing algorithms. Our experiments were implemented in python, and are available from the authors' website: http://staffwww.dcs.shef.ac.uk/people/J.Hensman/.
Badke, Yvonne M; Bates, Ronald O; Ernst, Catherine W; Fix, Justin; Steibel, Juan P
2014-04-16
Genomic selection has the potential to increase genetic progress. Genotype imputation of high-density single-nucleotide polymorphism (SNP) genotypes can improve the cost efficiency of genomic breeding value (GEBV) prediction for pig breeding. Consequently, the objectives of this work were to: (1) estimate accuracy of genomic evaluation and GEBV for three traits in a Yorkshire population and (2) quantify the loss of accuracy of genomic evaluation and GEBV when genotypes were imputed under two scenarios: a high-cost, high-accuracy scenario in which only selection candidates were imputed from a low-density platform and a low-cost, low-accuracy scenario in which all animals were imputed using a small reference panel of haplotypes. Phenotypes and genotypes obtained with the PorcineSNP60 BeadChip were available for 983 Yorkshire boars. Genotypes of selection candidates were masked and imputed using tagSNP in the GeneSeek Genomic Profiler (10K). Imputation was performed with BEAGLE using 128 or 1800 haplotypes as reference panels. GEBV were obtained through an animal-centric ridge regression model using de-regressed breeding values as response variables. Accuracy of genomic evaluation was estimated as the correlation between estimated breeding values and GEBV in a 10-fold cross validation design. Accuracy of genomic evaluation using observed genotypes was high for all traits (0.65-0.68). Using genotypes imputed from a large reference panel (accuracy: R(2) = 0.95) for genomic evaluation did not significantly decrease accuracy, whereas a scenario with genotypes imputed from a small reference panel (R(2) = 0.88) did show a significant decrease in accuracy. Genomic evaluation based on imputed genotypes in selection candidates can be implemented at a fraction of the cost of a genomic evaluation using observed genotypes and still yield virtually the same accuracy. On the other side, using a very small reference panel of haplotypes to impute training animals and candidates for selection results in lower accuracy of genomic evaluation.
Zhu, Lin; Guo, Wei-Li; Lu, Canyi; Huang, De-Shuang
2016-12-01
Although the newly available ChIP-seq data provides immense opportunities for comparative study of regulatory activities across different biological conditions, due to cost, time or sample material availability, it is not always possible for researchers to obtain binding profiles for every protein in every sample of interest, which considerably limits the power of integrative studies. Recently, by leveraging related information from measured data, Ernst et al. proposed ChromImpute for predicting additional ChIP-seq and other types of datasets, it is demonstrated that the imputed signal tracks accurately approximate the experimentally measured signals, and thereby could potentially enhance the power of integrative analysis. Despite the success of ChromImpute, in this paper, we reexamine its learning process, and show that its performance may degrade substantially and sometimes may even fail to output a prediction when the available data is scarce. This limitation could hurt its applicability to important predictive tasks, such as the imputation of TF binding data. To alleviate this problem, we propose a novel method called Local Sensitive Unified Embedding (LSUE) for imputing new ChIP-seq datasets. In LSUE, the ChIP-seq data compendium are fused together by mapping proteins, samples, and genomic positions simultaneously into the Euclidean space, thereby making their underling associations directly evaluable using simple calculations. In contrast to ChromImpute which mainly makes use of the local correlations between available datasets, LSUE can better estimate the overall data structure by formulating the representation learning of all involved entities as a single unified optimization problem. Meanwhile, a novel form of local sensitive low rank regularization is also proposed to further improve the performance of LSUE. Experimental evaluations on the ENCODE TF ChIP-seq data illustrate the performance of the proposed model. The code of LSUE is available at https://github.com/ekffar/LSUE.
Hieke, Stefanie; Benner, Axel; Schlenl, Richard F; Schumacher, Martin; Bullinger, Lars; Binder, Harald
2016-08-30
High-throughput technology allows for genome-wide measurements at different molecular levels for the same patient, e.g. single nucleotide polymorphisms (SNPs) and gene expression. Correspondingly, it might be beneficial to also integrate complementary information from different molecular levels when building multivariable risk prediction models for a clinical endpoint, such as treatment response or survival. Unfortunately, such a high-dimensional modeling task will often be complicated by a limited overlap of molecular measurements at different levels between patients, i.e. measurements from all molecular levels are available only for a smaller proportion of patients. We propose a sequential strategy for building clinical risk prediction models that integrate genome-wide measurements from two molecular levels in a complementary way. To deal with partial overlap, we develop an imputation approach that allows us to use all available data. This approach is investigated in two acute myeloid leukemia applications combining gene expression with either SNP or DNA methylation data. After obtaining a sparse risk prediction signature e.g. from SNP data, an automatically selected set of prognostic SNPs, by componentwise likelihood-based boosting, imputation is performed for the corresponding linear predictor by a linking model that incorporates e.g. gene expression measurements. The imputed linear predictor is then used for adjustment when building a prognostic signature from the gene expression data. For evaluation, we consider stability, as quantified by inclusion frequencies across resampling data sets. Despite an extremely small overlap in the application example with gene expression and SNPs, several genes are seen to be more stably identified when taking the (imputed) linear predictor from the SNP data into account. In the application with gene expression and DNA methylation, prediction performance with respect to survival also indicates that the proposed approach might work well. We consider imputation of linear predictor values to be a feasible and sensible approach for dealing with partial overlap in complementary integrative analysis of molecular measurements at different levels. More generally, these results indicate that a complementary strategy for integrating different molecular levels can result in more stable risk prediction signatures, potentially providing a more reliable insight into the underlying biology.
Nelson, Sarah C.; Stilp, Adrienne M.; Papanicolaou, George J.; Taylor, Kent D.; Rotter, Jerome I.; Thornton, Timothy A.; Laurie, Cathy C.
2016-01-01
Imputation is commonly used in genome-wide association studies to expand the set of genetic variants available for analysis. Larger and more diverse reference panels, such as the final Phase 3 of the 1000 Genomes Project, hold promise for improving imputation accuracy in genetically diverse populations such as Hispanics/Latinos in the USA. Here, we sought to empirically evaluate imputation accuracy when imputing to a 1000 Genomes Phase 3 versus a Phase 1 reference, using participants from the Hispanic Community Health Study/Study of Latinos. Our assessments included calculating the correlation between imputed and observed allelic dosage in a subset of samples genotyped on a supplemental array. We observed that the Phase 3 reference yielded higher accuracy at rare variants, but that the two reference panels were comparable at common variants. At a sample level, the Phase 3 reference improved imputation accuracy in Hispanic/Latino samples from the Caribbean more than for Mainland samples, which we attribute primarily to the additional reference panel samples available in Phase 3. We conclude that a 1000 Genomes Project Phase 3 reference panel can yield improved imputation accuracy compared with Phase 1, particularly for rare variants and for samples of certain genetic ancestry compositions. Our findings can inform imputation design for other genome-wide association studies of participants with diverse ancestries, especially as larger and more diverse reference panels continue to become available. PMID:27346520
Genotype Imputation with Thousands of Genomes
Howie, Bryan; Marchini, Jonathan; Stephens, Matthew
2011-01-01
Genotype imputation is a statistical technique that is often used to increase the power and resolution of genetic association studies. Imputation methods work by using haplotype patterns in a reference panel to predict unobserved genotypes in a study dataset, and a number of approaches have been proposed for choosing subsets of reference haplotypes that will maximize accuracy in a given study population. These panel selection strategies become harder to apply and interpret as sequencing efforts like the 1000 Genomes Project produce larger and more diverse reference sets, which led us to develop an alternative framework. Our approach is built around a new approximation that uses local sequence similarity to choose a custom reference panel for each study haplotype in each region of the genome. This approximation makes it computationally efficient to use all available reference haplotypes, which allows us to bypass the panel selection step and to improve accuracy at low-frequency variants by capturing unexpected allele sharing among populations. Using data from HapMap 3, we show that our framework produces accurate results in a wide range of human populations. We also use data from the Malaria Genetic Epidemiology Network (MalariaGEN) to provide recommendations for imputation-based studies in Africa. We demonstrate that our approximation improves efficiency in large, sequence-based reference panels, and we discuss general computational strategies for modern reference datasets. Genome-wide association studies will soon be able to harness the power of thousands of reference genomes, and our work provides a practical way for investigators to use this rich information. New methodology from this study is implemented in the IMPUTE2 software package. PMID:22384356
Shults, Ruth A.; Banerjee, Tanima; Perry, Timothy
2017-01-01
Objectives We examined associations among race/ethnicity, socioeconomic factors, and driving status in a nationally representative sample of >26,000 U.S. high school seniors. Methods Weighted data from the 2012 and 2013 Monitoring the Future surveys were combined and analyzed. We imputed missing values using fully conditional specification multiple imputation methods. Multivariate logistic regression modeling was conducted to explore associations among race/ethnicity, socioeconomic factors, and driving status, while accounting for selected student behaviors and location. Lastly, odds ratios were converted to prevalence ratios. Results 23% of high school seniors did not drive during an average week; 14% of white students were nondrivers compared to 40% of black students. Multivariate analysis revealed that minority students were 1.8 to 2.5 times more likely to be nondrivers than their white counterparts, and students who had no earned income were 2.8 times more likely to be nondrivers than those earning an average of ≥$36 a week. Driving status also varied considerably by student academic performance, number of parents in the household, parental education, census region, and urbanicity. Conclusions Our findings suggest that resources—both financial and time—influence when or whether a teen will learn to drive. Many young people from minority or lower socioeconomic families who learn to drive may be doing so after their 18th birthday and therefore would not take advantage of the safety benefits provided by graduated driver licensing. Innovative approaches may be needed to improve safety for these young novice drivers. PMID:27064697
Reading Profiles in Multi-Site Data With Missingness.
Eckert, Mark A; Vaden, Kenneth I; Gebregziabher, Mulugeta
2018-01-01
Children with reading disability exhibit varied deficits in reading and cognitive abilities that contribute to their reading comprehension problems. Some children exhibit primary deficits in phonological processing, while others can exhibit deficits in oral language and executive functions that affect comprehension. This behavioral heterogeneity is problematic when missing data prevent the characterization of different reading profiles, which often occurs in retrospective data sharing initiatives without coordinated data collection. Here we show that reading profiles can be reliably identified based on Random Forest classification of incomplete behavioral datasets, after the missForest method is used to multiply impute missing values. Results from simulation analyses showed that reading profiles could be accurately classified across degrees of missingness (e.g., ∼5% classification error for 30% missingness across the sample). The application of missForest to a real multi-site dataset with missingness ( n = 924) showed that reading disability profiles significantly and consistently differed in reading and cognitive abilities for cases with and without missing data. The results of validation analyses indicated that the reading profiles (cases with and without missing data) exhibited significant differences for an independent set of behavioral variables that were not used to classify reading profiles. Together, the results show how multiple imputation can be applied to the classification of cases with missing data and can increase the integrity of results from multi-site open access datasets.
genipe: an automated genome-wide imputation pipeline with automatic reporting and statistical tools.
Lemieux Perreault, Louis-Philippe; Legault, Marc-André; Asselin, Géraldine; Dubé, Marie-Pierre
2016-12-01
Genotype imputation is now commonly performed following genome-wide genotyping experiments. Imputation increases the density of analyzed genotypes in the dataset, enabling fine-mapping across the genome. However, the process of imputation using the most recent publicly available reference datasets can require considerable computation power and the management of hundreds of large intermediate files. We have developed genipe, a complete genome-wide imputation pipeline which includes automatic reporting, imputed data indexing and management, and a suite of statistical tests for imputed data commonly used in genetic epidemiology (Sequence Kernel Association Test, Cox proportional hazards for survival analysis, and linear mixed models for repeated measurements in longitudinal studies). The genipe package is an open source Python software and is freely available for non-commercial use (CC BY-NC 4.0) at https://github.com/pgxcentre/genipe Documentation and tutorials are available at http://pgxcentre.github.io/genipe CONTACT: louis-philippe.lemieux.perreault@statgen.org or marie-pierre.dube@statgen.orgSupplementary information: Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.
Trends in Allergic Conditions among Children: United States, 1997-2011
... and imputed family income ( 13 ). Data source and methods Prevalence estimates for allergic conditions were obtained from ... sample design of NHIS. The Taylor series linearization method was chosen for variance estimation. Differences between percentages ...
[Prevention and handling of missing data in clinical trials].
Jiang, Zhi-wei; Li, Chan-juan; Wang, Ling; Xia, Jie-lai
2015-11-01
Missing data is a common but unavoidable issue in clinical trials. It not only lowers the trial power, but brings the bias to the trial results. Therefore, on one hand, the missing data handling methods are employed in data analysis. On the other hand, it is vital to prevent the missing data in the trials. Prevention of missing data should take the first place. From the perspective of data, firstly, some measures should be taken at the stages of protocol design, data collection and data check to enhance the patients' compliance and reduce the unnecessary missing data. Secondly, the causes of confirmed missing data in the trials should be notified and recorded in detail, which are very important to determine the mechanism of missing data and choose the suitable missing data handling methods, e.g., last observation carried forward (LOCF); multiple imputation (MI); mixed-effect model repeated measure (MMRM), etc.
Wang, Chaolong; Schroeder, Kari B.; Rosenberg, Noah A.
2012-01-01
Allelic dropout is a commonly observed source of missing data in microsatellite genotypes, in which one or both allelic copies at a locus fail to be amplified by the polymerase chain reaction. Especially for samples with poor DNA quality, this problem causes a downward bias in estimates of observed heterozygosity and an upward bias in estimates of inbreeding, owing to mistaken classifications of heterozygotes as homozygotes when one of the two copies drops out. One general approach for avoiding allelic dropout involves repeated genotyping of homozygous loci to minimize the effects of experimental error. Existing computational alternatives often require replicate genotyping as well. These approaches, however, are costly and are suitable only when enough DNA is available for repeated genotyping. In this study, we propose a maximum-likelihood approach together with an expectation-maximization algorithm to jointly estimate allelic dropout rates and allele frequencies when only one set of nonreplicated genotypes is available. Our method considers estimates of allelic dropout caused by both sample-specific factors and locus-specific factors, and it allows for deviation from Hardy–Weinberg equilibrium owing to inbreeding. Using the estimated parameters, we correct the bias in the estimation of observed heterozygosity through the use of multiple imputations of alleles in cases where dropout might have occurred. With simulated data, we show that our method can (1) effectively reproduce patterns of missing data and heterozygosity observed in real data; (2) correctly estimate model parameters, including sample-specific dropout rates, locus-specific dropout rates, and the inbreeding coefficient; and (3) successfully correct the downward bias in estimating the observed heterozygosity. We find that our method is fairly robust to violations of model assumptions caused by population structure and by genotyping errors from sources other than allelic dropout. Because the data sets imputed under our model can be investigated in additional subsequent analyses, our method will be useful for preparing data for applications in diverse contexts in population genetics and molecular ecology. PMID:22851645
ERIC Educational Resources Information Center
Smith, Nichole Danielle
2017-01-01
According to the few quasi-experimental studies examining course outcomes for community college (Xu & Jaggars, 2011a, 2011b, 2013, 2014) and for-profit students (Bettinger, Fox, Loeb, & Taylor, 2014), there is a significant penalty for online students. No comparable research has been conducted on public four-year university students to…
Auerbach, Benjamin M
2011-05-01
One of the greatest limitations to the application of the revised Fully anatomical stature estimation method is the inability to measure some of the skeletal elements required in its calculation. These element dimensions cannot be obtained due to taphonomic factors, incomplete excavation, or disease processes, and result in missing data. This study examines methods of imputing these missing dimensions using observable Fully measurements from the skeleton and the accuracy of incorporating these missing element estimations into anatomical stature reconstruction. These are further assessed against stature estimations obtained from mathematical regression formulae for the lower limb bones (femur and tibia). Two thousand seven hundred and seventeen North and South American indigenous skeletons were measured, and subsets of these with observable Fully dimensions were used to simulate missing elements and create estimation methods and equations. Comparisons were made directly between anatomically reconstructed statures and mathematically derived statures, as well as with anatomically derived statures with imputed missing dimensions. These analyses demonstrate that, while mathematical stature estimations are more accurate, anatomical statures incorporating missing dimensions are not appreciably less accurate and are more precise. The anatomical stature estimation method using imputed missing dimensions is supported. Missing element estimation, however, is limited to the vertebral column (only when lumbar vertebrae are present) and to talocalcaneal height (only when femora and tibiae are present). Crania, entire vertebral columns, and femoral or tibial lengths cannot be reliably estimated. Further discussion of the applicability of these methods is discussed. Copyright © 2011 Wiley-Liss, Inc.
Similarity indices of meteo-climatic gauging stations: definition and comparison.
Barca, Emanuele; Bruno, Delia Evelina; Passarella, Giuseppe
2016-07-01
Space-time dependencies among monitoring network stations have been investigated to detect and quantify similarity relationships among gauging stations. In this work, besides the well-known rank correlation index, two new similarity indices have been defined and applied to compute the similarity matrix related to the Apulian meteo-climatic monitoring network. The similarity matrices can be applied to address reliably the issue of missing data in space-time series. In order to establish the effectiveness of the similarity indices, a simulation test was then designed and performed with the aim of estimating missing monthly rainfall rates in a suitably selected gauging station. The results of the simulation allowed us to evaluate the effectiveness of the proposed similarity indices. Finally, the multiple imputation by chained equations method was used as a benchmark to have an absolute yardstick for comparing the outcomes of the test. In conclusion, the new proposed multiplicative similarity index resulted at least as reliable as the selected benchmark.
Strategies for genotype imputation in composite beef cattle.
Chud, Tatiane C S; Ventura, Ricardo V; Schenkel, Flavio S; Carvalheiro, Roberto; Buzanskas, Marcos E; Rosa, Jaqueline O; Mudadu, Maurício de Alvarenga; da Silva, Marcos Vinicius G B; Mokry, Fabiana B; Marcondes, Cintia R; Regitano, Luciana C A; Munari, Danísio P
2015-08-07
Genotype imputation has been used to increase genomic information, allow more animals in genome-wide analyses, and reduce genotyping costs. In Brazilian beef cattle production, many animals are resulting from crossbreeding and such an event may alter linkage disequilibrium patterns. Thus, the challenge is to obtain accurately imputed genotypes in crossbred animals. The objective of this study was to evaluate the best fitting and most accurate imputation strategy on the MA genetic group (the progeny of a Charolais sire mated with crossbred Canchim X Zebu cows) and Canchim cattle. The data set contained 400 animals (born between 1999 and 2005) genotyped with the Illumina BovineHD panel. Imputation accuracy of genotypes from the Illumina-Bovine3K (3K), Illumina-BovineLD (6K), GeneSeek-Genomic-Profiler (GGP) BeefLD (GGP9K), GGP-IndicusLD (GGP20Ki), Illumina-BovineSNP50 (50K), GGP-IndicusHD (GGP75Ki), and GGP-BeefHD (GGP80K) to Illumina-BovineHD (HD) SNP panels were investigated. Seven scenarios for reference and target populations were tested; the animals were grouped according with birth year (S1), genetic groups (S2 and S3), genetic groups and birth year (S4 and S5), gender (S6), and gender and birth year (S7). Analyses were performed using FImpute and BEAGLE software and computation run-time was recorded. Genotype imputation accuracy was measured by concordance rate (CR) and allelic R square (R(2)). The highest imputation accuracy scenario consisted of a reference population with males and females and a target population with young females. Among the SNP panels in the tested scenarios, from the 50K, GGP75Ki and GGP80K were the most adequate to impute to HD in Canchim cattle. FImpute reduced computation run-time to impute genotypes from 20 to 100 times when compared to BEAGLE. The genotyping panels possessing at least 50 thousands markers are suitable for genotype imputation to HD with acceptable accuracy. The FImpute algorithm demonstrated a higher efficiency of imputed markers, especially in lower density panels. These considerations may assist to increase genotypic information, reduce genotyping costs, and aid in genomic selection evaluations in crossbred animals.
The utility of low-density genotyping for imputation in the Thoroughbred horse
2014-01-01
Background Despite the dramatic reduction in the cost of high-density genotyping that has occurred over the last decade, it remains one of the limiting factors for obtaining the large datasets required for genomic studies of disease in the horse. In this study, we investigated the potential for low-density genotyping and subsequent imputation to address this problem. Results Using the haplotype phasing and imputation program, BEAGLE, it is possible to impute genotypes from low- to high-density (50K) in the Thoroughbred horse with reasonable to high accuracy. Analysis of the sources of variation in imputation accuracy revealed dependence both on the minor allele frequency of the single nucleotide polymorphisms (SNPs) being imputed and on the underlying linkage disequilibrium structure. Whereas equidistant spacing of the SNPs on the low-density panel worked well, optimising SNP selection to increase their minor allele frequency was advantageous, even when the panel was subsequently used in a population of different geographical origin. Replacing base pair position with linkage disequilibrium map distance reduced the variation in imputation accuracy across SNPs. Whereas a 1K SNP panel was generally sufficient to ensure that more than 80% of genotypes were correctly imputed, other studies suggest that a 2K to 3K panel is more efficient to minimize the subsequent loss of accuracy in genomic prediction analyses. The relationship between accuracy and genotyping costs for the different low-density panels, suggests that a 2K SNP panel would represent good value for money. Conclusions Low-density genotyping with a 2K SNP panel followed by imputation provides a compromise between cost and accuracy that could promote more widespread genotyping, and hence the use of genomic information in horses. In addition to offering a low cost alternative to high-density genotyping, imputation provides a means to combine datasets from different genotyping platforms, which is becoming necessary since researchers are starting to use the recently developed equine 70K SNP chip. However, more work is needed to evaluate the impact of between-breed differences on imputation accuracy. PMID:24495673
FCMPSO: An Imputation for Missing Data Features in Heart Disease Classification
NASA Astrophysics Data System (ADS)
Salleh, Mohd Najib Mohd; Ashikin Samat, Nurul
2017-08-01
The application of data mining and machine learning in directing clinical research into possible hidden knowledge is becoming greatly influential in medical areas. Heart Disease is a killer disease around the world, and early prevention through efficient methods can help to reduce the mortality number. Medical data may contain many uncertainties, as they are fuzzy and vague in nature. Nonetheless, imprecise features data such as no values and missing values can affect quality of classification results. Nevertheless, the other complete features are still capable to give information in certain features. Therefore, an imputation approach based on Fuzzy C-Means and Particle Swarm Optimization (FCMPSO) is developed in preprocessing stage to help fill in the missing values. Then, the complete dataset is trained in classification algorithm, Decision Tree. The experiment is trained with Heart Disease dataset and the performance is analysed using accuracy, precision, and ROC values. Results show that the performance of Decision Tree is increased after the application of FCMSPO for imputation.
An estimate of the cost of administering intravenous biological agents in Spanish day hospitals
Nolla, Joan Miquel; Martín, Esperanza; Llamas, Pilar; Manero, Javier; Rodríguez de la Serna, Arturo; Fernández-Miera, Manuel Francisco; Rodríguez, Mercedes; López, José Manuel; Ivanova, Alexandra; Aragón, Belén
2017-01-01
Objective To estimate the unit costs of administering intravenous (IV) biological agents in day hospitals (DHs) in the Spanish National Health System. Patients and methods Data were obtained from 188 patients with rheumatoid arthritis, collected from nine DHs, receiving one of the following IV therapies: infliximab (n=48), rituximab (n=38), abatacept (n=41), or tocilizumab (n=61). The fieldwork was carried out between March 2013 and March 2014. The following three groups of costs were considered: 1) structural costs, 2) material costs, and 3) staff costs. Staff costs were considered a fixed cost and were estimated according to the DH theoretical level of activity, which includes, as well as personal care of each patient, the DH general activities (complete imputation method, CIM). In addition, an alternative calculation was performed, in which the staff costs were considered a variable cost imputed according to the time spent on direct care (partial imputation method, PIM). All costs were expressed in euros for the reference year 2014. Results The average total cost was €146.12 per infusion (standard deviation [SD] ±87.11; CIM) and €29.70 per infusion (SD ±11.42; PIM). The structure-related costs per infusion varied between €2.23 and €62.35 per patient and DH; the cost of consumables oscillated between €3.48 and €20.34 per patient and DH. In terms of the care process, the average difference between the shortest and the longest time taken by different hospitals to administer an IV biological therapy was 113 minutes. Conclusion The average total cost of infusion was less than that normally used in models of economic evaluation coming from secondary sources. This cost is even less when the staff costs are imputed according to the PIM. A high degree of variability was observed between different DHs in the cost of the consumables, in the structure-related costs, and in those of the care process. PMID:28356746
Spiliopoulou, Athina; Colombo, Marco; Orchard, Peter; Agakov, Felix; McKeigue, Paul
2017-01-01
We address the task of genotype imputation to a dense reference panel given genotype likelihoods computed from ultralow coverage sequencing as inputs. In this setting, the data have a high-level of missingness or uncertainty, and are thus more amenable to a probabilistic representation. Most existing imputation algorithms are not well suited for this situation, as they rely on prephasing for computational efficiency, and, without definite genotype calls, the prephasing task becomes computationally expensive. We describe GeneImp, a program for genotype imputation that does not require prephasing and is computationally tractable for whole-genome imputation. GeneImp does not explicitly model recombination, instead it capitalizes on the existence of large reference panels—comprising thousands of reference haplotypes—and assumes that the reference haplotypes can adequately represent the target haplotypes over short regions unaltered. We validate GeneImp based on data from ultralow coverage sequencing (0.5×), and compare its performance to the most recent version of BEAGLE that can perform this task. We show that GeneImp achieves imputation quality very close to that of BEAGLE, using one to two orders of magnitude less time, without an increase in memory complexity. Therefore, GeneImp is the first practical choice for whole-genome imputation to a dense reference panel when prephasing cannot be applied, for instance, in datasets produced via ultralow coverage sequencing. A related future application for GeneImp is whole-genome imputation based on the off-target reads from deep whole-exome sequencing. PMID:28348060
Xu, Stanley; Clarke, Christina L; Newcomer, Sophia R; Daley, Matthew F; Glanz, Jason M
2018-05-16
Vaccine safety studies are often electronic health record (EHR)-based observational studies. These studies often face significant methodological challenges, including confounding and misclassification of adverse event. Vaccine safety researchers use self-controlled case series (SCCS) study design to handle confounding effect and employ medical chart review to ascertain cases that are identified using EHR data. However, for common adverse events, limited resources often make it impossible to adjudicate all adverse events observed in electronic data. In this paper, we considered four approaches for analyzing SCCS data with confirmation rates estimated from an internal validation sample: (1) observed cases, (2) confirmed cases only, (3) known confirmation rate, and (4) multiple imputation (MI). We conducted a simulation study to evaluate these four approaches using type I error rates, percent bias, and empirical power. Our simulation results suggest that when misclassification of adverse events is present, approaches such as observed cases, confirmed case only, and known confirmation rate may inflate the type I error, yield biased point estimates, and affect statistical power. The multiple imputation approach considers the uncertainty of estimated confirmation rates from an internal validation sample, yields a proper type I error rate, largely unbiased point estimate, proper variance estimate, and statistical power. © 2018 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Methods to control for unmeasured confounding in pharmacoepidemiology: an overview.
Uddin, Md Jamal; Groenwold, Rolf H H; Ali, Mohammed Sanni; de Boer, Anthonius; Roes, Kit C B; Chowdhury, Muhammad A B; Klungel, Olaf H
2016-06-01
Background Unmeasured confounding is one of the principal problems in pharmacoepidemiologic studies. Several methods have been proposed to detect or control for unmeasured confounding either at the study design phase or the data analysis phase. Aim of the Review To provide an overview of commonly used methods to detect or control for unmeasured confounding and to provide recommendations for proper application in pharmacoepidemiology. Methods/Results Methods to control for unmeasured confounding in the design phase of a study are case only designs (e.g., case-crossover, case-time control, self-controlled case series) and the prior event rate ratio adjustment method. Methods that can be applied in the data analysis phase include, negative control method, perturbation variable method, instrumental variable methods, sensitivity analysis, and ecological analysis. A separate group of methods are those in which additional information on confounders is collected from a substudy. The latter group includes external adjustment, propensity score calibration, two-stage sampling, and multiple imputation. Conclusion As the performance and application of the methods to handle unmeasured confounding may differ across studies and across databases, we stress the importance of using both statistical evidence and substantial clinical knowledge for interpretation of the study results.
16 CFR 1115.11 - Imputed knowledge.
Code of Federal Regulations, 2014 CFR
2014-01-01
... care to ascertain the truth of complaints or other representations. This includes the knowledge a firm... 16 Commercial Practices 2 2014-01-01 2014-01-01 false Imputed knowledge. 1115.11 Section 1115.11... PRODUCT HAZARD REPORTS General Interpretation § 1115.11 Imputed knowledge. (a) In evaluating whether or...
16 CFR 1115.11 - Imputed knowledge.
Code of Federal Regulations, 2010 CFR
2010-01-01
... care to ascertain the truth of complaints or other representations. This includes the knowledge a firm... 16 Commercial Practices 2 2010-01-01 2010-01-01 false Imputed knowledge. 1115.11 Section 1115.11... PRODUCT HAZARD REPORTS General Interpretation § 1115.11 Imputed knowledge. (a) In evaluating whether or...
16 CFR 1115.11 - Imputed knowledge.
Code of Federal Regulations, 2011 CFR
2011-01-01
... care to ascertain the truth of complaints or other representations. This includes the knowledge a firm... 16 Commercial Practices 2 2011-01-01 2011-01-01 false Imputed knowledge. 1115.11 Section 1115.11... PRODUCT HAZARD REPORTS General Interpretation § 1115.11 Imputed knowledge. (a) In evaluating whether or...
16 CFR 1115.11 - Imputed knowledge.
Code of Federal Regulations, 2012 CFR
2012-01-01
... care to ascertain the truth of complaints or other representations. This includes the knowledge a firm... 16 Commercial Practices 2 2012-01-01 2012-01-01 false Imputed knowledge. 1115.11 Section 1115.11... PRODUCT HAZARD REPORTS General Interpretation § 1115.11 Imputed knowledge. (a) In evaluating whether or...
Fu, Liya; Wang, You-Gan
2011-02-15
Environmental data usually include measurements, such as water quality data, which fall below detection limits, because of limitations of the instruments or of certain analytical methods used. The fact that some responses are not detected needs to be properly taken into account in statistical analysis of such data. However, it is well-known that it is challenging to analyze a data set with detection limits, and we often have to rely on the traditional parametric methods or simple imputation methods. Distributional assumptions can lead to biased inference and justification of distributions is often not possible when the data are correlated and there is a large proportion of data below detection limits. The extent of bias is usually unknown. To draw valid conclusions and hence provide useful advice for environmental management authorities, it is essential to develop and apply an appropriate statistical methodology. This paper proposes rank-based procedures for analyzing non-normally distributed data collected at different sites over a period of time in the presence of multiple detection limits. To take account of temporal correlations within each site, we propose an optimal linear combination of estimating functions and apply the induced smoothing method to reduce the computational burden. Finally, we apply the proposed method to the water quality data collected at Susquehanna River Basin in United States of America, which clearly demonstrates the advantages of the rank regression models.
... and imputed family income ( 10 ). Data source and methods All ADHD prevalence estimates were obtained from the ... sample design of NHIS. The Taylor series linearization method was chosen for variance estimation. Differences between percentages ...
16 CFR § 1115.11 - Imputed knowledge.
Code of Federal Regulations, 2013 CFR
2013-01-01
... due care to ascertain the truth of complaints or other representations. This includes the knowledge a... 16 Commercial Practices 2 2013-01-01 2013-01-01 false Imputed knowledge. § 1115.11 Section Â... SUBSTANTIAL PRODUCT HAZARD REPORTS General Interpretation § 1115.11 Imputed knowledge. (a) In evaluating...
5 CFR 919.630 - May the OPM impute conduct of one person to another?
Code of Federal Regulations, 2010 CFR
2010-01-01
...'s knowledge, approval or acquiescence. The organization's acceptance of the benefits derived from the conduct is evidence of knowledge, approval or acquiescence. (b) Conduct imputed from an... individual to whom the improper conduct is imputed either participated in, had knowledge of, or reason to...
48 CFR 1830.7002-4 - Determining imputed cost of money.
Code of Federal Regulations, 2012 CFR
2012-10-01
... of money. 1830.7002-4 Section 1830.7002-4 Federal Acquisition Regulations System NATIONAL AERONAUTICS... Determining imputed cost of money. (a) Determine the imputed cost of money for an asset under construction, fabrication, or development by applying a cost of money rate (see 1830.7002-2) to the representative...
48 CFR 1830.7002-4 - Determining imputed cost of money.
Code of Federal Regulations, 2013 CFR
2013-10-01
... of money. 1830.7002-4 Section 1830.7002-4 Federal Acquisition Regulations System NATIONAL AERONAUTICS... Determining imputed cost of money. (a) Determine the imputed cost of money for an asset under construction, fabrication, or development by applying a cost of money rate (see 1830.7002-2) to the representative...
48 CFR 1830.7002-4 - Determining imputed cost of money.
Code of Federal Regulations, 2014 CFR
2014-10-01
... of money. 1830.7002-4 Section 1830.7002-4 Federal Acquisition Regulations System NATIONAL AERONAUTICS... Determining imputed cost of money. (a) Determine the imputed cost of money for an asset under construction, fabrication, or development by applying a cost of money rate (see 1830.7002-2) to the representative...
48 CFR 1830.7002-4 - Determining imputed cost of money.
Code of Federal Regulations, 2010 CFR
2010-10-01
... money. 1830.7002-4 Section 1830.7002-4 Federal Acquisition Regulations System NATIONAL AERONAUTICS AND... Determining imputed cost of money. (a) Determine the imputed cost of money for an asset under construction, fabrication, or development by applying a cost of money rate (see 1830.7002-2) to the representative...
2010-01-01
Background Participant nonresponse in an HIV serosurvey can affect estimates of HIV prevalence. Nonresponse can arise from a participant's refusal to provide a blood sample or the failure to trace a sampled individual. In a serosurvey conducted by the African Population and Health Research Center and Kenya Medical Research Centre in the slums of Nairobi, 43% of sampled individuals did not provide a blood sample. This paper describes selective participation in the serosurvey and estimates bias in HIV prevalence figures. Methods The paper uses data derived from an HIV serosurvey nested in an on-going demographic surveillance system. Nonresponse was assessed using logistic regression and multiple imputation methods to impute missing data for HIV status using a set of common variables available for all sampled participants. Results Age, residence, high mobility, wealth, and ethnicity were independent predictors of a sampled individual not being contacted. Individuals aged 30-34 years, females, individuals from the Kikuyu and Kamba ethnicity, married participants, and residents of Viwandani were all less likely to accept HIV testing when contacted. Although men were less likely to be contacted, those found were more willing to be tested compared to females. The overall observed HIV prevalence was overestimated by 2%. The observed prevalence for male participants was underestimated by about 1% and that for females was overestimated by 3%. These differences were small and did not affect the overall estimate substantially as the observed estimates fell within the confidence limits of the corrected prevalence estimate. Conclusions Nonresponse in the HIV serosurvey in the two informal settlements was high, however, the effect on overall prevalence estimate was minimal. PMID:20649957
Cognitive behavior therapy for fear of flying: sustainability of treatment gains after September 11.
Anderson, Page; Jacobs, Carli H; Lindner, Gretchen K; Edwards, Shannan; Zimand, Elana; Hodges, Larry; Rothbaum, Barbara Olasov
2006-03-01
This study examines the long-term efficacy of cognitive-behavioral therapy (CBT) for fear of flying (FOF) after a catastrophic fear-relevant event, the September 11, 2001, terrorist attacks. Participants (N = 115) were randomly assigned to and completed treatment for FOF using 8 sessions of either virtual reality exposure therapy (VRE) or standard exposure therapy (SE) prior to September 11, 2001. Individuals were reassessed in June, 2002, an average of 2.3 years after treatment, with a response rate of 48% (n = 55). Analyses were run on the original data and, using multiple imputation procedures, on imputed data for the full sample. Individuals maintained or improved upon gains made in treatment as measured by standardized FOF questionnaires and by number of flights taken. There were no differences between VRE and SE. Thus, results suggest that individuals previously treated for FOF with cognitive-behavioral therapy can maintain treatment gains in the face of a catastrophic fear-relevant event, even years after treatment is completed.
Integrative approaches for large-scale transcriptome-wide association studies
Gusev, Alexander; Ko, Arthur; Shi, Huwenbo; Bhatia, Gaurav; Chung, Wonil; Penninx, Brenda W J H; Jansen, Rick; de Geus, Eco JC; Boomsma, Dorret I; Wright, Fred A; Sullivan, Patrick F; Nikkola, Elina; Alvarez, Marcus; Civelek, Mete; Lusis, Aldons J.; Lehtimäki, Terho; Raitoharju, Emma; Kähönen, Mika; Seppälä, Ilkka; Raitakari, Olli T.; Kuusisto, Johanna; Laakso, Markku; Price, Alkes L.; Pajukanta, Päivi; Pasaniuc, Bogdan
2016-01-01
Many genetic variants influence complex traits by modulating gene expression, thus altering the abundance levels of one or multiple proteins. Here, we introduce a powerful strategy that integrates gene expression measurements with summary association statistics from large-scale genome-wide association studies (GWAS) to identify genes whose cis-regulated expression is associated to complex traits. We leverage expression imputation to perform a transcriptome wide association scan (TWAS) to identify significant expression-trait associations. We applied our approaches to expression data from blood and adipose tissue measured in ~3,000 individuals overall. We imputed gene expression into GWAS data from over 900,000 phenotype measurements to identify 69 novel genes significantly associated to obesity-related traits (BMI, lipids, and height). Many of the novel genes are associated with relevant phenotypes in the Hybrid Mouse Diversity Panel. Our results showcase the power of integrating genotype, gene expression and phenotype to gain insights into the genetic basis of complex traits. PMID:26854917
The Oxytocin Receptor Gene ( OXTR) and Face Recognition.
Verhallen, Roeland J; Bosten, Jenny M; Goodbourn, Patrick T; Lawrance-Owen, Adam J; Bargary, Gary; Mollon, J D
2017-01-01
A recent study has linked individual differences in face recognition to rs237887, a single-nucleotide polymorphism (SNP) of the oxytocin receptor gene ( OXTR; Skuse et al., 2014). In that study, participants were assessed using the Warrington Recognition Memory Test for Faces, but performance on Warrington's test has been shown not to rely purely on face recognition processes. We administered the widely used Cambridge Face Memory Test-a purer test of face recognition-to 370 participants. Performance was not significantly associated with rs237887, with 16 other SNPs of OXTR that we genotyped, or with a further 75 imputed SNPs. We also administered three other tests of face processing (the Mooney Face Test, the Glasgow Face Matching Test, and the Composite Face Test), but performance was never significantly associated with rs237887 or with any of the other genotyped or imputed SNPs, after corrections for multiple testing. In addition, we found no associations between OXTR and Autism-Spectrum Quotient scores.
The association of workplace hazards and smoking in a U.S. multiethnic working-class population.
Okechukwu, Cassandra A; Krieger, Nancy; Chen, Jarvis; Sorensen, Glorian; Li, Yi; Barbeau, Elizabeth M
2010-01-01
We investigated the extent to which smoking status was associated with exposure to occupational (e.g., dust, chemicals, noise, and ergonomic strain) and social (e.g., abuse, sexual harassment, and racial discrimination) workplace hazards in a sample of U.S. multiethnic working-class adults. United for Health is a cross-sectional study designed to investigate the combined burden of occupational and social workplace hazards in relation to race/ethnicity, gender, and wage and to evaluate related health effects in a working-class population. Using validated measures, we collected data from 1,282 multiethnic working-class participants using audio computer-assisted interviews. We used multiple imputation methods to impute data for those missing data. Crude and adjusted logistic odds ratios (ORs) were modeled to estimate ORs and 95% confidence intervals (CIs). The prevalence of smoking was highest among non-Hispanic white workers (38.3%) and lowest for foreign-born workers (13.1%). We found an association between racial discrimination and smoking (OR = 1.12, 95% CI 1.01, 1.25). The relationship between smoking and sexual harassment, although not significant, was different for black women compared with men (OR = 1.79, 95% CI 0.99, 3.22). We did not find any associations by workplace abuse or by any of the occupational hazards. These results indicate that racial discrimination might be related to smoking in working-class populations and should be considered in tobacco-control efforts that target this high-risk population.
Multiple hot-deck imputation for network inference from RNA sequencing data.
Imbert, Alyssa; Valsesia, Armand; Le Gall, Caroline; Armenise, Claudia; Lefebvre, Gregory; Gourraud, Pierre-Antoine; Viguerie, Nathalie; Villa-Vialaneix, Nathalie
2018-05-15
Network inference provides a global view of the relations existing between gene expression in a given transcriptomic experiment (often only for a restricted list of chosen genes). However, it is still a challenging problem: even if the cost of sequencing techniques has decreased over the last years, the number of samples in a given experiment is still (very) small compared to the number of genes. We propose a method to increase the reliability of the inference when RNA-seq expression data have been measured together with an auxiliary dataset that can provide external information on gene expression similarity between samples. Our statistical approach, hd-MI, is based on imputation for samples without available RNA-seq data that are considered as missing data but are observed on the secondary dataset. hd-MI can improve the reliability of the inference for missing rates up to 30% and provides more stable networks with a smaller number of false positive edges. On a biological point of view, hd-MI was also found relevant to infer networks from RNA-seq data acquired in adipose tissue during a nutritional intervention in obese individuals. In these networks, novel links between genes were highlighted, as well as an improved comparability between the two steps of the nutritional intervention. Software and sample data are available as an R package, RNAseqNet, that can be downloaded from the Comprehensive R Archive Network (CRAN). alyssa.imbert@inra.fr or nathalie.villa-vialaneix@inra.fr. Supplementary data are available at Bioinformatics online.
26 CFR 1.401(a)(4)-7 - Imputation of permitted disparity.
Code of Federal Regulations, 2010 CFR
2010-04-01
... + permitted disparity rate (3) Employees whose plan year compensation exceeds taxable wage base. If an... 26 Internal Revenue 5 2010-04-01 2010-04-01 false Imputation of permitted disparity. 1.401(a)(4)-7... Imputation of permitted disparity. (a) Introduction. In determining whether a plan satisfies section 401(a)(4...
31 CFR 19.630 - May the Department of the Treasury impute conduct of one person to another?
Code of Federal Regulations, 2010 CFR
2010-07-01
...'s knowledge, approval or acquiescence. The organization's acceptance of the benefits derived from the conduct is evidence of knowledge, approval or acquiescence. (b) Conduct imputed from an... individual to whom the improper conduct is imputed either participated in, had knowledge of, or reason to...
22 CFR 1508.630 - May the African Development Foundation impute conduct of one person to another?
Code of Federal Regulations, 2010 CFR
2010-04-01
... knowledge, approval or acquiescence. The organization's acceptance of the benefits derived from the conduct is evidence of knowledge, approval or acquiescence. (b) Conduct imputed from an organization to an... improper conduct is imputed either participated in, had knowledge of, or reason to know of the improper...
Code of Federal Regulations, 2010 CFR
2010-04-01
...'s knowledge, approval or acquiescence. The organization's acceptance of the benefits derived from the conduct is evidence of knowledge, approval or acquiescence. (b) Conduct imputed from an... individual to whom the improper conduct is imputed either participated in, had knowledge of, or reason to...