Wahl, Simone; Boulesteix, Anne-Laure; Zierer, Astrid; Thorand, Barbara; van de Wiel, Mark A
2016-10-26
Missing values are a frequent issue in human studies. In many situations, multiple imputation (MI) is an appropriate missing data handling strategy, whereby missing values are imputed multiple times, the analysis is performed in every imputed data set, and the obtained estimates are pooled. If the aim is to estimate (added) predictive performance measures, such as (change in) the area under the receiver-operating characteristic curve (AUC), internal validation strategies become desirable in order to correct for optimism. It is not fully understood how internal validation should be combined with multiple imputation. In a comprehensive simulation study and in a real data set based on blood markers as predictors for mortality, we compare three combination strategies: Val-MI, internal validation followed by MI on the training and test parts separately, MI-Val, MI on the full data set followed by internal validation, and MI(-y)-Val, MI on the full data set omitting the outcome followed by internal validation. Different validation strategies, including bootstrap und cross-validation, different (added) performance measures, and various data characteristics are considered, and the strategies are evaluated with regard to bias and mean squared error of the obtained performance estimates. In addition, we elaborate on the number of resamples and imputations to be used, and adopt a strategy for confidence interval construction to incomplete data. Internal validation is essential in order to avoid optimism, with the bootstrap 0.632+ estimate representing a reliable method to correct for optimism. While estimates obtained by MI-Val are optimistically biased, those obtained by MI(-y)-Val tend to be pessimistic in the presence of a true underlying effect. Val-MI provides largely unbiased estimates, with a slight pessimistic bias with increasing true effect size, number of covariates and decreasing sample size. In Val-MI, accuracy of the estimate is more strongly improved by increasing the number of bootstrap draws rather than the number of imputations. With a simple integrated approach, valid confidence intervals for performance estimates can be obtained. When prognostic models are developed on incomplete data, Val-MI represents a valid strategy to obtain estimates of predictive performance measures.
Ma, Yan; Zhang, Wei; Lyman, Stephen; Huang, Yihe
2018-06-01
To identify the most appropriate imputation method for missing data in the HCUP State Inpatient Databases (SID) and assess the impact of different missing data methods on racial disparities research. HCUP SID. A novel simulation study compared four imputation methods (random draw, hot deck, joint multiple imputation [MI], conditional MI) for missing values for multiple variables, including race, gender, admission source, median household income, and total charges. The simulation was built on real data from the SID to retain their hierarchical data structures and missing data patterns. Additional predictive information from the U.S. Census and American Hospital Association (AHA) database was incorporated into the imputation. Conditional MI prediction was equivalent or superior to the best performing alternatives for all missing data structures and substantially outperformed each of the alternatives in various scenarios. Conditional MI substantially improved statistical inferences for racial health disparities research with the SID. © Health Research and Educational Trust.
Hayati Rezvan, Panteha; Lee, Katherine J; Simpson, Julie A
2015-04-07
Missing data are common in medical research, which can lead to a loss in statistical power and potentially biased results if not handled appropriately. Multiple imputation (MI) is a statistical method, widely adopted in practice, for dealing with missing data. Many academic journals now emphasise the importance of reporting information regarding missing data and proposed guidelines for documenting the application of MI have been published. This review evaluated the reporting of missing data, the application of MI including the details provided regarding the imputation model, and the frequency of sensitivity analyses within the MI framework in medical research articles. A systematic review of articles published in the Lancet and New England Journal of Medicine between January 2008 and December 2013 in which MI was implemented was carried out. We identified 103 papers that used MI, with the number of papers increasing from 11 in 2008 to 26 in 2013. Nearly half of the papers specified the proportion of complete cases or the proportion with missing data by each variable. In the majority of the articles (86%) the imputed variables were specified. Of the 38 papers (37%) that stated the method of imputation, 20 used chained equations, 8 used multivariate normal imputation, and 10 used alternative methods. Very few articles (9%) detailed how they handled non-normally distributed variables during imputation. Thirty-nine papers (38%) stated the variables included in the imputation model. Less than half of the papers (46%) reported the number of imputations, and only two papers compared the distribution of imputed and observed data. Sixty-six papers presented the results from MI as a secondary analysis. Only three articles carried out a sensitivity analysis following MI to assess departures from the missing at random assumption, with details of the sensitivity analyses only provided by one article. This review outlined deficiencies in the documenting of missing data and the details provided about imputation. Furthermore, only a few articles performed sensitivity analyses following MI even though this is strongly recommended in guidelines. Authors are encouraged to follow the available guidelines and provide information on missing data and the imputation process.
A bias-corrected estimator in multiple imputation for missing data.
Tomita, Hiroaki; Fujisawa, Hironori; Henmi, Masayuki
2018-05-29
Multiple imputation (MI) is one of the most popular methods to deal with missing data, and its use has been rapidly increasing in medical studies. Although MI is rather appealing in practice since it is possible to use ordinary statistical methods for a complete data set once the missing values are fully imputed, the method of imputation is still problematic. If the missing values are imputed from some parametric model, the validity of imputation is not necessarily ensured, and the final estimate for a parameter of interest can be biased unless the parametric model is correctly specified. Nonparametric methods have been also proposed for MI, but it is not so straightforward as to produce imputation values from nonparametrically estimated distributions. In this paper, we propose a new method for MI to obtain a consistent (or asymptotically unbiased) final estimate even if the imputation model is misspecified. The key idea is to use an imputation model from which the imputation values are easily produced and to make a proper correction in the likelihood function after the imputation by using the density ratio between the imputation model and the true conditional density function for the missing variable as a weight. Although the conditional density must be nonparametrically estimated, it is not used for the imputation. The performance of our method is evaluated by both theory and simulation studies. A real data analysis is also conducted to illustrate our method by using the Duke Cardiac Catheterization Coronary Artery Disease Diagnostic Dataset. Copyright © 2018 John Wiley & Sons, Ltd.
Multiple imputation of missing data in nested case-control and case-cohort studies.
Keogh, Ruth H; Seaman, Shaun R; Bartlett, Jonathan W; Wood, Angela M
2018-06-05
The nested case-control and case-cohort designs are two main approaches for carrying out a substudy within a prospective cohort. This article adapts multiple imputation (MI) methods for handling missing covariates in full-cohort studies for nested case-control and case-cohort studies. We consider data missing by design and data missing by chance. MI analyses that make use of full-cohort data and MI analyses based on substudy data only are described, alongside an intermediate approach in which the imputation uses full-cohort data but the analysis uses only the substudy. We describe adaptations to two imputation methods: the approximate method (MI-approx) of White and Royston () and the "substantive model compatible" (MI-SMC) method of Bartlett et al. (). We also apply the "MI matched set" approach of Seaman and Keogh () to nested case-control studies, which does not require any full-cohort information. The methods are investigated using simulation studies and all perform well when their assumptions hold. Substantial gains in efficiency can be made by imputing data missing by design using the full-cohort approach or by imputing data missing by chance in analyses using the substudy only. The intermediate approach brings greater gains in efficiency relative to the substudy approach and is more robust to imputation model misspecification than the full-cohort approach. The methods are illustrated using the ARIC Study cohort. Supplementary Materials provide R and Stata code. © 2018, The International Biometric Society.
MacNeil Vroomen, Janet; Eekhout, Iris; Dijkgraaf, Marcel G; van Hout, Hein; de Rooij, Sophia E; Heymans, Martijn W; Bosmans, Judith E
2016-11-01
Cost and effect data often have missing data because economic evaluations are frequently added onto clinical studies where cost data are rarely the primary outcome. The objective of this article was to investigate which multiple imputation strategy is most appropriate to use for missing cost-effectiveness data in a randomized controlled trial. Three incomplete data sets were generated from a complete reference data set with 17, 35 and 50 % missing data in effects and costs. The strategies evaluated included complete case analysis (CCA), multiple imputation with predictive mean matching (MI-PMM), MI-PMM on log-transformed costs (log MI-PMM), and a two-step MI. Mean cost and effect estimates, standard errors and incremental net benefits were compared with the results of the analyses on the complete reference data set. The CCA, MI-PMM, and the two-step MI strategy diverged from the results for the reference data set when the amount of missing data increased. In contrast, the estimates of the Log MI-PMM strategy remained stable irrespective of the amount of missing data. MI provided better estimates than CCA in all scenarios. With low amounts of missing data the MI strategies appeared equivalent but we recommend using the log MI-PMM with missing data greater than 35 %.
Variable selection under multiple imputation using the bootstrap in a prognostic study
Heymans, Martijn W; van Buuren, Stef; Knol, Dirk L; van Mechelen, Willem; de Vet, Henrica CW
2007-01-01
Background Missing data is a challenging problem in many prognostic studies. Multiple imputation (MI) accounts for imputation uncertainty that allows for adequate statistical testing. We developed and tested a methodology combining MI with bootstrapping techniques for studying prognostic variable selection. Method In our prospective cohort study we merged data from three different randomized controlled trials (RCTs) to assess prognostic variables for chronicity of low back pain. Among the outcome and prognostic variables data were missing in the range of 0 and 48.1%. We used four methods to investigate the influence of respectively sampling and imputation variation: MI only, bootstrap only, and two methods that combine MI and bootstrapping. Variables were selected based on the inclusion frequency of each prognostic variable, i.e. the proportion of times that the variable appeared in the model. The discriminative and calibrative abilities of prognostic models developed by the four methods were assessed at different inclusion levels. Results We found that the effect of imputation variation on the inclusion frequency was larger than the effect of sampling variation. When MI and bootstrapping were combined at the range of 0% (full model) to 90% of variable selection, bootstrap corrected c-index values of 0.70 to 0.71 and slope values of 0.64 to 0.86 were found. Conclusion We recommend to account for both imputation and sampling variation in sets of missing data. The new procedure of combining MI with bootstrapping for variable selection, results in multivariable prognostic models with good performance and is therefore attractive to apply on data sets with missing values. PMID:17629912
Palmer, Cameron; Pe’er, Itsik
2016-01-01
Missing data are an unavoidable component of modern statistical genetics. Different array or sequencing technologies cover different single nucleotide polymorphisms (SNPs), leading to a complicated mosaic pattern of missingness where both individual genotypes and entire SNPs are sporadically absent. Such missing data patterns cannot be ignored without introducing bias, yet cannot be inferred exclusively from nonmissing data. In genome-wide association studies, the accepted solution to missingness is to impute missing data using external reference haplotypes. The resulting probabilistic genotypes may be analyzed in the place of genotype calls. A general-purpose paradigm, called Multiple Imputation (MI), is known to model uncertainty in many contexts, yet it is not widely used in association studies. Here, we undertake a systematic evaluation of existing imputed data analysis methods and MI. We characterize biases related to uncertainty in association studies, and find that bias is introduced both at the imputation level, when imputation algorithms generate inconsistent genotype probabilities, and at the association level, when analysis methods inadequately model genotype uncertainty. We find that MI performs at least as well as existing methods or in some cases much better, and provides a straightforward paradigm for adapting existing genotype association methods to uncertain data. PMID:27310603
Rendall, Michael S.; Ghosh-Dastidar, Bonnie; Weden, Margaret M.; Baker, Elizabeth H.; Nazarov, Zafar
2013-01-01
Within-survey multiple imputation (MI) methods are adapted to pooled-survey regression estimation where one survey has more regressors, but typically fewer observations, than the other. This adaptation is achieved through: (1) larger numbers of imputations to compensate for the higher fraction of missing values; (2) model-fit statistics to check the assumption that the two surveys sample from a common universe; and (3) specificying the analysis model completely from variables present in the survey with the larger set of regressors, thereby excluding variables never jointly observed. In contrast to the typical within-survey MI context, cross-survey missingness is monotonic and easily satisfies the Missing At Random (MAR) assumption needed for unbiased MI. Large efficiency gains and substantial reduction in omitted variable bias are demonstrated in an application to sociodemographic differences in the risk of child obesity estimated from two nationally-representative cohort surveys. PMID:24223447
Missing Data and Multiple Imputation: An Unbiased Approach
NASA Technical Reports Server (NTRS)
Foy, M.; VanBaalen, M.; Wear, M.; Mendez, C.; Mason, S.; Meyers, V.; Alexander, D.; Law, J.
2014-01-01
The default method of dealing with missing data in statistical analyses is to only use the complete observations (complete case analysis), which can lead to unexpected bias when data do not meet the assumption of missing completely at random (MCAR). For the assumption of MCAR to be met, missingness cannot be related to either the observed or unobserved variables. A less stringent assumption, missing at random (MAR), requires that missingness not be associated with the value of the missing variable itself, but can be associated with the other observed variables. When data are truly MAR as opposed to MCAR, the default complete case analysis method can lead to biased results. There are statistical options available to adjust for data that are MAR, including multiple imputation (MI) which is consistent and efficient at estimating effects. Multiple imputation uses informing variables to determine statistical distributions for each piece of missing data. Then multiple datasets are created by randomly drawing on the distributions for each piece of missing data. Since MI is efficient, only a limited number, usually less than 20, of imputed datasets are required to get stable estimates. Each imputed dataset is analyzed using standard statistical techniques, and then results are combined to get overall estimates of effect. A simulation study will be demonstrated to show the results of using the default complete case analysis, and MI in a linear regression of MCAR and MAR simulated data. Further, MI was successfully applied to the association study of CO2 levels and headaches when initial analysis showed there may be an underlying association between missing CO2 levels and reported headaches. Through MI, we were able to show that there is a strong association between average CO2 levels and the risk of headaches. Each unit increase in CO2 (mmHg) resulted in a doubling in the odds of reported headaches.
Andridge, Rebecca. R.
2011-01-01
In cluster randomized trials (CRTs), identifiable clusters rather than individuals are randomized to study groups. Resulting data often consist of a small number of clusters with correlated observations within a treatment group. Missing data often present a problem in the analysis of such trials, and multiple imputation (MI) has been used to create complete data sets, enabling subsequent analysis with well-established analysis methods for CRTs. We discuss strategies for accounting for clustering when multiply imputing a missing continuous outcome, focusing on estimation of the variance of group means as used in an adjusted t-test or ANOVA. These analysis procedures are congenial to (can be derived from) a mixed effects imputation model; however, this imputation procedure is not yet available in commercial statistical software. An alternative approach that is readily available and has been used in recent studies is to include fixed effects for cluster, but the impact of using this convenient method has not been studied. We show that under this imputation model the MI variance estimator is positively biased and that smaller ICCs lead to larger overestimation of the MI variance. Analytical expressions for the bias of the variance estimator are derived in the case of data missing completely at random (MCAR), and cases in which data are missing at random (MAR) are illustrated through simulation. Finally, various imputation methods are applied to data from the Detroit Middle School Asthma Project, a recent school-based CRT, and differences in inference are compared. PMID:21259309
Simons, Claire L; Rivero-Arias, Oliver; Yu, Ly-Mee; Simon, Judit
2015-04-01
Missing data are a well-known and widely documented problem in cost-effectiveness analyses alongside clinical trials using individual patient-level data. Current methodological research recommends multiple imputation (MI) to deal with missing health outcome data, but there is little guidance on whether MI for multi-attribute questionnaires, such as the EQ-5D-3L, should be carried out at domain or at summary score level. In this paper, we evaluated the impact of imputing individual domains versus imputing index values to deal with missing EQ-5D-3L data using a simulation study and developed recommendations for future practice. We simulated missing data in a patient-level dataset with complete EQ-5D-3L data at one point in time from a large multinational clinical trial (n = 1,814). Different proportions of missing data were generated using a missing at random (MAR) mechanism and three different scenarios were studied. The performance of using each method was evaluated using root mean squared error and mean absolute error of the actual versus predicted EQ-5D-3L indices. In large sample sizes (n > 500) and a missing data pattern that follows mainly unit non-response, imputing domains or the index produced similar results. However, domain imputation became more accurate than index imputation with pattern of missingness following an item non-response. For smaller sample sizes (n < 100), index imputation was more accurate. When MI models were misspecified, both domain and index imputations were inaccurate for any proportion of missing data. The decision between imputing the domains or the EQ-5D-3L index scores depends on the observed missing data pattern and the sample size available for analysis. Analysts conducting this type of exercises should also evaluate the sensitivity of the analysis to the MAR assumption and whether the imputation model is correctly specified.
Comulada, W. Scott
2015-01-01
Stata’s mi commands provide powerful tools to conduct multiple imputation in the presence of ignorable missing data. In this article, I present Stata code to extend the capabilities of the mi commands to address two areas of statistical inference where results are not easily aggregated across imputed datasets. First, mi commands are restricted to covariate selection. I show how to address model fit to correctly specify a model. Second, the mi commands readily aggregate model-based standard errors. I show how standard errors can be bootstrapped for situations where model assumptions may not be met. I illustrate model specification and bootstrapping on frequency counts for the number of times that alcohol was consumed in data with missing observations from a behavioral intervention. PMID:26973439
Handling Missing Data: Analysis of a Challenging Data Set Using Multiple Imputation
ERIC Educational Resources Information Center
Pampaka, Maria; Hutcheson, Graeme; Williams, Julian
2016-01-01
Missing data is endemic in much educational research. However, practices such as step-wise regression common in the educational research literature have been shown to be dangerous when significant data are missing, and multiple imputation (MI) is generally recommended by statisticians. In this paper, we provide a review of these advances and their…
Reporting the Use of Multiple Imputation for Missing Data in Higher Education Research
ERIC Educational Resources Information Center
Manly, Catherine A.; Wells, Ryan S.
2015-01-01
Higher education researchers using survey data often face decisions about handling missing data. Multiple imputation (MI) is considered by many statisticians to be the most appropriate technique for addressing missing data in many circumstances. In particular, it has been shown to be preferable to listwise deletion, which has historically been a…
Jolani, Shahab
2018-03-01
In health and medical sciences, multiple imputation (MI) is now becoming popular to obtain valid inferences in the presence of missing data. However, MI of clustered data such as multicenter studies and individual participant data meta-analysis requires advanced imputation routines that preserve the hierarchical structure of data. In clustered data, a specific challenge is the presence of systematically missing data, when a variable is completely missing in some clusters, and sporadically missing data, when it is partly missing in some clusters. Unfortunately, little is known about how to perform MI when both types of missing data occur simultaneously. We develop a new class of hierarchical imputation approach based on chained equations methodology that simultaneously imputes systematically and sporadically missing data while allowing for arbitrary patterns of missingness among them. Here, we use a random effect imputation model and adopt a simplification over fully Bayesian techniques such as Gibbs sampler to directly obtain draws of parameters within each step of the chained equations. We justify through theoretical arguments and extensive simulation studies that the proposed imputation methodology has good statistical properties in terms of bias and coverage rates of parameter estimates. An illustration is given in a case study with eight individual participant datasets. © 2017 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Moore, Lynne; Hanley, James A; Lavoie, André; Turgeon, Alexis
2009-05-01
The National Trauma Data Bank (NTDB) is plagued by the problem of missing physiological data. The Glasgow Coma Scale score, Respiratory Rate and Systolic Blood Pressure are an essential part of risk adjustment strategies for trauma system evaluation and clinical research. Missing data on these variables may compromise the feasibility and the validity of trauma group comparisons. To evaluate the validity of Multiple Imputation (MI) for completing missing physiological data in the National Trauma Data Bank (NTDB), by assessing the impact of MI on 1) frequency distributions, 2) associations with mortality, and 3) risk adjustment. Analyses were based on 170,956 NTDB observations with complete physiological data (observed data set). Missing physiological data were artificially imposed on this data set and then imputed using MI (MI data set). To assess the impact of MI on risk adjustment, 100 pairs of hospitals were randomly selected with replacement and compared using adjusted Odds Ratios (OR) of mortality. OR generated by the observed data set were then compared to those generated by the MI data set. Frequency distributions and associations with mortality were preserved following MI. The median absolute difference between adjusted OR of mortality generated by the observed data set and by the MI data set was 3.6% (inter-quartile range: 2.4%-6.1%). This study suggests that, provided it is implemented with care, MI of missing physiological data in the NTDB leads to valid frequency distributions, preserves associations with mortality, and does not compromise risk adjustment in inter-hospital comparisons of mortality.
Tang, Yongqiang
2017-12-01
Control-based pattern mixture models (PMM) and delta-adjusted PMMs are commonly used as sensitivity analyses in clinical trials with non-ignorable dropout. These PMMs assume that the statistical behavior of outcomes varies by pattern in the experimental arm in the imputation procedure, but the imputed data are typically analyzed by a standard method such as the primary analysis model. In the multiple imputation (MI) inference, Rubin's variance estimator is generally biased when the imputation and analysis models are uncongenial. One objective of the article is to quantify the bias of Rubin's variance estimator in the control-based and delta-adjusted PMMs for longitudinal continuous outcomes. These PMMs assume the same observed data distribution as the mixed effects model for repeated measures (MMRM). We derive analytic expressions for the MI treatment effect estimator and the associated Rubin's variance in these PMMs and MMRM as functions of the maximum likelihood estimator from the MMRM analysis and the observed proportion of subjects in each dropout pattern when the number of imputations is infinite. The asymptotic bias is generally small or negligible in the delta-adjusted PMM, but can be sizable in the control-based PMM. This indicates that the inference based on Rubin's rule is approximately valid in the delta-adjusted PMM. A simple variance estimator is proposed to ensure asymptotically valid MI inferences in these PMMs, and compared with the bootstrap variance. The proposed method is illustrated by the analysis of an antidepressant trial, and its performance is further evaluated via a simulation study. © 2017, The International Biometric Society.
Jiao, S; Tiezzi, F; Huang, Y; Gray, K A; Maltecca, C
2016-02-01
Obtaining accurate individual feed intake records is the key first step in achieving genetic progress toward more efficient nutrient utilization in pigs. Feed intake records collected by electronic feeding systems contain errors (erroneous and abnormal values exceeding certain cutoff criteria), which are due to feeder malfunction or animal-feeder interaction. In this study, we examined the use of a novel data-editing strategy involving multiple imputation to minimize the impact of errors and missing values on the quality of feed intake data collected by an electronic feeding system. Accuracy of feed intake data adjustment obtained from the conventional linear mixed model (LMM) approach was compared with 2 alternative implementations of multiple imputation by chained equation, denoted as MI (multiple imputation) and MICE (multiple imputation by chained equation). The 3 methods were compared under 3 scenarios, where 5, 10, and 20% feed intake error rates were simulated. Each of the scenarios was replicated 5 times. Accuracy of the alternative error adjustment was measured as the correlation between the true daily feed intake (DFI; daily feed intake in the testing period) or true ADFI (the mean DFI across testing period) and the adjusted DFI or adjusted ADFI. In the editing process, error cutoff criteria are used to define if a feed intake visit contains errors. To investigate the possibility that the error cutoff criteria may affect any of the 3 methods, the simulation was repeated with 2 alternative error cutoff values. Multiple imputation methods outperformed the LMM approach in all scenarios with mean accuracies of 96.7, 93.5, and 90.2% obtained with MI and 96.8, 94.4, and 90.1% obtained with MICE compared with 91.0, 82.6, and 68.7% using LMM for DFI. Similar results were obtained for ADFI. Furthermore, multiple imputation methods consistently performed better than LMM regardless of the cutoff criteria applied to define errors. In conclusion, multiple imputation is proposed as a more accurate and flexible method for error adjustments in feed intake data collected by electronic feeders.
Voillet, Valentin; Besse, Philippe; Liaubet, Laurence; San Cristobal, Magali; González, Ignacio
2016-10-03
In omics data integration studies, it is common, for a variety of reasons, for some individuals to not be present in all data tables. Missing row values are challenging to deal with because most statistical methods cannot be directly applied to incomplete datasets. To overcome this issue, we propose a multiple imputation (MI) approach in a multivariate framework. In this study, we focus on multiple factor analysis (MFA) as a tool to compare and integrate multiple layers of information. MI involves filling the missing rows with plausible values, resulting in M completed datasets. MFA is then applied to each completed dataset to produce M different configurations (the matrices of coordinates of individuals). Finally, the M configurations are combined to yield a single consensus solution. We assessed the performance of our method, named MI-MFA, on two real omics datasets. Incomplete artificial datasets with different patterns of missingness were created from these data. The MI-MFA results were compared with two other approaches i.e., regularized iterative MFA (RI-MFA) and mean variable imputation (MVI-MFA). For each configuration resulting from these three strategies, the suitability of the solution was determined against the true MFA configuration obtained from the original data and a comprehensive graphical comparison showing how the MI-, RI- or MVI-MFA configurations diverge from the true configuration was produced. Two approaches i.e., confidence ellipses and convex hulls, to visualize and assess the uncertainty due to missing values were also described. We showed how the areas of ellipses and convex hulls increased with the number of missing individuals. A free and easy-to-use code was proposed to implement the MI-MFA method in the R statistical environment. We believe that MI-MFA provides a useful and attractive method for estimating the coordinates of individuals on the first MFA components despite missing rows. MI-MFA configurations were close to the true configuration even when many individuals were missing in several data tables. This method takes into account the uncertainty of MI-MFA configurations induced by the missing rows, thereby allowing the reliability of the results to be evaluated.
A comparison of multiple imputation methods for incomplete longitudinal binary data.
Yamaguchi, Yusuke; Misumi, Toshihiro; Maruo, Kazushi
2018-01-01
Longitudinal binary data are commonly encountered in clinical trials. Multiple imputation is an approach for getting a valid estimation of treatment effects under an assumption of missing at random mechanism. Although there are a variety of multiple imputation methods for the longitudinal binary data, a limited number of researches have reported on relative performances of the methods. Moreover, when focusing on the treatment effect throughout a period that has often been used in clinical evaluations of specific disease areas, no definite investigations comparing the methods have been available. We conducted an extensive simulation study to examine comparative performances of six multiple imputation methods available in the SAS MI procedure for longitudinal binary data, where two endpoints of responder rates at a specified time point and throughout a period were assessed. The simulation study suggested that results from naive approaches of a single imputation with non-responders and a complete case analysis could be very sensitive against missing data. The multiple imputation methods using a monotone method and a full conditional specification with a logistic regression imputation model were recommended for obtaining unbiased and robust estimations of the treatment effect. The methods were illustrated with data from a mental health research.
Karakaya, Jale; Karabulut, Erdem; Yucel, Recai M.
2015-01-01
Modern statistical methods using incomplete data have been increasingly applied in a wide variety of substantive problems. Similarly, receiver operating characteristic (ROC) analysis, a method used in evaluating diagnostic tests or biomarkers in medical research, has also been increasingly popular problem in both its development and application. While missing-data methods have been applied in ROC analysis, the impact of model mis-specification and/or assumptions (e.g. missing at random) underlying the missing data has not been thoroughly studied. In this work, we study the performance of multiple imputation (MI) inference in ROC analysis. Particularly, we investigate parametric and non-parametric techniques for MI inference under common missingness mechanisms. Depending on the coherency of the imputation model with the underlying data generation mechanism, our results show that MI generally leads to well-calibrated inferences under ignorable missingness mechanisms. PMID:26379316
Multiple imputation in the presence of non-normal data.
Lee, Katherine J; Carlin, John B
2017-02-20
Multiple imputation (MI) is becoming increasingly popular for handling missing data. Standard approaches for MI assume normality for continuous variables (conditionally on the other variables in the imputation model). However, it is unclear how to impute non-normally distributed continuous variables. Using simulation and a case study, we compared various transformations applied prior to imputation, including a novel non-parametric transformation, to imputation on the raw scale and using predictive mean matching (PMM) when imputing non-normal data. We generated data from a range of non-normal distributions, and set 50% to missing completely at random or missing at random. We then imputed missing values on the raw scale, following a zero-skewness log, Box-Cox or non-parametric transformation and using PMM with both type 1 and 2 matching. We compared inferences regarding the marginal mean of the incomplete variable and the association with a fully observed outcome. We also compared results from these approaches in the analysis of depression and anxiety symptoms in parents of very preterm compared with term-born infants. The results provide novel empirical evidence that the decision regarding how to impute a non-normal variable should be based on the nature of the relationship between the variables of interest. If the relationship is linear in the untransformed scale, transformation can introduce bias irrespective of the transformation used. However, if the relationship is non-linear, it may be important to transform the variable to accurately capture this relationship. A useful alternative is to impute the variable using PMM with type 1 matching. Copyright © 2016 John Wiley & Sons, Ltd. Copyright © 2016 John Wiley & Sons, Ltd.
Welch, Catherine A; Petersen, Irene; Bartlett, Jonathan W; White, Ian R; Marston, Louise; Morris, Richard W; Nazareth, Irwin; Walters, Kate; Carpenter, James
2014-01-01
Most implementations of multiple imputation (MI) of missing data are designed for simple rectangular data structures ignoring temporal ordering of data. Therefore, when applying MI to longitudinal data with intermittent patterns of missing data, some alternative strategies must be considered. One approach is to divide data into time blocks and implement MI independently at each block. An alternative approach is to include all time blocks in the same MI model. With increasing numbers of time blocks, this approach is likely to break down because of co-linearity and over-fitting. The new two-fold fully conditional specification (FCS) MI algorithm addresses these issues, by only conditioning on measurements, which are local in time. We describe and report the results of a novel simulation study to critically evaluate the two-fold FCS algorithm and its suitability for imputation of longitudinal electronic health records. After generating a full data set, approximately 70% of selected continuous and categorical variables were made missing completely at random in each of ten time blocks. Subsequently, we applied a simple time-to-event model. We compared efficiency of estimated coefficients from a complete records analysis, MI of data in the baseline time block and the two-fold FCS algorithm. The results show that the two-fold FCS algorithm maximises the use of data available, with the gain relative to baseline MI depending on the strength of correlations within and between variables. Using this approach also increases plausibility of the missing at random assumption by using repeated measures over time of variables whose baseline values may be missing. PMID:24782349
Zhang, Zhaoyang; Fang, Hua; Wang, Honggang
2016-06-01
Web-delivered trials are an important component in eHealth services. These trials, mostly behavior-based, generate big heterogeneous data that are longitudinal, high dimensional with missing values. Unsupervised learning methods have been widely applied in this area, however, validating the optimal number of clusters has been challenging. Built upon our multiple imputation (MI) based fuzzy clustering, MIfuzzy, we proposed a new multiple imputation based validation (MIV) framework and corresponding MIV algorithms for clustering big longitudinal eHealth data with missing values, more generally for fuzzy-logic based clustering methods. Specifically, we detect the optimal number of clusters by auto-searching and -synthesizing a suite of MI-based validation methods and indices, including conventional (bootstrap or cross-validation based) and emerging (modularity-based) validation indices for general clustering methods as well as the specific one (Xie and Beni) for fuzzy clustering. The MIV performance was demonstrated on a big longitudinal dataset from a real web-delivered trial and using simulation. The results indicate MI-based Xie and Beni index for fuzzy-clustering are more appropriate for detecting the optimal number of clusters for such complex data. The MIV concept and algorithms could be easily adapted to different types of clustering that could process big incomplete longitudinal trial data in eHealth services.
Zhang, Zhaoyang; Wang, Honggang
2016-01-01
Web-delivered trials are an important component in eHealth services. These trials, mostly behavior-based, generate big heterogeneous data that are longitudinal, high dimensional with missing values. Unsupervised learning methods have been widely applied in this area, however, validating the optimal number of clusters has been challenging. Built upon our multiple imputation (MI) based fuzzy clustering, MIfuzzy, we proposed a new multiple imputation based validation (MIV) framework and corresponding MIV algorithms for clustering big longitudinal eHealth data with missing values, more generally for fuzzy-logic based clustering methods. Specifically, we detect the optimal number of clusters by auto-searching and -synthesizing a suite of MI-based validation methods and indices, including conventional (bootstrap or cross-validation based) and emerging (modularity-based) validation indices for general clustering methods as well as the specific one (Xie and Beni) for fuzzy clustering. The MIV performance was demonstrated on a big longitudinal dataset from a real web-delivered trial and using simulation. The results indicate MI-based Xie and Beni index for fuzzy-clustering is more appropriate for detecting the optimal number of clusters for such complex data. The MIV concept and algorithms could be easily adapted to different types of clustering that could process big incomplete longitudinal trial data in eHealth services. PMID:27126063
Mukaka, Mavuto; White, Sarah A; Terlouw, Dianne J; Mwapasa, Victor; Kalilani-Phiri, Linda; Faragher, E Brian
2016-07-22
Missing outcomes can seriously impair the ability to make correct inferences from randomized controlled trials (RCTs). Complete case (CC) analysis is commonly used, but it reduces sample size and is perceived to lead to reduced statistical efficiency of estimates while increasing the potential for bias. As multiple imputation (MI) methods preserve sample size, they are generally viewed as the preferred analytical approach. We examined this assumption, comparing the performance of CC and MI methods to determine risk difference (RD) estimates in the presence of missing binary outcomes. We conducted simulation studies of 5000 simulated data sets with 50 imputations of RCTs with one primary follow-up endpoint at different underlying levels of RD (3-25 %) and missing outcomes (5-30 %). For missing at random (MAR) or missing completely at random (MCAR) outcomes, CC method estimates generally remained unbiased and achieved precision similar to or better than MI methods, and high statistical coverage. Missing not at random (MNAR) scenarios yielded invalid inferences with both methods. Effect size estimate bias was reduced in MI methods by always including group membership even if this was unrelated to missingness. Surprisingly, under MAR and MCAR conditions in the assessed scenarios, MI offered no statistical advantage over CC methods. While MI must inherently accompany CC methods for intention-to-treat analyses, these findings endorse CC methods for per protocol risk difference analyses in these conditions. These findings provide an argument for the use of the CC approach to always complement MI analyses, with the usual caveat that the validity of the mechanism for missingness be thoroughly discussed. More importantly, researchers should strive to collect as much data as possible.
Using full-cohort data in nested case-control and case-cohort studies by multiple imputation.
Keogh, Ruth H; White, Ian R
2013-10-15
In many large prospective cohorts, expensive exposure measurements cannot be obtained for all individuals. Exposure-disease association studies are therefore often based on nested case-control or case-cohort studies in which complete information is obtained only for sampled individuals. However, in the full cohort, there may be a large amount of information on cheaply available covariates and possibly a surrogate of the main exposure(s), which typically goes unused. We view the nested case-control or case-cohort study plus the remainder of the cohort as a full-cohort study with missing data. Hence, we propose using multiple imputation (MI) to utilise information in the full cohort when data from the sub-studies are analysed. We use the fully observed data to fit the imputation models. We consider using approximate imputation models and also using rejection sampling to draw imputed values from the true distribution of the missing values given the observed data. Simulation studies show that using MI to utilise full-cohort information in the analysis of nested case-control and case-cohort studies can result in important gains in efficiency, particularly when a surrogate of the main exposure is available in the full cohort. In simulations, this method outperforms counter-matching in nested case-control studies and a weighted analysis for case-cohort studies, both of which use some full-cohort information. Approximate imputation models perform well except when there are interactions or non-linear terms in the outcome model, where imputation using rejection sampling works well. Copyright © 2013 John Wiley & Sons, Ltd.
Tian, Ting; McLachlan, Geoffrey J.; Dieters, Mark J.; Basford, Kaye E.
2015-01-01
It is a common occurrence in plant breeding programs to observe missing values in three-way three-mode multi-environment trial (MET) data. We proposed modifications of models for estimating missing observations for these data arrays, and developed a novel approach in terms of hierarchical clustering. Multiple imputation (MI) was used in four ways, multiple agglomerative hierarchical clustering, normal distribution model, normal regression model, and predictive mean match. The later three models used both Bayesian analysis and non-Bayesian analysis, while the first approach used a clustering procedure with randomly selected attributes and assigned real values from the nearest neighbour to the one with missing observations. Different proportions of data entries in six complete datasets were randomly selected to be missing and the MI methods were compared based on the efficiency and accuracy of estimating those values. The results indicated that the models using Bayesian analysis had slightly higher accuracy of estimation performance than those using non-Bayesian analysis but they were more time-consuming. However, the novel approach of multiple agglomerative hierarchical clustering demonstrated the overall best performances. PMID:26689369
Tian, Ting; McLachlan, Geoffrey J; Dieters, Mark J; Basford, Kaye E
2015-01-01
It is a common occurrence in plant breeding programs to observe missing values in three-way three-mode multi-environment trial (MET) data. We proposed modifications of models for estimating missing observations for these data arrays, and developed a novel approach in terms of hierarchical clustering. Multiple imputation (MI) was used in four ways, multiple agglomerative hierarchical clustering, normal distribution model, normal regression model, and predictive mean match. The later three models used both Bayesian analysis and non-Bayesian analysis, while the first approach used a clustering procedure with randomly selected attributes and assigned real values from the nearest neighbour to the one with missing observations. Different proportions of data entries in six complete datasets were randomly selected to be missing and the MI methods were compared based on the efficiency and accuracy of estimating those values. The results indicated that the models using Bayesian analysis had slightly higher accuracy of estimation performance than those using non-Bayesian analysis but they were more time-consuming. However, the novel approach of multiple agglomerative hierarchical clustering demonstrated the overall best performances.
Capers, Patrice L.; Brown, Andrew W.; Dawson, John A.; Allison, David B.
2015-01-01
Background: Meta-research can involve manual retrieval and evaluation of research, which is resource intensive. Creation of high throughput methods (e.g., search heuristics, crowdsourcing) has improved feasibility of large meta-research questions, but possibly at the cost of accuracy. Objective: To evaluate the use of double sampling combined with multiple imputation (DS + MI) to address meta-research questions, using as an example adherence of PubMed entries to two simple consolidated standards of reporting trials guidelines for titles and abstracts. Methods: For the DS large sample, we retrieved all PubMed entries satisfying the filters: RCT, human, abstract available, and English language (n = 322, 107). For the DS subsample, we randomly sampled 500 entries from the large sample. The large sample was evaluated with a lower rigor, higher throughput (RLOTHI) method using search heuristics, while the subsample was evaluated using a higher rigor, lower throughput (RHITLO) human rating method. Multiple imputation of the missing-completely at-random RHITLO data for the large sample was informed by: RHITLO data from the subsample; RLOTHI data from the large sample; whether a study was an RCT; and country and year of publication. Results: The RHITLO and RLOTHI methods in the subsample largely agreed (phi coefficients: title = 1.00, abstract = 0.92). Compliance with abstract and title criteria has increased over time, with non-US countries improving more rapidly. DS + MI logistic regression estimates were more precise than subsample estimates (e.g., 95% CI for change in title and abstract compliance by year: subsample RHITLO 1.050–1.174 vs. DS + MI 1.082–1.151). As evidence of improved accuracy, DS + MI coefficient estimates were closer to RHITLO than the large sample RLOTHI. Conclusion: Our results support our hypothesis that DS + MI would result in improved precision and accuracy. This method is flexible and may provide a practical way to examine large corpora of literature. PMID:25988135
De Silva, Anurika Priyanjali; Moreno-Betancur, Margarita; De Livera, Alysha Madhu; Lee, Katherine Jane; Simpson, Julie Anne
2017-07-25
Missing data is a common problem in epidemiological studies, and is particularly prominent in longitudinal data, which involve multiple waves of data collection. Traditional multiple imputation (MI) methods (fully conditional specification (FCS) and multivariate normal imputation (MVNI)) treat repeated measurements of the same time-dependent variable as just another 'distinct' variable for imputation and therefore do not make the most of the longitudinal structure of the data. Only a few studies have explored extensions to the standard approaches to account for the temporal structure of longitudinal data. One suggestion is the two-fold fully conditional specification (two-fold FCS) algorithm, which restricts the imputation of a time-dependent variable to time blocks where the imputation model includes measurements taken at the specified and adjacent times. To date, no study has investigated the performance of two-fold FCS and standard MI methods for handling missing data in a time-varying covariate with a non-linear trajectory over time - a commonly encountered scenario in epidemiological studies. We simulated 1000 datasets of 5000 individuals based on the Longitudinal Study of Australian Children (LSAC). Three missing data mechanisms: missing completely at random (MCAR), and a weak and a strong missing at random (MAR) scenarios were used to impose missingness on body mass index (BMI) for age z-scores; a continuous time-varying exposure variable with a non-linear trajectory over time. We evaluated the performance of FCS, MVNI, and two-fold FCS for handling up to 50% of missing data when assessing the association between childhood obesity and sleep problems. The standard two-fold FCS produced slightly more biased and less precise estimates than FCS and MVNI. We observed slight improvements in bias and precision when using a time window width of two for the two-fold FCS algorithm compared to the standard width of one. We recommend the use of FCS or MVNI in a similar longitudinal setting, and when encountering convergence issues due to a large number of time points or variables with missing values, the two-fold FCS with exploration of a suitable time window.
Descalzo, Miguel Á; Garcia, Virginia Villaverde; González-Alvaro, Isidoro; Carbonell, Jordi; Balsa, Alejandro; Sanmartí, Raimon; Lisbona, Pilar; Hernandez-Barrera, Valentín; Jiménez-Garcia, Rodrigo; Carmona, Loreto
2013-02-01
To describe the results of different statistical ways of addressing radiographic outcome affected by missing data--multiple imputation technique, inverse probability weights and complete case analysis--using data from an observational study. A random sample of 96 RA patients was selected for a follow-up study in which radiographs of hands and feet were scored. Radiographic progression was tested by comparing the change in the total Sharp-van der Heijde radiographic score (TSS) and the joint erosion score (JES) from baseline to the end of the second year of follow-up. MI technique, inverse probability weights in weighted estimating equation (WEE) and CC analysis were used to fit a negative binomial regression. Major predictors of radiographic progression were JES and joint space narrowing (JSN) at baseline, together with baseline disease activity measured by DAS28 for TSS and MTX use for JES. Results from CC analysis show larger coefficients and s.e.s compared with MI and weighted techniques. The results from the WEE model were quite in line with those of MI. If it seems plausible that CC or MI analysis may be valid, then MI should be preferred because of its greater efficiency. CC analysis resulted in inefficient estimates or, translated into non-statistical terminology, could guide us into inaccurate results and unwise conclusions. The methods discussed here will contribute to the use of alternative approaches for tackling missing data in observational studies.
Methods for Mediation Analysis with Missing Data
ERIC Educational Resources Information Center
Zhang, Zhiyong; Wang, Lijuan
2013-01-01
Despite wide applications of both mediation models and missing data techniques, formal discussion of mediation analysis with missing data is still rare. We introduce and compare four approaches to dealing with missing data in mediation analysis including list wise deletion, pairwise deletion, multiple imputation (MI), and a two-stage maximum…
NASA Astrophysics Data System (ADS)
Hasan, Haliza; Ahmad, Sanizah; Osman, Balkish Mohd; Sapri, Shamsiah; Othman, Nadirah
2017-08-01
In regression analysis, missing covariate data has been a common problem. Many researchers use ad hoc methods to overcome this problem due to the ease of implementation. However, these methods require assumptions about the data that rarely hold in practice. Model-based methods such as Maximum Likelihood (ML) using the expectation maximization (EM) algorithm and Multiple Imputation (MI) are more promising when dealing with difficulties caused by missing data. Then again, inappropriate methods of missing value imputation can lead to serious bias that severely affects the parameter estimates. The main objective of this study is to provide a better understanding regarding missing data concept that can assist the researcher to select the appropriate missing data imputation methods. A simulation study was performed to assess the effects of different missing data techniques on the performance of a regression model. The covariate data were generated using an underlying multivariate normal distribution and the dependent variable was generated as a combination of explanatory variables. Missing values in covariate were simulated using a mechanism called missing at random (MAR). Four levels of missingness (10%, 20%, 30% and 40%) were imposed. ML and MI techniques available within SAS software were investigated. A linear regression analysis was fitted and the model performance measures; MSE, and R-Squared were obtained. Results of the analysis showed that MI is superior in handling missing data with highest R-Squared and lowest MSE when percent of missingness is less than 30%. Both methods are unable to handle larger than 30% level of missingness.
Normal Theory Two-Stage ML Estimator When Data Are Missing at the Item Level
Savalei, Victoria; Rhemtulla, Mijke
2017-01-01
In many modeling contexts, the variables in the model are linear composites of the raw items measured for each participant; for instance, regression and path analysis models rely on scale scores, and structural equation models often use parcels as indicators of latent constructs. Currently, no analytic estimation method exists to appropriately handle missing data at the item level. Item-level multiple imputation (MI), however, can handle such missing data straightforwardly. In this article, we develop an analytic approach for dealing with item-level missing data—that is, one that obtains a unique set of parameter estimates directly from the incomplete data set and does not require imputations. The proposed approach is a variant of the two-stage maximum likelihood (TSML) methodology, and it is the analytic equivalent of item-level MI. We compare the new TSML approach to three existing alternatives for handling item-level missing data: scale-level full information maximum likelihood, available-case maximum likelihood, and item-level MI. We find that the TSML approach is the best analytic approach, and its performance is similar to item-level MI. We recommend its implementation in popular software and its further study. PMID:29276371
Normal Theory Two-Stage ML Estimator When Data Are Missing at the Item Level.
Savalei, Victoria; Rhemtulla, Mijke
2017-08-01
In many modeling contexts, the variables in the model are linear composites of the raw items measured for each participant; for instance, regression and path analysis models rely on scale scores, and structural equation models often use parcels as indicators of latent constructs. Currently, no analytic estimation method exists to appropriately handle missing data at the item level. Item-level multiple imputation (MI), however, can handle such missing data straightforwardly. In this article, we develop an analytic approach for dealing with item-level missing data-that is, one that obtains a unique set of parameter estimates directly from the incomplete data set and does not require imputations. The proposed approach is a variant of the two-stage maximum likelihood (TSML) methodology, and it is the analytic equivalent of item-level MI. We compare the new TSML approach to three existing alternatives for handling item-level missing data: scale-level full information maximum likelihood, available-case maximum likelihood, and item-level MI. We find that the TSML approach is the best analytic approach, and its performance is similar to item-level MI. We recommend its implementation in popular software and its further study.
Reuse of imputed data in microarray analysis increases imputation efficiency
Kim, Ki-Yeol; Kim, Byoung-Jin; Yi, Gwan-Su
2004-01-01
Background The imputation of missing values is necessary for the efficient use of DNA microarray data, because many clustering algorithms and some statistical analysis require a complete data set. A few imputation methods for DNA microarray data have been introduced, but the efficiency of the methods was low and the validity of imputed values in these methods had not been fully checked. Results We developed a new cluster-based imputation method called sequential K-nearest neighbor (SKNN) method. This imputes the missing values sequentially from the gene having least missing values, and uses the imputed values for the later imputation. Although it uses the imputed values, the efficiency of this new method is greatly improved in its accuracy and computational complexity over the conventional KNN-based method and other methods based on maximum likelihood estimation. The performance of SKNN was in particular higher than other imputation methods for the data with high missing rates and large number of experiments. Application of Expectation Maximization (EM) to the SKNN method improved the accuracy, but increased computational time proportional to the number of iterations. The Multiple Imputation (MI) method, which is well known but not applied previously to microarray data, showed a similarly high accuracy as the SKNN method, with slightly higher dependency on the types of data sets. Conclusions Sequential reuse of imputed data in KNN-based imputation greatly increases the efficiency of imputation. The SKNN method should be practically useful to save the data of some microarray experiments which have high amounts of missing entries. The SKNN method generates reliable imputed values which can be used for further cluster-based analysis of microarray data. PMID:15504240
ERIC Educational Resources Information Center
Li, Tiandong
2012-01-01
In large-scale assessments, such as the National Assessment of Educational Progress (NAEP), plausible values based on Multiple Imputations (MI) have been used to estimate population characteristics for latent constructs under complex sample designs. Mislevy (1991) derived a closed-form analytic solution for a fixed-effect model in creating…
ERIC Educational Resources Information Center
Wolgast, Anett; Schwinger, Malte; Hahnel, Carolin; Stiensmeier-Pelster, Joachim
2017-01-01
Introduction: Multiple imputation (MI) is one of the most highly recommended methods for replacing missing values in research data. The scope of this paper is to demonstrate missing data handling in SEM by analyzing two modified data examples from educational psychology, and to give practical recommendations for applied researchers. Method: We…
Hamel, J F; Sebille, V; Le Neel, T; Kubis, G; Boyer, F C; Hardouin, J B
2017-12-01
Subjective health measurements using Patient Reported Outcomes (PRO) are increasingly used in randomized trials, particularly for patient groups comparisons. Two main types of analytical strategies can be used for such data: Classical Test Theory (CTT) and Item Response Theory models (IRT). These two strategies display very similar characteristics when data are complete, but in the common case when data are missing, whether IRT or CTT would be the most appropriate remains unknown and was investigated using simulations. We simulated PRO data such as quality of life data. Missing responses to items were simulated as being completely random, depending on an observable covariate or on an unobserved latent trait. The considered CTT-based methods allowed comparing scores using complete-case analysis, personal mean imputations or multiple-imputations based on a two-way procedure. The IRT-based method was the Wald test on a Rasch model including a group covariate. The IRT-based method and the multiple-imputations-based method for CTT displayed the highest observed power and were the only unbiased method whatever the kind of missing data. Online software and Stata® modules compatibles with the innate mi impute suite are provided for performing such analyses. Traditional procedures (listwise deletion and personal mean imputations) should be avoided, due to inevitable problems of biases and lack of power.
Belger, Mark; Haro, Josep Maria; Reed, Catherine; Happich, Michael; Kahle-Wrobleski, Kristin; Argimon, Josep Maria; Bruno, Giuseppe; Dodel, Richard; Jones, Roy W; Vellas, Bruno; Wimo, Anders
2016-07-18
Missing data are a common problem in prospective studies with a long follow-up, and the volume, pattern and reasons for missing data may be relevant when estimating the cost of illness. We aimed to evaluate the effects of different methods for dealing with missing longitudinal cost data and for costing caregiver time on total societal costs in Alzheimer's disease (AD). GERAS is an 18-month observational study of costs associated with AD. Total societal costs included patient health and social care costs, and caregiver health and informal care costs. Missing data were classified as missing completely at random (MCAR), missing at random (MAR) or missing not at random (MNAR). Simulation datasets were generated from baseline data with 10-40 % missing total cost data for each missing data mechanism. Datasets were also simulated to reflect the missing cost data pattern at 18 months using MAR and MNAR assumptions. Naïve and multiple imputation (MI) methods were applied to each dataset and results compared with complete GERAS 18-month cost data. Opportunity and replacement cost approaches were used for caregiver time, which was costed with and without supervision included and with time for working caregivers only being costed. Total costs were available for 99.4 % of 1497 patients at baseline. For MCAR datasets, naïve methods performed as well as MI methods. For MAR, MI methods performed better than naïve methods. All imputation approaches were poor for MNAR data. For all approaches, percentage bias increased with missing data volume. For datasets reflecting 18-month patterns, a combination of imputation methods provided more accurate cost estimates (e.g. bias: -1 % vs -6 % for single MI method), although different approaches to costing caregiver time had a greater impact on estimated costs (29-43 % increase over base case estimate). Methods used to impute missing cost data in AD will impact on accuracy of cost estimates although varying approaches to costing informal caregiver time has the greatest impact on total costs. Tailoring imputation methods to the reason for missing data will further our understanding of the best analytical approach for studies involving cost outcomes.
Seaman, Shaun R; Hughes, Rachael A
2018-06-01
Estimating the parameters of a regression model of interest is complicated by missing data on the variables in that model. Multiple imputation is commonly used to handle these missing data. Joint model multiple imputation and full-conditional specification multiple imputation are known to yield imputed data with the same asymptotic distribution when the conditional models of full-conditional specification are compatible with that joint model. We show that this asymptotic equivalence of imputation distributions does not imply that joint model multiple imputation and full-conditional specification multiple imputation will also yield asymptotically equally efficient inference about the parameters of the model of interest, nor that they will be equally robust to misspecification of the joint model. When the conditional models used by full-conditional specification multiple imputation are linear, logistic and multinomial regressions, these are compatible with a restricted general location joint model. We show that multiple imputation using the restricted general location joint model can be substantially more asymptotically efficient than full-conditional specification multiple imputation, but this typically requires very strong associations between variables. When associations are weaker, the efficiency gain is small. Moreover, full-conditional specification multiple imputation is shown to be potentially much more robust than joint model multiple imputation using the restricted general location model to mispecification of that model when there is substantial missingness in the outcome variable.
Kupek, Emil; de Assis, Maria Alice A
2016-09-01
External validation of food recall over 24 h in schoolchildren is often restricted to eating events in schools and is based on direct observation as the reference method. The aim of this study was to estimate the dietary intake out of school, and consequently the bias in such research design based on only part-time validated food recall, using multiple imputation (MI) conditioned on the information on child age, sex, BMI, family income, parental education and the school attended. The previous-day, web-based questionnaire WebCAAFE, structured as six meals/snacks and thirty-two foods/beverage, was answered by a sample of 7-11-year-old Brazilian schoolchildren (n 602) from five public schools. Food/beverage intake recalled by children was compared with the records provided by trained observers during school meals. Sensitivity analysis was performed with artificial data emulating those recalled by children on WebCAAFE in order to evaluate the impact of both differential and non-differential bias. Estimated bias was within ±30 % interval for 84·4 % of the thirty-two foods/beverages evaluated in WebCAAFE, and half of the latter reached statistical significance (P<0·05). Rarely (<3 %) consumed dietary items were often under-reported (fish/seafood, vegetable soup, cheese bread, French fries), whereas some of those most frequently reported (meat, bread/biscuits, fruits) showed large overestimation. Compared with the analysis restricted to fully validated data, MI reduced differential bias in sensitivity analysis but the bias still remained large in most cases. MI provided a suitable statistical framework for part-time validation design of dietary intake over six daily eating events.
The multiple imputation method: a case study involving secondary data analysis.
Walani, Salimah R; Cleland, Charles M
2015-05-01
To illustrate with the example of a secondary data analysis study the use of the multiple imputation method to replace missing data. Most large public datasets have missing data, which need to be handled by researchers conducting secondary data analysis studies. Multiple imputation is a technique widely used to replace missing values while preserving the sample size and sampling variability of the data. The 2004 National Sample Survey of Registered Nurses. The authors created a model to impute missing values using the chained equation method. They used imputation diagnostics procedures and conducted regression analysis of imputed data to determine the differences between the log hourly wages of internationally educated and US-educated registered nurses. The authors used multiple imputation procedures to replace missing values in a large dataset with 29,059 observations. Five multiple imputed datasets were created. Imputation diagnostics using time series and density plots showed that imputation was successful. The authors also present an example of the use of multiple imputed datasets to conduct regression analysis to answer a substantive research question. Multiple imputation is a powerful technique for imputing missing values in large datasets while preserving the sample size and variance of the data. Even though the chained equation method involves complex statistical computations, recent innovations in software and computation have made it possible for researchers to conduct this technique on large datasets. The authors recommend nurse researchers use multiple imputation methods for handling missing data to improve the statistical power and external validity of their studies.
Should multiple imputation be the method of choice for handling missing data in randomized trials?
Sullivan, Thomas R; White, Ian R; Salter, Amy B; Ryan, Philip; Lee, Katherine J
2016-01-01
The use of multiple imputation has increased markedly in recent years, and journal reviewers may expect to see multiple imputation used to handle missing data. However in randomized trials, where treatment group is always observed and independent of baseline covariates, other approaches may be preferable. Using data simulation we evaluated multiple imputation, performed both overall and separately by randomized group, across a range of commonly encountered scenarios. We considered both missing outcome and missing baseline data, with missing outcome data induced under missing at random mechanisms. Provided the analysis model was correctly specified, multiple imputation produced unbiased treatment effect estimates, but alternative unbiased approaches were often more efficient. When the analysis model overlooked an interaction effect involving randomized group, multiple imputation produced biased estimates of the average treatment effect when applied to missing outcome data, unless imputation was performed separately by randomized group. Based on these results, we conclude that multiple imputation should not be seen as the only acceptable way to handle missing data in randomized trials. In settings where multiple imputation is adopted, we recommend that imputation is carried out separately by randomized group. PMID:28034175
Should multiple imputation be the method of choice for handling missing data in randomized trials?
Sullivan, Thomas R; White, Ian R; Salter, Amy B; Ryan, Philip; Lee, Katherine J
2016-01-01
The use of multiple imputation has increased markedly in recent years, and journal reviewers may expect to see multiple imputation used to handle missing data. However in randomized trials, where treatment group is always observed and independent of baseline covariates, other approaches may be preferable. Using data simulation we evaluated multiple imputation, performed both overall and separately by randomized group, across a range of commonly encountered scenarios. We considered both missing outcome and missing baseline data, with missing outcome data induced under missing at random mechanisms. Provided the analysis model was correctly specified, multiple imputation produced unbiased treatment effect estimates, but alternative unbiased approaches were often more efficient. When the analysis model overlooked an interaction effect involving randomized group, multiple imputation produced biased estimates of the average treatment effect when applied to missing outcome data, unless imputation was performed separately by randomized group. Based on these results, we conclude that multiple imputation should not be seen as the only acceptable way to handle missing data in randomized trials. In settings where multiple imputation is adopted, we recommend that imputation is carried out separately by randomized group.
NASA Astrophysics Data System (ADS)
Xiao, Q.; Liu, Y.
2017-12-01
Satellite aerosol optical depth (AOD) has been used to assess fine particulate matter (PM2.5) pollution worldwide. However, non-random missing AOD due to cloud cover or high surface reflectance can cause up to 80% data loss and bias model-estimated spatial and temporal trends of PM2.5. Previous studies filled the data gap largely by spatial smoothing which ignored the impact of cloud cover and meteorology on aerosol loadings and has been shown to exhibit poor performance when monitoring stations are sparse or when there is seasonal large-scale missingness. Using the Yangtze River Delta of China as an example, we present a flexible Multiple Imputation (MI) method that combines cloud fraction, elevation, humidity, temperature, and spatiotemporal trends to impute the missing AOD. A two-stage statistical model driven by gap-filled AOD, meteorology and land use information was then fitted to estimate daily ground PM2.5 concentrations in 2013 and 2014 at 1 km resolution with complete coverage in space and time. The daily MI models have an average R2 of 0.77, with an inter-quartile range of 0.71 to 0.82 across days. The overall model 10-fold cross-validation R2 were 0.81 and 0.73 (for year 2013 and 2014, respectively. Predictions with only observational AOD or only imputed AOD showed similar accuracy. This method provides reliable PM2.5 predictions with complete coverage at high resolution. By including all the pixels of all days into model development, this method corrected the sampling bias in satellite-driven air pollution modelling due to non-random missingness in AOD. Comparing with previously reported gap-filling methods, the MI method has the strength of not relying on ground PM2.5 measurements, therefore allows the prediction of historical PM2.5 levels prior to the establishment of regular ground monitoring networks.
Multiple hot-deck imputation for network inference from RNA sequencing data.
Imbert, Alyssa; Valsesia, Armand; Le Gall, Caroline; Armenise, Claudia; Lefebvre, Gregory; Gourraud, Pierre-Antoine; Viguerie, Nathalie; Villa-Vialaneix, Nathalie
2018-05-15
Network inference provides a global view of the relations existing between gene expression in a given transcriptomic experiment (often only for a restricted list of chosen genes). However, it is still a challenging problem: even if the cost of sequencing techniques has decreased over the last years, the number of samples in a given experiment is still (very) small compared to the number of genes. We propose a method to increase the reliability of the inference when RNA-seq expression data have been measured together with an auxiliary dataset that can provide external information on gene expression similarity between samples. Our statistical approach, hd-MI, is based on imputation for samples without available RNA-seq data that are considered as missing data but are observed on the secondary dataset. hd-MI can improve the reliability of the inference for missing rates up to 30% and provides more stable networks with a smaller number of false positive edges. On a biological point of view, hd-MI was also found relevant to infer networks from RNA-seq data acquired in adipose tissue during a nutritional intervention in obese individuals. In these networks, novel links between genes were highlighted, as well as an improved comparability between the two steps of the nutritional intervention. Software and sample data are available as an R package, RNAseqNet, that can be downloaded from the Comprehensive R Archive Network (CRAN). alyssa.imbert@inra.fr or nathalie.villa-vialaneix@inra.fr. Supplementary data are available at Bioinformatics online.
Karasek, R A; Theorell, T; Schwartz, J E; Schnall, P L; Pieper, C F; Michela, J L
1988-08-01
Associations between psychosocial job characteristics and past myocardial infarction (MI) prevalence for employed males were tested with the Health Examination Survey (HES) 1960-61, N = 2,409, and the Health and Nutrition Examination Survey (HANES) 1971-75, N = 2,424. A new estimation method is used which imputes to census occupation codes, job characteristic information from national surveys of job characteristics (US Department of Labor, Quality of Employment Surveys). Controlling for age, we find that employed males with jobs which are simultaneously low in decision latitude and high in psychological work load (a multiplicative product term isolating 20 per cent of the population) have a higher prevalence of myocardial infarction in both data bases. In a logistic regression analysis, using job measures adjusted for demographic factors and controlling for age, race, education, systolic blood pressure, serum cholesterol, smoking (HANES only), and physical exertion, we find a low decision latitude/high psychological demand multiplicative product term associated with MI in both data bases. Additional multiple logistic regressions show that low decision latitude is associated with increased prevalence of MI in both the HES and the HANES. Psychological workload and physical exertion are significant only in the HANES.
Karasek, R A; Theorell, T; Schwartz, J E; Schnall, P L; Pieper, C F; Michela, J L
1988-01-01
Associations between psychosocial job characteristics and past myocardial infarction (MI) prevalence for employed males were tested with the Health Examination Survey (HES) 1960-61, N = 2,409, and the Health and Nutrition Examination Survey (HANES) 1971-75, N = 2,424. A new estimation method is used which imputes to census occupation codes, job characteristic information from national surveys of job characteristics (US Department of Labor, Quality of Employment Surveys). Controlling for age, we find that employed males with jobs which are simultaneously low in decision latitude and high in psychological work load (a multiplicative product term isolating 20 per cent of the population) have a higher prevalence of myocardial infarction in both data bases. In a logistic regression analysis, using job measures adjusted for demographic factors and controlling for age, race, education, systolic blood pressure, serum cholesterol, smoking (HANES only), and physical exertion, we find a low decision latitude/high psychological demand multiplicative product term associated with MI in both data bases. Additional multiple logistic regressions show that low decision latitude is associated with increased prevalence of MI in both the HES and the HANES. Psychological workload and physical exertion are significant only in the HANES. PMID:3389427
Multiple imputation for handling missing outcome data when estimating the relative risk.
Sullivan, Thomas R; Lee, Katherine J; Ryan, Philip; Salter, Amy B
2017-09-06
Multiple imputation is a popular approach to handling missing data in medical research, yet little is known about its applicability for estimating the relative risk. Standard methods for imputing incomplete binary outcomes involve logistic regression or an assumption of multivariate normality, whereas relative risks are typically estimated using log binomial models. It is unclear whether misspecification of the imputation model in this setting could lead to biased parameter estimates. Using simulated data, we evaluated the performance of multiple imputation for handling missing data prior to estimating adjusted relative risks from a correctly specified multivariable log binomial model. We considered an arbitrary pattern of missing data in both outcome and exposure variables, with missing data induced under missing at random mechanisms. Focusing on standard model-based methods of multiple imputation, missing data were imputed using multivariate normal imputation or fully conditional specification with a logistic imputation model for the outcome. Multivariate normal imputation performed poorly in the simulation study, consistently producing estimates of the relative risk that were biased towards the null. Despite outperforming multivariate normal imputation, fully conditional specification also produced somewhat biased estimates, with greater bias observed for higher outcome prevalences and larger relative risks. Deleting imputed outcomes from analysis datasets did not improve the performance of fully conditional specification. Both multivariate normal imputation and fully conditional specification produced biased estimates of the relative risk, presumably since both use a misspecified imputation model. Based on simulation results, we recommend researchers use fully conditional specification rather than multivariate normal imputation and retain imputed outcomes in the analysis when estimating relative risks. However fully conditional specification is not without its shortcomings, and so further research is needed to identify optimal approaches for relative risk estimation within the multiple imputation framework.
Alternative Multiple Imputation Inference for Mean and Covariance Structure Modeling
ERIC Educational Resources Information Center
Lee, Taehun; Cai, Li
2012-01-01
Model-based multiple imputation has become an indispensable method in the educational and behavioral sciences. Mean and covariance structure models are often fitted to multiply imputed data sets. However, the presence of multiple random imputations complicates model fit testing, which is an important aspect of mean and covariance structure…
Should "Multiple Imputations" Be Treated as "Multiple Indicators"?
ERIC Educational Resources Information Center
Mislevy, Robert J.
1993-01-01
Multiple imputations for latent variables are constructed so that analyses treating them as true variables have the correct expectations for population characteristics. Analyzing multiple imputations in accordance with their construction yields correct estimates of population characteristics, whereas analyzing them as multiple indicators generally…
Seaman, Shaun R; White, Ian R; Carpenter, James R
2015-01-01
Missing covariate data commonly occur in epidemiological and clinical research, and are often dealt with using multiple imputation. Imputation of partially observed covariates is complicated if the substantive model is non-linear (e.g. Cox proportional hazards model), or contains non-linear (e.g. squared) or interaction terms, and standard software implementations of multiple imputation may impute covariates from models that are incompatible with such substantive models. We show how imputation by fully conditional specification, a popular approach for performing multiple imputation, can be modified so that covariates are imputed from models which are compatible with the substantive model. We investigate through simulation the performance of this proposal, and compare it with existing approaches. Simulation results suggest our proposal gives consistent estimates for a range of common substantive models, including models which contain non-linear covariate effects or interactions, provided data are missing at random and the assumed imputation models are correctly specified and mutually compatible. Stata software implementing the approach is freely available. PMID:24525487
Suyundikov, Anvar; Stevens, John R.; Corcoran, Christopher; Herrick, Jennifer; Wolff, Roger K.; Slattery, Martha L.
2015-01-01
Missing data can arise in bioinformatics applications for a variety of reasons, and imputation methods are frequently applied to such data. We are motivated by a colorectal cancer study where miRNA expression was measured in paired tumor-normal samples of hundreds of patients, but data for many normal samples were missing due to lack of tissue availability. We compare the precision and power performance of several imputation methods, and draw attention to the statistical dependence induced by K-Nearest Neighbors (KNN) imputation. This imputation-induced dependence has not previously been addressed in the literature. We demonstrate how to account for this dependence, and show through simulation how the choice to ignore or account for this dependence affects both power and type I error rate control. PMID:25849489
Siddique, Juned; Harel, Ofer; Crespi, Catherine M.; Hedeker, Donald
2014-01-01
The true missing data mechanism is never known in practice. We present a method for generating multiple imputations for binary variables that formally incorporates missing data mechanism uncertainty. Imputations are generated from a distribution of imputation models rather than a single model, with the distribution reflecting subjective notions of missing data mechanism uncertainty. Parameter estimates and standard errors are obtained using rules for nested multiple imputation. Using simulation, we investigate the impact of missing data mechanism uncertainty on post-imputation inferences and show that incorporating this uncertainty can increase the coverage of parameter estimates. We apply our method to a longitudinal smoking cessation trial where nonignorably missing data were a concern. Our method provides a simple approach for formalizing subjective notions regarding nonresponse and can be implemented using existing imputation software. PMID:24634315
Leyrat, Clémence; Seaman, Shaun R; White, Ian R; Douglas, Ian; Smeeth, Liam; Kim, Joseph; Resche-Rigon, Matthieu; Carpenter, James R; Williamson, Elizabeth J
2017-01-01
Inverse probability of treatment weighting is a popular propensity score-based approach to estimate marginal treatment effects in observational studies at risk of confounding bias. A major issue when estimating the propensity score is the presence of partially observed covariates. Multiple imputation is a natural approach to handle missing data on covariates: covariates are imputed and a propensity score analysis is performed in each imputed dataset to estimate the treatment effect. The treatment effect estimates from each imputed dataset are then combined to obtain an overall estimate. We call this method MIte. However, an alternative approach has been proposed, in which the propensity scores are combined across the imputed datasets (MIps). Therefore, there are remaining uncertainties about how to implement multiple imputation for propensity score analysis: (a) should we apply Rubin's rules to the inverse probability of treatment weighting treatment effect estimates or to the propensity score estimates themselves? (b) does the outcome have to be included in the imputation model? (c) how should we estimate the variance of the inverse probability of treatment weighting estimator after multiple imputation? We studied the consistency and balancing properties of the MIte and MIps estimators and performed a simulation study to empirically assess their performance for the analysis of a binary outcome. We also compared the performance of these methods to complete case analysis and the missingness pattern approach, which uses a different propensity score model for each pattern of missingness, and a third multiple imputation approach in which the propensity score parameters are combined rather than the propensity scores themselves (MIpar). Under a missing at random mechanism, complete case and missingness pattern analyses were biased in most cases for estimating the marginal treatment effect, whereas multiple imputation approaches were approximately unbiased as long as the outcome was included in the imputation model. Only MIte was unbiased in all the studied scenarios and Rubin's rules provided good variance estimates for MIte. The propensity score estimated in the MIte approach showed good balancing properties. In conclusion, when using multiple imputation in the inverse probability of treatment weighting context, MIte with the outcome included in the imputation model is the preferred approach.
Kordas, Katarzyna; Ardoino, Graciela; Coffman, Donna L.; Queirolo, Elena I.; Ciccariello, Daniela; Mañay, Nelly; Ettinger, Adrienne S.
2015-01-01
While it is known that toxic metals contribute individually to child cognitive and behavioral deficits, we still know little about the effects of exposure to multiple metals, particularly when exposures are low. We studied the association between children's blood lead and hair arsenic, cadmium, and manganese and their performance on the Bayley Scales of Infant Development III. Ninety-two preschool children (age 13–42 months) from Montevideo, Uruguay, provided a hair sample and 78 had a blood lead level (BLL) measurement. Using latent class analysis (LCA), we identified four groups of exposure based on metal concentrations: (1) low metals, (2) low-to-moderate metals, (3) high lead and cadmium, and (4) high metals. Using the four-group exposure variable as the main predictor, and fitting raw scores on the cognitive, receptive vocabulary, and expressive vocabulary scales as dependent variables, both complete-case and multiple imputation (MI) analyses were conducted. We found no association between multiple-metal exposures and neurodevelopment in covariate-adjusted models. This study demonstrates the use of LCA together with MI to determine patterns of exposure to multiple toxic metals and relate these to child neurodevelopment. However, because the overall study population was small, other studies with larger sample sizes are needed to investigate these associations. PMID:25694786
Stanley J. Zarnoch; H. Ken Cordell; Carter J. Betz; John C. Bergstrom
2010-01-01
Multiple imputation is used to create values for missing family income data in the National Survey on Recreation and the Environment. We present an overview of the survey and a description of the missingness pattern for family income and other key variables. We create a logistic model for the multiple imputation process and to impute data sets for family income. We...
Peterson, Josh F.; Eden, Svetlana K.; Moons, Karel G.; Ikizler, T. Alp; Matheny, Michael E.
2013-01-01
Summary Background and objectives Baseline creatinine (BCr) is frequently missing in AKI studies. Common surrogate estimates can misclassify AKI and adversely affect the study of related outcomes. This study examined whether multiple imputation improved accuracy of estimating missing BCr beyond current recommendations to apply assumed estimated GFR (eGFR) of 75 ml/min per 1.73 m2 (eGFR 75). Design, setting, participants, & measurements From 41,114 unique adult admissions (13,003 with and 28,111 without BCr data) at Vanderbilt University Hospital between 2006 and 2008, a propensity score model was developed to predict likelihood of missing BCr. Propensity scoring identified 6502 patients with highest likelihood of missing BCr among 13,003 patients with known BCr to simulate a “missing” data scenario while preserving actual reference BCr. Within this cohort (n=6502), the ability of various multiple-imputation approaches to estimate BCr and classify AKI were compared with that of eGFR 75. Results All multiple-imputation methods except the basic one more closely approximated actual BCr than did eGFR 75. Total AKI misclassification was lower with multiple imputation (full multiple imputation + serum creatinine) (9.0%) than with eGFR 75 (12.3%; P<0.001). Improvements in misclassification were greater in patients with impaired kidney function (full multiple imputation + serum creatinine) (15.3%) versus eGFR 75 (40.5%; P<0.001). Multiple imputation improved specificity and positive predictive value for detecting AKI at the expense of modestly decreasing sensitivity relative to eGFR 75. Conclusions Multiple imputation can improve accuracy in estimating missing BCr and reduce misclassification of AKI beyond currently proposed methods. PMID:23037980
Multiple imputation methods for bivariate outcomes in cluster randomised trials.
DiazOrdaz, K; Kenward, M G; Gomes, M; Grieve, R
2016-09-10
Missing observations are common in cluster randomised trials. The problem is exacerbated when modelling bivariate outcomes jointly, as the proportion of complete cases is often considerably smaller than the proportion having either of the outcomes fully observed. Approaches taken to handling such missing data include the following: complete case analysis, single-level multiple imputation that ignores the clustering, multiple imputation with a fixed effect for each cluster and multilevel multiple imputation. We contrasted the alternative approaches to handling missing data in a cost-effectiveness analysis that uses data from a cluster randomised trial to evaluate an exercise intervention for care home residents. We then conducted a simulation study to assess the performance of these approaches on bivariate continuous outcomes, in terms of confidence interval coverage and empirical bias in the estimated treatment effects. Missing-at-random clustered data scenarios were simulated following a full-factorial design. Across all the missing data mechanisms considered, the multiple imputation methods provided estimators with negligible bias, while complete case analysis resulted in biased treatment effect estimates in scenarios where the randomised treatment arm was associated with missingness. Confidence interval coverage was generally in excess of nominal levels (up to 99.8%) following fixed-effects multiple imputation and too low following single-level multiple imputation. Multilevel multiple imputation led to coverage levels of approximately 95% throughout. © 2016 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd. © 2016 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd.
Multiple Imputation of Multilevel Missing Data-Rigor versus Simplicity
ERIC Educational Resources Information Center
Drechsler, Jörg
2015-01-01
Multiple imputation is widely accepted as the method of choice to address item-nonresponse in surveys. However, research on imputation strategies for the hierarchical structures that are typically found in the data in educational contexts is still limited. While a multilevel imputation model should be preferred from a theoretical point of view if…
A Comparison of Item-Level and Scale-Level Multiple Imputation for Questionnaire Batteries
ERIC Educational Resources Information Center
Gottschall, Amanda C.; West, Stephen G.; Enders, Craig K.
2012-01-01
Behavioral science researchers routinely use scale scores that sum or average a set of questionnaire items to address their substantive questions. A researcher applying multiple imputation to incomplete questionnaire data can either impute the incomplete items prior to computing scale scores or impute the scale scores directly from other scale…
ERIC Educational Resources Information Center
van Ginkel, Joost R.; van der Ark, L. Andries; Sijtsma, Klaas
2007-01-01
The performance of five simple multiple imputation methods for dealing with missing data were compared. In addition, random imputation and multivariate normal imputation were used as lower and upper benchmark, respectively. Test data were simulated and item scores were deleted such that they were either missing completely at random, missing at…
Combining multiple imputation and meta-analysis with individual participant data
Burgess, Stephen; White, Ian R; Resche-Rigon, Matthieu; Wood, Angela M
2013-01-01
Multiple imputation is a strategy for the analysis of incomplete data such that the impact of the missingness on the power and bias of estimates is mitigated. When data from multiple studies are collated, we can propose both within-study and multilevel imputation models to impute missing data on covariates. It is not clear how to choose between imputation models or how to combine imputation and inverse-variance weighted meta-analysis methods. This is especially important as often different studies measure data on different variables, meaning that we may need to impute data on a variable which is systematically missing in a particular study. In this paper, we consider a simulation analysis of sporadically missing data in a single covariate with a linear analysis model and discuss how the results would be applicable to the case of systematically missing data. We find in this context that ensuring the congeniality of the imputation and analysis models is important to give correct standard errors and confidence intervals. For example, if the analysis model allows between-study heterogeneity of a parameter, then we should incorporate this heterogeneity into the imputation model to maintain the congeniality of the two models. In an inverse-variance weighted meta-analysis, we should impute missing data and apply Rubin's rules at the study level prior to meta-analysis, rather than meta-analyzing each of the multiple imputations and then combining the meta-analysis estimates using Rubin's rules. We illustrate the results using data from the Emerging Risk Factors Collaboration. PMID:23703895
Gaussian-based routines to impute categorical variables in health surveys.
Yucel, Recai M; He, Yulei; Zaslavsky, Alan M
2011-12-20
The multivariate normal (MVN) distribution is arguably the most popular parametric model used in imputation and is available in most software packages (e.g., SAS PROC MI, R package norm). When it is applied to categorical variables as an approximation, practitioners often either apply simple rounding techniques for ordinal variables or create a distinct 'missing' category and/or disregard the nominal variable from the imputation phase. All of these practices can potentially lead to biased and/or uninterpretable inferences. In this work, we develop a new rounding methodology calibrated to preserve observed distributions to multiply impute missing categorical covariates. The major attractiveness of this method is its flexibility to use any 'working' imputation software, particularly those based on MVN, allowing practitioners to obtain usable imputations with small biases. A simulation study demonstrates the clear advantage of the proposed method in rounding ordinal variables and, in some scenarios, its plausibility in imputing nominal variables. We illustrate our methods on a widely used National Survey of Children with Special Health Care Needs where incomplete values on race posed a valid threat on inferences pertaining to disparities. Copyright © 2011 John Wiley & Sons, Ltd.
A New MI-Based Visualization Aided Validation Index for Mining Big Longitudinal Web Trial Data
Zhang, Zhaoyang; Fang, Hua; Wang, Honggang
2016-01-01
Web-delivered clinical trials generate big complex data. To help untangle the heterogeneity of treatment effects, unsupervised learning methods have been widely applied. However, identifying valid patterns is a priority but challenging issue for these methods. This paper, built upon our previous research on multiple imputation (MI)-based fuzzy clustering and validation, proposes a new MI-based Visualization-aided validation index (MIVOOS) to determine the optimal number of clusters for big incomplete longitudinal Web-trial data with inflated zeros. Different from a recently developed fuzzy clustering validation index, MIVOOS uses a more suitable overlap and separation measures for Web-trial data but does not depend on the choice of fuzzifiers as the widely used Xie and Beni (XB) index. Through optimizing the view angles of 3-D projections using Sammon mapping, the optimal 2-D projection-guided MIVOOS is obtained to better visualize and verify the patterns in conjunction with trajectory patterns. Compared with XB and VOS, our newly proposed MIVOOS shows its robustness in validating big Web-trial data under different missing data mechanisms using real and simulated Web-trial data. PMID:27482473
Amene, E; Horn, B; Pirie, R; Lake, R; Döpfer, D
2016-09-06
Data containing notified cases of disease are often compromised by incomplete or partial information related to individual cases. In an effort to enhance the value of information from enteric disease notifications in New Zealand, this study explored the use of Bayesian and Multiple Imputation (MI) models to fill risk factor data gaps. As a test case, overseas travel as a risk factor for infection with campylobacteriosis has been examined. Two methods, namely Bayesian Specification (BAS) and Multiple Imputation (MI), were compared regarding predictive performance for various levels of artificially induced missingness of overseas travel status in campylobacteriosis notification data. Predictive performance of the models was assessed through the Brier Score, the Area Under the ROC Curve and the Percent Bias of regression coefficients. Finally, the best model was selected and applied to predict missing overseas travel status of campylobacteriosis notifications. While no difference was observed in the predictive performance of the BAS and MI methods at a lower rate of missingness (<10 %), but the BAS approach performed better than MI at a higher rate of missingness (50 %, 65 %, 80 %). The estimated proportion (95 % Credibility Intervals) of travel related cases was greatest in highly urban District Health Boards (DHBs) in Counties Manukau, Auckland and Waitemata, at 0.37 (0.12, 0.57), 0.33 (0.13, 0.55) and 0.28 (0.10, 0.49), whereas the lowest proportion was estimated for more rural West Coast, Northland and Tairawhiti DHBs at 0.02 (0.01, 0.05), 0.03 (0.01, 0.08) and 0.04 (0.01, 0.06), respectively. The national rate of travel related campylobacteriosis cases was estimated at 0.16 (0.02, 0.48). The use of BAS offers a flexible approach to data augmentation particularly when the missing rate is very high and when the Missing At Random (MAR) assumption holds. High rates of travel associated cases in urban regions of New Zealand predicted by this approach are plausible given the high rate of travel in these regions, including destinations with higher risk of infection. The added advantage of using a Bayesian approach is that the model's prediction can be improved whenever new information becomes available.
Bernhardt, Paul W; Wang, Huixia Judy; Zhang, Daowen
2014-01-01
Models for survival data generally assume that covariates are fully observed. However, in medical studies it is not uncommon for biomarkers to be censored at known detection limits. A computationally-efficient multiple imputation procedure for modeling survival data with covariates subject to detection limits is proposed. This procedure is developed in the context of an accelerated failure time model with a flexible seminonparametric error distribution. The consistency and asymptotic normality of the multiple imputation estimator are established and a consistent variance estimator is provided. An iterative version of the proposed multiple imputation algorithm that approximates the EM algorithm for maximum likelihood is also suggested. Simulation studies demonstrate that the proposed multiple imputation methods work well while alternative methods lead to estimates that are either biased or more variable. The proposed methods are applied to analyze the dataset from a recently-conducted GenIMS study.
Meta‐analysis of test accuracy studies using imputation for partial reporting of multiple thresholds
Deeks, J.J.; Martin, E.C.; Riley, R.D.
2017-01-01
Introduction For tests reporting continuous results, primary studies usually provide test performance at multiple but often different thresholds. This creates missing data when performing a meta‐analysis at each threshold. A standard meta‐analysis (no imputation [NI]) ignores such missing data. A single imputation (SI) approach was recently proposed to recover missing threshold results. Here, we propose a new method that performs multiple imputation of the missing threshold results using discrete combinations (MIDC). Methods The new MIDC method imputes missing threshold results by randomly selecting from the set of all possible discrete combinations which lie between the results for 2 known bounding thresholds. Imputed and observed results are then synthesised at each threshold. This is repeated multiple times, and the multiple pooled results at each threshold are combined using Rubin's rules to give final estimates. We compared the NI, SI, and MIDC approaches via simulation. Results Both imputation methods outperform the NI method in simulations. There was generally little difference in the SI and MIDC methods, but the latter was noticeably better in terms of estimating the between‐study variances and generally gave better coverage, due to slightly larger standard errors of pooled estimates. Given selective reporting of thresholds, the imputation methods also reduced bias in the summary receiver operating characteristic curve. Simulations demonstrate the imputation methods rely on an equal threshold spacing assumption. A real example is presented. Conclusions The SI and, in particular, MIDC methods can be used to examine the impact of missing threshold results in meta‐analysis of test accuracy studies. PMID:29052347
Salim, Agus; Mackinnon, Andrew; Christensen, Helen; Griffiths, Kathleen
2008-09-30
The pre-test-post-test design (PPD) is predominant in trials of psychotherapeutic treatments. Missing data due to withdrawals present an even bigger challenge in assessing treatment effectiveness under the PPD than under designs with more observations since dropout implies an absence of information about response to treatment. When confronted with missing data, often it is reasonable to assume that the mechanism underlying missingness is related to observed but not to unobserved outcomes (missing at random, MAR). Previous simulation and theoretical studies have shown that, under MAR, modern techniques such as maximum-likelihood (ML) based methods and multiple imputation (MI) can be used to produce unbiased estimates of treatment effects. In practice, however, ad hoc methods such as last observation carried forward (LOCF) imputation and complete-case (CC) analysis continue to be used. In order to better understand the behaviour of these methods in the PPD, we compare the performance of traditional approaches (LOCF, CC) and theoretically sound techniques (MI, ML), under various MAR mechanisms. We show that the LOCF method is seriously biased and conclude that its use should be abandoned. Complete-case analysis produces unbiased estimates only when the dropout mechanism does not depend on pre-test values even when dropout is related to fixed covariates including treatment group (covariate-dependent: CD). However, CC analysis is generally biased under MAR. The magnitude of the bias is largest when the correlation of post- and pre-test is relatively low.
Xu, Stanley; Clarke, Christina L; Newcomer, Sophia R; Daley, Matthew F; Glanz, Jason M
2018-05-16
Vaccine safety studies are often electronic health record (EHR)-based observational studies. These studies often face significant methodological challenges, including confounding and misclassification of adverse event. Vaccine safety researchers use self-controlled case series (SCCS) study design to handle confounding effect and employ medical chart review to ascertain cases that are identified using EHR data. However, for common adverse events, limited resources often make it impossible to adjudicate all adverse events observed in electronic data. In this paper, we considered four approaches for analyzing SCCS data with confirmation rates estimated from an internal validation sample: (1) observed cases, (2) confirmed cases only, (3) known confirmation rate, and (4) multiple imputation (MI). We conducted a simulation study to evaluate these four approaches using type I error rates, percent bias, and empirical power. Our simulation results suggest that when misclassification of adverse events is present, approaches such as observed cases, confirmed case only, and known confirmation rate may inflate the type I error, yield biased point estimates, and affect statistical power. The multiple imputation approach considers the uncertainty of estimated confirmation rates from an internal validation sample, yields a proper type I error rate, largely unbiased point estimate, proper variance estimate, and statistical power. © 2018 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Gandhi, Mihir; Teivaanmaki, Tiina; Maleta, Kenneth; Duan, Xiaolian; Ashorn, Per; Cheung, Yin Bun
2013-01-01
This study aimed to examine the association between child development at 5 years of age and mathematics ability and schooling outcomes at 12 years of age in Malawian children. A prospective cohort study looking at 609 rural Malawian children. Outcome measures were percentage of correctly answered mathematics questions, highest school grade completed and number of times repeating school grades at 12 years of age. A child development summary score obtained at 5 years of age was the main exposure variable. Regression analyses were used to estimate the association and adjust for confounders. Sensitivity analysis was performed by handling losses to follow-up with multiple imputation (MI) method. The summary score was positively associated with percentage of correctly answered mathematics questions (p = 0.057; p = 0.031 MI) and with highest school grade completed (p = 0.096; p = 0.070 MI), and negatively associated with number of times repeating school grades (p = 0.834; p = 0.339 MI). Fine motor score at 5 years was independently associated with the mathematic score (p = 0.032; p = 0.011 MI). The association between child development and mathematics ability did not depend on school attendance. Child development at 5 years of age showed signs of positive association with mathematics ability and possibly with highest school grade completed at 12 years of age. © 2012 The Author(s)/Acta Paediatrica © 2012 Foundation Acta Paediatrica.
Graffelman, Jan; Sánchez, Milagros; Cook, Samantha; Moreno, Victor
2013-01-01
In genetic association studies, tests for Hardy-Weinberg proportions are often employed as a quality control checking procedure. Missing genotypes are typically discarded prior to testing. In this paper we show that inference for Hardy-Weinberg proportions can be biased when missing values are discarded. We propose to use multiple imputation of missing values in order to improve inference for Hardy-Weinberg proportions. For imputation we employ a multinomial logit model that uses information from allele intensities and/or neighbouring markers. Analysis of an empirical data set of single nucleotide polymorphisms possibly related to colon cancer reveals that missing genotypes are not missing completely at random. Deviation from Hardy-Weinberg proportions is mostly due to a lack of heterozygotes. Inbreeding coefficients estimated by multiple imputation of the missings are typically lowered with respect to inbreeding coefficients estimated by discarding the missings. Accounting for missings by multiple imputation qualitatively changed the results of 10 to 17% of the statistical tests performed. Estimates of inbreeding coefficients obtained by multiple imputation showed high correlation with estimates obtained by single imputation using an external reference panel. Our conclusion is that imputation of missing data leads to improved statistical inference for Hardy-Weinberg proportions.
ERIC Educational Resources Information Center
Mistler, Stephen A.; Enders, Craig K.
2017-01-01
Multiple imputation methods can generally be divided into two broad frameworks: joint model (JM) imputation and fully conditional specification (FCS) imputation. JM draws missing values simultaneously for all incomplete variables using a multivariate distribution, whereas FCS imputes variables one at a time from a series of univariate conditional…
Hedden, Sarra L; Woolson, Robert F; Carter, Rickey E; Palesch, Yuko; Upadhyaya, Himanshu P; Malcolm, Robert J
2009-07-01
"Loss to follow-up" can be substantial in substance abuse clinical trials. When extensive losses to follow-up occur, one must cautiously analyze and interpret the findings of a research study. Aims of this project were to introduce the types of missing data mechanisms and describe several methods for analyzing data with loss to follow-up. Furthermore, a simulation study compared Type I error and power of several methods when missing data amount and mechanism varies. Methods compared were the following: Last observation carried forward (LOCF), multiple imputation (MI), modified stratified summary statistics (SSS), and mixed effects models. Results demonstrated nominal Type I error for all methods; power was high for all methods except LOCF. Mixed effect model, modified SSS, and MI are generally recommended for use; however, many methods require that the data are missing at random or missing completely at random (i.e., "ignorable"). If the missing data are presumed to be nonignorable, a sensitivity analysis is recommended.
Nested case-control studies: should one break the matching?
Borgan, Ørnulf; Keogh, Ruth
2015-10-01
In a nested case-control study, controls are selected for each case from the individuals who are at risk at the time at which the case occurs. We say that the controls are matched on study time. To adjust for possible confounding, it is common to match on other variables as well. The standard analysis of nested case-control data is based on a partial likelihood which compares the covariates of each case to those of its matched controls. It has been suggested that one may break the matching of nested case-control data and analyse them as case-cohort data using an inverse probability weighted (IPW) pseudo likelihood. Further, when some covariates are available for all individuals in the cohort, multiple imputation (MI) makes it possible to use all available data in the cohort. In the paper we review the standard method and the IPW and MI approaches, and compare their performance using simulations that cover a range of scenarios, including one and two endpoints.
ERIC Educational Resources Information Center
Si, Yajuan; Reiter, Jerome P.
2013-01-01
In many surveys, the data comprise a large number of categorical variables that suffer from item nonresponse. Standard methods for multiple imputation, like log-linear models or sequential regression imputation, can fail to capture complex dependencies and can be difficult to implement effectively in high dimensions. We present a fully Bayesian,…
Quartagno, M; Carpenter, J R
2016-07-30
Recently, multiple imputation has been proposed as a tool for individual patient data meta-analysis with sporadically missing observations, and it has been suggested that within-study imputation is usually preferable. However, such within study imputation cannot handle variables that are completely missing within studies. Further, if some of the contributing studies are relatively small, it may be appropriate to share information across studies when imputing. In this paper, we develop and evaluate a joint modelling approach to multiple imputation of individual patient data in meta-analysis, with an across-study probability distribution for the study specific covariance matrices. This retains the flexibility to allow for between-study heterogeneity when imputing while allowing (i) sharing information on the covariance matrix across studies when this is appropriate, and (ii) imputing variables that are wholly missing from studies. Simulation results show both equivalent performance to the within-study imputation approach where this is valid, and good results in more general, practically relevant, scenarios with studies of very different sizes, non-negligible between-study heterogeneity and wholly missing variables. We illustrate our approach using data from an individual patient data meta-analysis of hypertension trials. © 2015 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd. © 2015 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd.
Multiple imputation of missing covariates for the Cox proportional hazards cure model
Beesley, Lauren J; Bartlett, Jonathan W; Wolf, Gregory T; Taylor, Jeremy M G
2016-01-01
We explore several approaches for imputing partially observed covariates when the outcome of interest is a censored event time and when there is an underlying subset of the population that will never experience the event of interest. We call these subjects “cured,” and we consider the case where the data are modeled using a Cox proportional hazards (CPH) mixture cure model. We study covariate imputation approaches using fully conditional specification (FCS). We derive the exact conditional distribution and suggest a sampling scheme for imputing partially observed covariates in the CPH cure model setting. We also propose several approximations to the exact distribution that are simpler and more convenient to use for imputation. A simulation study demonstrates that the proposed imputation approaches outperform existing imputation approaches for survival data without a cure fraction in terms of bias in estimating CPH cure model parameters. We apply our multiple imputation techniques to a study of patients with head and neck cancer. PMID:27439726
A Method for Imputing Response Options for Missing Data on Multiple-Choice Assessments
ERIC Educational Resources Information Center
Wolkowitz, Amanda A.; Skorupski, William P.
2013-01-01
When missing values are present in item response data, there are a number of ways one might impute a correct or incorrect response to a multiple-choice item. There are significantly fewer methods for imputing the actual response option an examinee may have provided if he or she had not omitted the item either purposely or accidentally. This…
Bouwman, Aniek C; Veerkamp, Roel F
2014-10-03
The aim of this study was to determine the consequences of splitting sequencing effort over multiple breeds for imputation accuracy from a high-density SNP chip towards whole-genome sequence. Such information would assist for instance numerical smaller cattle breeds, but also pig and chicken breeders, who have to choose wisely how to spend their sequencing efforts over all the breeds or lines they evaluate. Sequence data from cattle breeds was used, because there are currently relatively many individuals from several breeds sequenced within the 1,000 Bull Genomes project. The advantage of whole-genome sequence data is that it carries the causal mutations, but the question is whether it is possible to impute the causal variants accurately. This study therefore focussed on imputation accuracy of variants with low minor allele frequency and breed specific variants. Imputation accuracy was assessed for chromosome 1 and 29 as the correlation between observed and imputed genotypes. For chromosome 1, the average imputation accuracy was 0.70 with a reference population of 20 Holstein, and increased to 0.83 when the reference population was increased by including 3 other dairy breeds with 20 animals each. When the same amount of animals from the Holstein breed were added the accuracy improved to 0.88, while adding the 3 other breeds to the reference population of 80 Holstein improved the average imputation accuracy marginally to 0.89. For chromosome 29, the average imputation accuracy was lower. Some variants benefitted from the inclusion of other breeds in the reference population, initially determined by the MAF of the variant in each breed, but even Holstein specific variants did gain imputation accuracy from the multi-breed reference population. This study shows that splitting sequencing effort over multiple breeds and combining the reference populations is a good strategy for imputation from high-density SNP panels towards whole-genome sequence when reference populations are small and sequencing effort is limiting. When sequencing effort is limiting and interest lays in multiple breeds or lines this provides imputation of each breed.
Howie, Bryan N.; Donnelly, Peter; Marchini, Jonathan
2009-01-01
Genotype imputation methods are now being widely used in the analysis of genome-wide association studies. Most imputation analyses to date have used the HapMap as a reference dataset, but new reference panels (such as controls genotyped on multiple SNP chips and densely typed samples from the 1,000 Genomes Project) will soon allow a broader range of SNPs to be imputed with higher accuracy, thereby increasing power. We describe a genotype imputation method (IMPUTE version 2) that is designed to address the challenges presented by these new datasets. The main innovation of our approach is a flexible modelling framework that increases accuracy and combines information across multiple reference panels while remaining computationally feasible. We find that IMPUTE v2 attains higher accuracy than other methods when the HapMap provides the sole reference panel, but that the size of the panel constrains the improvements that can be made. We also find that imputation accuracy can be greatly enhanced by expanding the reference panel to contain thousands of chromosomes and that IMPUTE v2 outperforms other methods in this setting at both rare and common SNPs, with overall error rates that are 15%–20% lower than those of the closest competing method. One particularly challenging aspect of next-generation association studies is to integrate information across multiple reference panels genotyped on different sets of SNPs; we show that our approach to this problem has practical advantages over other suggested solutions. PMID:19543373
Multiple imputation of missing fMRI data in whole brain analysis
Vaden, Kenneth I.; Gebregziabher, Mulugeta; Kuchinsky, Stefanie E.; Eckert, Mark A.
2012-01-01
Whole brain fMRI analyses rarely include the entire brain because of missing data that result from data acquisition limits and susceptibility artifact, in particular. This missing data problem is typically addressed by omitting voxels from analysis, which may exclude brain regions that are of theoretical interest and increase the potential for Type II error at cortical boundaries or Type I error when spatial thresholds are used to establish significance. Imputation could significantly expand statistical map coverage, increase power, and enhance interpretations of fMRI results. We examined multiple imputation for group level analyses of missing fMRI data using methods that leverage the spatial information in fMRI datasets for both real and simulated data. Available case analysis, neighbor replacement, and regression based imputation approaches were compared in a general linear model framework to determine the extent to which these methods quantitatively (effect size) and qualitatively (spatial coverage) increased the sensitivity of group analyses. In both real and simulated data analysis, multiple imputation provided 1) variance that was most similar to estimates for voxels with no missing data, 2) fewer false positive errors in comparison to mean replacement, and 3) fewer false negative errors in comparison to available case analysis. Compared to the standard analysis approach of omitting voxels with missing data, imputation methods increased brain coverage in this study by 35% (from 33,323 to 45,071 voxels). In addition, multiple imputation increased the size of significant clusters by 58% and number of significant clusters across statistical thresholds, compared to the standard voxel omission approach. While neighbor replacement produced similar results, we recommend multiple imputation because it uses an informed sampling distribution to deal with missing data across subjects that can include neighbor values and other predictors. Multiple imputation is anticipated to be particularly useful for 1) large fMRI data sets with inconsistent missing voxels across subjects and 2) addressing the problem of increased artifact at ultra-high field, which significantly limit the extent of whole brain coverage and interpretations of results. PMID:22500925
Coquet, Julia Becaria; Tumas, Natalia; Osella, Alberto Ruben; Tanzi, Matteo; Franco, Isabella; Diaz, Maria Del Pilar
2016-01-01
A number of studies have evidenced the effect of modifiable lifestyle factors such as diet, breastfeeding and nutritional status on breast cancer risk. However, none have addressed the missing data problem in nutritional epidemiologic research in South America. Missing data is a frequent problem in breast cancer studies and epidemiological settings in general. Estimates of effect obtained from these studies may be biased, if no appropriate method for handling missing data is applied. We performed Multiple Imputation for missing values on covariates in a breast cancer case-control study of Córdoba (Argentina) to optimize risk estimates. Data was obtained from a breast cancer case control study from 2008 to 2015 (318 cases, 526 controls). Complete case analysis and multiple imputation using chained equations were the methods applied to estimate the effects of a Traditional dietary pattern and other recognized factors associated with breast cancer. Physical activity and socioeconomic status were imputed. Logistic regression models were performed. When complete case analysis was performed only 31% of women were considered. Although a positive association of Traditional dietary pattern and breast cancer was observed from both approaches (complete case analysis OR=1.3, 95%CI=1.0-1.7; multiple imputation OR=1.4, 95%CI=1.2-1.7), effects of other covariates, like BMI and breastfeeding, were only identified when multiple imputation was considered. A Traditional dietary pattern, BMI and breastfeeding are associated with the occurrence of breast cancer in this Argentinean population when multiple imputation is appropriately performed. Multiple Imputation is suggested in Latin America’s epidemiologic studies to optimize effect estimates in the future. PMID:27892664
Taylor, Sandra L; Ruhaak, L Renee; Kelly, Karen; Weiss, Robert H; Kim, Kyoungmi
2017-03-01
With expanded access to, and decreased costs of, mass spectrometry, investigators are collecting and analyzing multiple biological matrices from the same subject such as serum, plasma, tissue and urine to enhance biomarker discoveries, understanding of disease processes and identification of therapeutic targets. Commonly, each biological matrix is analyzed separately, but multivariate methods such as MANOVAs that combine information from multiple biological matrices are potentially more powerful. However, mass spectrometric data typically contain large amounts of missing values, and imputation is often used to create complete data sets for analysis. The effects of imputation on multiple biological matrix analyses have not been studied. We investigated the effects of seven imputation methods (half minimum substitution, mean substitution, k-nearest neighbors, local least squares regression, Bayesian principal components analysis, singular value decomposition and random forest), on the within-subject correlation of compounds between biological matrices and its consequences on MANOVA results. Through analysis of three real omics data sets and simulation studies, we found the amount of missing data and imputation method to substantially change the between-matrix correlation structure. The magnitude of the correlations was generally reduced in imputed data sets, and this effect increased with the amount of missing data. Significant results from MANOVA testing also were substantially affected. In particular, the number of false positives increased with the level of missing data for all imputation methods. No one imputation method was universally the best, but the simple substitution methods (Half Minimum and Mean) consistently performed poorly. © The Author 2016. Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com.
Sajobi, Tolulope T; Lix, Lisa M; Singh, Gurbakhshash; Lowerison, Mark; Engbers, Jordan; Mayo, Nancy E
2015-03-01
Response shift (RS) is an important phenomenon that influences the assessment of longitudinal changes in health-related quality of life (HRQOL) studies. Given that RS effects are often small, missing data due to attrition or item non-response can contribute to failure to detect RS effects. Since missing data are often encountered in longitudinal HRQOL data, effective strategies to deal with missing data are important to consider. This study aims to compare different imputation methods on the detection of reprioritization RS in the HRQOL of caregivers of stroke survivors. Data were from a Canadian multi-center longitudinal study of caregivers of stroke survivors over a one-year period. The Stroke Impact Scale physical function score at baseline, with a cutoff of 75, was used to measure patient stroke severity for the reprioritization RS analysis. Mean imputation, likelihood-based expectation-maximization imputation, and multiple imputation methods were compared in test procedures based on changes in relative importance weights to detect RS in SF-36 domains over a 6-month period. Monte Carlo simulation methods were used to compare the statistical powers of relative importance test procedures for detecting RS in incomplete longitudinal data under different missing data mechanisms and imputation methods. Of the 409 caregivers, 15.9 and 31.3 % of them had missing data at baseline and 6 months, respectively. There were no statistically significant changes in relative importance weights on any of the domains when complete-case analysis was adopted. But statistical significant changes were detected on physical functioning and/or vitality domains when mean imputation or EM imputation was adopted. There were also statistically significant changes in relative importance weights for physical functioning, mental health, and vitality domains when multiple imputation method was adopted. Our simulations revealed that relative importance test procedures were least powerful under complete-case analysis method and most powerful when a mean imputation or multiple imputation method was adopted for missing data, regardless of the missing data mechanism and proportion of missing data. Test procedures based on relative importance measures are sensitive to the type and amount of missing data and imputation method. Relative importance test procedures based on mean imputation and multiple imputation are recommended for detecting RS in incomplete data.
Multiple Imputation of a Randomly Censored Covariate Improves Logistic Regression Analysis.
Atem, Folefac D; Qian, Jing; Maye, Jacqueline E; Johnson, Keith A; Betensky, Rebecca A
2016-01-01
Randomly censored covariates arise frequently in epidemiologic studies. The most commonly used methods, including complete case and single imputation or substitution, suffer from inefficiency and bias. They make strong parametric assumptions or they consider limit of detection censoring only. We employ multiple imputation, in conjunction with semi-parametric modeling of the censored covariate, to overcome these shortcomings and to facilitate robust estimation. We develop a multiple imputation approach for randomly censored covariates within the framework of a logistic regression model. We use the non-parametric estimate of the covariate distribution or the semiparametric Cox model estimate in the presence of additional covariates in the model. We evaluate this procedure in simulations, and compare its operating characteristics to those from the complete case analysis and a survival regression approach. We apply the procedures to an Alzheimer's study of the association between amyloid positivity and maternal age of onset of dementia. Multiple imputation achieves lower standard errors and higher power than the complete case approach under heavy and moderate censoring and is comparable under light censoring. The survival regression approach achieves the highest power among all procedures, but does not produce interpretable estimates of association. Multiple imputation offers a favorable alternative to complete case analysis and ad hoc substitution methods in the presence of randomly censored covariates within the framework of logistic regression.
Aßmann, C
2016-06-01
Besides large efforts regarding field work, provision of valid databases requires statistical and informational infrastructure to enable long-term access to longitudinal data sets on height, weight and related issues. To foster use of longitudinal data sets within the scientific community, provision of valid databases has to address data-protection regulations. It is, therefore, of major importance to hinder identifiability of individuals from publicly available databases. To reach this goal, one possible strategy is to provide a synthetic database to the public allowing for pretesting strategies for data analysis. The synthetic databases can be established using multiple imputation tools. Given the approval of the strategy, verification is based on the original data. Multiple imputation by chained equations is illustrated to facilitate provision of synthetic databases as it allows for capturing a wide range of statistical interdependencies. Also missing values, typically occurring within longitudinal databases for reasons of item non-response, can be addressed via multiple imputation when providing databases. The provision of synthetic databases using multiple imputation techniques is one possible strategy to ensure data protection, increase visibility of longitudinal databases and enhance the analytical potential.
Multiple Imputation of Cognitive Performance as a Repeatedly Measured Outcome
Rawlings, Andreea M.; Sang, Yingying; Sharrett, A. Richey; Coresh, Josef; Griswold, Michael; Kucharska-Newton, Anna M.; Palta, Priya; Wruck, Lisa M.; Gross, Alden L.; Deal, Jennifer A.; Power, Melinda C.; Bandeen-Roche, Karen
2016-01-01
Background Longitudinal studies of cognitive performance are sensitive to dropout, as participants experiencing cognitive deficits are less likely to attend study visits, which may bias estimated associations between exposures of interest and cognitive decline. Multiple imputation is a powerful tool for handling missing data, however its use for missing cognitive outcome measures in longitudinal analyses remains limited. Methods We use multiple imputation by chained equations (MICE) to impute cognitive performance scores of participants who did not attend the 2011-2013 exam of the Atherosclerosis Risk in Communities Study. We examined the validity of imputed scores using observed and simulated data under varying assumptions. We examined differences in the estimated association between diabetes at baseline and 20-year cognitive decline with and without imputed values. Lastly, we discuss how different analytic methods (mixed models and models fit using generalized estimate equations) and choice of for whom to impute result in different estimands. Results Validation using observed data showed MICE produced unbiased imputations. Simulations showed a substantial reduction in the bias of the 20-year association between diabetes and cognitive decline comparing MICE (3-4% bias) to analyses of available data only (16-23% bias) in a construct where missingness was strongly informative but realistic. Associations between diabetes and 20-year cognitive decline were substantially stronger with MICE than in available-case analyses. Conclusions Our study suggests when informative data are available for non-examined participants, MICE can be an effective tool for imputing cognitive performance and improving assessment of cognitive decline, though careful thought should be given to target imputation population and analytic model chosen, as they may yield different estimands. PMID:27619926
SPSS Syntax for Missing Value Imputation in Test and Questionnaire Data
ERIC Educational Resources Information Center
van Ginkel, Joost R.; van der Ark, L. Andries
2005-01-01
A well-known problem in the analysis of test and questionnaire data is that some item scores may be missing. Advanced methods for the imputation of missing data are available, such as multiple imputation under the multivariate normal model and imputation under the saturated logistic model (Schafer, 1997). Accompanying software was made available…
Luo, Yuan; Szolovits, Peter; Dighe, Anand S; Baron, Jason M
2018-06-01
A key challenge in clinical data mining is that most clinical datasets contain missing data. Since many commonly used machine learning algorithms require complete datasets (no missing data), clinical analytic approaches often entail an imputation procedure to "fill in" missing data. However, although most clinical datasets contain a temporal component, most commonly used imputation methods do not adequately accommodate longitudinal time-based data. We sought to develop a new imputation algorithm, 3-dimensional multiple imputation with chained equations (3D-MICE), that can perform accurate imputation of missing clinical time series data. We extracted clinical laboratory test results for 13 commonly measured analytes (clinical laboratory tests). We imputed missing test results for the 13 analytes using 3 imputation methods: multiple imputation with chained equations (MICE), Gaussian process (GP), and 3D-MICE. 3D-MICE utilizes both MICE and GP imputation to integrate cross-sectional and longitudinal information. To evaluate imputation method performance, we randomly masked selected test results and imputed these masked results alongside results missing from our original data. We compared predicted results to measured results for masked data points. 3D-MICE performed significantly better than MICE and GP-based imputation in a composite of all 13 analytes, predicting missing results with a normalized root-mean-square error of 0.342, compared to 0.373 for MICE alone and 0.358 for GP alone. 3D-MICE offers a novel and practical approach to imputing clinical laboratory time series data. 3D-MICE may provide an additional tool for use as a foundation in clinical predictive analytics and intelligent clinical decision support.
Bozio, Catherine H; Flanders, W Dana; Finelli, Lyn; Bramley, Anna M; Reed, Carrie; Gandhi, Neel R; Vidal, Jorge E; Erdman, Dean; Levine, Min Z; Lindstrom, Stephen; Ampofo, Krow; Arnold, Sandra R; Self, Wesley H; Williams, Derek J; Grijalva, Carlos G; Anderson, Evan J; McCullers, Jonathan A; Edwards, Kathryn M; Pavia, Andrew T; Wunderink, Richard G; Jain, Seema
2018-04-01
Real-time polymerase chain reaction (PCR) on respiratory specimens and serology on paired blood specimens are used to determine the etiology of respiratory illnesses for research studies. However, convalescent serology is often not collected. We used multiple imputation to assign values for missing serology results to estimate virus-specific prevalence among pediatric and adult community-acquired pneumonia hospitalizations using data from an active population-based surveillance study. Presence of adenoviruses, human metapneumovirus, influenza viruses, parainfluenza virus types 1-3, and respiratory syncytial virus was defined by positive PCR on nasopharyngeal/oropharyngeal specimens or a 4-fold rise in paired serology. We performed multiple imputation by developing a multivariable regression model for each virus using data from patients with available serology results. We calculated absolute and relative differences in the proportion of each virus detected comparing the imputed to observed (nonimputed) results. Among 2222 children and 2259 adults, 98.8% and 99.5% had nasopharyngeal/oropharyngeal specimens and 43.2% and 37.5% had paired serum specimens, respectively. Imputed results increased viral etiology assignments by an absolute difference of 1.6%-4.4% and 0.8%-2.8% in children and adults, respectively; relative differences were 1.1-3.0 times higher. Multiple imputation can be used when serology results are missing, to refine virus-specific prevalence estimates, and these will likely increase estimates.
Sun, Wanjie; Larsen, Michael D; Lachin, John M
2014-04-15
In longitudinal studies, a quantitative outcome (such as blood pressure) may be altered during follow-up by the administration of a non-randomized, non-trial intervention (such as anti-hypertensive medication) that may seriously bias the study results. Current methods mainly address this issue for cross-sectional studies. For longitudinal data, the current methods are either restricted to a specific longitudinal data structure or are valid only under special circumstances. We propose two new methods for estimation of covariate effects on the underlying (untreated) general longitudinal outcomes: a single imputation method employing a modified expectation-maximization (EM)-type algorithm and a multiple imputation (MI) method utilizing a modified Monte Carlo EM-MI algorithm. Each method can be implemented as one-step, two-step, and full-iteration algorithms. They combine the advantages of the current statistical methods while reducing their restrictive assumptions and generalizing them to realistic scenarios. The proposed methods replace intractable numerical integration of a multi-dimensionally censored MVN posterior distribution with a simplified, sufficiently accurate approximation. It is particularly attractive when outcomes reach a plateau after intervention due to various reasons. Methods are studied via simulation and applied to data from the Diabetes Control and Complications Trial/Epidemiology of Diabetes Interventions and Complications study of treatment for type 1 diabetes. Methods proved to be robust to high dimensions, large amounts of censored data, low within-subject correlation, and when subjects receive non-trial intervention to treat the underlying condition only (with high Y), or for treatment in the majority of subjects (with high Y) in combination with prevention for a small fraction of subjects (with normal Y). Copyright © 2013 John Wiley & Sons, Ltd.
Statistical Methods for Generalized Linear Models with Covariates Subject to Detection Limits.
Bernhardt, Paul W; Wang, Huixia J; Zhang, Daowen
2015-05-01
Censored observations are a common occurrence in biomedical data sets. Although a large amount of research has been devoted to estimation and inference for data with censored responses, very little research has focused on proper statistical procedures when predictors are censored. In this paper, we consider statistical methods for dealing with multiple predictors subject to detection limits within the context of generalized linear models. We investigate and adapt several conventional methods and develop a new multiple imputation approach for analyzing data sets with predictors censored due to detection limits. We establish the consistency and asymptotic normality of the proposed multiple imputation estimator and suggest a computationally simple and consistent variance estimator. We also demonstrate that the conditional mean imputation method often leads to inconsistent estimates in generalized linear models, while several other methods are either computationally intensive or lead to parameter estimates that are biased or more variable compared to the proposed multiple imputation estimator. In an extensive simulation study, we assess the bias and variability of different approaches within the context of a logistic regression model and compare variance estimation methods for the proposed multiple imputation estimator. Lastly, we apply several methods to analyze the data set from a recently-conducted GenIMS study.
Ambler, Gareth; Omar, Rumana Z; Royston, Patrick
2007-06-01
Risk models that aim to predict the future course and outcome of disease processes are increasingly used in health research, and it is important that they are accurate and reliable. Most of these risk models are fitted using routinely collected data in hospitals or general practices. Clinical outcomes such as short-term mortality will be near-complete, but many of the predictors may have missing values. A common approach to dealing with this is to perform a complete-case analysis. However, this may lead to overfitted models and biased estimates if entire patient subgroups are excluded. The aim of this paper is to investigate a number of methods for imputing missing data to evaluate their effect on risk model estimation and the reliability of the predictions. Multiple imputation methods, including hotdecking and multiple imputation by chained equations (MICE), were investigated along with several single imputation methods. A large national cardiac surgery database was used to create simulated yet realistic datasets. The results suggest that complete case analysis may produce unreliable risk predictions and should be avoided. Conditional mean imputation performed well in our scenario, but may not be appropriate if using variable selection methods. MICE was amongst the best performing multiple imputation methods with regards to the quality of the predictions. Additionally, it produced the least biased estimates, with good coverage, and hence is recommended for use in practice.
Ipsative imputation for a 15-item Geriatric Depression Scale in community-dwelling elderly people.
Imai, Hissei; Furukawa, Toshiaki A; Kasahara, Yoriko; Ishimoto, Yasuko; Kimura, Yumi; Fukutomi, Eriko; Chen, Wen-Ling; Tanaka, Mire; Sakamoto, Ryota; Wada, Taizo; Fujisawa, Michiko; Okumiya, Kiyohito; Matsubayashi, Kozo
2014-09-01
Missing data are inevitable in almost all medical studies. Imputation methods using the probabilistic model are common, but they cannot impute individual data and require special software. In contrast, the ipsative imputation method, which substitutes the missing items by the mean of the remaining items within the individual, is easy and does not need any special software, but it can provide individual scores. The aim of the present study was to evaluate the validity of the ipsative imputation method using data involving the 15-item Geriatric Depression Scale. Participants were community-dwelling elderly individuals (n = 1178). A structural equation model was constructed. The model fit indexes were calculated to assess the validity of the imputation method when it is used for individuals who were missing 20% of data or less and 40% of data or less, depending on whether we assumed that their correlation coefficients were the same as the dataset with no missing items. Finally, we compared path coefficients of the dataset imputed by ipsative imputation with those by multiple imputation. When compared with the assumption that the datasets differed, all of the model fit indexes were better under the assumption that the dataset without missing data is the same as that that was missing 20% of data or less. However, by the same assumption, the model fit indexes were worse in the dataset that was missing 40% of data or less. The path coefficients of the dataset imputed by ipsative imputation and by multiple imputation were compatible with each other if the proportion of missing items was 20% or less. Ipsative imputation appears to be a valid imputation method and can be used to impute data in studies using the 15-item Geriatric Depression Scale, if the percentage of its missing items is 20% or less. © 2014 The Authors. Psychogeriatrics © 2014 Japanese Psychogeriatric Society.
Multiple imputation for cure rate quantile regression with censored data.
Wu, Yuanshan; Yin, Guosheng
2017-03-01
The main challenge in the context of cure rate analysis is that one never knows whether censored subjects are cured or uncured, or whether they are susceptible or insusceptible to the event of interest. Considering the susceptible indicator as missing data, we propose a multiple imputation approach to cure rate quantile regression for censored data with a survival fraction. We develop an iterative algorithm to estimate the conditionally uncured probability for each subject. By utilizing this estimated probability and Bernoulli sample imputation, we can classify each subject as cured or uncured, and then employ the locally weighted method to estimate the quantile regression coefficients with only the uncured subjects. Repeating the imputation procedure multiple times and taking an average over the resultant estimators, we obtain consistent estimators for the quantile regression coefficients. Our approach relaxes the usual global linearity assumption, so that we can apply quantile regression to any particular quantile of interest. We establish asymptotic properties for the proposed estimators, including both consistency and asymptotic normality. We conduct simulation studies to assess the finite-sample performance of the proposed multiple imputation method and apply it to a lung cancer study as an illustration. © 2016, The International Biometric Society.
Multiple imputation of missing passenger boarding data in the national census of ferry operators
DOT National Transportation Integrated Search
2008-08-01
This report presents findings from the 2006 National Census of Ferry Operators (NCFO) augmented : with imputed values for passengers and passenger miles. Due to the imputation procedures used to calculate missing data, totals in Table 1 may not corre...
Kudo, Daisuke; Hayakawa, Mineji; Ono, Kota; Yamakawa, Kazuma
2018-03-01
Anticoagulant therapy for patients with sepsis is not recommended in the latest Surviving Sepsis Campaign guidelines, and non-anticoagulant therapy is the global standard treatment approach at present. We aimed at elucidating the effect of non-anticoagulant therapy on patients with sepsis-induced disseminated intravascular coagulation (DIC), as evidence on this topic has remained inconclusive. Data from 3195 consecutive adult patients admitted to 42 intensive care units for the treatment of severe sepsis were retrospectively analyzed via propensity score analyses with and without multiple imputation. The primary outcome was in-hospital all-cause mortality. Among 1784 patients with sepsis-induced DIC, 745 (41.8%) were not treated with anticoagulants. The inverse probability of treatment-weighted (with and without multiple imputation) and quintile-stratified propensity score analyses (without multiple imputation) indicated a significant association between non-anticoagulant therapy and higher in-hospital all-cause mortality (odds ratio [95% confidence interval]: 1.59 [1.19-2.12], 1.32 [1.02-1.81], and 1.32 [1.03-1.69], respectively). However, quintile-stratified propensity score analyses with multiple imputation and propensity score matching analysis with and without multiple imputation did not show this association. Survival duration was not significantly different between patients in the propensity score-matched non-anticoagulant therapy group and those in the anticoagulant therapy group (Cox regression analysis with and without multiple imputation: hazard ratio [95% confidence interval]: 1.26 [1.00-1.60] and 1.22 [0.93-1.59], respectively). It remains controversial if non-anticoagulant therapy is harmful, equivalent, or beneficial compared with anticoagulant therapy in the treatment of patients with sepsis-induced DIC. Copyright © 2018 Elsevier Ltd. All rights reserved.
Missing Data and Multiple Imputation in the Context of Multivariate Analysis of Variance
ERIC Educational Resources Information Center
Finch, W. Holmes
2016-01-01
Multivariate analysis of variance (MANOVA) is widely used in educational research to compare means on multiple dependent variables across groups. Researchers faced with the problem of missing data often use multiple imputation of values in place of the missing observations. This study compares the performance of 2 methods for combining p values in…
Multiple Imputation for Multivariate Missing-Data Problems: A Data Analyst's Perspective.
ERIC Educational Resources Information Center
Schafer, Joseph L.; Olsen, Maren K.
1998-01-01
The key ideas of multiple imputation for multivariate missing data problems are reviewed. Software programs available for this analysis are described, and their use is illustrated with data from the Adolescent Alcohol Prevention Trial (W. Hansen and J. Graham, 1991). (SLD)
Statistical methods for incomplete data: Some results on model misspecification.
McIsaac, Michael; Cook, R J
2017-02-01
Inverse probability weighted estimating equations and multiple imputation are two of the most studied frameworks for dealing with incomplete data in clinical and epidemiological research. We examine the limiting behaviour of estimators arising from inverse probability weighted estimating equations, augmented inverse probability weighted estimating equations and multiple imputation when the requisite auxiliary models are misspecified. We compute limiting values for settings involving binary responses and covariates and illustrate the effects of model misspecification using simulations based on data from a breast cancer clinical trial. We demonstrate that, even when both auxiliary models are misspecified, the asymptotic biases of double-robust augmented inverse probability weighted estimators are often smaller than the asymptotic biases of estimators arising from complete-case analyses, inverse probability weighting or multiple imputation. We further demonstrate that use of inverse probability weighting or multiple imputation with slightly misspecified auxiliary models can actually result in greater asymptotic bias than the use of naïve, complete case analyses. These asymptotic results are shown to be consistent with empirical results from simulation studies.
Limitations in Using Multiple Imputation to Harmonize Individual Participant Data for Meta-Analysis.
Siddique, Juned; de Chavez, Peter J; Howe, George; Cruden, Gracelyn; Brown, C Hendricks
2018-02-01
Individual participant data (IPD) meta-analysis is a meta-analysis in which the individual-level data for each study are obtained and used for synthesis. A common challenge in IPD meta-analysis is when variables of interest are measured differently in different studies. The term harmonization has been coined to describe the procedure of placing variables on the same scale in order to permit pooling of data from a large number of studies. Using data from an IPD meta-analysis of 19 adolescent depression trials, we describe a multiple imputation approach for harmonizing 10 depression measures across the 19 trials by treating those depression measures that were not used in a study as missing data. We then apply diagnostics to address the fit of our imputation model. Even after reducing the scale of our application, we were still unable to produce accurate imputations of the missing values. We describe those features of the data that made it difficult to harmonize the depression measures and provide some guidelines for using multiple imputation for harmonization in IPD meta-analysis.
Ondeck, Nathaniel T; Fu, Michael C; Skrip, Laura A; McLynn, Ryan P; Cui, Jonathan J; Basques, Bryce A; Albert, Todd J; Grauer, Jonathan N
2018-04-09
The presence of missing data is a limitation of large datasets, including the National Surgical Quality Improvement Program (NSQIP). In addressing this issue, most studies use complete case analysis, which excludes cases with missing data, thus potentially introducing selection bias. Multiple imputation, a statistically rigorous approach that approximates missing data and preserves sample size, may be an improvement over complete case analysis. The present study aims to evaluate the impact of using multiple imputation in comparison with complete case analysis for assessing the associations between preoperative laboratory values and adverse outcomes following anterior cervical discectomy and fusion (ACDF) procedures. This is a retrospective review of prospectively collected data. Patients undergoing one-level ACDF were identified in NSQIP 2012-2015. Perioperative adverse outcome variables assessed included the occurrence of any adverse event, severe adverse events, and hospital readmission. Missing preoperative albumin and hematocrit values were handled using complete case analysis and multiple imputation. These preoperative laboratory levels were then tested for associations with 30-day postoperative outcomes using logistic regression. A total of 11,999 patients were included. Of this cohort, 63.5% of patients had missing preoperative albumin and 9.9% had missing preoperative hematocrit. When using complete case analysis, only 4,311 patients were studied. The removed patients were significantly younger, healthier, of a common body mass index, and male. Logistic regression analysis failed to identify either preoperative hypoalbuminemia or preoperative anemia as significantly associated with adverse outcomes. When employing multiple imputation, all 11,999 patients were included. Preoperative hypoalbuminemia was significantly associated with the occurrence of any adverse event and severe adverse events. Preoperative anemia was significantly associated with the occurrence of any adverse event, severe adverse events, and hospital readmission. Multiple imputation is a rigorous statistical procedure that is being increasingly used to address missing values in large datasets. Using this technique for ACDF avoided the loss of cases that may have affected the representativeness and power of the study and led to different results than complete case analysis. Multiple imputation should be considered for future spine studies. Copyright © 2018 Elsevier Inc. All rights reserved.
Obtaining Predictions from Models Fit to Multiply Imputed Data
ERIC Educational Resources Information Center
Miles, Andrew
2016-01-01
Obtaining predictions from regression models fit to multiply imputed data can be challenging because treatments of multiple imputation seldom give clear guidance on how predictions can be calculated, and because available software often does not have built-in routines for performing the necessary calculations. This research note reviews how…
Bozio, Catherine H; Flanders, W Dana; Finelli, Lyn; Bramley, Anna M; Reed, Carrie; Gandhi, Neel R; Vidal, Jorge E; Erdman, Dean; Levine, Min Z; Lindstrom, Stephen; Ampofo, Krow; Arnold, Sandra R; Self, Wesley H; Williams, Derek J; Grijalva, Carlos G; Anderson, Evan J; McCullers, Jonathan A; Edwards, Kathryn M; Pavia, Andrew T; Wunderink, Richard G; Jain, Seema
2018-01-01
Abstract Background Real-time polymerase chain reaction (PCR) on respiratory specimens and serology on paired blood specimens are used to determine the etiology of respiratory illnesses for research studies. However, convalescent serology is often not collected. We used multiple imputation to assign values for missing serology results to estimate virus-specific prevalence among pediatric and adult community-acquired pneumonia hospitalizations using data from an active population-based surveillance study. Methods Presence of adenoviruses, human metapneumovirus, influenza viruses, parainfluenza virus types 1–3, and respiratory syncytial virus was defined by positive PCR on nasopharyngeal/oropharyngeal specimens or a 4-fold rise in paired serology. We performed multiple imputation by developing a multivariable regression model for each virus using data from patients with available serology results. We calculated absolute and relative differences in the proportion of each virus detected comparing the imputed to observed (nonimputed) results. Results Among 2222 children and 2259 adults, 98.8% and 99.5% had nasopharyngeal/oropharyngeal specimens and 43.2% and 37.5% had paired serum specimens, respectively. Imputed results increased viral etiology assignments by an absolute difference of 1.6%–4.4% and 0.8%–2.8% in children and adults, respectively; relative differences were 1.1–3.0 times higher. Conclusions Multiple imputation can be used when serology results are missing, to refine virus-specific prevalence estimates, and these will likely increase estimates.
Genotype Imputation for Latinos Using the HapMap and 1000 Genomes Project Reference Panels.
Gao, Xiaoyi; Haritunians, Talin; Marjoram, Paul; McKean-Cowdin, Roberta; Torres, Mina; Taylor, Kent D; Rotter, Jerome I; Gauderman, William J; Varma, Rohit
2012-01-01
Genotype imputation is a vital tool in genome-wide association studies (GWAS) and meta-analyses of multiple GWAS results. Imputation enables researchers to increase genomic coverage and to pool data generated using different genotyping platforms. HapMap samples are often employed as the reference panel. More recently, the 1000 Genomes Project resource is becoming the primary source for reference panels. Multiple GWAS and meta-analyses are targeting Latinos, the most populous, and fastest growing minority group in the US. However, genotype imputation resources for Latinos are rather limited compared to individuals of European ancestry at present, largely because of the lack of good reference data. One choice of reference panel for Latinos is one derived from the population of Mexican individuals in Los Angeles contained in the HapMap Phase 3 project and the 1000 Genomes Project. However, a detailed evaluation of the quality of the imputed genotypes derived from the public reference panels has not yet been reported. Using simulation studies, the Illumina OmniExpress GWAS data from the Los Angles Latino Eye Study and the MACH software package, we evaluated the accuracy of genotype imputation in Latinos. Our results show that the 1000 Genomes Project AMR + CEU + YRI reference panel provides the highest imputation accuracy for Latinos, and that also including Asian samples in the panel can reduce imputation accuracy. We also provide the imputation accuracy for each autosomal chromosome using the 1000 Genomes Project panel for Latinos. Our results serve as a guide to future imputation based analysis in Latinos.
NASA Technical Reports Server (NTRS)
Xiao, Qingyang; Wang, Yujie; Chang, Howard H.; Meng, Xia; Geng, Guannan; Lyapustin, Alexei Ivanovich; Liu, Yang
2017-01-01
Satellite aerosol optical depth (AOD) has been used to assess population exposure to fine particulate matter (PM (sub 2.5)). The emerging high-resolution satellite aerosol product, Multi-Angle Implementation of Atmospheric Correction(MAIAC), provides a valuable opportunity to characterize local-scale PM(sub 2.5) at 1-km resolution. However, non-random missing AOD due to cloud snow cover or high surface reflectance makes this task challenging. Previous studies filled the data gap by spatially interpolating neighboring PM(sub 2.5) measurements or predictions. This strategy ignored the effect of cloud cover on aerosol loadings and has been shown to exhibit poor performance when monitoring stations are sparse or when there is seasonal large-scale missngness. Using the Yangtze River Delta of China as an example, we present a Multiple Imputation (MI) method that combines the MAIAC high-resolution satellite retrievals with chemical transport model (CTM) simulations to fill missing AOD. A two-stage statistical model driven by gap-filled AOD, meteorology and land use information was then fitted to estimate daily ground PM(sub 2.5) concentrations in 2013 and 2014 at 1 km resolution with complete coverage in space and time. The daily MI models have an average R(exp 2) of 0.77, with an inter-quartile range of 0.71 to 0.82 across days. The overall Ml model 10-fold cross-validation R(exp 2) (root mean square error) were 0.81 (25 gm(exp 3)) and 0.73 (18 gm(exp 3)) for year 2013 and 2014, respectively. Predictions with only observational AOD or only imputed AOD showed similar accuracy.Comparing with previous gap-filling methods, our MI method presented in this study performed bette rwith higher coverage, higher accuracy, and the ability to fill missing PM(sub 2.5) predictions without ground PM(sub 2.5) measurements. This method can provide reliable PM(sub 2.5)predictions with complete coverage that can reduce biasin exposure assessment in air pollution and health studies.
Cox regression analysis with missing covariates via nonparametric multiple imputation.
Hsu, Chiu-Hsieh; Yu, Mandi
2018-01-01
We consider the situation of estimating Cox regression in which some covariates are subject to missing, and there exists additional information (including observed event time, censoring indicator and fully observed covariates) which may be predictive of the missing covariates. We propose to use two working regression models: one for predicting the missing covariates and the other for predicting the missing probabilities. For each missing covariate observation, these two working models are used to define a nearest neighbor imputing set. This set is then used to non-parametrically impute covariate values for the missing observation. Upon the completion of imputation, Cox regression is performed on the multiply imputed datasets to estimate the regression coefficients. In a simulation study, we compare the nonparametric multiple imputation approach with the augmented inverse probability weighted (AIPW) method, which directly incorporates the two working models into estimation of Cox regression, and the predictive mean matching imputation (PMM) method. We show that all approaches can reduce bias due to non-ignorable missing mechanism. The proposed nonparametric imputation method is robust to mis-specification of either one of the two working models and robust to mis-specification of the link function of the two working models. In contrast, the PMM method is sensitive to misspecification of the covariates included in imputation. The AIPW method is sensitive to the selection probability. We apply the approaches to a breast cancer dataset from Surveillance, Epidemiology and End Results (SEER) Program.
Jones, Rachael M; Stayner, Leslie T; Demirtas, Hakan
2014-10-01
Drinking water may contain pollutants that harm human health. The frequency of pollutant monitoring may occur quarterly, annually, or less frequently, depending upon the pollutant, the pollutant concentration, and community water system. However, birth and other health outcomes are associated with narrow time-windows of exposure. Infrequent monitoring impedes linkage between water quality and health outcomes for epidemiological analyses. To evaluate the performance of multiple imputation to fill in water quality values between measurements in community water systems (CWSs). The multiple imputation method was implemented in a simulated setting using data from the Atrazine Monitoring Program (AMP, 2006-2009 in five Midwestern states). Values were deleted from the AMP data to leave one measurement per month. Four patterns reflecting drinking water monitoring regulations were used to delete months of data in each CWS: three patterns were missing at random and one pattern was missing not at random. Synthetic health outcome data were created using a linear and a Poisson exposure-response relationship with five levels of hypothesized association, respectively. The multiple imputation method was evaluated by comparing the exposure-response relationships estimated based on multiply imputed data with the hypothesized association. The four patterns deleted 65-92% months of atrazine observations in AMP data. Even with these high rates of missing information, our procedure was able to recover most of the missing information when the synthetic health outcome was included for missing at random patterns and for missing not at random patterns with low-to-moderate exposure-response relationships. Multiple imputation appears to be an effective method for filling in water quality values between measurements. Copyright © 2014 Elsevier Inc. All rights reserved.
Lee, Minjung; Dignam, James J.; Han, Junhee
2014-01-01
We propose a nonparametric approach for cumulative incidence estimation when causes of failure are unknown or missing for some subjects. Under the missing at random assumption, we estimate the cumulative incidence function using multiple imputation methods. We develop asymptotic theory for the cumulative incidence estimators obtained from multiple imputation methods. We also discuss how to construct confidence intervals for the cumulative incidence function and perform a test for comparing the cumulative incidence functions in two samples with missing cause of failure. Through simulation studies, we show that the proposed methods perform well. The methods are illustrated with data from a randomized clinical trial in early stage breast cancer. PMID:25043107
Generating Multiple Imputations for Matrix Sampling Data Analyzed with Item Response Models.
ERIC Educational Resources Information Center
Thomas, Neal; Gan, Nianci
1997-01-01
Describes and assesses missing data methods currently used to analyze data from matrix sampling designs implemented by the National Assessment of Educational Progress. Several improved methods are developed, and these models are evaluated using an EM algorithm to obtain maximum likelihood estimates followed by multiple imputation of complete data…
Filling the gap in functional trait databases: use of ecological hypotheses to replace missing data.
Taugourdeau, Simon; Villerd, Jean; Plantureux, Sylvain; Huguenin-Elie, Olivier; Amiaud, Bernard
2014-04-01
Functional trait databases are powerful tools in ecology, though most of them contain large amounts of missing values. The goal of this study was to test the effect of imputation methods on the evaluation of trait values at species level and on the subsequent calculation of functional diversity indices at community level using functional trait databases. Two simple imputation methods (average and median), two methods based on ecological hypotheses, and one multiple imputation method were tested using a large plant trait database, together with the influence of the percentage of missing data and differences between functional traits. At community level, the complete-case approach and three functional diversity indices calculated from grassland plant communities were included. At the species level, one of the methods based on ecological hypothesis was for all traits more accurate than imputation with average or median values, but the multiple imputation method was superior for most of the traits. The method based on functional proximity between species was the best method for traits with an unbalanced distribution, while the method based on the existence of relationships between traits was the best for traits with a balanced distribution. The ranking of the grassland communities for their functional diversity indices was not robust with the complete-case approach, even for low percentages of missing data. With the imputation methods based on ecological hypotheses, functional diversity indices could be computed with a maximum of 30% of missing data, without affecting the ranking between grassland communities. The multiple imputation method performed well, but not better than single imputation based on ecological hypothesis and adapted to the distribution of the trait values for the functional identity and range of the communities. Ecological studies using functional trait databases have to deal with missing data using imputation methods corresponding to their specific needs and making the most out of the information available in the databases. Within this framework, this study indicates the possibilities and limits of single imputation methods based on ecological hypothesis and concludes that they could be useful when studying the ranking of communities for their functional diversity indices.
Filling the gap in functional trait databases: use of ecological hypotheses to replace missing data
Taugourdeau, Simon; Villerd, Jean; Plantureux, Sylvain; Huguenin-Elie, Olivier; Amiaud, Bernard
2014-01-01
Functional trait databases are powerful tools in ecology, though most of them contain large amounts of missing values. The goal of this study was to test the effect of imputation methods on the evaluation of trait values at species level and on the subsequent calculation of functional diversity indices at community level using functional trait databases. Two simple imputation methods (average and median), two methods based on ecological hypotheses, and one multiple imputation method were tested using a large plant trait database, together with the influence of the percentage of missing data and differences between functional traits. At community level, the complete-case approach and three functional diversity indices calculated from grassland plant communities were included. At the species level, one of the methods based on ecological hypothesis was for all traits more accurate than imputation with average or median values, but the multiple imputation method was superior for most of the traits. The method based on functional proximity between species was the best method for traits with an unbalanced distribution, while the method based on the existence of relationships between traits was the best for traits with a balanced distribution. The ranking of the grassland communities for their functional diversity indices was not robust with the complete-case approach, even for low percentages of missing data. With the imputation methods based on ecological hypotheses, functional diversity indices could be computed with a maximum of 30% of missing data, without affecting the ranking between grassland communities. The multiple imputation method performed well, but not better than single imputation based on ecological hypothesis and adapted to the distribution of the trait values for the functional identity and range of the communities. Ecological studies using functional trait databases have to deal with missing data using imputation methods corresponding to their specific needs and making the most out of the information available in the databases. Within this framework, this study indicates the possibilities and limits of single imputation methods based on ecological hypothesis and concludes that they could be useful when studying the ranking of communities for their functional diversity indices. PMID:24772273
Meta-analysis with missing study-level sample variance data.
Chowdhry, Amit K; Dworkin, Robert H; McDermott, Michael P
2016-07-30
We consider a study-level meta-analysis with a normally distributed outcome variable and possibly unequal study-level variances, where the object of inference is the difference in means between a treatment and control group. A common complication in such an analysis is missing sample variances for some studies. A frequently used approach is to impute the weighted (by sample size) mean of the observed variances (mean imputation). Another approach is to include only those studies with variances reported (complete case analysis). Both mean imputation and complete case analysis are only valid under the missing-completely-at-random assumption, and even then the inverse variance weights produced are not necessarily optimal. We propose a multiple imputation method employing gamma meta-regression to impute the missing sample variances. Our method takes advantage of study-level covariates that may be used to provide information about the missing data. Through simulation studies, we show that multiple imputation, when the imputation model is correctly specified, is superior to competing methods in terms of confidence interval coverage probability and type I error probability when testing a specified group difference. Finally, we describe a similar approach to handling missing variances in cross-over studies. Copyright © 2016 John Wiley & Sons, Ltd. Copyright © 2016 John Wiley & Sons, Ltd.
Janet L. Ohmann; Matthew J. Gregory; Emilie B. Henderson; Heather M. Roberts
2011-01-01
Question: How can nearest-neighbour (NN) imputation be used to develop maps of multiple species and plant communities? Location: Western and central Oregon, USA, but methods are applicable anywhere. Methods: We demonstrate NN imputation by mapping woody plant communities for >100 000 km2 of diverse forests and woodlands. Species abundances on...
Attrition Bias Related to Missing Outcome Data: A Longitudinal Simulation Study.
Lewin, Antoine; Brondeel, Ruben; Benmarhnia, Tarik; Thomas, Frédérique; Chaix, Basile
2018-01-01
Most longitudinal studies do not address potential selection biases due to selective attrition. Using empirical data and simulating additional attrition, we investigated the effectiveness of common approaches to handle missing outcome data from attrition in the association between individual education level and change in body mass index (BMI). Using data from the two waves of the French RECORD Cohort Study (N = 7,172), we first examined how inverse probability weighting (IPW) and multiple imputation handled missing outcome data from attrition in the observed data (stage 1). Second, simulating additional missing data in BMI at follow-up under various missing-at-random scenarios, we quantified the impact of attrition and assessed how multiple imputation performed compared to complete case analysis and to a perfectly specified IPW model as a gold standard (stage 2). With the observed data in stage 1, we found an inverse association between individual education and change in BMI, with complete case analysis, as well as with IPW and multiple imputation. When we simulated additional attrition under a missing-at-random pattern (stage 2), the bias increased with the magnitude of selective attrition, and multiple imputation was useless to address it. Our simulations revealed that selective attrition in the outcome heavily biased the association of interest. The present article contributes to raising awareness that for missing outcome data, multiple imputation does not do better than complete case analysis. More effort is thus needed during the design phase to understand attrition mechanisms by collecting information on the reasons for dropout.
Imputing data that are missing at high rates using a boosting algorithm
DOE Office of Scientific and Technical Information (OSTI.GOV)
Cauthen, Katherine Regina; Lambert, Gregory; Ray, Jaideep
Traditional multiple imputation approaches may perform poorly for datasets with high rates of missingness unless many m imputations are used. This paper implements an alternative machine learning-based approach to imputing data that are missing at high rates. Here, we use boosting to create a strong learner from a weak learner fitted to a dataset missing many observations. This approach may be applied to a variety of types of learners (models). The approach is demonstrated by application to a spatiotemporal dataset for predicting dengue outbreaks in India from meteorological covariates. A Bayesian spatiotemporal CAR model is boosted to produce imputations, andmore » the overall RMSE from a k-fold cross-validation is used to assess imputation accuracy.« less
Hopke, P K; Liu, C; Rubin, D B
2001-03-01
Many chemical and environmental data sets are complicated by the existence of fully missing values or censored values known to lie below detection thresholds. For example, week-long samples of airborne particulate matter were obtained at Alert, NWT, Canada, between 1980 and 1991, where some of the concentrations of 24 particulate constituents were coarsened in the sense of being either fully missing or below detection limits. To facilitate scientific analysis, it is appealing to create complete data by filling in missing values so that standard complete-data methods can be applied. We briefly review commonly used strategies for handling missing values and focus on the multiple-imputation approach, which generally leads to valid inferences when faced with missing data. Three statistical models are developed for multiply imputing the missing values of airborne particulate matter. We expect that these models are useful for creating multiple imputations in a variety of incomplete multivariate time series data sets.
Estimating Interaction Effects With Incomplete Predictor Variables
Enders, Craig K.; Baraldi, Amanda N.; Cham, Heining
2014-01-01
The existing missing data literature does not provide a clear prescription for estimating interaction effects with missing data, particularly when the interaction involves a pair of continuous variables. In this article, we describe maximum likelihood and multiple imputation procedures for this common analysis problem. We outline 3 latent variable model specifications for interaction analyses with missing data. These models apply procedures from the latent variable interaction literature to analyses with a single indicator per construct (e.g., a regression analysis with scale scores). We also discuss multiple imputation for interaction effects, emphasizing an approach that applies standard imputation procedures to the product of 2 raw score predictors. We thoroughly describe the process of probing interaction effects with maximum likelihood and multiple imputation. For both missing data handling techniques, we outline centering and transformation strategies that researchers can implement in popular software packages, and we use a series of real data analyses to illustrate these methods. Finally, we use computer simulations to evaluate the performance of the proposed techniques. PMID:24707955
ERIC Educational Resources Information Center
Aßmann, Christian; Würbach, Ariane; Goßmann, Solange; Geissler, Ferdinand; Bela, Anika
2017-01-01
Large-scale surveys typically exhibit data structures characterized by rich mutual dependencies between surveyed variables and individual-specific skip patterns. Despite high efforts in fieldwork and questionnaire design, missing values inevitably occur. One approach for handling missing values is to provide multiply imputed data sets, thus…
Resche-Rigon, Matthieu; White, Ian R
2018-06-01
In multilevel settings such as individual participant data meta-analysis, a variable is 'systematically missing' if it is wholly missing in some clusters and 'sporadically missing' if it is partly missing in some clusters. Previously proposed methods to impute incomplete multilevel data handle either systematically or sporadically missing data, but frequently both patterns are observed. We describe a new multiple imputation by chained equations (MICE) algorithm for multilevel data with arbitrary patterns of systematically and sporadically missing variables. The algorithm is described for multilevel normal data but can easily be extended for other variable types. We first propose two methods for imputing a single incomplete variable: an extension of an existing method and a new two-stage method which conveniently allows for heteroscedastic data. We then discuss the difficulties of imputing missing values in several variables in multilevel data using MICE, and show that even the simplest joint multilevel model implies conditional models which involve cluster means and heteroscedasticity. However, a simulation study finds that the proposed methods can be successfully combined in a multilevel MICE procedure, even when cluster means are not included in the imputation models.
A Hot-Deck Multiple Imputation Procedure for Gaps in Longitudinal Recurrent Event Histories
Wang, Chia-Ning; Little, Roderick; Nan, Bin; Harlow, Siobán D.
2012-01-01
Summary We propose a regression-based hot deck multiple imputation method for gaps of missing data in longitudinal studies, where subjects experience a recurrent event process and a terminal event. Examples are repeated asthma episodes and death, or menstrual periods and the menopause, as in our motivating application. Research interest concerns the onset time of a marker event, defined by the recurrent-event process, or the duration from this marker event to the final event. Gaps in the recorded event history make it difficult to determine the onset time of the marker event, and hence, the duration from onset to the final event. Simple approaches such as jumping gap times or dropping cases with gaps have obvious limitations. We propose a procedure for imputing information in the gaps by substituting information in the gap from a matched individual with a completely recorded history in the corresponding interval. Predictive Mean Matching is used to incorporate information on longitudinal characteristics of the repeated process and the final event time. Multiple imputation is used to propagate imputation uncertainty. The procedure is applied to an important data set for assessing the timing and duration of the menopausal transition. The performance of the proposed method is assessed by a simulation study. PMID:21361886
ERIC Educational Resources Information Center
Fish, Laurel J.; Halcoussis, Dennis; Phillips, G. Michael
2017-01-01
The Monte Carlo method and related multiple imputation methods are traditionally used in math, physics and science to estimate and analyze data and are now becoming standard tools in analyzing business and financial problems. However, few sources explain the application of the Monte Carlo method for individuals and business professionals who are…
Multiple Imputation for Incomplete Data in Epidemiologic Studies
Harel, Ofer; Mitchell, Emily M; Perkins, Neil J; Cole, Stephen R; Tchetgen Tchetgen, Eric J; Sun, BaoLuo; Schisterman, Enrique F
2018-01-01
Abstract Epidemiologic studies are frequently susceptible to missing information. Omitting observations with missing variables remains a common strategy in epidemiologic studies, yet this simple approach can often severely bias parameter estimates of interest if the values are not missing completely at random. Even when missingness is completely random, complete-case analysis can reduce the efficiency of estimated parameters, because large amounts of available data are simply tossed out with the incomplete observations. Alternative methods for mitigating the influence of missing information, such as multiple imputation, are becoming an increasing popular strategy in order to retain all available information, reduce potential bias, and improve efficiency in parameter estimation. In this paper, we describe the theoretical underpinnings of multiple imputation, and we illustrate application of this method as part of a collaborative challenge to assess the performance of various techniques for dealing with missing data (Am J Epidemiol. 2018;187(3):568–575). We detail the steps necessary to perform multiple imputation on a subset of data from the Collaborative Perinatal Project (1959–1974), where the goal is to estimate the odds of spontaneous abortion associated with smoking during pregnancy. PMID:29165547
Brock, Guy N; Shaffer, John R; Blakesley, Richard E; Lotz, Meredith J; Tseng, George C
2008-01-10
Gene expression data frequently contain missing values, however, most down-stream analyses for microarray experiments require complete data. In the literature many methods have been proposed to estimate missing values via information of the correlation patterns within the gene expression matrix. Each method has its own advantages, but the specific conditions for which each method is preferred remains largely unclear. In this report we describe an extensive evaluation of eight current imputation methods on multiple types of microarray experiments, including time series, multiple exposures, and multiple exposures x time series data. We then introduce two complementary selection schemes for determining the most appropriate imputation method for any given data set. We found that the optimal imputation algorithms (LSA, LLS, and BPCA) are all highly competitive with each other, and that no method is uniformly superior in all the data sets we examined. The success of each method can also depend on the underlying "complexity" of the expression data, where we take complexity to indicate the difficulty in mapping the gene expression matrix to a lower-dimensional subspace. We developed an entropy measure to quantify the complexity of expression matrixes and found that, by incorporating this information, the entropy-based selection (EBS) scheme is useful for selecting an appropriate imputation algorithm. We further propose a simulation-based self-training selection (STS) scheme. This technique has been used previously for microarray data imputation, but for different purposes. The scheme selects the optimal or near-optimal method with high accuracy but at an increased computational cost. Our findings provide insight into the problem of which imputation method is optimal for a given data set. Three top-performing methods (LSA, LLS and BPCA) are competitive with each other. Global-based imputation methods (PLS, SVD, BPCA) performed better on mcroarray data with lower complexity, while neighbour-based methods (KNN, OLS, LSA, LLS) performed better in data with higher complexity. We also found that the EBS and STS schemes serve as complementary and effective tools for selecting the optimal imputation algorithm.
Pappas, D J; Lizee, A; Paunic, V; Beutner, K R; Motyer, A; Vukcevic, D; Leslie, S; Biesiada, J; Meller, J; Taylor, K D; Zheng, X; Zhao, L P; Gourraud, P-A; Hollenbach, J A; Mack, S J; Maiers, M
2018-05-22
Four single nucleotide polymorphism (SNP)-based human leukocyte antigen (HLA) imputation methods (e-HLA, HIBAG, HLA*IMP:02 and MAGPrediction) were trained using 1000 Genomes SNP and HLA genotypes and assessed for their ability to accurately impute molecular HLA-A, -B, -C and -DRB1 genotypes in the Human Genome Diversity Project cell panel. Imputation concordance was high (>89%) across all methods for both HLA-A and HLA-C, but HLA-B and HLA-DRB1 proved generally difficult to impute. Overall, <27.8% of subjects were correctly imputed for all HLA loci by any method. Concordance across all loci was not enhanced via the application of confidence thresholds; reliance on confidence scores across methods only led to noticeable improvement (+3.2%) for HLA-DRB1. As the HLA complex is highly relevant to the study of human health and disease, a standardized assessment of SNP-based HLA imputation methods is crucial for advancing genomic research. Considerable room remains for the improvement of HLA-B and especially HLA-DRB1 imputation methods, and no imputation method is as accurate as molecular genotyping. The application of large, ancestrally diverse HLA and SNP reference data sets and multiple imputation methods has the potential to make SNP-based HLA imputation methods a tractable option for determining HLA genotypes.
Bounthavong, Mark; Watanabe, Jonathan H; Sullivan, Kevin M
2015-04-01
The complete capture of all values for each variable of interest in pharmacy research studies remains aspirational. The absence of these possibly influential values is a common problem for pharmacist investigators. Failure to account for missing data may translate to biased study findings and conclusions. Our goal in this analysis was to apply validated statistical methods for missing data to a previously analyzed data set and compare results when missing data methods were implemented versus standard analytics that ignore missing data effects. Using data from a retrospective cohort study, the statistical method of multiple imputation was used to provide regression-based estimates of the missing values to improve available data usable for study outcomes measurement. These findings were then contrasted with a complete-case analysis that restricted estimation to subjects in the cohort that had no missing values. Odds ratios were compared to assess differences in findings of the analyses. A nonadjusted regression analysis ("crude analysis") was also performed as a reference for potential bias. Veterans Integrated Systems Network that includes VA facilities in the Southern California and Nevada regions. New statin users between November 30, 2006, and December 2, 2007, with a diagnosis of dyslipidemia. We compared the odds ratios (ORs) and 95% confidence intervals (CIs) for the crude, complete-case, and multiple imputation analyses for the end points of a 25% or greater reduction in atherogenic lipids. Data were missing for 21.5% of identified patients (1665 subjects of 7739). Regression model results were similar for the crude, complete-case, and multiple imputation analyses with overlap of 95% confidence limits at each end point. The crude, complete-case, and multiple imputation ORs (95% CIs) for a 25% or greater reduction in low-density lipoprotein cholesterol were 3.5 (95% CI 3.1-3.9), 4.3 (95% CI 3.8-4.9), and 4.1 (95% CI 3.7-4.6), respectively. The crude, complete-case, and multiple imputation ORs (95% CIs) for a 25% or greater reduction in non-high-density lipoprotein cholesterol were 3.5 (95% CI 3.1-3.9), 4.5 (95% CI 4.0-5.2), and 4.4 (95% CI 3.9-4.9), respectively. The crude, complete-case, and multiple imputation ORs (95% CIs) for 25% or greater reduction in TGs were 3.1 (95% CI 2.8-3.6), 4.0 (95% CI 3.5-4.6), and 4.1 (95% CI 3.6-4.6), respectively. The use of the multiple imputation method to account for missing data did not alter conclusions based on a complete-case analysis. Given the frequency of missing data in research using electronic health records and pharmacy claims data, multiple imputation may play an important role in the validation of study findings. © 2015 Pharmacotherapy Publications, Inc.
Covariate Selection for Multilevel Models with Missing Data
Marino, Miguel; Buxton, Orfeu M.; Li, Yi
2017-01-01
Missing covariate data hampers variable selection in multilevel regression settings. Current variable selection techniques for multiply-imputed data commonly address missingness in the predictors through list-wise deletion and stepwise-selection methods which are problematic. Moreover, most variable selection methods are developed for independent linear regression models and do not accommodate multilevel mixed effects regression models with incomplete covariate data. We develop a novel methodology that is able to perform covariate selection across multiply-imputed data for multilevel random effects models when missing data is present. Specifically, we propose to stack the multiply-imputed data sets from a multiple imputation procedure and to apply a group variable selection procedure through group lasso regularization to assess the overall impact of each predictor on the outcome across the imputed data sets. Simulations confirm the advantageous performance of the proposed method compared with the competing methods. We applied the method to reanalyze the Healthy Directions-Small Business cancer prevention study, which evaluated a behavioral intervention program targeting multiple risk-related behaviors in a working-class, multi-ethnic population. PMID:28239457
ERIC Educational Resources Information Center
Acock, Alan C.
2005-01-01
Less than optimum strategies for missing values can produce biased estimates, distorted statistical power, and invalid conclusions. After reviewing traditional approaches (listwise, pairwise, and mean substitution), selected alternatives are covered including single imputation, multiple imputation, and full information maximum likelihood…
Caron, Alexandre; Clement, Guillaume; Heyman, Christophe; Aernout, Eva; Chazard, Emmanuel; Le Tertre, Alain
2015-01-01
Incompleteness of epidemiological databases is a major drawback when it comes to analyzing data. We conceived an epidemiological study to assess the association between newborn thyroid function and the exposure to perchlorates found in the tap water of the mother's home. Only 9% of newborn's exposure to perchlorate was known. The aim of our study was to design, test and evaluate an original method for imputing perchlorate exposure of newborns based on their maternity of birth. In a first database, an exhaustive collection of newborn's thyroid function measured during a systematic neonatal screening was collected. In this database the municipality of residence of the newborn's mother was only available for 2012. Between 2004 and 2011, the closest data available was the municipality of the maternity of birth. Exposure was assessed using a second database which contained the perchlorate levels for each municipality. We computed the catchment area of every maternity ward based on the French nationwide exhaustive database of inpatient stay. Municipality, and consequently perchlorate exposure, was imputed by a weighted draw in the catchment area. Missing values for remaining covariates were imputed by chained equation. A linear mixture model was computed on each imputed dataset. We compared odds ratios (ORs) and 95% confidence intervals (95% CI) estimated on real versus imputed 2012 data. The same model was then carried out for the whole imputed database. The ORs estimated on 36,695 observations by our multiple imputation method are comparable to the real 2012 data. On the 394,979 observations of the whole database, the ORs remain stable but the 95% CI tighten considerably. The model estimates computed on imputed data are similar to those calculated on real data. The main advantage of multiple imputation is to provide unbiased estimate of the ORs while maintaining their variances. Thus, our method will be used to increase the statistical power of future studies by including all 394,979 newborns.
Phung, Dung; Mueller, Jochen; Lai, Foon Yin; O'Brien, Jake; Dang, Nhung; Morawska, Lidia; Thai, Phong K
2017-07-01
Ambient temperature is known to have impact on population health but assessing its impact by the traditional cohort approach is resource intensive. Wastewater-based epidemiology (WBE) could be an alternative for the traditional approach. This study was to provide the first evaluation to see if WBE can be used to assess the impact of temperature exposure to a population in South East Queensland, Australia using selected pharmaceuticals and personal care products (PPCPs) as biomarkers. Daily loads of eight PPCPs in wastewater collected from a wastewater treatment plant were measured from February 2011 to June 2012. Corresponding daily weather data were obtained from the closest weather station. Missing data of PPCPs were handled using the multiple imputation (MI) method, then we used a one-way between-groups analysis of variance to examine the seasonal effect on daily variation of PPCPs by seasons. Finally, an MI estimate was performed to evaluate the continuous relationship between daily average temperature and each multiply-imputed PPCP using time-series regression analysis. The results indicated that an increase of 1°C in average temperature associated with decrease at 1.3g/d (95% CI: -2.2 to (-0.4), p<0.05) for atenolol, increase at 36.5g/d (95% CI: 25.2-47.8, p<0.01) for acesulfame, and increase at 0.8g/d (95% CI: 0.02-1.55, p=0.05) for naproxen. No significant association was observed between temperature and the remaining PPCPs, comprising: caffeine, carbamazepine, codeine, hydrochlorothiazide, and salicylic acid. The findings suggested that consumption of sweetened drinks, risk of worsening cardiovascular conditions and pains are associated with variation in ambient temperature. WBE can thus be used as a complementary method to traditional cohort studies in epidemiological evaluation of the association between environmental factors and health outcomes provided that specific biomarkers of such health outcomes can be identified. Copyright © 2017 Elsevier Inc. All rights reserved.
Zhou, Hanzhi; Elliott, Michael R; Raghunathan, Trivellore E
2016-06-01
Multistage sampling is often employed in survey samples for cost and convenience. However, accounting for clustering features when generating datasets for multiple imputation is a nontrivial task, particularly when, as is often the case, cluster sampling is accompanied by unequal probabilities of selection, necessitating case weights. Thus, multiple imputation often ignores complex sample designs and assumes simple random sampling when generating imputations, even though failing to account for complex sample design features is known to yield biased estimates and confidence intervals that have incorrect nominal coverage. In this article, we extend a recently developed, weighted, finite-population Bayesian bootstrap procedure to generate synthetic populations conditional on complex sample design data that can be treated as simple random samples at the imputation stage, obviating the need to directly model design features for imputation. We develop two forms of this method: one where the probabilities of selection are known at the first and second stages of the design, and the other, more common in public use files, where only the final weight based on the product of the two probabilities is known. We show that this method has advantages in terms of bias, mean square error, and coverage properties over methods where sample designs are ignored, with little loss in efficiency, even when compared with correct fully parametric models. An application is made using the National Automotive Sampling System Crashworthiness Data System, a multistage, unequal probability sample of U.S. passenger vehicle crashes, which suffers from a substantial amount of missing data in "Delta-V," a key crash severity measure.
Zhou, Hanzhi; Elliott, Michael R.; Raghunathan, Trivellore E.
2017-01-01
Multistage sampling is often employed in survey samples for cost and convenience. However, accounting for clustering features when generating datasets for multiple imputation is a nontrivial task, particularly when, as is often the case, cluster sampling is accompanied by unequal probabilities of selection, necessitating case weights. Thus, multiple imputation often ignores complex sample designs and assumes simple random sampling when generating imputations, even though failing to account for complex sample design features is known to yield biased estimates and confidence intervals that have incorrect nominal coverage. In this article, we extend a recently developed, weighted, finite-population Bayesian bootstrap procedure to generate synthetic populations conditional on complex sample design data that can be treated as simple random samples at the imputation stage, obviating the need to directly model design features for imputation. We develop two forms of this method: one where the probabilities of selection are known at the first and second stages of the design, and the other, more common in public use files, where only the final weight based on the product of the two probabilities is known. We show that this method has advantages in terms of bias, mean square error, and coverage properties over methods where sample designs are ignored, with little loss in efficiency, even when compared with correct fully parametric models. An application is made using the National Automotive Sampling System Crashworthiness Data System, a multistage, unequal probability sample of U.S. passenger vehicle crashes, which suffers from a substantial amount of missing data in “Delta-V,” a key crash severity measure. PMID:29226161
Martín-Merino, Elisa; Calderón-Larrañaga, Amaia; Hawley, Samuel; Poblador-Plou, Beatriz; Llorente-García, Ana; Petersen, Irene; Prieto-Alhambra, Daniel
2018-01-01
Background Missing data are often an issue in electronic medical records (EMRs) research. However, there are many ways that people deal with missing data in drug safety studies. Aim To compare the risk estimates resulting from different strategies for the handling of missing data in the study of venous thromboembolism (VTE) risk associated with antiosteoporotic medications (AOM). Methods New users of AOM (alendronic acid, other bisphosphonates, strontium ranelate, selective estrogen receptor modulators, teriparatide, or denosumab) aged ≥50 years during 1998–2014 were identified in two Spanish (the Base de datos para la Investigación Farmacoepidemiológica en Atención Primaria [BIFAP] and EpiChron cohort) and one UK (Clinical Practice Research Datalink [CPRD]) EMR. Hazard ratios (HRs) according to AOM (with alendronic acid as reference) were calculated adjusting for VTE risk factors, body mass index (that was missing in 61% of patients included in the three databases), and smoking (that was missing in 23% of patients) in the year of AOM therapy initiation. HRs and standard errors obtained using cross-sectional multiple imputation (MI) (reference method) were compared to complete case (CC) analysis – using only patients with complete data – and longitudinal MI – adding to the cross-sectional MI model the body mass index/smoking values as recorded in the year before and after therapy initiation. Results Overall, 422/95,057 (0.4%), 19/12,688 (0.1%), and 2,051/161,202 (1.3%) VTE cases/participants were seen in BIFAP, EpiChron, and CPRD, respectively. HRs moved from 100.00% underestimation to 40.31% overestimation in CC compared with cross-sectional MI, while longitudinal MI methods provided similar risk estimates compared with cross-sectional MI. Precision for HR improved in cross-sectional MI versus CC by up to 160.28%, while longitudinal MI improved precision (compared with cross-sectional) only minimally (up to 0.80%). Conclusion CC may substantially affect relative risk estimation in EMR-based drug safety studies, since missing data are not often completely at random. Little improvement was seen in these data in terms of power with the inclusion of longitudinal MI compared with cross-sectional MI. The strategy for handling missing data in drug safety studies can have a large impact on both risk estimates and precision.
Martín-Merino, Elisa; Calderón-Larrañaga, Amaia; Hawley, Samuel; Poblador-Plou, Beatriz; Llorente-García, Ana; Petersen, Irene; Prieto-Alhambra, Daniel
2018-01-01
Missing data are often an issue in electronic medical records (EMRs) research. However, there are many ways that people deal with missing data in drug safety studies. To compare the risk estimates resulting from different strategies for the handling of missing data in the study of venous thromboembolism (VTE) risk associated with antiosteoporotic medications (AOM). New users of AOM (alendronic acid, other bisphosphonates, strontium ranelate, selective estrogen receptor modulators, teriparatide, or denosumab) aged ≥50 years during 1998-2014 were identified in two Spanish (the Base de datos para la Investigación Farmacoepidemiológica en Atención Primaria [BIFAP] and EpiChron cohort) and one UK (Clinical Practice Research Datalink [CPRD]) EMR. Hazard ratios (HRs) according to AOM (with alendronic acid as reference) were calculated adjusting for VTE risk factors, body mass index (that was missing in 61% of patients included in the three databases), and smoking (that was missing in 23% of patients) in the year of AOM therapy initiation. HRs and standard errors obtained using cross-sectional multiple imputation (MI) (reference method) were compared to complete case (CC) analysis - using only patients with complete data - and longitudinal MI - adding to the cross-sectional MI model the body mass index/smoking values as recorded in the year before and after therapy initiation. Overall, 422/95,057 (0.4%), 19/12,688 (0.1%), and 2,051/161,202 (1.3%) VTE cases/participants were seen in BIFAP, EpiChron, and CPRD, respectively. HRs moved from 100.00% underestimation to 40.31% overestimation in CC compared with cross-sectional MI, while longitudinal MI methods provided similar risk estimates compared with cross-sectional MI. Precision for HR improved in cross-sectional MI versus CC by up to 160.28%, while longitudinal MI improved precision (compared with cross-sectional) only minimally (up to 0.80%). CC may substantially affect relative risk estimation in EMR-based drug safety studies, since missing data are not often completely at random. Little improvement was seen in these data in terms of power with the inclusion of longitudinal MI compared with cross-sectional MI. The strategy for handling missing data in drug safety studies can have a large impact on both risk estimates and precision.
Falcaro, Milena; Carpenter, James R
2017-06-01
Population-based net survival by tumour stage at diagnosis is a key measure in cancer surveillance. Unfortunately, data on tumour stage are often missing for a non-negligible proportion of patients and the mechanism giving rise to the missingness is usually anything but completely at random. In this setting, restricting analysis to the subset of complete records gives typically biased results. Multiple imputation is a promising practical approach to the issues raised by the missing data, but its use in conjunction with the Pohar-Perme method for estimating net survival has not been formally evaluated. We performed a resampling study using colorectal cancer population-based registry data to evaluate the ability of multiple imputation, used along with the Pohar-Perme method, to deliver unbiased estimates of stage-specific net survival and recover missing stage information. We created 1000 independent data sets, each containing 5000 patients. Stage data were then made missing at random under two scenarios (30% and 50% missingness). Complete records analysis showed substantial bias and poor confidence interval coverage. Across both scenarios our multiple imputation strategy virtually eliminated the bias and greatly improved confidence interval coverage. In the presence of missing stage data complete records analysis often gives severely biased results. We showed that combining multiple imputation with the Pohar-Perme estimator provides a valid practical approach for the estimation of stage-specific colorectal cancer net survival. As usual, when the percentage of missing data is high the results should be interpreted cautiously and sensitivity analyses are recommended. Copyright © 2017 Elsevier Ltd. All rights reserved.
Ondeck, Nathaniel T; Fu, Michael C; Skrip, Laura A; McLynn, Ryan P; Su, Edwin P; Grauer, Jonathan N
2018-03-01
Despite the advantages of large, national datasets, one continuing concern is missing data values. Complete case analysis, where only cases with complete data are analyzed, is commonly used rather than more statistically rigorous approaches such as multiple imputation. This study characterizes the potential selection bias introduced using complete case analysis and compares the results of common regressions using both techniques following unicompartmental knee arthroplasty. Patients undergoing unicompartmental knee arthroplasty were extracted from the 2005 to 2015 National Surgical Quality Improvement Program. As examples, the demographics of patients with and without missing preoperative albumin and hematocrit values were compared. Missing data were then treated with both complete case analysis and multiple imputation (an approach that reproduces the variation and associations that would have been present in a full dataset) and the conclusions of common regressions for adverse outcomes were compared. A total of 6117 patients were included, of which 56.7% were missing at least one value. Younger, female, and healthier patients were more likely to have missing preoperative albumin and hematocrit values. The use of complete case analysis removed 3467 patients from the study in comparison with multiple imputation which included all 6117 patients. The 2 methods of handling missing values led to differing associations of low preoperative laboratory values with commonly studied adverse outcomes. The use of complete case analysis can introduce selection bias and may lead to different conclusions in comparison with the statistically rigorous multiple imputation approach. Joint surgeons should consider the methods of handling missing values when interpreting arthroplasty research. Copyright © 2017 Elsevier Inc. All rights reserved.
Liu, Benmei; Yu, Mandi; Graubard, Barry I; Troiano, Richard P; Schenker, Nathaniel
2016-01-01
The Physical Activity Monitor (PAM) component was introduced into the 2003-2004 National Health and Nutrition Examination Survey (NHANES) to collect objective information on physical activity including both movement intensity counts and ambulatory steps. Due to an error in the accelerometer device initialization process, the steps data were missing for all participants in several primary sampling units (PSUs), typically a single county or group of contiguous counties, who had intensity count data from their accelerometers. To avoid potential bias and loss in efficiency in estimation and inference involving the steps data, we considered methods to accurately impute the missing values for steps collected in the 2003-2004 NHANES. The objective was to come up with an efficient imputation method which minimized model-based assumptions. We adopted a multiple imputation approach based on Additive Regression, Bootstrapping and Predictive mean matching (ARBP) methods. This method fits alternative conditional expectation (ace) models, which use an automated procedure to estimate optimal transformations for both the predictor and response variables. This paper describes the approaches used in this imputation and evaluates the methods by comparing the distributions of the original and the imputed data. A simulation study using the observed data is also conducted as part of the model diagnostics. Finally some real data analyses are performed to compare the before and after imputation results. PMID:27488606
[Prevention and handling of missing data in clinical trials].
Jiang, Zhi-wei; Li, Chan-juan; Wang, Ling; Xia, Jie-lai
2015-11-01
Missing data is a common but unavoidable issue in clinical trials. It not only lowers the trial power, but brings the bias to the trial results. Therefore, on one hand, the missing data handling methods are employed in data analysis. On the other hand, it is vital to prevent the missing data in the trials. Prevention of missing data should take the first place. From the perspective of data, firstly, some measures should be taken at the stages of protocol design, data collection and data check to enhance the patients' compliance and reduce the unnecessary missing data. Secondly, the causes of confirmed missing data in the trials should be notified and recorded in detail, which are very important to determine the mechanism of missing data and choose the suitable missing data handling methods, e.g., last observation carried forward (LOCF); multiple imputation (MI); mixed-effect model repeated measure (MMRM), etc.
[Imputing missing data in public health: general concepts and application to dichotomous variables].
Hernández, Gilma; Moriña, David; Navarro, Albert
The presence of missing data in collected variables is common in health surveys, but the subsequent imputation thereof at the time of analysis is not. Working with imputed data may have certain benefits regarding the precision of the estimators and the unbiased identification of associations between variables. The imputation process is probably still little understood by many non-statisticians, who view this process as highly complex and with an uncertain goal. To clarify these questions, this note aims to provide a straightforward, non-exhaustive overview of the imputation process to enable public health researchers ascertain its strengths. All this in the context of dichotomous variables which are commonplace in public health. To illustrate these concepts, an example in which missing data is handled by means of simple and multiple imputation is introduced. Copyright © 2017 SESPAS. Publicado por Elsevier España, S.L.U. All rights reserved.
Godin, Judith; Keefe, Janice; Andrew, Melissa K
2017-04-01
Missing values are commonly encountered on the Mini Mental State Examination (MMSE), particularly when administered to frail older people. This presents challenges for MMSE scoring in research settings. We sought to describe missingness in MMSEs administered in long-term-care facilities (LTCF) and to compare and contrast approaches to dealing with missing items. As part of the Care and Construction project in Nova Scotia, Canada, LTCF residents completed an MMSE. Different methods of dealing with missing values (e.g., use of raw scores, raw scores/number of items attempted, scale-level multiple imputation [MI], and blended approaches) are compared to item-level MI. The MMSE was administered to 320 residents living in 23 LTCF. The sample was predominately female (73%), and 38% of participants were aged >85 years. At least one item was missing from 122 (38.2%) of the MMSEs. Data were not Missing Completely at Random (MCAR), χ 2 (1110) = 1,351, p < 0.001. Using raw scores for those missing <6 items in combination with scale-level MI resulted in the regression coefficients and standard errors closest to item-level MI. Patterns of missing items often suggest systematic problems, such as trouble with manual dexterity, literacy, or visual impairment. While these observations may be relatively easy to take into account in clinical settings, non-random missingness presents challenges for research and must be considered in statistical analyses. We present suggestions for dealing with missing MMSE data based on the extent of missingness and the goal of analyses. Copyright © 2016 The Authors. Production and hosting by Elsevier B.V. All rights reserved.
Wallert, John; Tomasoni, Mattia; Madison, Guy; Held, Claes
2017-07-05
Machine learning algorithms hold potential for improved prediction of all-cause mortality in cardiovascular patients, yet have not previously been developed with high-quality population data. This study compared four popular machine learning algorithms trained on unselected, nation-wide population data from Sweden to solve the binary classification problem of predicting survival versus non-survival 2 years after first myocardial infarction (MI). This prospective national registry study for prognostic accuracy validation of predictive models used data from 51,943 complete first MI cases as registered during 6 years (2006-2011) in the national quality register SWEDEHEART/RIKS-HIA (90% coverage of all MIs in Sweden) with follow-up in the Cause of Death register (> 99% coverage). Primary outcome was AUROC (C-statistic) performance of each model on the untouched test set (40% of cases) after model development on the training set (60% of cases) with the full (39) predictor set. Model AUROCs were bootstrapped and compared, correcting the P-values for multiple comparisons with the Bonferroni method. Secondary outcomes were derived when varying sample size (1-100% of total) and predictor sets (39, 10, and 5) for each model. Analyses were repeated on 79,869 completed cases after multivariable imputation of predictors. A Support Vector Machine with a radial basis kernel developed on 39 predictors had the highest complete cases performance on the test set (AUROC = 0.845, PPV = 0.280, NPV = 0.966) outperforming Boosted C5.0 (0.845 vs. 0.841, P = 0.028) but not significantly higher than Logistic Regression or Random Forest. Models converged to the point of algorithm indifference with increased sample size and predictors. Using the top five predictors also produced good classifiers. Imputed analyses had slightly higher performance. Improved mortality prediction at hospital discharge after first MI is important for identifying high-risk individuals eligible for intensified treatment and care. All models performed accurately and similarly and because of the superior national coverage, the best model can potentially be used to better differentiate new patients, allowing for improved targeting of limited resources. Future research should focus on further model development and investigate possibilities for implementation.
References for Haplotype Imputation in the Big Data Era
Li, Wenzhi; Xu, Wei; Li, Qiling; Ma, Li; Song, Qing
2016-01-01
Imputation is a powerful in silico approach to fill in those missing values in the big datasets. This process requires a reference panel, which is a collection of big data from which the missing information can be extracted and imputed. Haplotype imputation requires ethnicity-matched references; a mismatched reference panel will significantly reduce the quality of imputation. However, currently existing big datasets cover only a small number of ethnicities, there is a lack of ethnicity-matched references for many ethnic populations in the world, which has hampered the data imputation of haplotypes and its downstream applications. To solve this issue, several approaches have been proposed and explored, including the mixed reference panel, the internal reference panel and genotype-converted reference panel. This review article provides the information and comparison between these approaches. Increasing evidence showed that not just one or two genetic elements dictate the gene activity and functions; instead, cis-interactions of multiple elements dictate gene activity. Cis-interactions require the interacting elements to be on the same chromosome molecule, therefore, haplotype analysis is essential for the investigation of cis-interactions among multiple genetic variants at different loci, and appears to be especially important for studying the common diseases. It will be valuable in a wide spectrum of applications from academic research, to clinical diagnosis, prevention, treatment, and pharmaceutical industry. PMID:27274952
Liu, Siwei; Molenaar, Peter C M
2014-12-01
This article introduces iVAR, an R program for imputing missing data in multivariate time series on the basis of vector autoregressive (VAR) models. We conducted a simulation study to compare iVAR with three methods for handling missing data: listwise deletion, imputation with sample means and variances, and multiple imputation ignoring time dependency. The results showed that iVAR produces better estimates for the cross-lagged coefficients than do the other three methods. We demonstrate the use of iVAR with an empirical example of time series electrodermal activity data and discuss the advantages and limitations of the program.
Fang, Lingzhao; Sørensen, Peter; Sahana, Goutam; Panitz, Frank; Su, Guosheng; Zhang, Shengli; Yu, Ying; Li, Bingjie; Ma, Li; Liu, George; Lund, Mogens Sandø; Thomsen, Bo
2018-06-19
MicroRNAs (miRNA) are key modulators of gene expression and so act as putative fine-tuners of complex phenotypes. Here, we hypothesized that causal variants of complex traits are enriched in miRNAs and miRNA-target networks. First, we conducted a genome-wide association study (GWAS) for seven functional and milk production traits using imputed sequence variants (13~15 million) and >10,000 animals from three dairy cattle breeds, i.e., Holstein (HOL), Nordic red cattle (RDC) and Jersey (JER). Second, we analyzed for enrichments of association signals in miRNAs and their miRNA-target networks. Our results demonstrated that genomic regions harboring miRNA genes were significantly (P < 0.05) enriched with GWAS signals for milk production traits and mastitis, and that enrichments within miRNA-target gene networks were significantly higher than in random gene-sets for the majority of traits. Furthermore, most between-trait and across-breed correlations of enrichments with miRNA-target networks were significantly greater than with random gene-sets, suggesting pleiotropic effects of miRNAs. Intriguingly, genes that were differentially expressed in response to mammary gland infections were significantly enriched in the miRNA-target networks associated with mastitis. All these findings were consistent across three breeds. Collectively, our observations demonstrate the importance of miRNAs and their targets for the expression of complex traits.
Wallace, Meredith L; Anderson, Stewart J; Mazumdar, Sati
2010-12-20
Missing covariate data present a challenge to tree-structured methodology due to the fact that a single tree model, as opposed to an estimated parameter value, may be desired for use in a clinical setting. To address this problem, we suggest a multiple imputation algorithm that adds draws of stochastic error to a tree-based single imputation method presented by Conversano and Siciliano (Technical Report, University of Naples, 2003). Unlike previously proposed techniques for accommodating missing covariate data in tree-structured analyses, our methodology allows the modeling of complex and nonlinear covariate structures while still resulting in a single tree model. We perform a simulation study to evaluate our stochastic multiple imputation algorithm when covariate data are missing at random and compare it to other currently used methods. Our algorithm is advantageous for identifying the true underlying covariate structure when complex data and larger percentages of missing covariate observations are present. It is competitive with other current methods with respect to prediction accuracy. To illustrate our algorithm, we create a tree-structured survival model for predicting time to treatment response in older, depressed adults. Copyright © 2010 John Wiley & Sons, Ltd.
DOT National Transportation Integrated Search
2002-01-01
The National Center for Statistics and Analysis (NCSA) of the National Highway Traffic Safety : Administration (NHTSA) has undertaken several approaches to remedy the problem of missing blood alcohol : test results in the Fatality Analysis Reporting ...
Missing continuous outcomes under covariate dependent missingness in cluster randomised trials
Diaz-Ordaz, Karla; Bartlett, Jonathan W
2016-01-01
Attrition is a common occurrence in cluster randomised trials which leads to missing outcome data. Two approaches for analysing such trials are cluster-level analysis and individual-level analysis. This paper compares the performance of unadjusted cluster-level analysis, baseline covariate adjusted cluster-level analysis and linear mixed model analysis, under baseline covariate dependent missingness in continuous outcomes, in terms of bias, average estimated standard error and coverage probability. The methods of complete records analysis and multiple imputation are used to handle the missing outcome data. We considered four scenarios, with the missingness mechanism and baseline covariate effect on outcome either the same or different between intervention groups. We show that both unadjusted cluster-level analysis and baseline covariate adjusted cluster-level analysis give unbiased estimates of the intervention effect only if both intervention groups have the same missingness mechanisms and there is no interaction between baseline covariate and intervention group. Linear mixed model and multiple imputation give unbiased estimates under all four considered scenarios, provided that an interaction of intervention and baseline covariate is included in the model when appropriate. Cluster mean imputation has been proposed as a valid approach for handling missing outcomes in cluster randomised trials. We show that cluster mean imputation only gives unbiased estimates when missingness mechanism is the same between the intervention groups and there is no interaction between baseline covariate and intervention group. Multiple imputation shows overcoverage for small number of clusters in each intervention group. PMID:27177885
Missing continuous outcomes under covariate dependent missingness in cluster randomised trials.
Hossain, Anower; Diaz-Ordaz, Karla; Bartlett, Jonathan W
2017-06-01
Attrition is a common occurrence in cluster randomised trials which leads to missing outcome data. Two approaches for analysing such trials are cluster-level analysis and individual-level analysis. This paper compares the performance of unadjusted cluster-level analysis, baseline covariate adjusted cluster-level analysis and linear mixed model analysis, under baseline covariate dependent missingness in continuous outcomes, in terms of bias, average estimated standard error and coverage probability. The methods of complete records analysis and multiple imputation are used to handle the missing outcome data. We considered four scenarios, with the missingness mechanism and baseline covariate effect on outcome either the same or different between intervention groups. We show that both unadjusted cluster-level analysis and baseline covariate adjusted cluster-level analysis give unbiased estimates of the intervention effect only if both intervention groups have the same missingness mechanisms and there is no interaction between baseline covariate and intervention group. Linear mixed model and multiple imputation give unbiased estimates under all four considered scenarios, provided that an interaction of intervention and baseline covariate is included in the model when appropriate. Cluster mean imputation has been proposed as a valid approach for handling missing outcomes in cluster randomised trials. We show that cluster mean imputation only gives unbiased estimates when missingness mechanism is the same between the intervention groups and there is no interaction between baseline covariate and intervention group. Multiple imputation shows overcoverage for small number of clusters in each intervention group.
An Introduction to Modern Missing Data Analyses
ERIC Educational Resources Information Center
Baraldi, Amanda N.; Enders, Craig K.
2010-01-01
A great deal of recent methodological research has focused on two modern missing data analysis methods: maximum likelihood and multiple imputation. These approaches are advantageous to traditional techniques (e.g. deletion and mean imputation techniques) because they require less stringent assumptions and mitigate the pitfalls of traditional…
A comprehensive SNP and indel imputability database.
Duan, Qing; Liu, Eric Yi; Croteau-Chonka, Damien C; Mohlke, Karen L; Li, Yun
2013-02-15
Genotype imputation has become an indispensible step in genome-wide association studies (GWAS). Imputation accuracy, directly influencing downstream analysis, has shown to be improved using re-sequencing-based reference panels; however, this comes at the cost of high computational burden due to the huge number of potentially imputable markers (tens of millions) discovered through sequencing a large number of individuals. Therefore, there is an increasing need for access to imputation quality information without actually conducting imputation. To facilitate this process, we have established a publicly available SNP and indel imputability database, aiming to provide direct access to imputation accuracy information for markers identified by the 1000 Genomes Project across four major populations and covering multiple GWAS genotyping platforms. SNP and indel imputability information can be retrieved through a user-friendly interface by providing the ID(s) of the desired variant(s) or by specifying the desired genomic region. The query results can be refined by selecting relevant GWAS genotyping platform(s). This is the first database providing variant imputability information specific to each continental group and to each genotyping platform. In Filipino individuals from the Cebu Longitudinal Health and Nutrition Survey, our database can achieve an area under the receiver-operating characteristic curve of 0.97, 0.91, 0.88 and 0.79 for markers with minor allele frequency >5%, 3-5%, 1-3% and 0.5-1%, respectively. Specifically, by filtering out 48.6% of markers (corresponding to a reduction of up to 48.6% in computational costs for actual imputation) based on the imputability information in our database, we can remove 77%, 58%, 51% and 42% of the poorly imputed markers at the cost of only 0.3%, 0.8%, 1.5% and 4.6% of the well-imputed markers with minor allele frequency >5%, 3-5%, 1-3% and 0.5-1%, respectively. http://www.unc.edu/∼yunmli/imputability.html
Wang, Guoshen; Pan, Yi; Seth, Puja; Song, Ruiguang; Belcher, Lisa
2017-01-01
Missing data create challenges for determining progress made in linking HIV-positive persons to HIV medical care. Statistical methods are not used to address missing program data on linkage. In 2014, 61 health department jurisdictions were funded by Centers for Disease Control and Prevention (CDC) and submitted data on HIV testing, newly diagnosed HIV-positive persons, and linkage to HIV medical care. Missing or unusable data existed in our data set. A new approach using multiple imputation to address missing linkage data was proposed, and results were compared to the current approach that uses data with complete information. There were 12,472 newly diagnosed HIV-positive persons from CDC-funded HIV testing events in 2014. Using multiple imputation, 94.1% (95% confidence interval (CI): [93.7%, 94.6%]) of newly diagnosed persons were referred to HIV medical care, 88.6% (95% CI: [88.0%, 89.1%]) were linked to care within any time frame, and 83.6% (95% CI: [83.0%, 84.3%]) were linked to care within 90 days. Multiple imputation is recommended for addressing missing linkage data in future analyses when the missing percentage is high. The use of multiple imputation for missing values can result in a better understanding of how programs are performing on key HIV testing and HIV service delivery indicators.
Lissåker, Claudia; Madison, Guy; Held, Claes; Olsson, Erik
2017-01-01
Background Cognitive ability (CA) is positively related to later health, health literacy, health behaviours and longevity. Accordingly, a lower CA is expected to be associated with poorer adherence to medication. We investigated the long-term role of CA in adherence to prescribed statins in male patients after a first myocardial infarction (MI). Methods CA was estimated at 18–20 years of age from Military Conscript Register data for first MI male patients (≤60 years) and was related to the one- and two-year post-MI statin adherence on average 30 years later. Background and clinical data were retrieved through register linkage with the unselected national quality register SWEDEHEART for acute coronary events (Register of Information and Knowledge about Swedish Heart Intensive Care Admissions) and secondary prevention (Secondary Prevention after Heart Intensive Care Admission). Previous and present statin prescription data were obtained from the Prescribed Drug Register and adherence was calculated as ≥80% of prescribed dispensations assuming standard dosage. Logistic regression was used to estimate crude and adjusted associations. The primary analyses used 2613 complete cases and imputing incomplete cases rendered a sample of 4061 cases for use in secondary (replicated) analyses. Results One standard deviation increase in CA was positively associated with both one-year (OR 1.15 (CI 1.01–1.31), P < 0.05) and two-year (OR 1.14 (CI 1.02–1.27), P < 0.05) adherence to prescribed statins. Only smoking attenuated the CA–adherence association after adjustment for a range of > 20 covariates. Imputed and complete case analyses yielded very similar results. Conclusions CA estimated on average 30 years earlier in young adulthood is a risk indicator for statin adherence in first MI male patients aged ≤60 years. Future research should include older and female patients and more socioeconomic variables. PMID:28195516
Newgard, Craig; Malveau, Susan; Staudenmayer, Kristan; Wang, N. Ewen; Hsia, Renee Y.; Mann, N. Clay; Holmes, James F.; Kuppermann, Nathan; Haukoos, Jason S.; Bulger, Eileen M.; Dai, Mengtao; Cook, Lawrence J.
2012-01-01
Objectives The objective was to evaluate the process of using existing data sources, probabilistic linkage, and multiple imputation to create large population-based injury databases matched to outcomes. Methods This was a retrospective cohort study of injured children and adults transported by 94 emergency medical systems (EMS) agencies to 122 hospitals in seven regions of the western United States over a 36-month period (2006 to 2008). All injured patients evaluated by EMS personnel within specific geographic catchment areas were included, regardless of field disposition or outcome. The authors performed probabilistic linkage of EMS records to four hospital and postdischarge data sources (emergency department [ED] data, patient discharge data, trauma registries, and vital statistics files) and then handled missing values using multiple imputation. The authors compare and evaluate matched records, match rates (proportion of matches among eligible patients), and injury outcomes within and across sites. Results There were 381,719 injured patients evaluated by EMS personnel in the seven regions. Among transported patients, match rates ranged from 14.9% to 87.5% and were directly affected by the availability of hospital data sources and proportion of missing values for key linkage variables. For vital statistics records (1-year mortality), estimated match rates ranged from 88.0% to 98.7%. Use of multiple imputation (compared to complete case analysis) reduced bias for injury outcomes, although sample size, percentage missing, type of variable, and combined-site versus single-site imputation models all affected the resulting estimates and variance. Conclusions This project demonstrates the feasibility and describes the process of constructing population-based injury databases across multiple phases of care using existing data sources and commonly available analytic methods. Attention to key linkage variables and decisions for handling missing values can be used to increase match rates between data sources, minimize bias, and preserve sampling design. PMID:22506952
McClure, Matthew C.; Sonstegard, Tad S.; Wiggans, George R.; Van Eenennaam, Alison L.; Weber, Kristina L.; Penedo, Cecilia T.; Berry, Donagh P.; Flynn, John; Garcia, Jose F.; Carmo, Adriana S.; Regitano, Luciana C. A.; Albuquerque, Milla; Silva, Marcos V. G. B.; Machado, Marco A.; Coffey, Mike; Moore, Kirsty; Boscher, Marie-Yvonne; Genestout, Lucie; Mazza, Raffaele; Taylor, Jeremy F.; Schnabel, Robert D.; Simpson, Barry; Marques, Elisa; McEwan, John C.; Cromie, Andrew; Coutinho, Luiz L.; Kuehn, Larry A.; Keele, John W.; Piper, Emily K.; Cook, Jim; Williams, Robert; Van Tassell, Curtis P.
2013-01-01
To assist cattle producers transition from microsatellite (MS) to single nucleotide polymorphism (SNP) genotyping for parental verification we previously devised an effective and inexpensive method to impute MS alleles from SNP haplotypes. While the reported method was verified with only a limited data set (N = 479) from Brown Swiss, Guernsey, Holstein, and Jersey cattle, some of the MS-SNP haplotype associations were concordant across these phylogenetically diverse breeds. This implied that some haplotypes predate modern breed formation and remain in strong linkage disequilibrium. To expand the utility of MS allele imputation across breeds, MS and SNP data from more than 8000 animals representing 39 breeds (Bos taurus and B. indicus) were used to predict 9410 SNP haplotypes, incorporating an average of 73 SNPs per haplotype, for which alleles from 12 MS markers could be accurately be imputed. Approximately 25% of the MS-SNP haplotypes were present in multiple breeds (N = 2 to 36 breeds). These shared haplotypes allowed for MS imputation in breeds that were not represented in the reference population with only a small increase in Mendelian inheritance inconsistancies. Our reported reference haplotypes can be used for any cattle breed and the reported methods can be applied to any species to aid the transition from MS to SNP genetic markers. While ~91% of the animals with imputed alleles for 12 MS markers had ≤1 Mendelian inheritance conflicts with their parents' reported MS genotypes, this figure was 96% for our reference animals, indicating potential errors in the reported MS genotypes. The workflow we suggest autocorrects for genotyping errors and rare haplotypes, by MS genotyping animals whose imputed MS alleles fail parentage verification, and then incorporating those animals into the reference dataset. PMID:24065982
Missing Data in Alcohol Clinical Trials with Binary Outcomes
Hallgren, Kevin A.; Witkiewitz, Katie; Kranzler, Henry R.; Falk, Daniel E.; Litten, Raye Z.; O’Malley, Stephanie S.; Anton, Raymond F.
2017-01-01
Background Missing data are common in alcohol clinical trials for both continuous and binary endpoints. Approaches to handle missing data have been explored for continuous outcomes, yet no studies have compared missing data approaches for binary outcomes (e.g., abstinence, no heavy drinking days). The present study compares approaches to modeling binary outcomes with missing data in the COMBINE study. Method We included participants in the COMBINE Study who had complete drinking data during treatment and who were assigned to active medication or placebo conditions (N=1146). Using simulation methods, missing data were introduced under common scenarios with varying sample sizes and amounts of missing data. Logistic regression was used to estimate the effect of naltrexone (vs. placebo) in predicting any drinking and any heavy drinking outcomes at the end of treatment using four analytic approaches: complete case analysis (CCA), last observation carried forward (LOCF), the worst-case scenario of missing equals any drinking or heavy drinking (WCS), and multiple imputation (MI). In separate analyses, these approaches were compared when drinking data were manually deleted for those participants who discontinued treatment but continued to provide drinking data. Results WCS produced the greatest amount of bias in treatment effect estimates. MI usually yielded less biased estimates than WCS and CCA in the simulated data, and performed considerably better than LOCF when estimating treatment effects among individuals who discontinued treatment. Conclusions Missing data can introduce bias in treatment effect estimates in alcohol clinical trials. Researchers should utilize modern missing data methods, including MI, and avoid WCS and CCA when analyzing binary alcohol clinical trial outcomes. PMID:27254113
Missing Data in Alcohol Clinical Trials with Binary Outcomes.
Hallgren, Kevin A; Witkiewitz, Katie; Kranzler, Henry R; Falk, Daniel E; Litten, Raye Z; O'Malley, Stephanie S; Anton, Raymond F
2016-07-01
Missing data are common in alcohol clinical trials for both continuous and binary end points. Approaches to handle missing data have been explored for continuous outcomes, yet no studies have compared missing data approaches for binary outcomes (e.g., abstinence, no heavy drinking days). This study compares approaches to modeling binary outcomes with missing data in the COMBINE study. We included participants in the COMBINE study who had complete drinking data during treatment and who were assigned to active medication or placebo conditions (N = 1,146). Using simulation methods, missing data were introduced under common scenarios with varying sample sizes and amounts of missing data. Logistic regression was used to estimate the effect of naltrexone (vs. placebo) in predicting any drinking and any heavy drinking outcomes at the end of treatment using 4 analytic approaches: complete case analysis (CCA), last observation carried forward (LOCF), the worst case scenario (WCS) of missing equals any drinking or heavy drinking, and multiple imputation (MI). In separate analyses, these approaches were compared when drinking data were manually deleted for those participants who discontinued treatment but continued to provide drinking data. WCS produced the greatest amount of bias in treatment effect estimates. MI usually yielded less biased estimates than WCS and CCA in the simulated data and performed considerably better than LOCF when estimating treatment effects among individuals who discontinued treatment. Missing data can introduce bias in treatment effect estimates in alcohol clinical trials. Researchers should utilize modern missing data methods, including MI, and avoid WCS and CCA when analyzing binary alcohol clinical trial outcomes. Copyright © 2016 by the Research Society on Alcoholism.
Missing data and multiple imputation in clinical epidemiological research.
Pedersen, Alma B; Mikkelsen, Ellen M; Cronin-Fenton, Deirdre; Kristensen, Nickolaj R; Pham, Tra My; Pedersen, Lars; Petersen, Irene
2017-01-01
Missing data are ubiquitous in clinical epidemiological research. Individuals with missing data may differ from those with no missing data in terms of the outcome of interest and prognosis in general. Missing data are often categorized into the following three types: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). In clinical epidemiological research, missing data are seldom MCAR. Missing data can constitute considerable challenges in the analyses and interpretation of results and can potentially weaken the validity of results and conclusions. A number of methods have been developed for dealing with missing data. These include complete-case analyses, missing indicator method, single value imputation, and sensitivity analyses incorporating worst-case and best-case scenarios. If applied under the MCAR assumption, some of these methods can provide unbiased but often less precise estimates. Multiple imputation is an alternative method to deal with missing data, which accounts for the uncertainty associated with missing data. Multiple imputation is implemented in most statistical software under the MAR assumption and provides unbiased and valid estimates of associations based on information from the available data. The method affects not only the coefficient estimates for variables with missing data but also the estimates for other variables with no missing data.
Missing data and multiple imputation in clinical epidemiological research
Pedersen, Alma B; Mikkelsen, Ellen M; Cronin-Fenton, Deirdre; Kristensen, Nickolaj R; Pham, Tra My; Pedersen, Lars; Petersen, Irene
2017-01-01
Missing data are ubiquitous in clinical epidemiological research. Individuals with missing data may differ from those with no missing data in terms of the outcome of interest and prognosis in general. Missing data are often categorized into the following three types: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). In clinical epidemiological research, missing data are seldom MCAR. Missing data can constitute considerable challenges in the analyses and interpretation of results and can potentially weaken the validity of results and conclusions. A number of methods have been developed for dealing with missing data. These include complete-case analyses, missing indicator method, single value imputation, and sensitivity analyses incorporating worst-case and best-case scenarios. If applied under the MCAR assumption, some of these methods can provide unbiased but often less precise estimates. Multiple imputation is an alternative method to deal with missing data, which accounts for the uncertainty associated with missing data. Multiple imputation is implemented in most statistical software under the MAR assumption and provides unbiased and valid estimates of associations based on information from the available data. The method affects not only the coefficient estimates for variables with missing data but also the estimates for other variables with no missing data. PMID:28352203
Imputation method for lifetime exposure assessment in air pollution epidemiologic studies
2013-01-01
Background Environmental epidemiology, when focused on the life course of exposure to a specific pollutant, requires historical exposure estimates that are difficult to obtain for the full time period due to gaps in the historical record, especially in earlier years. We show that these gaps can be filled by applying multiple imputation methods to a formal risk equation that incorporates lifetime exposure. We also address challenges that arise, including choice of imputation method, potential bias in regression coefficients, and uncertainty in age-at-exposure sensitivities. Methods During time periods when parameters needed in the risk equation are missing for an individual, the parameters are filled by an imputation model using group level information or interpolation. A random component is added to match the variance found in the estimates for study subjects not needing imputation. The process is repeated to obtain multiple data sets, whose regressions against health data can be combined statistically to develop confidence limits using Rubin’s rules to account for the uncertainty introduced by the imputations. To test for possible recall bias between cases and controls, which can occur when historical residence location is obtained by interview, and which can lead to misclassification of imputed exposure by disease status, we introduce an “incompleteness index,” equal to the percentage of dose imputed (PDI) for a subject. “Effective doses” can be computed using different functional dependencies of relative risk on age of exposure, allowing intercomparison of different risk models. To illustrate our approach, we quantify lifetime exposure (dose) from traffic air pollution in an established case–control study on Long Island, New York, where considerable in-migration occurred over a period of many decades. Results The major result is the described approach to imputation. The illustrative example revealed potential recall bias, suggesting that regressions against health data should be done as a function of PDI to check for consistency of results. The 1% of study subjects who lived for long durations near heavily trafficked intersections, had very high cumulative exposures. Thus, imputation methods must be designed to reproduce non-standard distributions. Conclusions Our approach meets a number of methodological challenges to extending historical exposure reconstruction over a lifetime and shows promise for environmental epidemiology. Application to assessment of breast cancer risks will be reported in a subsequent manuscript. PMID:23919666
Time-dependent summary receiver operating characteristics for meta-analysis of prognostic studies.
Hattori, Satoshi; Zhou, Xiao-Hua
2016-11-20
Prognostic studies are widely conducted to examine whether biomarkers are associated with patient's prognoses and play important roles in medical decisions. Because findings from one prognostic study may be very limited, meta-analyses may be useful to obtain sound evidence. However, prognostic studies are often analyzed by relying on a study-specific cut-off value, which can lead to difficulty in applying the standard meta-analysis techniques. In this paper, we propose two methods to estimate a time-dependent version of the summary receiver operating characteristics curve for meta-analyses of prognostic studies with a right-censored time-to-event outcome. We introduce a bivariate normal model for the pair of time-dependent sensitivity and specificity and propose a method to form inferences based on summary statistics reported in published papers. This method provides a valid inference asymptotically. In addition, we consider a bivariate binomial model. To draw inferences from this bivariate binomial model, we introduce a multiple imputation method. The multiple imputation is found to be approximately proper multiple imputation, and thus the standard Rubin's variance formula is justified from a Bayesian view point. Our simulation study and application to a real dataset revealed that both methods work well with a moderate or large number of studies and the bivariate binomial model coupled with the multiple imputation outperforms the bivariate normal model with a small number of studies. Copyright © 2016 John Wiley & Sons, Ltd. Copyright © 2016 John Wiley & Sons, Ltd.
A nonparametric multiple imputation approach for missing categorical data.
Zhou, Muhan; He, Yulei; Yu, Mandi; Hsu, Chiu-Hsieh
2017-06-06
Incomplete categorical variables with more than two categories are common in public health data. However, most of the existing missing-data methods do not use the information from nonresponse (missingness) probabilities. We propose a nearest-neighbour multiple imputation approach to impute a missing at random categorical outcome and to estimate the proportion of each category. The donor set for imputation is formed by measuring distances between each missing value with other non-missing values. The distance function is calculated based on a predictive score, which is derived from two working models: one fits a multinomial logistic regression for predicting the missing categorical outcome (the outcome model) and the other fits a logistic regression for predicting missingness probabilities (the missingness model). A weighting scheme is used to accommodate contributions from two working models when generating the predictive score. A missing value is imputed by randomly selecting one of the non-missing values with the smallest distances. We conduct a simulation to evaluate the performance of the proposed method and compare it with several alternative methods. A real-data application is also presented. The simulation study suggests that the proposed method performs well when missingness probabilities are not extreme under some misspecifications of the working models. However, the calibration estimator, which is also based on two working models, can be highly unstable when missingness probabilities for some observations are extremely high. In this scenario, the proposed method produces more stable and better estimates. In addition, proper weights need to be chosen to balance the contributions from the two working models and achieve optimal results for the proposed method. We conclude that the proposed multiple imputation method is a reasonable approach to dealing with missing categorical outcome data with more than two levels for assessing the distribution of the outcome. In terms of the choices for the working models, we suggest a multinomial logistic regression for predicting the missing outcome and a binary logistic regression for predicting the missingness probability.
Liang, Yulan; Kelemen, Arpad
2017-01-01
Abstract Genetic and environmental (behavior, clinical, and demographic) factors are associated with increased risks of both myocardial infarction (MI) and high cholesterol (HC). It is known that HC is major risk factor that may cause MI. However, whether there are common single nucleotide polymorphism (SNPs) associated with both MI and HC is not firmly established, and whether there are modulate and modified effects (interactions of genetic and known environmental factors) on either HC or MI, and whether these joint effects improve the predictions of MI, is understudied. The purpose of this study is to identify novel shared SNPs and modifiable environmental factors on MI and HC. We assess whether SNPs from a metabolic pathway related to MI may relate to HC; whether there are moderate effects among SNPs, lifestyle (smoke and drinking), HC, and MI after controlling other factors [gender, body mass index (BMI), and hypertension (HTN)]; and evaluate prediction power of the joint and modulate genetic and environmental factors influencing the MI and HC. This is a retrospective study with residents of Erie and Niagara counties in New York with a history of MI or with no history of MI. The data set includes environmental variables (demographic, clinical, lifestyle). Thirty-one tagSNPs from a metabolic pathway related to MI are genotyped. Generalized linear models (GLMs) with imputation-based analysis are conducted for examining the common effects of tagSNPs and environmental exposures and their interactions on having a history of HC or MI. MI, BMI, and HTN are significant risk factors for HC. HC shows the strongest effect on risk of MI in addition to HTN; gender and smoking status while drinking status shows protective effect on MI. rs16944 (gene IL-1β) and rs17222772 (gene ALOX) increase the risks of HC, while rs17231896 (gene CETP) has protective effects on HC either with or without the clinical, behavioral, demographic factors with different effect sizes that may indicate the existence of moderate or modifiable effects. Further analysis with the inclusions of gene–gene and gene–environmental interactions shows interactions between rs17231896 (CETP) and rs17222772 (ALOX); rs17231896 (CETP) and gender. rs17237890 (CETP) and rs2070744 (NOS3) are found to be significantly associated with risks of MI adjusted by both SNPs and environmental factors. After multiple testing adjustments, these effects diminished as expected. In addition, an interaction between drinking and smoking status is significant. Overall, the prediction power in successfully classifying MI status is increased to 80% with inclusions of all significant tagSNPs and environmental factors and their interactions compared with environmental factors only (72%). Having a history of either HC or MI has significant effects on each other in both directions, in addition to HTN and gender. Genes/SNPs identified from this analysis that are associated with HC may be potentially linked to MI, which could be further examined and validated through haplotype-pairs analysis with appropriate population stratification corrections, and function/pathway regulation analysis to eliminate the limitations of the current analysis. PMID:28906356
Liang, Yulan; Kelemen, Arpad
2017-09-01
Genetic and environmental (behavior, clinical, and demographic) factors are associated with increased risks of both myocardial infarction (MI) and high cholesterol (HC). It is known that HC is major risk factor that may cause MI. However, whether there are common single nucleotide polymorphism (SNPs) associated with both MI and HC is not firmly established, and whether there are modulate and modified effects (interactions of genetic and known environmental factors) on either HC or MI, and whether these joint effects improve the predictions of MI, is understudied.The purpose of this study is to identify novel shared SNPs and modifiable environmental factors on MI and HC. We assess whether SNPs from a metabolic pathway related to MI may relate to HC; whether there are moderate effects among SNPs, lifestyle (smoke and drinking), HC, and MI after controlling other factors [gender, body mass index (BMI), and hypertension (HTN)]; and evaluate prediction power of the joint and modulate genetic and environmental factors influencing the MI and HC.This is a retrospective study with residents of Erie and Niagara counties in New York with a history of MI or with no history of MI. The data set includes environmental variables (demographic, clinical, lifestyle). Thirty-one tagSNPs from a metabolic pathway related to MI are genotyped. Generalized linear models (GLMs) with imputation-based analysis are conducted for examining the common effects of tagSNPs and environmental exposures and their interactions on having a history of HC or MI.MI, BMI, and HTN are significant risk factors for HC. HC shows the strongest effect on risk of MI in addition to HTN; gender and smoking status while drinking status shows protective effect on MI. rs16944 (gene IL-1β) and rs17222772 (gene ALOX) increase the risks of HC, while rs17231896 (gene CETP) has protective effects on HC either with or without the clinical, behavioral, demographic factors with different effect sizes that may indicate the existence of moderate or modifiable effects. Further analysis with the inclusions of gene-gene and gene-environmental interactions shows interactions between rs17231896 (CETP) and rs17222772 (ALOX); rs17231896 (CETP) and gender. rs17237890 (CETP) and rs2070744 (NOS3) are found to be significantly associated with risks of MI adjusted by both SNPs and environmental factors. After multiple testing adjustments, these effects diminished as expected. In addition, an interaction between drinking and smoking status is significant. Overall, the prediction power in successfully classifying MI status is increased to 80% with inclusions of all significant tagSNPs and environmental factors and their interactions compared with environmental factors only (72%).Having a history of either HC or MI has significant effects on each other in both directions, in addition to HTN and gender. Genes/SNPs identified from this analysis that are associated with HC may be potentially linked to MI, which could be further examined and validated through haplotype-pairs analysis with appropriate population stratification corrections, and function/pathway regulation analysis to eliminate the limitations of the current analysis.
Imputation for multisource data with comparison and assessment techniques
Casleton, Emily Michele; Osthus, David Allen; Van Buren, Kendra Lu
2017-12-27
Missing data are prevalent issue in analyses involving data collection. The problem of missing data is exacerbated for multisource analysis, where data from multiple sensors are combined to arrive at a single conclusion. In this scenario, it is more likely to occur and can lead to discarding a large amount of data collected; however, the information from observed sensors can be leveraged to estimate those values not observed. We propose two methods for imputation of multisource data, both of which take advantage of potential correlation between data from different sensors, through ridge regression and a state-space model. These methods, asmore » well as the common median imputation, are applied to data collected from a variety of sensors monitoring an experimental facility. Performance of imputation methods is compared with the mean absolute deviation; however, rather than using this metric to solely rank themethods,we also propose an approach to identify significant differences. Imputation techniqueswill also be assessed by their ability to produce appropriate confidence intervals, through coverage and length, around the imputed values. Finally, performance of imputed datasets is compared with a marginalized dataset through a weighted k-means clustering. In general, we found that imputation through a dynamic linearmodel tended to be the most accurate and to produce the most precise confidence intervals, and that imputing the missing values and down weighting them with respect to observed values in the analysis led to the most accurate performance.« less
Imputation for multisource data with comparison and assessment techniques
DOE Office of Scientific and Technical Information (OSTI.GOV)
Casleton, Emily Michele; Osthus, David Allen; Van Buren, Kendra Lu
Missing data are prevalent issue in analyses involving data collection. The problem of missing data is exacerbated for multisource analysis, where data from multiple sensors are combined to arrive at a single conclusion. In this scenario, it is more likely to occur and can lead to discarding a large amount of data collected; however, the information from observed sensors can be leveraged to estimate those values not observed. We propose two methods for imputation of multisource data, both of which take advantage of potential correlation between data from different sensors, through ridge regression and a state-space model. These methods, asmore » well as the common median imputation, are applied to data collected from a variety of sensors monitoring an experimental facility. Performance of imputation methods is compared with the mean absolute deviation; however, rather than using this metric to solely rank themethods,we also propose an approach to identify significant differences. Imputation techniqueswill also be assessed by their ability to produce appropriate confidence intervals, through coverage and length, around the imputed values. Finally, performance of imputed datasets is compared with a marginalized dataset through a weighted k-means clustering. In general, we found that imputation through a dynamic linearmodel tended to be the most accurate and to produce the most precise confidence intervals, and that imputing the missing values and down weighting them with respect to observed values in the analysis led to the most accurate performance.« less
Approaches in Characterizing Genetic Structure and Mapping in a Rice Multiparental Population.
Raghavan, Chitra; Mauleon, Ramil; Lacorte, Vanica; Jubay, Monalisa; Zaw, Hein; Bonifacio, Justine; Singh, Rakesh Kumar; Huang, B Emma; Leung, Hei
2017-06-07
Multi-parent Advanced Generation Intercross (MAGIC) populations are fast becoming mainstream tools for research and breeding, along with the technology and tools for analysis. This paper demonstrates the analysis of a rice MAGIC population from data filtering to imputation and processing of genetic data to characterizing genomic structure, and finally quantitative trait loci (QTL) mapping. In this study, 1316 S6:8 indica MAGIC (MI) lines and the eight founders were sequenced using Genotyping by Sequencing (GBS). As the GBS approach often includes missing data, the first step was to impute the missing SNPs. The observable number of recombinations in the population was then explored. Based on this case study, a general outline of procedures for a MAGIC analysis workflow is provided, as well as for QTL mapping of agronomic traits and biotic and abiotic stress, using the results from both association and interval mapping approaches. QTL for agronomic traits (yield, flowering time, and plant height), physical (grain length and grain width) and cooking properties (amylose content) of the rice grain, abiotic stress (submergence tolerance), and biotic stress (brown spot disease) were mapped. Through presenting this extensive analysis in the MI population in rice, we highlight important considerations when choosing analytical approaches. The methods and results reported in this paper will provide a guide to future genetic analysis methods applied to multi-parent populations. Copyright © 2017 Raghavan et al.
Nearest neighbor imputation of species-level, plot-scale forest structure attributes from LiDAR data
Andrew T. Hudak; Nicholas L. Crookston; Jeffrey S. Evans; David E. Hall; Michael J. Falkowski
2008-01-01
Meaningful relationships between forest structure attributes measured in representative field plots on the ground and remotely sensed data measured comprehensively across the same forested landscape facilitate the production of maps of forest attributes such as basal area (BA) and tree density (TD). Because imputation methods can efficiently predict multiple response...
USDA-ARS?s Scientific Manuscript database
Genotyping-by-sequencing allows for large-scale genetic analyses in plant species with no reference genome, creating the challenge of sound inference in the presence of uncertain genotypes. Here we report an imputation-based genome-wide association study (GWAS) in reed canarygrass (Phalaris arundina...
USDA-ARS?s Scientific Manuscript database
Genotyping by sequencing allows for large-scale genetic analyses in plant species with no reference genome, but sets the challenge of sound inference in presence of uncertain genotypes. We report an imputation-based genome-wide association study (GWAS) in reed canarygrass (Phalaris arundinacea L., P...
[Imputation methods for missing data in educational diagnostic evaluation].
Fernández-Alonso, Rubén; Suárez-Álvarez, Javier; Muñiz, José
2012-02-01
In the diagnostic evaluation of educational systems, self-reports are commonly used to collect data, both cognitive and orectic. For various reasons, in these self-reports, some of the students' data are frequently missing. The main goal of this research is to compare the performance of different imputation methods for missing data in the context of the evaluation of educational systems. On an empirical database of 5,000 subjects, 72 conditions were simulated: three levels of missing data, three types of loss mechanisms, and eight methods of imputation. The levels of missing data were 5%, 10%, and 20%. The loss mechanisms were set at: Missing completely at random, moderately conditioned, and strongly conditioned. The eight imputation methods used were: listwise deletion, replacement by the mean of the scale, by the item mean, the subject mean, the corrected subject mean, multiple regression, and Expectation-Maximization (EM) algorithm, with and without auxiliary variables. The results indicate that the recovery of the data is more accurate when using an appropriate combination of different methods of recovering lost data. When a case is incomplete, the mean of the subject works very well, whereas for completely lost data, multiple imputation with the EM algorithm is recommended. The use of this combination is especially recommended when data loss is greater and its loss mechanism is more conditioned. Lastly, the results are discussed, and some future lines of research are analyzed.
Combining Fourier and lagged k-nearest neighbor imputation for biomedical time series data.
Rahman, Shah Atiqur; Huang, Yuxiao; Claassen, Jan; Heintzman, Nathaniel; Kleinberg, Samantha
2015-12-01
Most clinical and biomedical data contain missing values. A patient's record may be split across multiple institutions, devices may fail, and sensors may not be worn at all times. While these missing values are often ignored, this can lead to bias and error when the data are mined. Further, the data are not simply missing at random. Instead the measurement of a variable such as blood glucose may depend on its prior values as well as that of other variables. These dependencies exist across time as well, but current methods have yet to incorporate these temporal relationships as well as multiple types of missingness. To address this, we propose an imputation method (FLk-NN) that incorporates time lagged correlations both within and across variables by combining two imputation methods, based on an extension to k-NN and the Fourier transform. This enables imputation of missing values even when all data at a time point is missing and when there are different types of missingness both within and across variables. In comparison to other approaches on three biological datasets (simulated and actual Type 1 diabetes datasets, and multi-modality neurological ICU monitoring) the proposed method has the highest imputation accuracy. This was true for up to half the data being missing and when consecutive missing values are a significant fraction of the overall time series length. Copyright © 2015 Elsevier Inc. All rights reserved.
Chan, Kelvin K W; Xie, Feng; Willan, Andrew R; Pullenayegum, Eleanor M
2017-04-01
Parameter uncertainty in value sets of multiattribute utility-based instruments (MAUIs) has received little attention previously. This false precision leads to underestimation of the uncertainty of the results of cost-effectiveness analyses. The aim of this study is to examine the use of multiple imputation as a method to account for this uncertainty of MAUI scoring algorithms. We fitted a Bayesian model with random effects for respondents and health states to the data from the original US EQ-5D-3L valuation study, thereby estimating the uncertainty in the EQ-5D-3L scoring algorithm. We applied these results to EQ-5D-3L data from the Commonwealth Fund (CWF) Survey for Sick Adults ( n = 3958), comparing the standard error of the estimated mean utility in the CWF population using the predictive distribution from the Bayesian mixed-effect model (i.e., incorporating parameter uncertainty in the value set) with the standard error of the estimated mean utilities based on multiple imputation and the standard error using the conventional approach of using MAUI (i.e., ignoring uncertainty in the value set). The mean utility in the CWF population based on the predictive distribution of the Bayesian model was 0.827 with a standard error (SE) of 0.011. When utilities were derived using the conventional approach, the estimated mean utility was 0.827 with an SE of 0.003, which is only 25% of the SE based on the full predictive distribution of the mixed-effect model. Using multiple imputation with 20 imputed sets, the mean utility was 0.828 with an SE of 0.011, which is similar to the SE based on the full predictive distribution. Ignoring uncertainty of the predicted health utilities derived from MAUIs could lead to substantial underestimation of the variance of mean utilities. Multiple imputation corrects for this underestimation so that the results of cost-effectiveness analyses using MAUIs can report the correct degree of uncertainty.
Habbous, Steven; Chu, Karen P.; Lau, Harold; Schorr, Melissa; Belayneh, Mathieos; Ha, Michael N.; Murray, Scott; O’Sullivan, Brian; Huang, Shao Hui; Snow, Stephanie; Parliament, Matthew; Hao, Desiree; Cheung, Winson Y.; Xu, Wei; Liu, Geoffrey
2017-01-01
BACKGROUND: The incidence of oropharyngeal cancer has risen over the past 2 decades. This rise has been attributed to human papillomavirus (HPV), but information on temporal trends in incidence of HPV-associated cancers across Canada is limited. METHODS: We collected social, clinical and demographic characteristics and p16 protein status (p16-positive or p16-negative, using this immunohistochemistry variable as a surrogate marker of HPV status) for 3643 patients with oropharyngeal cancer diagnosed between 2000 and 2012 at comprehensive cancer centres in British Columbia (6 centres), Edmonton, Calgary, Toronto and Halifax. We used receiver operating characteristic curves and multiple imputation to estimate the p16 status for missing values. We chose a best-imputation probability cut point on the basis of accuracy in samples with known p16 status and through an independent relation between p16 status and overall survival. We used logistic and Cox proportional hazard regression. RESULTS: We found no temporal changes in p16-positive status initially, but there was significant selection bias, with p16 testing significantly more likely to be performed in males, lifetime never-smokers, patients with tonsillar or base-of-tongue tumours and those with nodal involvement (p < 0.05 for each variable). We used the following variables associated with p16-positive status for multiple imputation: male sex, tonsillar or base-of-tongue tumours, smaller tumours, nodal involvement, less smoking and lower alcohol consumption (p < 0.05 for each variable). Using sensitivity analyses, we showed that different imputation probability cut points for p16-positive status each identified a rise from 2000 to 2012, with the best-probability cut point identifying an increase from 47.3% in 2000 to 73.7% in 2012 (p < 0.001). INTERPRETATION: Across multiple centres in Canada, there was a steady rise in the proportion of oropharyngeal cancers attributable to HPV from 2000 to 2012. PMID:28808115
Incomplete Data in Smart Grid: Treatment of Values in Electric Vehicle Charging Data
DOE Office of Scientific and Technical Information (OSTI.GOV)
Majipour, Mostafa; Chu, Peter; Gadh, Rajit
2014-11-03
In this paper, five imputation methods namely Constant (zero), Mean, Median, Maximum Likelihood, and Multiple Imputation methods have been applied to compensate for missing values in Electric Vehicle (EV) charging data. The outcome of each of these methods have been used as the input to a prediction algorithm to forecast the EV load in the next 24 hours at each individual outlet. The data is real world data at the outlet level from the UCLA campus parking lots. Given the sparsity of the data, both Median and Constant (=zero) imputations improved the prediction results. Since in most missing value casesmore » in our database, all values of that instance are missing, the multivariate imputation methods did not improve the results significantly compared to univariate approaches.« less
Kontopantelis, Evangelos; Parisi, Rosa; Springate, David A; Reeves, David
2017-01-13
In modern health care systems, the computerization of all aspects of clinical care has led to the development of large data repositories. For example, in the UK, large primary care databases hold millions of electronic medical records, with detailed information on diagnoses, treatments, outcomes and consultations. Careful analyses of these observational datasets of routinely collected data can complement evidence from clinical trials or even answer research questions that cannot been addressed in an experimental setting. However, 'missingness' is a common problem for routinely collected data, especially for biological parameters over time. Absence of complete data for the whole of a individual's study period is a potential bias risk and standard complete-case approaches may lead to biased estimates. However, the structure of the data values makes standard cross-sectional multiple-imputation approaches unsuitable. In this paper we propose and evaluate mibmi, a new command for cleaning and imputing longitudinal body mass index data. The regression-based data cleaning aspects of the algorithm can be useful when researchers analyze messy longitudinal data. Although the multiple imputation algorithm is computationally expensive, it performed similarly or even better to existing alternatives, when interpolating observations. The mibmi algorithm can be a useful tool for analyzing longitudinal body mass index data, or other longitudinal data with very low individual-level variability.
Zhang, Guosheng; Huang, Kuan-Chieh; Xu, Zheng; Tzeng, Jung-Ying; Conneely, Karen N; Guan, Weihua; Kang, Jian; Li, Yun
2016-05-01
DNA methylation is a key epigenetic mark involved in both normal development and disease progression. Recent advances in high-throughput technologies have enabled genome-wide profiling of DNA methylation. However, DNA methylation profiling often employs different designs and platforms with varying resolution, which hinders joint analysis of methylation data from multiple platforms. In this study, we propose a penalized functional regression model to impute missing methylation data. By incorporating functional predictors, our model utilizes information from nonlocal probes to improve imputation quality. Here, we compared the performance of our functional model to linear regression and the best single probe surrogate in real data and via simulations. Specifically, we applied different imputation approaches to an acute myeloid leukemia dataset consisting of 194 samples and our method showed higher imputation accuracy, manifested, for example, by a 94% relative increase in information content and up to 86% more CpG sites passing post-imputation filtering. Our simulated association study further demonstrated that our method substantially improves the statistical power to identify trait-associated methylation loci. These findings indicate that the penalized functional regression model is a convenient and valuable imputation tool for methylation data, and it can boost statistical power in downstream epigenome-wide association study (EWAS). © 2016 WILEY PERIODICALS, INC.
Multi-task Gaussian process for imputing missing data in multi-trait and multi-environment trials.
Hori, Tomoaki; Montcho, David; Agbangla, Clement; Ebana, Kaworu; Futakuchi, Koichi; Iwata, Hiroyoshi
2016-11-01
A method based on a multi-task Gaussian process using self-measuring similarity gave increased accuracy for imputing missing phenotypic data in multi-trait and multi-environment trials. Multi-environmental trial (MET) data often encounter the problem of missing data. Accurate imputation of missing data makes subsequent analysis more effective and the results easier to understand. Moreover, accurate imputation may help to reduce the cost of phenotyping for thinned-out lines tested in METs. METs are generally performed for multiple traits that are correlated to each other. Correlation among traits can be useful information for imputation, but single-trait-based methods cannot utilize information shared by traits that are correlated. In this paper, we propose imputation methods based on a multi-task Gaussian process (MTGP) using self-measuring similarity kernels reflecting relationships among traits, genotypes, and environments. This framework allows us to use genetic correlation among multi-trait multi-environment data and also to combine MET data and marker genotype data. We compared the accuracy of three MTGP methods and iterative regularized PCA using rice MET data. Two scenarios for the generation of missing data at various missing rates were considered. The MTGP performed a better imputation accuracy than regularized PCA, especially at high missing rates. Under the 'uniform' scenario, in which missing data arise randomly, inclusion of marker genotype data in the imputation increased the imputation accuracy at high missing rates. Under the 'fiber' scenario, in which missing data arise in all traits for some combinations between genotypes and environments, the inclusion of marker genotype data decreased the imputation accuracy for most traits while increasing the accuracy in a few traits remarkably. The proposed methods will be useful for solving the missing data problem in MET data.
Raebel, Marsha A; Shetterly, Susan; Lu, Christine Y; Flory, James; Gagne, Joshua J; Harrell, Frank E; Haynes, Kevin; Herrinton, Lisa J; Patorno, Elisabetta; Popovic, Jennifer; Selvan, Mano; Shoaibi, Azadeh; Wang, Xingmei; Roy, Jason
2016-07-01
Our purpose was to quantify missing baseline laboratory results, assess predictors of missingness, and examine performance of missing data methods. Using the Mini-Sentinel Distributed Database from three sites, we selected three exposure-outcome scenarios with laboratory results as baseline confounders. We compared hazard ratios (HRs) or risk differences (RDs) and 95% confidence intervals (CIs) from models that omitted laboratory results, included only available results (complete cases), and included results after applying missing data methods (multiple imputation [MI] regression, MI predictive mean matching [PMM] indicator). Scenario 1 considered glucose among second-generation antipsychotic users and diabetes. Across sites, glucose was available for 27.7-58.9%. Results differed between complete case and missing data models (e.g., olanzapine: HR 0.92 [CI 0.73, 1.12] vs 1.02 [0.90, 1.16]). Across-site models employing different MI approaches provided similar HR and CI; site-specific models provided differing estimates. Scenario 2 evaluated creatinine among individuals starting high versus low dose lisinopril and hyperkalemia. Creatinine availability: 44.5-79.0%. Results differed between complete case and missing data models (e.g., HR 0.84 [CI 0.77, 0.92] vs. 0.88 [0.83, 0.94]). HR and CI were identical across MI methods. Scenario 3 examined international normalized ratio (INR) among warfarin users starting interacting versus noninteracting antimicrobials and bleeding. INR availability: 20.0-92.9%. Results differed between ignoring INR versus including INR using missing data methods (e.g., RD 0.05 [CI -0.03, 0.13] vs 0.09 [0.00, 0.18]). Indicator and PMM methods gave similar estimates. Multi-site studies must consider site variability in missing data. Different missing data methods performed similarly. Copyright © 2016 John Wiley & Sons, Ltd. Copyright © 2016 John Wiley & Sons, Ltd.
Markkula, Niina; Suvisaari, Jaana; Saarni, Samuli I; Pirkola, Sami; Peña, Sebastian; Saarni, Suoma; Ahola, Kirsi; Mattila, Aino K; Viertiö, Satu; Strehle, Jens; Koskinen, Seppo; Härkänen, Tommi
2015-03-01
Up-to-date epidemiological data on depressive disorders is needed to understand changes in population health and health care utilization. This study aims to assess the prevalence of major depressive disorder (MDD) and dysthymia in the Finnish population and possible changes during the past 11 years. In a nationally representative sample of Finns aged 30 and above (BRIF8901), depressive disorders were diagnosed with the Composite International Diagnostic Interview (M-CIDI) in 2000 and 2011. To account for nonresponse, two methods were compared: multiple imputation (MI) utilizing data from the hospital discharge register and from the interview in 2000 and statistical weighting. The MI-corrected 12-month prevalence of MDD was 7.4% (95% CI 5.7-9.0) and of dysthymia was 4.5% (95% CI 3.1-5.9), whereas the corresponding figures using weights were 5.4% (95% CI 4.7-6.1) for MDD and 2.0% (95% CI 1.6-2.4) for dysthymia. Women (OR 2.33, 95% CI 1.6-3.4) and unmarried people (OR 1.54, 95% CI 1.2-2.0) had a higher risk of depressive disorders. There was a significant increase in the prevalence of depressive disorders during the follow-up period from 7.3% in 2000 to 9.6% in 2011. Prevalences were two percentage points higher, on average, when using MI compared to weighting. Hospital treatments for depressive disorders and other mental disorders were strongly associated with nonparticipation. The CIDI response rate dropped from 75% in 2000 to 57% in 2011, but this was accounted for by MI and weighting. Depressive disorders are a growing public health concern in Finland. Non-participation of persons with severe mental disorders may bias the prevalence estimates of mental disorders in population-based studies. Copyright © 2014 Elsevier B.V. All rights reserved.
ERIC Educational Resources Information Center
Bokossa, Maxime C.; Huang, Gary G.
This report describes the imputation procedures used to deal with missing data in the National Education Longitudinal Study of 1988 (NELS:88), the only current National Center for Education Statistics (NCES) dataset that contains scores from cognitive tests given the same set of students at multiple time points. As is inevitable, cognitive test…
Shara, Nawar; Yassin, Sayf A.; Valaitis, Eduardas; Wang, Hong; Howard, Barbara V.; Wang, Wenyu; Lee, Elisa T.; Umans, Jason G.
2015-01-01
Kidney and cardiovascular disease are widespread among populations with high prevalence of diabetes, such as American Indians participating in the Strong Heart Study (SHS). Studying these conditions simultaneously in longitudinal studies is challenging, because the morbidity and mortality associated with these diseases result in missing data, and these data are likely not missing at random. When such data are merely excluded, study findings may be compromised. In this article, a subset of 2264 participants with complete renal function data from Strong Heart Exams 1 (1989–1991), 2 (1993–1995), and 3 (1998–1999) was used to examine the performance of five methods used to impute missing data: listwise deletion, mean of serial measures, adjacent value, multiple imputation, and pattern-mixture. Three missing at random models and one non-missing at random model were used to compare the performance of the imputation techniques on randomly and non-randomly missing data. The pattern-mixture method was found to perform best for imputing renal function data that were not missing at random. Determining whether data are missing at random or not can help in choosing the imputation method that will provide the most accurate results. PMID:26414328
Shara, Nawar; Yassin, Sayf A; Valaitis, Eduardas; Wang, Hong; Howard, Barbara V; Wang, Wenyu; Lee, Elisa T; Umans, Jason G
2015-01-01
Kidney and cardiovascular disease are widespread among populations with high prevalence of diabetes, such as American Indians participating in the Strong Heart Study (SHS). Studying these conditions simultaneously in longitudinal studies is challenging, because the morbidity and mortality associated with these diseases result in missing data, and these data are likely not missing at random. When such data are merely excluded, study findings may be compromised. In this article, a subset of 2264 participants with complete renal function data from Strong Heart Exams 1 (1989-1991), 2 (1993-1995), and 3 (1998-1999) was used to examine the performance of five methods used to impute missing data: listwise deletion, mean of serial measures, adjacent value, multiple imputation, and pattern-mixture. Three missing at random models and one non-missing at random model were used to compare the performance of the imputation techniques on randomly and non-randomly missing data. The pattern-mixture method was found to perform best for imputing renal function data that were not missing at random. Determining whether data are missing at random or not can help in choosing the imputation method that will provide the most accurate results.
Advancing US GHG Inventory by Incorporating Survey Data using Machine-Learning Techniques
NASA Astrophysics Data System (ADS)
Alsaker, C.; Ogle, S. M.; Breidt, J.
2017-12-01
Crop management data are used in the National Greenhouse Gas Inventory that is compiled annually and reported to the United Nations Framework Convention on Climate Change. Emissions for carbon stock change and N2O emissions for US agricultural soils are estimated using the USDA National Resources Inventory (NRI). NRI provides basic information on land use and cropping histories, but it does not provide much detail on other management practices. In contrast, the Conservation Effects Assessment Project (CEAP) survey collects detailed crop management data that could be used in the GHG Inventory. The survey data were collected from NRI survey locations that are a subset of the NRI every 10 years. Therefore, imputation of the CEAP are needed to represent the management practices across all NRI survey locations both spatially and temporally. Predictive mean matching and an artificial neural network methods have been applied to develop imputation model under a multiple imputation framework. Temporal imputation involves adjusting the imputation model using state-level USDA Agricultural Resource Management Survey data. Distributional and predictive accuracy is assessed for the imputed data, providing not only management data needed for the inventory but also rigorous estimates of uncertainty.
Penalized regression procedures for variable selection in the potential outcomes framework
Ghosh, Debashis; Zhu, Yeying; Coffman, Donna L.
2015-01-01
A recent topic of much interest in causal inference is model selection. In this article, we describe a framework in which to consider penalized regression approaches to variable selection for causal effects. The framework leads to a simple ‘impute, then select’ class of procedures that is agnostic to the type of imputation algorithm as well as penalized regression used. It also clarifies how model selection involves a multivariate regression model for causal inference problems, and that these methods can be applied for identifying subgroups in which treatment effects are homogeneous. Analogies and links with the literature on machine learning methods, missing data and imputation are drawn. A difference LASSO algorithm is defined, along with its multiple imputation analogues. The procedures are illustrated using a well-known right heart catheterization dataset. PMID:25628185
Chaurasia, Ashok; Harel, Ofer
2015-02-10
Tests for regression coefficients such as global, local, and partial F-tests are common in applied research. In the framework of multiple imputation, there are several papers addressing tests for regression coefficients. However, for simultaneous hypothesis testing, the existing methods are computationally intensive because they involve calculation with vectors and (inversion of) matrices. In this paper, we propose a simple method based on the scalar entity, coefficient of determination, to perform (global, local, and partial) F-tests with multiply imputed data. The proposed method is evaluated using simulated data and applied to suicide prevention data. Copyright © 2014 John Wiley & Sons, Ltd.
NASA Astrophysics Data System (ADS)
Poyatos, Rafael; Sus, Oliver; Vilà-Cabrera, Albert; Vayreda, Jordi; Badiella, Llorenç; Mencuccini, Maurizio; Martínez-Vilalta, Jordi
2016-04-01
Plant functional traits are increasingly being used in ecosystem ecology thanks to the growing availability of large ecological databases. However, these databases usually contain a large fraction of missing data because measuring plant functional traits systematically is labour-intensive and because most databases are compilations of datasets with different sampling designs. As a result, within a given database, there is an inevitable variability in the number of traits available for each data entry and/or the species coverage in a given geographical area. The presence of missing data may severely bias trait-based analyses, such as the quantification of trait covariation or trait-environment relationships and may hamper efforts towards trait-based modelling of ecosystem biogeochemical cycles. Several data imputation (i.e. gap-filling) methods have been recently tested on compiled functional trait databases, but the performance of imputation methods applied to a functional trait database with a regular spatial sampling has not been thoroughly studied. Here, we assess the effects of data imputation on five tree functional traits (leaf biomass to sapwood area ratio, foliar nitrogen, maximum height, specific leaf area and wood density) in the Ecological and Forest Inventory of Catalonia, an extensive spatial database (covering 31900 km2). We tested the performance of species mean imputation, single imputation by the k-nearest neighbors algorithm (kNN) and a multiple imputation method, Multivariate Imputation with Chained Equations (MICE) at different levels of missing data (10%, 30%, 50%, and 80%). We also assessed the changes in imputation performance when additional predictors (species identity, climate, forest structure, spatial structure) were added in kNN and MICE imputations. We evaluated the imputed datasets using a battery of indexes describing departure from the complete dataset in trait distribution, in the mean prediction error, in the correlation matrix and in selected bivariate trait relationships. MICE yielded imputations which better preserved the variability and covariance structure of the data and provided an estimate of between-imputation uncertainty. We found that adding species identity as a predictor in MICE and kNN improved imputation for all traits, but adding climate did not lead to any appreciable improvement. However, forest structure and spatial structure did reduce imputation errors in maximum height and in leaf biomass to sapwood area ratios, respectively. Although species mean imputations showed the lowest error for 3 out the 5 studied traits, dataset-averaged errors were lowest for MICE imputations with all additional predictors, when missing data levels were 50% or lower. Species mean imputations always resulted in larger errors in the correlation matrix and appreciably altered the studied bivariate trait relationships. In conclusion, MICE imputations using species identity, climate, forest structure and spatial structure as predictors emerged as the most suitable method of the ones tested here, but it was also evident that imputation performance deteriorates at high levels of missing data (80%).
Gottfredson, Nisha C; Sterba, Sonya K; Jackson, Kristina M
2017-01-01
Random coefficient-dependent (RCD) missingness is a non-ignorable mechanism through which missing data can arise in longitudinal designs. RCD, for which we cannot test, is a problematic form of missingness that occurs if subject-specific random effects correlate with propensity for missingness or dropout. Particularly when covariate missingness is a problem, investigators typically handle missing longitudinal data by using single-level multiple imputation procedures implemented with long-format data, which ignores within-person dependency entirely, or implemented with wide-format (i.e., multivariate) data, which ignores some aspects of within-person dependency. When either of these standard approaches to handling missing longitudinal data is used, RCD missingness leads to parameter bias and incorrect inference. We explain why multilevel multiple imputation (MMI) should alleviate bias induced by a RCD missing data mechanism under conditions that contribute to stronger determinacy of random coefficients. We evaluate our hypothesis with a simulation study. Three design factors are considered: intraclass correlation (ICC; ranging from .25 to .75), number of waves (ranging from 4 to 8), and percent of missing data (ranging from 20 to 50%). We find that MMI greatly outperforms the single-level wide-format (multivariate) method for imputation under a RCD mechanism. For the MMI analyses, bias was most alleviated when the ICC is high, there were more waves of data, and when there was less missing data. Practical recommendations for handling longitudinal missing data are suggested.
Simultaneous inhibition of multiple oncogenic miRNAs by a multi-potent microRNA sponge.
Jung, Jaeyun; Yeom, Chanjoo; Choi, Yeon-Sook; Kim, Sinae; Lee, EunJi; Park, Min Ji; Kang, Sang Wook; Kim, Sung Bae; Chang, Suhwan
2015-08-21
The roles of oncogenic miRNAs are widely recognized in many cancers. Inhibition of single miRNA using antagomiR can efficiently knock-down a specific miRNA. However, the effect is transient and often results in subtle phenotype, as there are other miRNAs contribute to tumorigenesis. Here we report a multi-potent miRNA sponge inhibiting multiple miRNAs simultaneously. As a model system, we targeted miR-21, miR-155 and miR-221/222, known as oncogenic miRNAs in multiple tumors including breast and pancreatic cancers. To achieve efficient knockdown, we generated perfect and bulged-matched miRNA binding sites (MBS) and introduced multiple copies of MBS, ranging from one to five, in the multi-potent miRNA sponge. Luciferase reporter assay showed the multi-potent miRNA sponge efficiently inhibited 4 miRNAs in breast and pancreatic cancer cells. Furthermore, a stable and inducible version of the multi-potent miRNA sponge cell line showed the miRNA sponge efficiently reduces the level of 4 target miRNAs and increase target protein level of these oncogenic miRNAs. Finally, we showed the miRNA sponge sensitize cells to cancer drug and attenuate cell migratory activity. Altogether, our study demonstrates the multi-potent miRNA sponge is a useful tool to examine the functional impact of simultaneous inhibition of multiple miRNAs and proposes a therapeutic potential.
Prediction of regulatory gene pairs using dynamic time warping and gene ontology.
Yang, Andy C; Hsu, Hui-Huang; Lu, Ming-Da; Tseng, Vincent S; Shih, Timothy K
2014-01-01
Selecting informative genes is the most important task for data analysis on microarray gene expression data. In this work, we aim at identifying regulatory gene pairs from microarray gene expression data. However, microarray data often contain multiple missing expression values. Missing value imputation is thus needed before further processing for regulatory gene pairs becomes possible. We develop a novel approach to first impute missing values in microarray time series data by combining k-Nearest Neighbour (KNN), Dynamic Time Warping (DTW) and Gene Ontology (GO). After missing values are imputed, we then perform gene regulation prediction based on our proposed DTW-GO distance measurement of gene pairs. Experimental results show that our approach is more accurate when compared with existing missing value imputation methods on real microarray data sets. Furthermore, our approach can also discover more regulatory gene pairs that are known in the literature than other methods.
NASA Astrophysics Data System (ADS)
Yozgatligil, Ceylan; Aslan, Sipan; Iyigun, Cem; Batmaz, Inci
2013-04-01
This study aims to compare several imputation methods to complete the missing values of spatio-temporal meteorological time series. To this end, six imputation methods are assessed with respect to various criteria including accuracy, robustness, precision, and efficiency for artificially created missing data in monthly total precipitation and mean temperature series obtained from the Turkish State Meteorological Service. Of these methods, simple arithmetic average, normal ratio (NR), and NR weighted with correlations comprise the simple ones, whereas multilayer perceptron type neural network and multiple imputation strategy adopted by Monte Carlo Markov Chain based on expectation-maximization (EM-MCMC) are computationally intensive ones. In addition, we propose a modification on the EM-MCMC method. Besides using a conventional accuracy measure based on squared errors, we also suggest the correlation dimension (CD) technique of nonlinear dynamic time series analysis which takes spatio-temporal dependencies into account for evaluating imputation performances. Depending on the detailed graphical and quantitative analysis, it can be said that although computational methods, particularly EM-MCMC method, are computationally inefficient, they seem favorable for imputation of meteorological time series with respect to different missingness periods considering both measures and both series studied. To conclude, using the EM-MCMC algorithm for imputing missing values before conducting any statistical analyses of meteorological data will definitely decrease the amount of uncertainty and give more robust results. Moreover, the CD measure can be suggested for the performance evaluation of missing data imputation particularly with computational methods since it gives more precise results in meteorological time series.
Lazar, Cosmin; Gatto, Laurent; Ferro, Myriam; Bruley, Christophe; Burger, Thomas
2016-04-01
Missing values are a genuine issue in label-free quantitative proteomics. Recent works have surveyed the different statistical methods to conduct imputation and have compared them on real or simulated data sets and recommended a list of missing value imputation methods for proteomics application. Although insightful, these comparisons do not account for two important facts: (i) depending on the proteomics data set, the missingness mechanism may be of different natures and (ii) each imputation method is devoted to a specific type of missingness mechanism. As a result, we believe that the question at stake is not to find the most accurate imputation method in general but instead the most appropriate one. We describe a series of comparisons that support our views: For instance, we show that a supposedly "under-performing" method (i.e., giving baseline average results), if applied at the "appropriate" time in the data-processing pipeline (before or after peptide aggregation) on a data set with the "appropriate" nature of missing values, can outperform a blindly applied, supposedly "better-performing" method (i.e., the reference method from the state-of-the-art). This leads us to formulate few practical guidelines regarding the choice and the application of an imputation method in a proteomics context.
Examining solutions to missing data in longitudinal nursing research.
Roberts, Mary B; Sullivan, Mary C; Winchester, Suzy B
2017-04-01
Longitudinal studies are highly valuable in pediatrics because they provide useful data about developmental patterns of child health and behavior over time. When data are missing, the value of the research is impacted. The study's purpose was to (1) introduce a three-step approach to assess and address missing data and (2) illustrate this approach using categorical and continuous-level variables from a longitudinal study of premature infants. A three-step approach with simulations was followed to assess the amount and pattern of missing data and to determine the most appropriate imputation method for the missing data. Patterns of missingness were Missing Completely at Random, Missing at Random, and Not Missing at Random. Missing continuous-level data were imputed using mean replacement, stochastic regression, multiple imputation, and fully conditional specification (FCS). Missing categorical-level data were imputed using last value carried forward, hot-decking, stochastic regression, and FCS. Simulations were used to evaluate these imputation methods under different patterns of missingness at different levels of missing data. The rate of missingness was 16-23% for continuous variables and 1-28% for categorical variables. FCS imputation provided the least difference in mean and standard deviation estimates for continuous measures. FCS imputation was acceptable for categorical measures. Results obtained through simulation reinforced and confirmed these findings. Significant investments are made in the collection of longitudinal data. The prudent handling of missing data can protect these investments and potentially improve the scientific information contained in pediatric longitudinal studies. © 2017 Wiley Periodicals, Inc.
Lapidus, Nathanael; Chevret, Sylvie; Resche-Rigon, Matthieu
2014-12-30
Agreement between two assays is usually based on the concordance correlation coefficient (CCC), estimated from the means, standard deviations, and correlation coefficient of these assays. However, such data will often suffer from left-censoring because of lower limits of detection of these assays. To handle such data, we propose to extend a multiple imputation approach by chained equations (MICE) developed in a close setting of one left-censored assay. The performance of this two-step approach is compared with that of a previously published maximum likelihood estimation through a simulation study. Results show close estimates of the CCC by both methods, although the coverage is improved by our MICE proposal. An application to cytomegalovirus quantification data is provided. Copyright © 2014 John Wiley & Sons, Ltd.
Characterizing and Managing Missing Structured Data in Electronic Health Records: Data Analysis.
Beaulieu-Jones, Brett K; Lavage, Daniel R; Snyder, John W; Moore, Jason H; Pendergrass, Sarah A; Bauer, Christopher R
2018-02-23
Missing data is a challenge for all studies; however, this is especially true for electronic health record (EHR)-based analyses. Failure to appropriately consider missing data can lead to biased results. While there has been extensive theoretical work on imputation, and many sophisticated methods are now available, it remains quite challenging for researchers to implement these methods appropriately. Here, we provide detailed procedures for when and how to conduct imputation of EHR laboratory results. The objective of this study was to demonstrate how the mechanism of missingness can be assessed, evaluate the performance of a variety of imputation methods, and describe some of the most frequent problems that can be encountered. We analyzed clinical laboratory measures from 602,366 patients in the EHR of Geisinger Health System in Pennsylvania, USA. Using these data, we constructed a representative set of complete cases and assessed the performance of 12 different imputation methods for missing data that was simulated based on 4 mechanisms of missingness (missing completely at random, missing not at random, missing at random, and real data modelling). Our results showed that several methods, including variations of Multivariate Imputation by Chained Equations (MICE) and softImpute, consistently imputed missing values with low error; however, only a subset of the MICE methods was suitable for multiple imputation. The analyses we describe provide an outline of considerations for dealing with missing EHR data, steps that researchers can perform to characterize missingness within their own data, and an evaluation of methods that can be applied to impute clinical data. While the performance of methods may vary between datasets, the process we describe can be generalized to the majority of structured data types that exist in EHRs, and all of our methods and code are publicly available. ©Brett K Beaulieu-Jones, Daniel R Lavage, John W Snyder, Jason H Moore, Sarah A Pendergrass, Christopher R Bauer. Originally published in JMIR Medical Informatics (http://medinform.jmir.org), 23.02.2018.
SparRec: An effective matrix completion framework of missing data imputation for GWAS
NASA Astrophysics Data System (ADS)
Jiang, Bo; Ma, Shiqian; Causey, Jason; Qiao, Linbo; Hardin, Matthew Price; Bitts, Ian; Johnson, Daniel; Zhang, Shuzhong; Huang, Xiuzhen
2016-10-01
Genome-wide association studies present computational challenges for missing data imputation, while the advances of genotype technologies are generating datasets of large sample sizes with sample sets genotyped on multiple SNP chips. We present a new framework SparRec (Sparse Recovery) for imputation, with the following properties: (1) The optimization models of SparRec, based on low-rank and low number of co-clusters of matrices, are different from current statistics methods. While our low-rank matrix completion (LRMC) model is similar to Mendel-Impute, our matrix co-clustering factorization (MCCF) model is completely new. (2) SparRec, as other matrix completion methods, is flexible to be applied to missing data imputation for large meta-analysis with different cohorts genotyped on different sets of SNPs, even when there is no reference panel. This kind of meta-analysis is very challenging for current statistics based methods. (3) SparRec has consistent performance and achieves high recovery accuracy even when the missing data rate is as high as 90%. Compared with Mendel-Impute, our low-rank based method achieves similar accuracy and efficiency, while the co-clustering based method has advantages in running time. The testing results show that SparRec has significant advantages and competitive performance over other state-of-the-art existing statistics methods including Beagle and fastPhase.
Examining Solutions to Missing Data in Longitudinal Nursing Research
Roberts, Mary B.; Sullivan, Mary C.; Winchester, Suzy B.
2017-01-01
Purpose Longitudinal studies are highly valuable in pediatrics because they provide useful data about developmental patterns of child health and behavior over time. When data are missing, the value of the research is impacted. The study’s purpose was to: (1) introduce a 3-step approach to assess and address missing data; (2) illustrate this approach using categorical and continuous level variables from a longitudinal study of premature infants. Methods A three-step approach with simulations was followed to assess the amount and pattern of missing data and to determine the most appropriate imputation method for the missing data. Patterns of missingness were Missing Completely at Random, Missing at Random, and Not Missing at Random. Missing continuous-level data were imputed using mean replacement, stochastic regression, multiple imputation, and fully conditional specification. Missing categorical-level data were imputed using last value carried forward, hot-decking, stochastic regression, and fully conditional specification. Simulations were used to evaluate these imputation methods under different patterns of missingness at different levels of missing data. Results The rate of missingness was 16–23% for continuous variables and 1–28% for categorical variables. Fully conditional specification imputation provided the least difference in mean and standard deviation estimates for continuous measures. Fully conditional specification imputation was acceptable for categorical measures. Results obtained through simulation reinforced and confirmed these findings. Practice Implications Significant investments are made in the collection of longitudinal data. The prudent handling of missing data can protect these investments and potentially improve the scientific information contained in pediatric longitudinal studies. PMID:28425202
Regnerus, Mark
2017-09-01
The study of stigma's influence on health has surged in recent years. Hatzenbuehler et al.'s (2014) study of structural stigma's effect on mortality revealed an average of 12 years' shorter life expectancy for sexual minorities who resided in communities thought to exhibit high levels of anti-gay prejudice, using data from the 1988-2002 administrations of the US General Social Survey linked to mortality outcome data in the 2008 National Death Index. In the original study, the key predictor variable (structural stigma) led to results suggesting the profound negative influence of structural stigma on the mortality of sexual minorities. Attempts to replicate the study, in order to explore alternative hypotheses, repeatedly failed to generate the original study's key finding on structural stigma. Efforts to discern the source of the disparity in results revealed complications in the multiple imputation process for missing values of the components of structural stigma. This prompted efforts at replication using 10 different imputation approaches. Efforts to replicate Hatzenbuehler et al.'s (2014) key finding on structural stigma's notable influence on the premature mortality of sexual minorities, including a more refined imputation strategy than described in the original study, failed. No data imputation approach yielded parameters that supported the original study's conclusions. Alternative hypotheses, which originally motivated the present study, revealed little new information. Ten different approaches to multiple imputation of missing data yielded none in which the effect of structural stigma on the mortality of sexual minorities was statistically significant. Minimally, the original study's structural stigma variable (and hence its key result) is so sensitive to subjective measurement decisions as to be rendered unreliable. Copyright © 2016 The Author. Published by Elsevier Ltd.. All rights reserved.
2014-01-01
BACKGROUND Elevated blood pressure (BP), a heritable risk factor for many age-related disorders, is commonly investigated in population and genetic studies, but antihypertensive use can confound study results. Routine methods to adjust for antihypertensives may not sufficiently account for newer treatment protocols (i.e., combination or multiple drug therapy) found in contemporary cohorts. METHODS We refined an existing method to impute unmedicated BP in individuals on antihypertensives by incorporating new treatment trends. We assessed BP and antihypertensive use in male twins (n = 1,237) from the Vietnam Era Twin Study of Aging: 36% reported antihypertensive use; 52% of those treated were on multiple drugs. RESULTS Estimated heritability was 0.43 (95% confidence interval (CI) = 0.20–0.50) and 0.44 (95% CI = 0.22–0.61) for measured systolic BP (SBP) and diastolic BP (DBP), respectively. We imputed BP for antihypertensives by 3 approaches: (i) addition of a fixed value of 10/5mm Hg to measured SBP/DBP; (ii) incremented addition of mm Hg to BP based on number of medications; and (iii) a refined approach adding mm Hg based on antihypertensive drug class and ethnicity. The imputations did not significantly affect estimated heritability of BP. However, use of our most refined imputation method and other methods resulted in significantly increased phenotypic correlations between BP and body mass index, a trait known to be correlated with BP. CONCLUSIONS This study highlights the potential usefulness of applying a representative adjustment for medication use, such as by considering drug class, ethnicity, and the combination of drugs when assessing the relationship between BP and risk factors. PMID:24532572
Doidge, James C
2018-02-01
Population-based cohort studies are invaluable to health research because of the breadth of data collection over time, and the representativeness of their samples. However, they are especially prone to missing data, which can compromise the validity of analyses when data are not missing at random. Having many waves of data collection presents opportunity for participants' responsiveness to be observed over time, which may be informative about missing data mechanisms and thus useful as an auxiliary variable. Modern approaches to handling missing data such as multiple imputation and maximum likelihood can be difficult to implement with the large numbers of auxiliary variables and large amounts of non-monotone missing data that occur in cohort studies. Inverse probability-weighting can be easier to implement but conventional wisdom has stated that it cannot be applied to non-monotone missing data. This paper describes two methods of applying inverse probability-weighting to non-monotone missing data, and explores the potential value of including measures of responsiveness in either inverse probability-weighting or multiple imputation. Simulation studies are used to compare methods and demonstrate that responsiveness in longitudinal studies can be used to mitigate bias induced by missing data, even when data are not missing at random.
Mallinckrodt, C H; Lin, Q; Molenberghs, M
2013-01-01
The objective of this research was to demonstrate a framework for drawing inference from sensitivity analyses of incomplete longitudinal clinical trial data via a re-analysis of data from a confirmatory clinical trial in depression. A likelihood-based approach that assumed missing at random (MAR) was the primary analysis. Robustness to departure from MAR was assessed by comparing the primary result to those from a series of analyses that employed varying missing not at random (MNAR) assumptions (selection models, pattern mixture models and shared parameter models) and to MAR methods that used inclusive models. The key sensitivity analysis used multiple imputation assuming that after dropout the trajectory of drug-treated patients was that of placebo treated patients with a similar outcome history (placebo multiple imputation). This result was used as the worst reasonable case to define the lower limit of plausible values for the treatment contrast. The endpoint contrast from the primary analysis was - 2.79 (p = .013). In placebo multiple imputation, the result was - 2.17. Results from the other sensitivity analyses ranged from - 2.21 to - 3.87 and were symmetrically distributed around the primary result. Hence, no clear evidence of bias from missing not at random data was found. In the worst reasonable case scenario, the treatment effect was 80% of the magnitude of the primary result. Therefore, it was concluded that a treatment effect existed. The structured sensitivity framework of using a worst reasonable case result based on a controlled imputation approach with transparent and debatable assumptions supplemented a series of plausible alternative models under varying assumptions was useful in this specific situation and holds promise as a generally useful framework. Copyright © 2012 John Wiley & Sons, Ltd.
de Vocht, Frank; Lee, Brian
2014-08-01
Studies have suggested that residential exposure to extremely low frequency (50 Hz) electromagnetic fields (ELF-EMF) from high voltage cables, overhead power lines, electricity substations or towers are associated with reduced birth weight and may be associated with adverse birth outcomes or even miscarriages. We previously conducted a study of 140,356 singleton live births between 2004 and 2008 in Northwest England, which suggested that close residential proximity (≤ 50 m) to ELF-EMF sources was associated with reduced average birth weight of 212 g (95%CI: -395 to -29 g) but not with statistically significant increased risks for other adverse perinatal outcomes. However, the cohort was limited by missing data for most potentially confounding variables including maternal smoking during pregnancy, which was only available for a small subgroup, while also residual confounding could not be excluded. This study, using the same cohort, was conducted to minimize the effects of these problems using multiple imputation to address missing data and propensity score matching to minimize residual confounding. Missing data were imputed using multiple imputation using chained equations to generate five datasets. For each dataset 115 exposed women (residing ≤ 50 m from a residential ELF-EMF source) were propensity score matched to 1150 unexposed women. After doubly robust confounder adjustment, close proximity to a residential ELF-EMF source remained associated with a reduction in birth weight of -116 g (95% confidence interval: -224:-7 g). No effect was found for proximity ≤ 100 m compared to women living further away. These results indicate that although the effect size was about half of the effect previously reported, close maternal residential proximity to sources of ELF-EMF remained associated with suboptimal fetal growth. Copyright © 2014 Elsevier Ltd. All rights reserved.
Hydrologic Response to Climate Change: Missing Precipitation Data Matters for Computed Timing Trends
NASA Astrophysics Data System (ADS)
Daniels, B.
2016-12-01
This work demonstrates the derivation of climate timing statistics and applying them to determine resulting hydroclimate impacts. Long-term daily precipitation observations from 50 California stations were used to compute climate trends of precipitation event Intensity, event Duration and Pause between events. Each precipitation event trend was then applied as input to a PRMS hydrology model which showed hydrology changes to recharge, baseflow, streamflow, etc. An important concern was precipitation uncertainty induced by missing observation values and causing errors in quantification of precipitation trends. Many standard statistical techniques such as ARIMA and simple endogenous or even exogenous imputation were applied but failed to help resolve these uncertainties. What helped resolve these uncertainties was use of multiple imputation techniques. This involved fitting of Weibull probability distributions to multiple imputed values for the three precipitation trends.Permutation resampling techniques using Monte Carlo processing were then applied to the multiple imputation values to derive significance p-values for each trend. Significance at the 95% level for Intensity was found for 11 of the 50 stations, Duration from 16 of the 50, and Pause from 19, of which 12 were 99% significant. The significance weighted trends for California are Intensity -4.61% per decade, Duration +3.49% per decade, and Pause +3.58% per decade. Two California basins with PRMS hydrologic models were studied: Feather River in the northern Sierra Nevada mountains and the central coast Soquel-Aptos. Each local trend was changed without changing the other trends or the total precipitation. Feather River Basin's critical supply to Lake Oroville and the State Water Project benefited from a total streamflow increase of 1.5%. The Soquel-Aptos Basin water supply was impacted by a total groundwater recharge decrease of -7.5% and streamflow decrease of -3.2%.
Zhang, Haixia; Zhao, Junkang; Gu, Caijiao; Cui, Yan; Rong, Huiying; Meng, Fanlong; Wang, Tong
2015-05-01
The study of the medical expenditure and its influencing factors among the students enrolling in Urban Resident Basic Medical Insurance (URBMI) in Taiyuan indicated that non response bias and selection bias coexist in dependent variable of the survey data. Unlike previous studies only focused on one missing mechanism, a two-stage method to deal with two missing mechanisms simultaneously was suggested in this study, combining multiple imputation with sample selection model. A total of 1 190 questionnaires were returned by the students (or their parents) selected in child care settings, schools and universities in Taiyuan by stratified cluster random sampling in 2012. In the returned questionnaires, 2.52% existed not missing at random (NMAR) of dependent variable and 7.14% existed missing at random (MAR) of dependent variable. First, multiple imputation was conducted for MAR by using completed data, then sample selection model was used to correct NMAR in multiple imputation, and a multi influencing factor analysis model was established. Based on 1 000 times resampling, the best scheme of filling the random missing values is the predictive mean matching (PMM) method under the missing proportion. With this optimal scheme, a two stage survey was conducted. Finally, it was found that the influencing factors on annual medical expenditure among the students enrolling in URBMI in Taiyuan included population group, annual household gross income, affordability of medical insurance expenditure, chronic disease, seeking medical care in hospital, seeking medical care in community health center or private clinic, hospitalization, hospitalization canceled due to certain reason, self medication and acceptable proportion of self-paid medical expenditure. The two-stage method combining multiple imputation with sample selection model can deal with non response bias and selection bias effectively in dependent variable of the survey data.
Peyre, Hugo; Leplège, Alain; Coste, Joël
2011-03-01
Missing items are common in quality of life (QoL) questionnaires and present a challenge for research in this field. It remains unclear which of the various methods proposed to deal with missing data performs best in this context. We compared personal mean score, full information maximum likelihood, multiple imputation, and hot deck techniques using various realistic simulation scenarios of item missingness in QoL questionnaires constructed within the framework of classical test theory. Samples of 300 and 1,000 subjects were randomly drawn from the 2003 INSEE Decennial Health Survey (of 23,018 subjects representative of the French population and having completed the SF-36) and various patterns of missing data were generated according to three different item non-response rates (3, 6, and 9%) and three types of missing data (Little and Rubin's "missing completely at random," "missing at random," and "missing not at random"). The missing data methods were evaluated in terms of accuracy and precision for the analysis of one descriptive and one association parameter for three different scales of the SF-36. For all item non-response rates and types of missing data, multiple imputation and full information maximum likelihood appeared superior to the personal mean score and especially to hot deck in terms of accuracy and precision; however, the use of personal mean score was associated with insignificant bias (relative bias <2%) in all studied situations. Whereas multiple imputation and full information maximum likelihood are confirmed as reference methods, the personal mean score appears nonetheless appropriate for dealing with items missing from completed SF-36 questionnaires in most situations of routine use. These results can reasonably be extended to other questionnaires constructed according to classical test theory.
Krausch-Hofmann, Stefanie; Bogaerts, Kris; Hofmann, Michael; de Almeida Mello, Johanna; Fávaro Moreira, Nádia Cristina; Lesaffre, Emmanuel; Declerck, Dominique; Declercq, Anja; Duyck, Joke
2015-01-01
Missing data within the comprehensive geriatric assessment of the interRAI suite of assessment instruments potentially imply the under-detection of conditions that require care as well as the risk of biased statistical results. Impaired oral health in older individuals has to be registered accurately as it causes pain and discomfort and is related to the general health status. This study was based on interRAI-Home Care (HC) baseline data from 7590 subjects (mean age 81.2 years, SD 6.9) in Belgium. It was investigated if missingness of the oral health-related items was associated with selected variables of general health. It was also determined if multiple imputation of missing data affected the associations between oral and general health. Multivariable logistic regression was used to determine if the prevalence of missingness in the oral health-related variables was associated with activities of daily life (ADLH), cognitive performance (CPS2) and depression (DRS). Associations between oral health and ADLH, CPS2 and DRS were determined, with missing data treated by 1. the complete-case technique and 2. by multiple imputation, and results were compared. The individual oral health-related variables had a similar proportion of missing values, ranging from 16.3% to 17.2%. The prevalence of missing data in all oral health-related variables was significantly associated with symptoms of depression (dental prosthesis use OR 1.66, CI 1.41-1.95; damaged teeth OR 1.74, CI 1.48-2.04; chewing problems OR 1.74, CI 1.47-2.05; dry mouth OR 1.65, CI 1.40-1.94). Missingness in damaged teeth (OR 1.27, CI 1.08-1.48), chewing problems (OR 1.22, CI 1.04-1.44) and dry mouth (OR 1.23, CI 1.05-1.44) occurred more frequently in cognitively impaired subjects. ADLH was not associated with the prevalence of missing data. When comparing the complete-case technique with the multiple imputation approach, nearly identical odds ratios characterized the associations between oral and general health. Cognitively impaired and depressive individuals had a higher risk of missing oral health-related information. Associations between oral health and ADLH, CPS2 and DRS were not influenced by multiple imputation of missing data. Further research should concentrate on the mechanisms that mediate the occurrence of missingness to develop preventative strategies.
Multivariate missing data in hydrology - Review and applications
NASA Astrophysics Data System (ADS)
Ben Aissia, Mohamed-Aymen; Chebana, Fateh; Ouarda, Taha B. M. J.
2017-12-01
Water resources planning and management require complete data sets of a number of hydrological variables, such as flood peaks and volumes. However, hydrologists are often faced with the problem of missing data (MD) in hydrological databases. Several methods are used to deal with the imputation of MD. During the last decade, multivariate approaches have gained popularity in the field of hydrology, especially in hydrological frequency analysis (HFA). However, treating the MD remains neglected in the multivariate HFA literature whereas the focus has been mainly on the modeling component. For a complete analysis and in order to optimize the use of data, MD should also be treated in the multivariate setting prior to modeling and inference. Imputation of MD in the multivariate hydrological framework can have direct implications on the quality of the estimation. Indeed, the dependence between the series represents important additional information that can be included in the imputation process. The objective of the present paper is to highlight the importance of treating MD in multivariate hydrological frequency analysis by reviewing and applying multivariate imputation methods and by comparing univariate and multivariate imputation methods. An application is carried out for multiple flood attributes on three sites in order to evaluate the performance of the different methods based on the leave-one-out procedure. The results indicate that, the performance of imputation methods can be improved by adopting the multivariate setting, compared to mean substitution and interpolation methods, especially when using the copula-based approach.
Probability genotype imputation method and integrated weighted lasso for QTL identification.
Demetrashvili, Nino; Van den Heuvel, Edwin R; Wit, Ernst C
2013-12-30
Many QTL studies have two common features: (1) often there is missing marker information, (2) among many markers involved in the biological process only a few are causal. In statistics, the second issue falls under the headings "sparsity" and "causal inference". The goal of this work is to develop a two-step statistical methodology for QTL mapping for markers with binary genotypes. The first step introduces a novel imputation method for missing genotypes. Outcomes of the proposed imputation method are probabilities which serve as weights to the second step, namely in weighted lasso. The sparse phenotype inference is employed to select a set of predictive markers for the trait of interest. Simulation studies validate the proposed methodology under a wide range of realistic settings. Furthermore, the methodology outperforms alternative imputation and variable selection methods in such studies. The methodology was applied to an Arabidopsis experiment, containing 69 markers for 165 recombinant inbred lines of a F8 generation. The results confirm previously identified regions, however several new markers are also found. On the basis of the inferred ROC behavior these markers show good potential for being real, especially for the germination trait Gmax. Our imputation method shows higher accuracy in terms of sensitivity and specificity compared to alternative imputation method. Also, the proposed weighted lasso outperforms commonly practiced multiple regression as well as the traditional lasso and adaptive lasso with three weighting schemes. This means that under realistic missing data settings this methodology can be used for QTL identification.
Data imputation analysis for Cosmic Rays time series
NASA Astrophysics Data System (ADS)
Fernandes, R. C.; Lucio, P. S.; Fernandez, J. H.
2017-05-01
The occurrence of missing data concerning Galactic Cosmic Rays time series (GCR) is inevitable since loss of data is due to mechanical and human failure or technical problems and different periods of operation of GCR stations. The aim of this study was to perform multiple dataset imputation in order to depict the observational dataset. The study has used the monthly time series of GCR Climax (CLMX) and Roma (ROME) from 1960 to 2004 to simulate scenarios of 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80% and 90% of missing data compared to observed ROME series, with 50 replicates. Then, the CLMX station as a proxy for allocation of these scenarios was used. Three different methods for monthly dataset imputation were selected: AMÉLIA II - runs the bootstrap Expectation Maximization algorithm, MICE - runs an algorithm via Multivariate Imputation by Chained Equations and MTSDI - an Expectation Maximization algorithm-based method for imputation of missing values in multivariate normal time series. The synthetic time series compared with the observed ROME series has also been evaluated using several skill measures as such as RMSE, NRMSE, Agreement Index, R, R2, F-test and t-test. The results showed that for CLMX and ROME, the R2 and R statistics were equal to 0.98 and 0.96, respectively. It was observed that increases in the number of gaps generate loss of quality of the time series. Data imputation was more efficient with MTSDI method, with negligible errors and best skill coefficients. The results suggest a limit of about 60% of missing data for imputation, for monthly averages, no more than this. It is noteworthy that CLMX, ROME and KIEL stations present no missing data in the target period. This methodology allowed reconstructing 43 time series.
MiR-191 Regulates Primary Human Fibroblast Proliferation and Directly Targets Multiple Oncogenes
Polioudakis, Damon; Abell, Nathan S.; Iyer, Vishwanath R.
2015-01-01
miRNAs play a central role in numerous pathologies including multiple cancer types. miR-191 has predominantly been studied as an oncogene, but the role of miR-191 in the proliferation of primary cells is not well characterized, and the miR-191 targetome has not been experimentally profiled. Here we utilized RNA induced silencing complex immunoprecipitations as well as gene expression profiling to construct a genome wide miR-191 target profile. We show that miR-191 represses proliferation in primary human fibroblasts, identify multiple proto-oncogenes as novel miR-191 targets, including CDK9, NOTCH2, and RPS6KA3, and present evidence that miR-191 extensively mediates target expression through coding sequence (CDS) pairing. Our results provide a comprehensive genome wide miR-191 target profile, and demonstrate miR-191’s regulation of primary human fibroblast proliferation. PMID:25992613
Fiero, Mallorie H; Hsu, Chiu-Hsieh; Bell, Melanie L
2017-11-20
We extend the pattern-mixture approach to handle missing continuous outcome data in longitudinal cluster randomized trials, which randomize groups of individuals to treatment arms, rather than the individuals themselves. Individuals who drop out at the same time point are grouped into the same dropout pattern. We approach extrapolation of the pattern-mixture model by applying multilevel multiple imputation, which imputes missing values while appropriately accounting for the hierarchical data structure found in cluster randomized trials. To assess parameters of interest under various missing data assumptions, imputed values are multiplied by a sensitivity parameter, k, which increases or decreases imputed values. Using simulated data, we show that estimates of parameters of interest can vary widely under differing missing data assumptions. We conduct a sensitivity analysis using real data from a cluster randomized trial by increasing k until the treatment effect inference changes. By performing a sensitivity analysis for missing data, researchers can assess whether certain missing data assumptions are reasonable for their cluster randomized trial. Copyright © 2017 John Wiley & Sons, Ltd.
Multiple imputation to account for measurement error in marginal structural models
Edwards, Jessie K.; Cole, Stephen R.; Westreich, Daniel; Crane, Heidi; Eron, Joseph J.; Mathews, W. Christopher; Moore, Richard; Boswell, Stephen L.; Lesko, Catherine R.; Mugavero, Michael J.
2015-01-01
Background Marginal structural models are an important tool for observational studies. These models typically assume that variables are measured without error. We describe a method to account for differential and non-differential measurement error in a marginal structural model. Methods We illustrate the method estimating the joint effects of antiretroviral therapy initiation and current smoking on all-cause mortality in a United States cohort of 12,290 patients with HIV followed for up to 5 years between 1998 and 2011. Smoking status was likely measured with error, but a subset of 3686 patients who reported smoking status on separate questionnaires composed an internal validation subgroup. We compared a standard joint marginal structural model fit using inverse probability weights to a model that also accounted for misclassification of smoking status using multiple imputation. Results In the standard analysis, current smoking was not associated with increased risk of mortality. After accounting for misclassification, current smoking without therapy was associated with increased mortality [hazard ratio (HR): 1.2 (95% CI: 0.6, 2.3)]. The HR for current smoking and therapy (0.4 (95% CI: 0.2, 0.7)) was similar to the HR for no smoking and therapy (0.4; 95% CI: 0.2, 0.6). Conclusions Multiple imputation can be used to account for measurement error in concert with methods for causal inference to strengthen results from observational studies. PMID:26214338
Comparing multiple imputation methods for systematically missing subject-level data.
Kline, David; Andridge, Rebecca; Kaizar, Eloise
2017-06-01
When conducting research synthesis, the collection of studies that will be combined often do not measure the same set of variables, which creates missing data. When the studies to combine are longitudinal, missing data can occur on the observation-level (time-varying) or the subject-level (non-time-varying). Traditionally, the focus of missing data methods for longitudinal data has been on missing observation-level variables. In this paper, we focus on missing subject-level variables and compare two multiple imputation approaches: a joint modeling approach and a sequential conditional modeling approach. We find the joint modeling approach to be preferable to the sequential conditional approach, except when the covariance structure of the repeated outcome for each individual has homogenous variance and exchangeable correlation. Specifically, the regression coefficient estimates from an analysis incorporating imputed values based on the sequential conditional method are attenuated and less efficient than those from the joint method. Remarkably, the estimates from the sequential conditional method are often less efficient than a complete case analysis, which, in the context of research synthesis, implies that we lose efficiency by combining studies. Copyright © 2015 John Wiley & Sons, Ltd. Copyright © 2015 John Wiley & Sons, Ltd.
Theory of Multiple Intelligences: Is It a Scientific Theory?
ERIC Educational Resources Information Center
Chen, Jie-Qi
2004-01-01
This essay discusses the status of multiple intelligences (MI) theory as a scientific theory by addressing three issues: the empirical evidence Gardner used to establish MI theory, the methodology he employed to validate MI theory, and the purpose or function of MI theory.
Gu, Chunming; Li, Tianfu; Yin, Zhao; Chen, Shengting; Fei, Jia; Shen, Jianping; Zhang, Yuan
2017-05-01
Berberine (BBR), a traditional Chinese herbal medicine compound, has emerged as a novel class of anti-tumor agent. Our previous microRNA (miRNA) microarray demonstrated that miR-106b/25 was significantly down-regulated in BBR-treated multiple myeloma (MM) cells. Here, systematic integration showed that miR-106b/25 cluster is involved in multiple cancer-related signaling pathways and tumorigenesis. MiREnvironment database revealed that multiple environmental factors (drug, ionizing radiation, hypoxia) affected the miR-106b/25 cluster expression. By targeting the seed region in the miRNA, tiny anti-mir106b/25 cluster (t-anti-mir106b/25 cluster) significantly induced suppression in cell viability and colony formation. Western blot validated that t-anti-miR-106b/25 cluster effectively inhibited the expression of P38 MAPK and phospho-P38 MAPK in MM cells. These findings indicated the miR-106b/25 cluster functioned as oncogene and might provide a novel molecular insight into MM.
miR-17-92 cluster microRNAs confers tumorigenicity in multiple myeloma.
Chen, Lijuan; Li, Chunming; Zhang, Run; Gao, Xiao; Qu, Xiaoyan; Zhao, Min; Qiao, Chun; Xu, Jiaren; Li, Jianyong
2011-10-01
miRNAs play important roles in the regulation of cell proliferation, differentiation and apoptosis. The deregulation of miRNAs expression contributes to tumorigenesis by modulating oncogenic and tumor suppressor signaling pathways. Oncogenic transcription factor Myc can control expression of a large set of microRNAs (miRNAs). Previous studies have shown that the expression of miR-17-92 cluster, a polycistron encoding six microRNAs (miRNA), has close relationship with the expression of Myc. In current study, silencing Myc in multiple myeloma (MM)cells induced cell death and growth inhibition, and downregulated expression of miR-17-92 cluster. Overexpression of miR-17 or miR-18 could partly abrogated Myc-knockdown-induced MM cell apoptosis. One of the mechanism of Myc inhibiting MM cell apoptosis is through Myc activates miR-17-92 cluster and subsequently down-modulates proapoptotic protein Bim. Although miR-17-92 cluster are located at 13q31.3, the expression of miR-18, miR-19 and miR-20 (especially miR-19) in patients with del(13q14) was higher than those without del(13q14). Patients with miR-17, miR-20 and miR-92 high-expression had shorter PFS compared to those with miR-17, miR-20 and miR-92 low-expression. These results suggest the Myc-inducible miR-17-92 cluster miRNAs contribute to tumorigenesis and poor prognosis in multiple myeloma. Copyright © 2011 Elsevier Ireland Ltd. All rights reserved.
Missing data imputation of solar radiation data under different atmospheric conditions.
Turrado, Concepción Crespo; López, María Del Carmen Meizoso; Lasheras, Fernando Sánchez; Gómez, Benigno Antonio Rodríguez; Rollé, José Luis Calvo; Juez, Francisco Javier de Cos
2014-10-29
Global solar broadband irradiance on a planar surface is measured at weather stations by pyranometers. In the case of the present research, solar radiation values from nine meteorological stations of the MeteoGalicia real-time observational network, captured and stored every ten minutes, are considered. In this kind of record, the lack of data and/or the presence of wrong values adversely affects any time series study. Consequently, when this occurs, a data imputation process must be performed in order to replace missing data with estimated values. This paper aims to evaluate the multivariate imputation of ten-minute scale data by means of the chained equations method (MICE). This method allows the network itself to impute the missing or wrong data of a solar radiation sensor, by using either all or just a group of the measurements of the remaining sensors. Very good results have been obtained with the MICE method in comparison with other methods employed in this field such as Inverse Distance Weighting (IDW) and Multiple Linear Regression (MLR). The average RMSE value of the predictions for the MICE algorithm was 13.37% while that for the MLR it was 28.19%, and 31.68% for the IDW.
Missing Data Imputation of Solar Radiation Data under Different Atmospheric Conditions
Turrado, Concepción Crespo; López, María del Carmen Meizoso; Lasheras, Fernando Sánchez; Gómez, Benigno Antonio Rodríguez; Rollé, José Luis Calvo; de Cos Juez, Francisco Javier
2014-01-01
Global solar broadband irradiance on a planar surface is measured at weather stations by pyranometers. In the case of the present research, solar radiation values from nine meteorological stations of the MeteoGalicia real-time observational network, captured and stored every ten minutes, are considered. In this kind of record, the lack of data and/or the presence of wrong values adversely affects any time series study. Consequently, when this occurs, a data imputation process must be performed in order to replace missing data with estimated values. This paper aims to evaluate the multivariate imputation of ten-minute scale data by means of the chained equations method (MICE). This method allows the network itself to impute the missing or wrong data of a solar radiation sensor, by using either all or just a group of the measurements of the remaining sensors. Very good results have been obtained with the MICE method in comparison with other methods employed in this field such as Inverse Distance Weighting (IDW) and Multiple Linear Regression (MLR). The average RMSE value of the predictions for the MICE algorithm was 13.37% while that for the MLR it was 28.19%, and 31.68% for the IDW. PMID:25356644
Two-pass imputation algorithm for missing value estimation in gene expression time series.
Tsiporkova, Elena; Boeva, Veselka
2007-10-01
Gene expression microarray experiments frequently generate datasets with multiple values missing. However, most of the analysis, mining, and classification methods for gene expression data require a complete matrix of gene array values. Therefore, the accurate estimation of missing values in such datasets has been recognized as an important issue, and several imputation algorithms have already been proposed to the biological community. Most of these approaches, however, are not particularly suitable for time series expression profiles. In view of this, we propose a novel imputation algorithm, which is specially suited for the estimation of missing values in gene expression time series data. The algorithm utilizes Dynamic Time Warping (DTW) distance in order to measure the similarity between time expression profiles, and subsequently selects for each gene expression profile with missing values a dedicated set of candidate profiles for estimation. Three different DTW-based imputation (DTWimpute) algorithms have been considered: position-wise, neighborhood-wise, and two-pass imputation. These have initially been prototyped in Perl, and their accuracy has been evaluated on yeast expression time series data using several different parameter settings. The experiments have shown that the two-pass algorithm consistently outperforms, in particular for datasets with a higher level of missing entries, the neighborhood-wise and the position-wise algorithms. The performance of the two-pass DTWimpute algorithm has further been benchmarked against the weighted K-Nearest Neighbors algorithm, which is widely used in the biological community; the former algorithm has appeared superior to the latter one. Motivated by these findings, indicating clearly the added value of the DTW techniques for missing value estimation in time series data, we have built an optimized C++ implementation of the two-pass DTWimpute algorithm. The software also provides for a choice between three different initial rough imputation methods.
State Alcohol-Impaired-Driving Estimates
... For more information on multiple imputation see NHTSA’s Technical Report (DOT HS 809 403, www- nrd. nhtsa. ... involvement); and NHTSA’s National Center for Statistics and Analysis 1200 New Jersey Avenue SE., Washington, DC 20590 ...
A multiple imputation strategy for sequential multiple assignment randomized trials
Shortreed, Susan M.; Laber, Eric; Stroup, T. Scott; Pineau, Joelle; Murphy, Susan A.
2014-01-01
Sequential multiple assignment randomized trials (SMARTs) are increasingly being used to inform clinical and intervention science. In a SMART, each patient is repeatedly randomized over time. Each randomization occurs at a critical decision point in the treatment course. These critical decision points often correspond to milestones in the disease process or other changes in a patient’s health status. Thus, the timing and number of randomizations may vary across patients and depend on evolving patient-specific information. This presents unique challenges when analyzing data from a SMART in the presence of missing data. This paper presents the first comprehensive discussion of missing data issues typical of SMART studies: we describe five specific challenges, and propose a flexible imputation strategy to facilitate valid statistical estimation and inference using incomplete data from a SMART. To illustrate these contributions, we consider data from the Clinical Antipsychotic Trial of Intervention and Effectiveness (CATIE), one of the most well-known SMARTs to date. PMID:24919867
Determinants of High-School Dropout: A Longitudinal Study in a Deprived Area of Japan.
Tabuchi, Takahiro; Fujihara, Sho; Shinozaki, Tomohiro; Fukuhara, Hiroyuki
2018-05-19
Our objective in this study was to find determinants of high-school dropout in a deprived area of Japan using longitudinal data, including socio-demographic and junior high school-period information. We followed 695 students who graduated the junior high school located in a deprived area of Japan between 2002 and 2010 for 3 years after graduation (614 students: follow-up rate, 88.3%). Multivariable log-binomial regression models were used to calculate the prevalence ratios (PRs) for high-school dropout, using multiple imputation (MI) to account for non-response at follow-up. The MI model estimated that 18.7% of students dropped out of high school in approximately 3 years. In the covariates-adjusted model, three factors were significantly associated with high-school dropout: ≥10 days of tardy arrival in junior high school (PR 6.44; 95% confidence interval [CI], 1.69-24.6 for "10-29 days of tardy arrival" and PR 8.01; 95% CI, 2.05-31.3 for "≥30 days of tardy arrival" compared with "0 day of tardy arrival"), daily smoking (PR 2.01; 95% CI, 1.41-2.86) and severe problems, such as abuse and neglect (PR 1.66; 95% CI, 1.16-2.39). Among students with ≥30 days of tardy arrival in addition to daily smoking or experience of severe problems, ≥50% high-school dropout rates were observed. Three determinants of high-school dropout were found: smoking, tardy arrival, and experience of severe problems. These factors were correlated and should be treated as warning signs of complex behavioral and academic problems. Parents, educators, and policy makers should work together to implement effective strategies to prevent school dropout.
Factors Influencing Early Feeding of Foods and Drinks Containing Free Sugars-A Birth Cohort Study.
Ha, Diep H; Do, Loc G; Spencer, Andrew John; Thomson, William Murray; Golley, Rebecca K; Rugg-Gunn, Andrew J; Levy, Steven M; Scott, Jane A
2017-10-23
Early feeding of free sugars to young children can increase the preference for sweetness and the risk of consuming a cariogenic diet high in free sugars later in life. This study aimed to investigate early life factors influencing early introduction of foods/drinks containing free sugars. Data from an ongoing population-based birth cohort study in Australia were used. Mothers of newborn children completed questionnaires at birth and subsequently at ages 3, 6, 12, and 24 months. The outcome was reported feeding (Yes/No) at age 6-9 months of common foods/drinks sources of free sugars (hereafter referred as foods/drinks with free sugars). Household income quartiles, mother's sugar-sweetened beverage (SSB) consumption, and other maternal factors were exposure variables. Analysis was conducted progressively from bivariate to multivariable log-binomial regression with robust standard error estimation to calculate prevalence ratios (PR) of being fed foods/drinks with free sugars at an early age (by 6-9 months). Models for both complete cases and with multiple imputations (MI) for missing data were generated. Of 1479 mother/child dyads, 21% of children had been fed foods/drinks with free sugars. There was a strong income gradient and a significant positive association with maternal SSB consumption. In the complete-case model, income Q1 and Q2 had PRs of 1.9 (1.2-3.1) and 1.8 (1.2-2.6) against Q4, respectively. The PR for mothers ingesting SSB everyday was 1.6 (1.2-2.3). The PR for children who had been breastfed to at least three months was 0.6 (0.5-0.8). Similar findings were observed in the MI model. Household income at birth and maternal behaviours were significant determinants of early feeding of foods/drinks with free sugars.
Xue, Xiaonan; Shore, Roy E; Ye, Xiangyang; Kim, Mimi Y
2004-10-01
Occupational exposures are often recorded as zero when the exposure is below the minimum detection level (BMDL). This can lead to an underestimation of the doses received by individuals and can lead to biased estimates of risk in occupational epidemiologic studies. The extent of the exposure underestimation is increased with the magnitude of the minimum detection level (MDL) and the frequency of monitoring. This paper uses multiple imputation methods to impute values for the missing doses due to BMDL. A Gibbs sampling algorithm is developed to implement the method, which is applied to two distinct scenarios: when dose information is available for each measurement (but BMDL is recorded as zero or some other arbitrary value), or when the dose information available represents the summation of a series of measurements (e.g., only yearly cumulative exposure is available but based on, say, weekly measurements). Then the average of the multiple imputed exposure realizations for each individual is used to obtain an unbiased estimate of the relative risk associated with exposure. Simulation studies are used to evaluate the performance of the estimators. As an illustration, the method is applied to a sample of historical occupational radiation exposure data from the Oak Ridge National Laboratory.
Aloisio, Kathryn M.; Swanson, Sonja A.; Micali, Nadia; Field, Alison; Horton, Nicholas J.
2015-01-01
Clustered data arise in many settings, particularly within the social and biomedical sciences. As an example, multiple–source reports are commonly collected in child and adolescent psychiatric epidemiologic studies where researchers use various informants (e.g. parent and adolescent) to provide a holistic view of a subject’s symptomatology. Fitzmaurice et al. (1995) have described estimation of multiple source models using a standard generalized estimating equation (GEE) framework. However, these studies often have missing data due to additional stages of consent and assent required. The usual GEE is unbiased when missingness is Missing Completely at Random (MCAR) in the sense of Little and Rubin (2002). This is a strong assumption that may not be tenable. Other options such as weighted generalized estimating equations (WEEs) are computationally challenging when missingness is non–monotone. Multiple imputation is an attractive method to fit incomplete data models while only requiring the less restrictive Missing at Random (MAR) assumption. Previously estimation of partially observed clustered data was computationally challenging however recent developments in Stata have facilitated their use in practice. We demonstrate how to utilize multiple imputation in conjunction with a GEE to investigate the prevalence of disordered eating symptoms in adolescents reported by parents and adolescents as well as factors associated with concordance and prevalence. The methods are motivated by the Avon Longitudinal Study of Parents and their Children (ALSPAC), a cohort study that enrolled more than 14,000 pregnant mothers in 1991–92 and has followed the health and development of their children at regular intervals. While point estimates were fairly similar to the GEE under MCAR, the MAR model had smaller standard errors, while requiring less stringent assumptions regarding missingness. PMID:25642154
Multiple Intelligences: From the Ivory Tower to the Dusty Classroom - But Why?
ERIC Educational Resources Information Center
Kornhaber, Mindy L.
2004-01-01
This article draws on research conducted over a 10-year period in an attempt to answer three central questions about the widespread adoption of Gardner's theory of multiple intelligences (MI): Why do educators adopt MI? Once MI is adopted, does anything really change in practice? When educators claim MI is working, what is happening in practice?
Taking a multiple intelligences (MI) perspective.
Gardner, Howard
2017-01-01
The theory of multiple intelligences (MI) seeks to describe and encompass the range of human cognitive capacities. In challenging the concept of general intelligence, we can apply an MI perspective that may provide a more useful approach to cognitive differences within and across species.
Preuss, Michael; König, Inke R; Thompson, John R; Erdmann, Jeanette; Absher, Devin; Assimes, Themistocles L; Blankenberg, Stefan; Boerwinkle, Eric; Chen, Li; Cupples, L Adrienne; Hall, Alistair S; Halperin, Eran; Hengstenberg, Christian; Holm, Hilma; Laaksonen, Reijo; Li, Mingyao; März, Winfried; McPherson, Ruth; Musunuru, Kiran; Nelson, Christopher P; Burnett, Mary Susan; Epstein, Stephen E; O'Donnell, Christopher J; Quertermous, Thomas; Rader, Daniel J; Roberts, Robert; Schillert, Arne; Stefansson, Kari; Stewart, Alexandre F R; Thorleifsson, Gudmar; Voight, Benjamin F; Wells, George A; Ziegler, Andreas; Kathiresan, Sekar; Reilly, Muredach P; Samani, Nilesh J; Schunkert, Heribert
2010-10-01
Recent genome-wide association studies (GWAS) of myocardial infarction (MI) and other forms of coronary artery disease (CAD) have led to the discovery of at least 13 genetic loci. In addition to the effect size, power to detect associations is largely driven by sample size. Therefore, to maximize the chance of finding novel susceptibility loci for CAD and MI, the Coronary ARtery DIsease Genome-wide Replication And Meta-analysis (CARDIoGRAM) consortium was formed. CARDIoGRAM combines data from all published and several unpublished GWAS in individuals with European ancestry; includes >22 000 cases with CAD, MI, or both and >60 000 controls; and unifies samples from the Atherosclerotic Disease VAscular functioN and genetiC Epidemiology study, CADomics, Cohorts for Heart and Aging Research in Genomic Epidemiology, deCODE, the German Myocardial Infarction Family Studies I, II, and III, Ludwigshafen Risk and Cardiovascular Heath Study/AtheroRemo, MedStar, Myocardial Infarction Genetics Consortium, Ottawa Heart Genomics Study, PennCath, and the Wellcome Trust Case Control Consortium. Genotyping was carried out on Affymetrix or Illumina platforms followed by imputation of genotypes in most studies. On average, 2.2 million single nucleotide polymorphisms were generated per study. The results from each study are combined using meta-analysis. As proof of principle, we meta-analyzed risk variants at 9p21 and found that rs1333049 confers a 29% increase in risk for MI per copy (P=2×10⁻²⁰). CARDIoGRAM is poised to contribute to our understanding of the role of common genetic variation on risk for CAD and MI.
Design of the Coronary ARtery DIsease Genome-Wide Replication And Meta-Analysis (CARDIoGRAM) Study
Preuss, Michael; König, Inke R.; Thompson, John R.; Erdmann, Jeanette; Absher, Devin; Assimes, Themistocles L.; Blankenberg, Stefan; Boerwinkle, Eric; Chen, Li; Cupples, L. Adrienne; Hall, Alistair S.; Halperin, Eran; Hengstenberg, Christian; Holm, Hilma; Laaksonen, Reijo; Li, Mingyao; März, Winfried; McPherson, Ruth; Musunuru, Kiran; Nelson, Christopher P.; Burnett, Mary Susan; Epstein, Stephen E.; O’Donnell, Christopher J.; Quertermous, Thomas; Rader, Daniel J.; Roberts, Robert; Schillert, Arne; Stefansson, Kari; Stewart, Alexandre F.R.; Thorleifsson, Gudmar; Voight, Benjamin F.; Wells, George A.; Ziegler, Andreas; Kathiresan, Sekar; Reilly, Muredach P.; Samani, Nilesh J.; Schunkert, Heribert
2011-01-01
Background Recent genome-wide association studies (GWAS) of myocardial infarction (MI) and other forms of coronary artery disease (CAD) have led to the discovery of at least 13 genetic loci. In addition to the effect size, power to detect associations is largely driven by sample size. Therefore, to maximize the chance of finding novel susceptibility loci for CAD and MI, the Coronary ARtery DIsease Genome-wide Replication And Meta-analysis (CARDIoGRAM) consortium was formed. Methods and Results CARDIoGRAM combines data from all published and several unpublished GWAS in individuals with European ancestry; includes >22 000 cases with CAD, MI, or both and >60 000 controls; and unifies samples from the Atherosclerotic Disease VAscular functioN and genetiC Epidemiology study, CADomics, Cohorts for Heart and Aging Research in Genomic Epidemiology, deCODE, the German Myocardial Infarction Family Studies I, II, and III, Ludwigshafen Risk and Cardiovascular Heath Study/AtheroRemo, MedStar, Myocardial Infarction Genetics Consortium, Ottawa Heart Genomics Study, PennCath, and the Wellcome Trust Case Control Consortium. Genotyping was carried out on Affymetrix or Illumina platforms followed by imputation of genotypes in most studies. On average, 2.2 million single nucleotide polymorphisms were generated per study. The results from each study are combined using meta-analysis. As proof of principle, we meta-analyzed risk variants at 9p21 and found that rs1333049 confers a 29% increase in risk for MI per copy (P=2×10−20). Conclusion CARDIoGRAM is poised to contribute to our understanding of the role of common genetic variation on risk for CAD and MI. PMID:20923989
Arabidopsis ARGONAUTE7 selects miR390 through multiple checkpoints during RISC assembly.
Endo, Yayoi; Iwakawa, Hiro-oki; Tomari, Yukihide
2013-07-01
Plant ARGONAUTE7 (AGO7) assembles RNA-induced silencing complex (RISC) specifically with miR390 and regulates the auxin-signalling pathway via production of TAS3 trans-acting siRNAs (tasiRNAs). However, how AGO7 discerns miR390 among other miRNAs remains unclear. Here, we show that the 5' adenosine of miR390 and the central region of miR390/miR390* duplex are critical for the specific interaction with AGO7. Furthermore, despite the existence of mismatches in the seed and central regions of the duplex, cleavage of the miR390* strand is required for maturation of AGO7-RISC. These findings suggest that AGO7 uses multiple checkpoints to select miR390, thereby circumventing promiscuous tasiRNA production.
yaImpute: An R package for kNN imputation
Nicholas L. Crookston; Andrew O. Finley
2008-01-01
This article introduces yaImpute, an R package for nearest neighbor search and imputation. Although nearest neighbor imputation is used in a host of disciplines, the methods implemented in the yaImpute package are tailored to imputation-based forest attribute estimation and mapping. The impetus to writing the yaImpute is a growing interest in nearest neighbor...
ERIC Educational Resources Information Center
McNamee, Paul; Madden, Dave; McNamee, Frank; Wall, John; Hurst, Alan; Vrasidas, Charalambos; Chanquoy, Lucile; Baccino, Thierry; Acar, Emrah; Onwy-Yazici, Ela; Jordan, Ann
2009-01-01
This paper describes an ongoing EU project concerned with developing an instructional design framework for virtual classes (VC) that is based on the theory of Multiple Intelligences (MI) (1983). The psychological theory of Multiple Intelligences (Gardner 1983) has received much credence within instructional design since its inception and has been…
A Nonparametric, Multiple Imputation-Based Method for the Retrospective Integration of Data Sets.
Carrig, Madeline M; Manrique-Vallier, Daniel; Ranby, Krista W; Reiter, Jerome P; Hoyle, Rick H
2015-01-01
Complex research questions often cannot be addressed adequately with a single data set. One sensible alternative to the high cost and effort associated with the creation of large new data sets is to combine existing data sets containing variables related to the constructs of interest. The goal of the present research was to develop a flexible, broadly applicable approach to the integration of disparate data sets that is based on nonparametric multiple imputation and the collection of data from a convenient, de novo calibration sample. We demonstrate proof of concept for the approach by integrating three existing data sets containing items related to the extent of problematic alcohol use and associations with deviant peers. We discuss both necessary conditions for the approach to work well and potential strengths and weaknesses of the method compared to other data set integration approaches.
Best practices for missing data management in counseling psychology.
Schlomer, Gabriel L; Bauman, Sheri; Card, Noel A
2010-01-01
This article urges counseling psychology researchers to recognize and report how missing data are handled, because consumers of research cannot accurately interpret findings without knowing the amount and pattern of missing data or the strategies that were used to handle those data. Patterns of missing data are reviewed, and some of the common strategies for dealing with them are described. The authors provide an illustration in which data were simulated and evaluate 3 methods of handling missing data: mean substitution, multiple imputation, and full information maximum likelihood. Results suggest that mean substitution is a poor method for handling missing data, whereas both multiple imputation and full information maximum likelihood are recommended alternatives to this approach. The authors suggest that researchers fully consider and report the amount and pattern of missing data and the strategy for handling those data in counseling psychology research and that editors advise researchers of this expectation.
Considerations of multiple imputation approaches for handling missing data in clinical trials.
Quan, Hui; Qi, Li; Luo, Xiaodong; Darchy, Loic
2018-07-01
Missing data exist in all clinical trials and missing data issue is a very serious issue in terms of the interpretability of the trial results. There is no universally applicable solution for all missing data problems. Methods used for handling missing data issue depend on the circumstances particularly the assumptions on missing data mechanisms. In recent years, if the missing at random mechanism cannot be assumed, conservative approaches such as the control-based and returning to baseline multiple imputation approaches are applied for dealing with the missing data issues. In this paper, we focus on the variability in data analysis of these approaches. As demonstrated by examples, the choice of the variability can impact the conclusion of the analysis. Besides the methods for continuous endpoints, we also discuss methods for binary and time to event endpoints as well as consideration for non-inferiority assessment. Copyright © 2018. Published by Elsevier Inc.
A Nonparametric, Multiple Imputation-Based Method for the Retrospective Integration of Data Sets
Carrig, Madeline M.; Manrique-Vallier, Daniel; Ranby, Krista W.; Reiter, Jerome P.; Hoyle, Rick H.
2015-01-01
Complex research questions often cannot be addressed adequately with a single data set. One sensible alternative to the high cost and effort associated with the creation of large new data sets is to combine existing data sets containing variables related to the constructs of interest. The goal of the present research was to develop a flexible, broadly applicable approach to the integration of disparate data sets that is based on nonparametric multiple imputation and the collection of data from a convenient, de novo calibration sample. We demonstrate proof of concept for the approach by integrating three existing data sets containing items related to the extent of problematic alcohol use and associations with deviant peers. We discuss both necessary conditions for the approach to work well and potential strengths and weaknesses of the method compared to other data set integration approaches. PMID:26257437
Impact of pre-imputation SNP-filtering on genotype imputation results
2014-01-01
Background Imputation of partially missing or unobserved genotypes is an indispensable tool for SNP data analyses. However, research and understanding of the impact of initial SNP-data quality control on imputation results is still limited. In this paper, we aim to evaluate the effect of different strategies of pre-imputation quality filtering on the performance of the widely used imputation algorithms MaCH and IMPUTE. Results We considered three scenarios: imputation of partially missing genotypes with usage of an external reference panel, without usage of an external reference panel, as well as imputation of completely un-typed SNPs using an external reference panel. We first created various datasets applying different SNP quality filters and masking certain percentages of randomly selected high-quality SNPs. We imputed these SNPs and compared the results between the different filtering scenarios by using established and newly proposed measures of imputation quality. While the established measures assess certainty of imputation results, our newly proposed measures focus on the agreement with true genotypes. These measures showed that pre-imputation SNP-filtering might be detrimental regarding imputation quality. Moreover, the strongest drivers of imputation quality were in general the burden of missingness and the number of SNPs used for imputation. We also found that using a reference panel always improves imputation quality of partially missing genotypes. MaCH performed slightly better than IMPUTE2 in most of our scenarios. Again, these results were more pronounced when using our newly defined measures of imputation quality. Conclusion Even a moderate filtering has a detrimental effect on the imputation quality. Therefore little or no SNP filtering prior to imputation appears to be the best strategy for imputing small to moderately sized datasets. Our results also showed that for these datasets, MaCH performs slightly better than IMPUTE2 in most scenarios at the cost of increased computing time. PMID:25112433
Liu, Cuilian; Zhang, Song; Wang, Qizhi; Zhang, Xiaobo
2017-01-01
Cancer progression depends on tumor growth and metastasis, which are activated or suppressed by multiple genes. An individual microRNA may target multiple genes, suggesting that a miRNA may suppress tumor growth and metastasis via simultaneously targeting different genes. However, thus far, this issue has not been explored. In the present study, the findings showed that miR-1 could simultaneously inhibit tumor growth and metastasis of gastric and breast cancers by targeting multiple genes. The results indicated that miR-1 was significantly downregulated in cancer tissues compared with normal tissues. The miR-1 overexpression led to cell cycle arrest in the G1 phase in gastric and breast cancer cells but not in normal cells. Furthermore, the miR-1 overexpression significantly inhibited the metastasis of gastric and breast cancer cells. An analysis of the underlying mechanism revealed that the simultaneous inhibition of tumor growth and metastasis mediated by miR-1 was due to the synchronous targeting of 6 miR-1 target genes encoding cyclin dependent kinase 4, twinfilin actin binding protein 1, calponin 3, coronin 1C, WAS protein family member 2 and thymosin beta 4, X-linked. In vivo assays demonstrated that miR-1 efficiently inhibited tumor growth and metastasis of gastric and breast cancers in nude mice. Therefore, our study contributed novel insights into the miR-1′s roles in tumorigenesis of gastric and breast cancers. PMID:28159933
Liu, Cuilian; Zhang, Song; Wang, Qizhi; Zhang, Xiaobo
2017-06-27
Cancer progression depends on tumor growth and metastasis, which are activated or suppressed by multiple genes. An individual microRNA may target multiple genes, suggesting that a miRNA may suppress tumor growth and metastasis via simultaneously targeting different genes. However, thus far, this issue has not been explored. In the present study, the findings showed that miR-1 could simultaneously inhibit tumor growth and metastasis of gastric and breast cancers by targeting multiple genes. The results indicated that miR-1 was significantly downregulated in cancer tissues compared with normal tissues. The miR-1 overexpression led to cell cycle arrest in the G1 phase in gastric and breast cancer cells but not in normal cells. Furthermore, the miR-1 overexpression significantly inhibited the metastasis of gastric and breast cancer cells. An analysis of the underlying mechanism revealed that the simultaneous inhibition of tumor growth and metastasis mediated by miR-1 was due to the synchronous targeting of 6 miR-1 target genes encoding cyclin dependent kinase 4, twinfilin actin binding protein 1, calponin 3, coronin 1C, WAS protein family member 2 and thymosin beta 4, X-linked. In vivo assays demonstrated that miR-1 efficiently inhibited tumor growth and metastasis of gastric and breast cancers in nude mice. Therefore, our study contributed novel insights into the miR-1's roles in tumorigenesis of gastric and breast cancers.
Strategies for Dealing with Missing Accelerometer Data.
Stephens, Samantha; Beyene, Joseph; Tremblay, Mark S; Faulkner, Guy; Pullnayegum, Eleanor; Feldman, Brian M
2018-05-01
Missing data is a universal research problem that can affect studies examining the relationship between physical activity measured with accelerometers and health outcomes. Statistical techniques are available to deal with missing data; however, available techniques have not been synthesized. A scoping review was conducted to summarize the advantages and disadvantages of identified methods of dealing with missing data from accelerometers. Missing data poses a threat to the validity and interpretation of trials using physical activity data from accelerometry. Imputation using multiple imputation techniques is recommended to deal with missing data and improve the validity and interpretation of studies using accelerometry. Copyright © 2018 Elsevier Inc. All rights reserved.
MiR-218 Mediates tumorigenesis and metastasis: Perspectives and implications
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lu, Ying-fei; Department of Orthopaedics & Traumatology, The Chinese University of Hong Kong, Prince of Wales Hospital, Shatin, Hong Kong; Zhang, Li
2015-05-15
MicroRNAs (miRNAs) are a class of small non-coding RNAs that negatively regulate gene expression at the post-transcriptional level. As a highly conserved miRNA across a variety of species, microRNA-218 (miR-218) was found to play pivotal roles in tumorigenesis and progression. A group of evidence has demonstrated that miR-218 acts as a tumor suppressor by targeting many oncogenes related to proliferation, apoptosis and invasion. In this review, we provide a complex overview of miR-218, including its regulatory mechanisms, known functions in cancer and future challenges as a potential therapeutic target in human cancers. - Highlights: • miR-218 is frequently down regulatedmore » in multiple cancers. • miR-218 plays pivotal roles in carcinogenesis. • miR-218 mediates proliferation, apoptosis, metastasis, invasion, etc. • miR-218 mediates tumorigenesis and metastasis via multiple pathways.« less
Fedko, Iryna O; Hottenga, Jouke-Jan; Medina-Gomez, Carolina; Pappa, Irene; van Beijsterveldt, Catharina E M; Ehli, Erik A; Davies, Gareth E; Rivadeneira, Fernando; Tiemeier, Henning; Swertz, Morris A; Middeldorp, Christel M; Bartels, Meike; Boomsma, Dorret I
2015-09-01
Combining genotype data across cohorts increases power to estimate the heritability due to common single nucleotide polymorphisms (SNPs), based on analyzing a Genetic Relationship Matrix (GRM). However, the combination of SNP data across multiple cohorts may lead to stratification, when for example, different genotyping platforms are used. In the current study, we address issues of combining SNP data from different cohorts, the Netherlands Twin Register (NTR) and the Generation R (GENR) study. Both cohorts include children of Northern European Dutch background (N = 3102 + 2826, respectively) who were genotyped on different platforms. We explore imputation and phasing as a tool and compare three GRM-building strategies, when data from two cohorts are (1) just combined, (2) pre-combined and cross-platform imputed and (3) cross-platform imputed and post-combined. We test these three strategies with data on childhood height for unrelated individuals (N = 3124, average age 6.7 years) to explore their effect on SNP-heritability estimates and compare results to those obtained from the independent studies. All combination strategies result in SNP-heritability estimates with a standard error smaller than those of the independent studies. We did not observe significant difference in estimates of SNP-heritability based on various cross-platform imputed GRMs. SNP-heritability of childhood height was on average estimated as 0.50 (SE = 0.10). Introducing cohort as a covariate resulted in ≈2 % drop. Principal components (PCs) adjustment resulted in SNP-heritability estimates of about 0.39 (SE = 0.11). Strikingly, we did not find significant difference between cross-platform imputed and combined GRMs. All estimates were significant regardless the use of PCs adjustment. Based on these analyses we conclude that imputation with a reference set helps to increase power to estimate SNP-heritability by combining cohorts of the same ethnicity genotyped on different platforms. However, important factors should be taken into account such as remaining cohort stratification after imputation and/or phenotypic heterogeneity between and within cohorts. Whether one should use imputation, or just combine the genotype data, depends on the number of overlapping SNPs in relation to the total number of genotyped SNPs for both cohorts, and their ability to tag all the genetic variance related to the specific trait of interest.
Optimal Design of Low-Density SNP Arrays for Genomic Prediction: Algorithm and Applications.
Wu, Xiao-Lin; Xu, Jiaqi; Feng, Guofei; Wiggans, George R; Taylor, Jeremy F; He, Jun; Qian, Changsong; Qiu, Jiansheng; Simpson, Barry; Walker, Jeremy; Bauck, Stewart
2016-01-01
Low-density (LD) single nucleotide polymorphism (SNP) arrays provide a cost-effective solution for genomic prediction and selection, but algorithms and computational tools are needed for the optimal design of LD SNP chips. A multiple-objective, local optimization (MOLO) algorithm was developed for design of optimal LD SNP chips that can be imputed accurately to medium-density (MD) or high-density (HD) SNP genotypes for genomic prediction. The objective function facilitates maximization of non-gap map length and system information for the SNP chip, and the latter is computed either as locus-averaged (LASE) or haplotype-averaged Shannon entropy (HASE) and adjusted for uniformity of the SNP distribution. HASE performed better than LASE with ≤1,000 SNPs, but required considerably more computing time. Nevertheless, the differences diminished when >5,000 SNPs were selected. Optimization was accomplished conditionally on the presence of SNPs that were obligated to each chromosome. The frame location of SNPs on a chip can be either uniform (evenly spaced) or non-uniform. For the latter design, a tunable empirical Beta distribution was used to guide location distribution of frame SNPs such that both ends of each chromosome were enriched with SNPs. The SNP distribution on each chromosome was finalized through the objective function that was locally and empirically maximized. This MOLO algorithm was capable of selecting a set of approximately evenly-spaced and highly-informative SNPs, which in turn led to increased imputation accuracy compared with selection solely of evenly-spaced SNPs. Imputation accuracy increased with LD chip size, and imputation error rate was extremely low for chips with ≥3,000 SNPs. Assuming that genotyping or imputation error occurs at random, imputation error rate can be viewed as the upper limit for genomic prediction error. Our results show that about 25% of imputation error rate was propagated to genomic prediction in an Angus population. The utility of this MOLO algorithm was also demonstrated in a real application, in which a 6K SNP panel was optimized conditional on 5,260 obligatory SNP selected based on SNP-trait association in U.S. Holstein animals. With this MOLO algorithm, both imputation error rate and genomic prediction error rate were minimal.
Optimal Design of Low-Density SNP Arrays for Genomic Prediction: Algorithm and Applications
Wu, Xiao-Lin; Xu, Jiaqi; Feng, Guofei; Wiggans, George R.; Taylor, Jeremy F.; He, Jun; Qian, Changsong; Qiu, Jiansheng; Simpson, Barry; Walker, Jeremy; Bauck, Stewart
2016-01-01
Low-density (LD) single nucleotide polymorphism (SNP) arrays provide a cost-effective solution for genomic prediction and selection, but algorithms and computational tools are needed for the optimal design of LD SNP chips. A multiple-objective, local optimization (MOLO) algorithm was developed for design of optimal LD SNP chips that can be imputed accurately to medium-density (MD) or high-density (HD) SNP genotypes for genomic prediction. The objective function facilitates maximization of non-gap map length and system information for the SNP chip, and the latter is computed either as locus-averaged (LASE) or haplotype-averaged Shannon entropy (HASE) and adjusted for uniformity of the SNP distribution. HASE performed better than LASE with ≤1,000 SNPs, but required considerably more computing time. Nevertheless, the differences diminished when >5,000 SNPs were selected. Optimization was accomplished conditionally on the presence of SNPs that were obligated to each chromosome. The frame location of SNPs on a chip can be either uniform (evenly spaced) or non-uniform. For the latter design, a tunable empirical Beta distribution was used to guide location distribution of frame SNPs such that both ends of each chromosome were enriched with SNPs. The SNP distribution on each chromosome was finalized through the objective function that was locally and empirically maximized. This MOLO algorithm was capable of selecting a set of approximately evenly-spaced and highly-informative SNPs, which in turn led to increased imputation accuracy compared with selection solely of evenly-spaced SNPs. Imputation accuracy increased with LD chip size, and imputation error rate was extremely low for chips with ≥3,000 SNPs. Assuming that genotyping or imputation error occurs at random, imputation error rate can be viewed as the upper limit for genomic prediction error. Our results show that about 25% of imputation error rate was propagated to genomic prediction in an Angus population. The utility of this MOLO algorithm was also demonstrated in a real application, in which a 6K SNP panel was optimized conditional on 5,260 obligatory SNP selected based on SNP-trait association in U.S. Holstein animals. With this MOLO algorithm, both imputation error rate and genomic prediction error rate were minimal. PMID:27583971
Corso, Phaedra S.; Ingels, Justin B.; Kogan, Steven M.; Foster, E. Michael; Chen, Yi-Fu; Brody, Gene H.
2013-01-01
Programmatic cost analyses of preventive interventions commonly have a number of methodological difficulties. To determine the mean total costs and properly characterize variability, one often has to deal with small sample sizes, skewed distributions, and especially missing data. Standard approaches for dealing with missing data such as multiple imputation may suffer from a small sample size, a lack of appropriate covariates, or too few details around the method used to handle the missing data. In this study, we estimate total programmatic costs for a prevention trial evaluating the Strong African American Families-Teen program. This intervention focuses on the prevention of substance abuse and risky sexual behavior. To account for missing data in the assessment of programmatic costs we compare multiple imputation to probabilistic sensitivity analysis. The latter approach uses collected cost data to create a distribution around each input parameter. We found that with the multiple imputation approach, the mean (95% confidence interval) incremental difference was $2149 ($397, $3901). With the probabilistic sensitivity analysis approach, the incremental difference was $2583 ($778, $4346). Although the true cost of the program is unknown, probabilistic sensitivity analysis may be a more viable alternative for capturing variability in estimates of programmatic costs when dealing with missing data, particularly with small sample sizes and the lack of strong predictor variables. Further, the larger standard errors produced by the probabilistic sensitivity analysis method may signal its ability to capture more of the variability in the data, thus better informing policymakers on the potentially true cost of the intervention. PMID:23299559
Corso, Phaedra S; Ingels, Justin B; Kogan, Steven M; Foster, E Michael; Chen, Yi-Fu; Brody, Gene H
2013-10-01
Programmatic cost analyses of preventive interventions commonly have a number of methodological difficulties. To determine the mean total costs and properly characterize variability, one often has to deal with small sample sizes, skewed distributions, and especially missing data. Standard approaches for dealing with missing data such as multiple imputation may suffer from a small sample size, a lack of appropriate covariates, or too few details around the method used to handle the missing data. In this study, we estimate total programmatic costs for a prevention trial evaluating the Strong African American Families-Teen program. This intervention focuses on the prevention of substance abuse and risky sexual behavior. To account for missing data in the assessment of programmatic costs we compare multiple imputation to probabilistic sensitivity analysis. The latter approach uses collected cost data to create a distribution around each input parameter. We found that with the multiple imputation approach, the mean (95 % confidence interval) incremental difference was $2,149 ($397, $3,901). With the probabilistic sensitivity analysis approach, the incremental difference was $2,583 ($778, $4,346). Although the true cost of the program is unknown, probabilistic sensitivity analysis may be a more viable alternative for capturing variability in estimates of programmatic costs when dealing with missing data, particularly with small sample sizes and the lack of strong predictor variables. Further, the larger standard errors produced by the probabilistic sensitivity analysis method may signal its ability to capture more of the variability in the data, thus better informing policymakers on the potentially true cost of the intervention.
Jakobsen, Janus Christian; Gluud, Christian; Wetterslev, Jørn; Winkel, Per
2017-12-06
Missing data may seriously compromise inferences from randomised clinical trials, especially if missing data are not handled appropriately. The potential bias due to missing data depends on the mechanism causing the data to be missing, and the analytical methods applied to amend the missingness. Therefore, the analysis of trial data with missing values requires careful planning and attention. The authors had several meetings and discussions considering optimal ways of handling missing data to minimise the bias potential. We also searched PubMed (key words: missing data; randomi*; statistical analysis) and reference lists of known studies for papers (theoretical papers; empirical studies; simulation studies; etc.) on how to deal with missing data when analysing randomised clinical trials. Handling missing data is an important, yet difficult and complex task when analysing results of randomised clinical trials. We consider how to optimise the handling of missing data during the planning stage of a randomised clinical trial and recommend analytical approaches which may prevent bias caused by unavoidable missing data. We consider the strengths and limitations of using of best-worst and worst-best sensitivity analyses, multiple imputation, and full information maximum likelihood. We also present practical flowcharts on how to deal with missing data and an overview of the steps that always need to be considered during the analysis stage of a trial. We present a practical guide and flowcharts describing when and how multiple imputation should be used to handle missing data in randomised clinical.
Arabidopsis ARGONAUTE7 selects miR390 through multiple checkpoints during RISC assembly
Endo, Yayoi; Iwakawa, Hiro-oki; Tomari, Yukihide
2013-01-01
Plant ARGONAUTE7 (AGO7) assembles RNA-induced silencing complex (RISC) specifically with miR390 and regulates the auxin-signalling pathway via production of TAS3 trans-acting siRNAs (tasiRNAs). However, how AGO7 discerns miR390 among other miRNAs remains unclear. Here, we show that the 5′ adenosine of miR390 and the central region of miR390/miR390* duplex are critical for the specific interaction with AGO7. Furthermore, despite the existence of mismatches in the seed and central regions of the duplex, cleavage of the miR390* strand is required for maturation of AGO7–RISC. These findings suggest that AGO7 uses multiple checkpoints to select miR390, thereby circumventing promiscuous tasiRNA production. PMID:23732541
Estimates of alcohol involvement in fatal crashes : new alcohol methodology
DOT National Transportation Integrated Search
2002-01-01
The National Highway Traffic Safety Administration (NHTSA) has adopted a new method to : estimate missing blood alcohol concentration (BAC) test result data. This new method, multiple : imputation, will be used by NHTSAs National Center for Statis...
Missing data in FFQs: making assumptions about item non-response.
Lamb, Karen E; Olstad, Dana Lee; Nguyen, Cattram; Milte, Catherine; McNaughton, Sarah A
2017-04-01
FFQs are a popular method of capturing dietary information in epidemiological studies and may be used to derive dietary exposures such as nutrient intake or overall dietary patterns and diet quality. As FFQs can involve large numbers of questions, participants may fail to respond to all questions, leaving researchers to decide how to deal with missing data when deriving intake measures. The aim of the present commentary is to discuss the current practice for dealing with item non-response in FFQs and to propose a research agenda for reporting and handling missing data in FFQs. Single imputation techniques, such as zero imputation (assuming no consumption of the item) or mean imputation, are commonly used to deal with item non-response in FFQs. However, single imputation methods make strong assumptions about the missing data mechanism and do not reflect the uncertainty created by the missing data. This can lead to incorrect inference about associations between diet and health outcomes. Although the use of multiple imputation methods in epidemiology has increased, these have seldom been used in the field of nutritional epidemiology to address missing data in FFQs. We discuss methods for dealing with item non-response in FFQs, highlighting the assumptions made under each approach. Researchers analysing FFQs should ensure that missing data are handled appropriately and clearly report how missing data were treated in analyses. Simulation studies are required to enable systematic evaluation of the utility of various methods for handling item non-response in FFQs under different assumptions about the missing data mechanism.
Imputation approaches for animal movement modeling
Scharf, Henry; Hooten, Mevin B.; Johnson, Devin S.
2017-01-01
The analysis of telemetry data is common in animal ecological studies. While the collection of telemetry data for individual animals has improved dramatically, the methods to properly account for inherent uncertainties (e.g., measurement error, dependence, barriers to movement) have lagged behind. Still, many new statistical approaches have been developed to infer unknown quantities affecting animal movement or predict movement based on telemetry data. Hierarchical statistical models are useful to account for some of the aforementioned uncertainties, as well as provide population-level inference, but they often come with an increased computational burden. For certain types of statistical models, it is straightforward to provide inference if the latent true animal trajectory is known, but challenging otherwise. In these cases, approaches related to multiple imputation have been employed to account for the uncertainty associated with our knowledge of the latent trajectory. Despite the increasing use of imputation approaches for modeling animal movement, the general sensitivity and accuracy of these methods have not been explored in detail. We provide an introduction to animal movement modeling and describe how imputation approaches may be helpful for certain types of models. We also assess the performance of imputation approaches in two simulation studies. Our simulation studies suggests that inference for model parameters directly related to the location of an individual may be more accurate than inference for parameters associated with higher-order processes such as velocity or acceleration. Finally, we apply these methods to analyze a telemetry data set involving northern fur seals (Callorhinus ursinus) in the Bering Sea. Supplementary materials accompanying this paper appear online.
Luo, Chonglin; Tetteh, Paul W; Merz, Patrick R; Dickes, Elke; Abukiwan, Alia; Hotz-Wagenblatt, Agnes; Holland-Cunz, Stefan; Sinnberg, Tobias; Schittek, Birgit; Schadendorf, Dirk; Diederichs, Sven; Eichmüller, Stefan B
2013-03-01
MicroRNAs are small noncoding RNAs that regulate gene expression and have important roles in various types of cancer. Previously, miR-137 was reported to act as a tumor suppressor in different cancers, including malignant melanoma. In this study, we show that low miR-137 expression is correlated with poor survival in stage IV melanoma patients. We identified and validated two genes (c-Met and YB1) as direct targets of miR-137 and confirmed two previously known targets, namely enhancer of zeste homolog 2 (EZH2) and microphthalmia-associated transcription factor (MITF). Functional studies showed that miR-137 suppressed melanoma cell invasion through the downregulation of multiple target genes. The decreased invasion caused by miR-137 overexpression could be phenocopied by small interfering RNA knockdown of EZH2, c-Met, or Y box-binding protein 1 (YB1). Furthermore, miR-137 inhibited melanoma cell migration and proliferation. Finally, miR-137 induced apoptosis in melanoma cell lines and decreased BCL2 levels. In summary, our study confirms that miR-137 acts as a tumor suppressor in malignant melanoma and reveals that miR-137 regulates multiple targets including c-Met, YB1, EZH2, and MITF.
Dore, David D.; Swaminathan, Shailender; Gutman, Roee; Trivedi, Amal N.; Mor, Vincent
2013-01-01
Objective To compare the assumptions and estimands across three approaches to estimating the effect of erythropoietin-stimulating agents (ESAs) on mortality. Study Design and Setting Using data from the Renal Management Information System, we conducted two analyses utilizing a change to bundled payment that we hypothesized mimicked random assignment to ESA (pre-post, difference-in-difference, and instrumental variable analyses). A third analysis was based on multiply imputing potential outcomes using propensity scores. Results There were 311,087 recipients of ESAs and 13,095 non-recipients. In the pre-post comparison, we identified no clear relationship between bundled payment (measured by calendar time) and the incidence of death within six months (risk difference -1.5%; 95% CI - 7.0% to 4.0%). In the instrumental variable analysis, the risk of mortality was similar among ESA recipients (risk difference -0.9%; 95% CI -2.1 to 0.3). In the multiple imputation analysis, we observed a 4.2% (95% CI 3.4% to 4.9%) absolute reduction in mortality risk with use of ESAs, but closer to the null for patients with baseline hematocrit >36%. Conclusion Methods emanating from different disciplines often rely on different assumptions, but can be informative about a similar causal contrast. The implications of these distinct approaches are discussed. PMID:23849152
Wang, Lan; Zhu, Jiang; Deng, Fei-Yan; Wu, Long-Fei; Mo, Xing-Bo; Zhu, Xiao-Wei; Xia, Wei; Xie, Fang-Fei; He, Pei; Bing, Peng-Fei; Qiu, Ying-Hua; Lin, Xiang; Lu, Xin; Zhang, Lei; Yi, Neng-Jun; Zhang, Yong-Hong; Lei, Shu-Feng
2018-02-01
MicroRNAs (miRNAs) can regulate gene expression through binding to complementary sites in the 3'-untranslated regions of target mRNAs, which will lead to existence of correlation in expression between miRNA and mRNA. However, the miRNA-mRNA correlation patterns are complex and remain largely unclear yet. To establish the global correlation patterns in human peripheral blood mononuclear cells (PBMCs), multiple miRNA-mRNA correlation analyses and expression quantitative trait locus (eQTL) analysis were conducted in this study. We predicted and achieved 861 miRNA-mRNA pairs (65 miRNAs, 412 mRNAs) using multiple bioinformatics programs, and found global negative miRNA-mRNA correlations in PBMC from all 46 study subjects. Among the 861 pairs of correlations, 19.5% were significant (P < 0.05) and ~70% were negative. The correlation network was complex and highlighted key miRNAs/genes in PBMC. Some miRNAs, such as hsa-miR-29a, hsa-miR-148a, regulate a cluster of target genes. Some genes, e.g., TNRC6A, are regulated by multiple miRNAs. The identified genes tend to be enriched in molecular functions of DNA and RNA binding, and biological processes such as protein transport, regulation of translation and chromatin modification. The results provided a global view of the miRNA-mRNA expression correlation profile in human PBMCs, which would facilitate in-depth investigation of biological functions of key miRNAs/mRNAs and better understanding of the pathogenesis underlying PBMC-related diseases.
Kessler, Ronald C.; Adler, Lenard A.; Barkley, Russell; Biederman, Joseph; Conners, C. Keith; Faraone, Stephen V.; Greenhill, Laurence L.; Jaeger, Savina; Secnik, Kristina; Spencer, Thomas; Üstün, T. Bedirhan; Zaslavsky, Alan M.
2010-01-01
BACKGROUND Despite growing interest in adult ADHD, little is known about predictors of persistence of childhood cases into adulthood. METHODS A retrospective assessment of childhood ADHD, childhood risk factors, and a screen for adult ADHD were included in a sample of 3197 18–44 year old respondents in the National Comorbidity Survey Replication (NCS-R). Blinded adult ADHD clinical reappraisal interviews were administered to a sub-sample of respondents. Multiple imputation (MI) was used to estimate adult persistence of childhood ADHD. Logistic regression was used to study retrospectively reported childhood predictors of persistence. Potential predictors included socio-demographics, childhood ADHD severity, childhood adversity, traumatic life experiences, and comorbid DSM-IV child-adolescent disorders (anxiety, mood, impulse-control, and substance disorders). RESULTS 36.3% of respondents with retrospectively assessed childhood ADHD were classified by blinded clinical interviews as meeting DSM-IV criteria for current ADHD. Childhood ADHD severity and childhood treatment significantly predicted persistence. Controlling for severity and excluding treatment, none of the other variables significantly predicted persistence even though they were significantly associated with childhood ADHD. CONCLUSIONS No modifiable risk factors were found for adult persistence of ADHD. Further research, ideally based on prospective general population samples, is needed to search for modifiable determinants of adult persistence of ADHD. PMID:15950019
Kessler, Ronald C.; Adler, Lenard; Barkley, Russell; Biederman, Joseph; Conners, C. Keith; Demler, Olga; Faraone, Stephen V.; Greenhill, Laurence L.; Howes, Mary J.; Secnik, Kristina; Spencer, Thomas; Ustun, T. Bedirhan; Walters, Ellen E.; Zaslavsky, Alan M.
2010-01-01
OBJECTIVE Despite growing interest in adult attention-deficit/hyperactivity disorder (ADHD), little is known about prevalence or correlates. METHODS A screen for adult ADHD was included in a probability sub-sample (n = 3199) of 18–44 year old respondents in the National Comorbidity Survey Replication (NCS-R), a nationally representative household survey that used a lay-administered diagnostic interview to assess a wide range of DSM-IV disorders. Blinded clinical follow-up interviews of adult ADHD were carried out with 154 NCS-R respondents, over-sampling those with a positive screen. Multiple imputation (MI) was used to estimate prevalence and correlates of clinician-assessed adult ADHD. RESULTS Estimated prevalence of current adult ADHD is 4.4%. Significant correlates include being male, previously married, unemployed, and Non-Hispanic White. Adult ADHD is highly comorbid with many other NCS-R/DSM-IV disorders and is associated with substantial role impairment. The majority of cases are untreated, although many obtain treatment for other comorbid mental and substance disorders. CONCLUSIONS Efforts are needed to increase the detection and treatment of adult ADHD. Research is needed to determine whether effective treatment would reduce the onset, persistence, and severity of disorders that co-occur with adult ADHD. PMID:16585449
Wang, Xinyu; Ren, Yanli; Wang, Zhiqiong; Xiong, Xiangyu; Han, Sichong; Pan, Wenting; Chen, Hongwei; Zhou, Liqing; Zhou, Changchun; Yuan, Qipeng; Yang, Ming
2015-12-21
5S rRNA plays an important part in ribosome biology and is over-expression in multiple cancers. In this study, we found that 5S rRNA is a direct target of miR-150 and miR-383 in esophageal squamous cell carcinoma (ESCC). Overexpression of miR-150 and miR-383 inhibited ESCC cell proliferation in vitro and in vivo. Moreover, 5S rRNA silencing by miR-150 and miR-383 might intensify rpL11-c-Myc interaction, which attenuated role of c-Myc as an oncogenic transcriptional factor and dysregulation of multiple c-Myc target genes. Taken together, our results highlight the involvement of miRNAs in ribosomal regulation during tumorigenesis. Copyright © 2015 Federation of European Biochemical Societies. Published by Elsevier B.V. All rights reserved.
Lu, Cecilia S; Zhai, Bo; Mauss, Alex; Landgraf, Matthias; Gygi, Stephen; Van Vactor, David
2014-09-26
Neuronal connectivity and specificity rely upon precise coordinated deployment of multiple cell-surface and secreted molecules. MicroRNAs have tremendous potential for shaping neural circuitry by fine-tuning the spatio-temporal expression of key synaptic effector molecules. The highly conserved microRNA miR-8 is required during late stages of neuromuscular synapse development in Drosophila. However, its role in initial synapse formation was previously unknown. Detailed analysis of synaptogenesis in this system now reveals that miR-8 is required at the earliest stages of muscle target contact by RP3 motor axons. We find that the localization of multiple synaptic cell adhesion molecules (CAMs) is dependent on the expression of miR-8, suggesting that miR-8 regulates the initial assembly of synaptic sites. Using stable isotope labelling in vivo and comparative mass spectrometry, we find that miR-8 is required for normal expression of multiple proteins, including the CAMs Fasciclin III (FasIII) and Neuroglian (Nrg). Genetic analysis suggests that Nrg and FasIII collaborate downstream of miR-8 to promote accurate target recognition. Unlike the function of miR-8 at mature larval neuromuscular junctions, at the embryonic stage we find that miR-8 controls key effectors on both sides of the synapse. MiR-8 controls multiple stages of synapse formation through the coordinate regulation of both pre- and postsynaptic cell adhesion proteins.
Lu, Cecilia S.; Zhai, Bo; Mauss, Alex; Landgraf, Matthias; Gygi, Stephen; Van Vactor, David
2014-01-01
Neuronal connectivity and specificity rely upon precise coordinated deployment of multiple cell-surface and secreted molecules. MicroRNAs have tremendous potential for shaping neural circuitry by fine-tuning the spatio-temporal expression of key synaptic effector molecules. The highly conserved microRNA miR-8 is required during late stages of neuromuscular synapse development in Drosophila. However, its role in initial synapse formation was previously unknown. Detailed analysis of synaptogenesis in this system now reveals that miR-8 is required at the earliest stages of muscle target contact by RP3 motor axons. We find that the localization of multiple synaptic cell adhesion molecules (CAMs) is dependent on the expression of miR-8, suggesting that miR-8 regulates the initial assembly of synaptic sites. Using stable isotope labelling in vivo and comparative mass spectrometry, we find that miR-8 is required for normal expression of multiple proteins, including the CAMs Fasciclin III (FasIII) and Neuroglian (Nrg). Genetic analysis suggests that Nrg and FasIII collaborate downstream of miR-8 to promote accurate target recognition. Unlike the function of miR-8 at mature larval neuromuscular junctions, at the embryonic stage we find that miR-8 controls key effectors on both sides of the synapse. MiR-8 controls multiple stages of synapse formation through the coordinate regulation of both pre- and postsynaptic cell adhesion proteins. PMID:25135978
Sehgal, Muhammad Shoaib B; Gondal, Iqbal; Dooley, Laurence S
2005-05-15
Microarray data are used in a range of application areas in biology, although often it contains considerable numbers of missing values. These missing values can significantly affect subsequent statistical analysis and machine learning algorithms so there is a strong motivation to estimate these values as accurately as possible before using these algorithms. While many imputation algorithms have been proposed, more robust techniques need to be developed so that further analysis of biological data can be accurately undertaken. In this paper, an innovative missing value imputation algorithm called collateral missing value estimation (CMVE) is presented which uses multiple covariance-based imputation matrices for the final prediction of missing values. The matrices are computed and optimized using least square regression and linear programming methods. The new CMVE algorithm has been compared with existing estimation techniques including Bayesian principal component analysis imputation (BPCA), least square impute (LSImpute) and K-nearest neighbour (KNN). All these methods were rigorously tested to estimate missing values in three separate non-time series (ovarian cancer based) and one time series (yeast sporulation) dataset. Each method was quantitatively analyzed using the normalized root mean square (NRMS) error measure, covering a wide range of randomly introduced missing value probabilities from 0.01 to 0.2. Experiments were also undertaken on the yeast dataset, which comprised 1.7% actual missing values, to test the hypothesis that CMVE performed better not only for randomly occurring but also for a real distribution of missing values. The results confirmed that CMVE consistently demonstrated superior and robust estimation capability of missing values compared with other methods for both series types of data, for the same order of computational complexity. A concise theoretical framework has also been formulated to validate the improved performance of the CMVE algorithm. The CMVE software is available upon request from the authors.
Sulovari, Arvis; Li, Dawei
2014-07-19
Genome-wide association studies (GWAS) have successfully identified genes associated with complex human diseases. Although much of the heritability remains unexplained, combining single nucleotide polymorphism (SNP) genotypes from multiple studies for meta-analysis will increase the statistical power to identify new disease-associated variants. Meta-analysis requires same allele definition (nomenclature) and genome build among individual studies. Similarly, imputation, commonly-used prior to meta-analysis, requires the same consistency. However, the genotypes from various GWAS are generated using different genotyping platforms, arrays or SNP-calling approaches, resulting in use of different genome builds and allele definitions. Incorrect assumptions of identical allele definition among combined GWAS lead to a large portion of discarded genotypes or incorrect association findings. There is no published tool that predicts and converts among all major allele definitions. In this study, we have developed a tool, GACT, which stands for Genome build and Allele definition Conversion Tool, that predicts and inter-converts between any of the common SNP allele definitions and between the major genome builds. In addition, we assessed several factors that may affect imputation quality, and our results indicated that inclusion of singletons in the reference had detrimental effects while ambiguous SNPs had no measurable effect. Unexpectedly, exclusion of genotypes with missing rate > 0.001 (40% of study SNPs) showed no significant decrease of imputation quality (even significantly higher when compared to the imputation with singletons in the reference), especially for rare SNPs. GACT is a new, powerful, and user-friendly tool with both command-line and interactive online versions that can accurately predict, and convert between any of the common allele definitions and between genome builds for genome-wide meta-analysis and imputation of genotypes from SNP-arrays or deep-sequencing, particularly for data from the dbGaP and other public databases. http://www.uvm.edu/genomics/software/gact.
MicroRNA-133 mediates cardiac diseases: Mechanisms and clinical implications
DOE Office of Scientific and Technical Information (OSTI.GOV)
Liu, Yi; Liang, Yan; Zhang, Jin-fang
MicroRNAs (miRNAs) belong to the family of small non-coding RNAs that mediate gene expression by post-transcriptional regulation. Increasing evidence have demonstrated that miR-133 is enriched in muscle tissues and myogenic cells, and its aberrant expression could induce the occurrence and development of cardiac disorders, such as cardiac hypertrophy, heart failure, etc. In this review, we summarized the regulatory roles of miR-133 in cardiac disorders and the underlying mechanisms, which suggest that miR-133 may be a potential diagnostic and therapeutic tool for cardiac disorders. - Highlights: • miR-218 is frequently downregulated in multiple cancers. • miR-218 plays pivotal roles in carcinogenesis.more » • miR-218 mediates proliferation, apoptosis, metastasis, invasion, etc. • miR-218 mediates tumorigenesis and metastasis via multiple pathways.« less
Molgenis-impute: imputation pipeline in a box.
Kanterakis, Alexandros; Deelen, Patrick; van Dijk, Freerk; Byelas, Heorhiy; Dijkstra, Martijn; Swertz, Morris A
2015-08-19
Genotype imputation is an important procedure in current genomic analysis such as genome-wide association studies, meta-analyses and fine mapping. Although high quality tools are available that perform the steps of this process, considerable effort and expertise is required to set up and run a best practice imputation pipeline, particularly for larger genotype datasets, where imputation has to scale out in parallel on computer clusters. Here we present MOLGENIS-impute, an 'imputation in a box' solution that seamlessly and transparently automates the set up and running of all the steps of the imputation process. These steps include genome build liftover (liftovering), genotype phasing with SHAPEIT2, quality control, sample and chromosomal chunking/merging, and imputation with IMPUTE2. MOLGENIS-impute builds on MOLGENIS-compute, a simple pipeline management platform for submission and monitoring of bioinformatics tasks in High Performance Computing (HPC) environments like local/cloud servers, clusters and grids. All the required tools, data and scripts are downloaded and installed in a single step. Researchers with diverse backgrounds and expertise have tested MOLGENIS-impute on different locations and imputed over 30,000 samples so far using the 1,000 Genomes Project and new Genome of the Netherlands data as the imputation reference. The tests have been performed on PBS/SGE clusters, cloud VMs and in a grid HPC environment. MOLGENIS-impute gives priority to the ease of setting up, configuring and running an imputation. It has minimal dependencies and wraps the pipeline in a simple command line interface, without sacrificing flexibility to adapt or limiting the options of underlying imputation tools. It does not require knowledge of a workflow system or programming, and is targeted at researchers who just want to apply best practices in imputation via simple commands. It is built on the MOLGENIS compute workflow framework to enable customization with additional computational steps or it can be included in other bioinformatics pipelines. It is available as open source from: https://github.com/molgenis/molgenis-imputation.
Southam, Lorraine; Panoutsopoulou, Kalliope; Rayner, N William; Chapman, Kay; Durrant, Caroline; Ferreira, Teresa; Arden, Nigel; Carr, Andrew; Deloukas, Panos; Doherty, Michael; Loughlin, John; McCaskie, Andrew; Ollier, William E R; Ralston, Stuart; Spector, Timothy D; Valdes, Ana M; Wallis, Gillian A; Wilkinson, J Mark; Marchini, Jonathan; Zeggini, Eleftheria
2011-05-01
Imputation is an extremely valuable tool in conducting and synthesising genome-wide association studies (GWASs). Directly typed SNP quality control (QC) is thought to affect imputation quality. It is, therefore, common practise to use quality-controlled (QCed) data as an input for imputing genotypes. This study aims to determine the effect of commonly applied QC steps on imputation outcomes. We performed several iterations of imputing SNPs across chromosome 22 in a dataset consisting of 3177 samples with Illumina 610 k (Illumina, San Diego, CA, USA) GWAS data, applying different QC steps each time. The imputed genotypes were compared with the directly typed genotypes. In addition, we investigated the correlation between alternatively QCed data. We also applied a series of post-imputation QC steps balancing elimination of poorly imputed SNPs and information loss. We found that the difference between the unQCed data and the fully QCed data on imputation outcome was minimal. Our study shows that imputation of common variants is generally very accurate and robust to GWAS QC, which is not a major factor affecting imputation outcome. A minority of common-frequency SNPs with particular properties cannot be accurately imputed regardless of QC stringency. These findings may not generalise to the imputation of low frequency and rare variants.
Ahlbrecht, Jonas; Martino, Filippo; Pul, Refik; Skripuletz, Thomas; Sühs, Kurt-Wolfram; Schauerte, Celina; Yildiz, Özlem; Trebst, Corinna; Tasto, Lars; Thum, Sabrina; Pfanne, Angelika; Roesler, Romy; Lauda, Florian; Hecker, Michael; Zettl, Uwe K; Tumani, Hayrettin; Thum, Thomas; Stangel, Martin
2016-08-01
MiRNA-181c, miRNA-633 and miRNA-922 have been reported to be deregulated in multiple sclerosis. To investigate the association between miRNA-181c, miRNA-633 and miRNA-922 and conversion from clinically isolated syndrome (CIS) to relapsing-remitting multiple sclerosis (RRMS); and to compare microRNAs in cerebrospinal fluid (CSF) and serum with regard to dysfunction of the blood-CSF barrier. CSF and serum miRNA-181c, miRNA-633 and miRNA-922 were retrospectively determined by quantitative real-time polymerase chain reaction in CIS patients with (CIS-RRMS) and without (CIS-CIS) conversion to RRMS within 1 year. Thirty of 58 CIS patients developed RRMS. Cerebrospinal fluid miRNA-922, serum miRNA-922 and cerebrospinal fluid miRNA-181c were significantly higher in CIS-RRMS compared to CIS-CIS (P=0.027, P=0.048, P=0.029, respectively). High levels of cerebrospinal fluid miRNA-181c were independently associated with conversion from CIS to RRMS in multivariate Cox regression analysis (hazard ratio 2.99, 95% confidence interval 1.41-6.34, P=0.005). A combination of high cerebrospinal fluid miRNA-181c, younger age and more than nine lesions on magnetic resonance imaging showed the highest specificity (96%) and positive predictive value (94%) for conversion from CIS to RRMS. MiRNA-181c was higher in serum than in cerebrospinal fluid (P <0.001), while miRNA-633 and miRNA-922 were no different in cerebrospinal fluid and serum. Cerebrospinal fluid/serum albumin quotients did not correlate with microRNAs in cerebrospinal fluid (all P>0.711). Cerebrospinal fluid miRNA-181c might serve as a biomarker for early conversion to RRMS. Moreover, our data suggest an intrathecal origin of microRNAs detected in the cerebrospinal fluid. © The Author(s), 2015.
Exploring the Application of Multiple Intelligences Theory to Career Counseling
ERIC Educational Resources Information Center
Shearer, C. Branton; Luzzo, Darrell Anthony
2009-01-01
This article demonstrates the practical value of applying H. Gardner's (1993) theory of multiple intelligences (MI) to the practice of career counseling. An overview of H. Gardner's MI theory is presented, and the ways in which educational and vocational planning can be augmented by the integration of MI theory in career counseling contexts are…
ERIC Educational Resources Information Center
Hanafin, Joan
2014-01-01
This paper presents findings from an action research project that investigated the application of Multiple Intelligences (MI) theory in classrooms and schools. It shows how MI theory was used in the project as a basis for suggestions to generate classroom practices; how participating teachers evaluated the project; and how teachers responded to…
Implementing Multiple Intelligences: The New City School Experience. Fastback 407.
ERIC Educational Resources Information Center
Hoerr, Thomas R.
This brief reviews the concept of multiple intelligences (MI) and discusses the implementation of the theory of MI in the New City School, an independent school in St. Louis (Missouri). The theory of MI, as developed by Howard Gardner, says that there are at least seven different intelligences: linguistic, logical, musical, bodily-kinesthetic,…
Circular RNA expression in basal cell carcinoma.
Sand, Michael; Bechara, Falk G; Sand, Daniel; Gambichler, Thilo; Hahn, Stephan A; Bromba, Michael; Stockfleth, Eggert; Hessam, Schapoor
2016-05-01
Circular RNAs (circRNAs), are nonprotein coding RNAs consisting of a circular loop with multiple miRNA, binding sites called miRNA response elements (MREs), functioning as miRNA sponges. This study was performed to identify differentially expressed circRNAs and their MREs in basal cell carcinoma (BCC). Microarray circRNA expression profiles were acquired from BCC and control followed by qRT-PCR validation. Bioinformatical target prediction revealed multiple MREs. Sequence analysis was performed concerning MRE interaction potential with the BCC miRNome. We identified 23 upregulated and 48 downregulated circRNAs with 354 miRNA response elements capable of sequestering miRNA target sequences of the BCC miRNome. The present study describes a variety of circRNAs that are potentially involved in the molecular pathogenesis of BCC.
Podshivalova, Katie; Salomon, Daniel R.
2014-01-01
MicroRNAs (miRNA) are a class of small non-coding RNAs that constitute an essential and evolutionarily conserved mechanism for post-transcriptional gene regulation. Multiple miRNAs have been described to play key roles in T lymphocyte development, differentiation and function. In this review we highlight the current literature regarding the differential expression of miRNAs in various models of mouse and human T cell biology and emphasize mechanistic understandings of miRNA regulation of thymocyte development, T cell activation, and differentiation into effector and memory subsets. We describe the participation of miRNAs in complex regulatory circuits shaping T cell proteomes in a context-dependent manner. It is striking that some miRNAs regulate multiple processes, while others only appear in limited functional contexts. It is also evident that the expression and function of specific miRNAs can differ between mouse and human systems. Ultimately, it is not always correct to simplify the complex events of T cell biology into a model driven by only one or two master regulator miRNAs. In reality, T cell activation and differentiation involves the expression of multiple miRNAs with many mRNA targets and thus, the true extent of miRNA regulation of T cell biology is likely far more vast than currently appreciated. PMID:24099302
Xu, Peng; Wang, Junhua; Sun, Bo; Xiao, Zhongdang
2018-06-15
Investigating the potential biological function of differential changed genes through integrating multiple omics data including miRNA and mRNA expression profiles, is always hot topic. However, how to evaluate the repression effect on target genes integrating miRNA and mRNA expression profiles are not fully solved. In this study, we provide an analyzing method by integrating both miRNAs and mRNAs expression data simultaneously. Difference analysis was adopted based on the repression score, then significantly repressed mRNAs were screened out by DEGseq. Pathway analysis for the significantly repressed mRNAs shows that multiple pathways such as MAPK signaling pathway, TGF-beta signaling pathway and so on, may correlated to the colorectal cancer(CRC). Focusing on the MAPK signaling pathway, a miRNA-mRNA network that centering the cell fate genes was constructed. Finally, the miRNA-mRNAs that potentially important in the CRC carcinogenesis were screened out and scored by impact index. Copyright © 2018 Elsevier B.V. All rights reserved.
Larmer, S G; Sargolzaei, M; Schenkel, F S
2014-05-01
Genomic selection requires a large reference population to accurately estimate single nucleotide polymorphism (SNP) effects. In some Canadian dairy breeds, the available reference populations are not large enough for accurate estimation of SNP effects for traits of interest. If marker phase is highly consistent across multiple breeds, it is theoretically possible to increase the accuracy of genomic prediction for one or all breeds by pooling several breeds into a common reference population. This study investigated the extent of linkage disequilibrium (LD) in 5 major dairy breeds using a 50,000 (50K) SNP panel and 3 of the same breeds using the 777,000 (777K) SNP panel. Correlation of pair-wise SNP phase was also investigated on both panels. The level of LD was measured using the squared correlation of alleles at 2 loci (r(2)), and the consistency of SNP gametic phases was correlated using the signed square root of these values. Because of the high cost of the 777K panel, the accuracy of imputation from lower density marker panels [6,000 (6K) or 50K] was examined both within breed and using a multi-breed reference population in Holstein, Ayrshire, and Guernsey. Imputation was carried out using FImpute V2.2 and Beagle 3.3.2 software. Imputation accuracies were then calculated as both the proportion of correct SNP filled in (concordance rate) and allelic R(2). Computation time was also explored to determine the efficiency of the different algorithms for imputation. Analysis showed that LD values >0.2 were found in all breeds at distances at or shorter than the average adjacent pair-wise distance between SNP on the 50K panel. Correlations of r-values, however, did not reach high levels (<0.9) at these distances. High correlation values of SNP phase between breeds were observed (>0.94) when the average pair-wise distances using the 777K SNP panel were examined. High concordance rate (0.968-0.995) and allelic R(2) (0.946-0.991) were found for all breeds when imputation was carried out with FImpute from 50K to 777K. Imputation accuracy for Guernsey and Ayrshire was slightly lower when using the imputation method in Beagle. Computing time was significantly greater when using Beagle software, with all comparable procedures being 9 to 13 times less efficient, in terms of time, compared with FImpute. These findings suggest that use of a multi-breed reference population might increase prediction accuracy using the 777K SNP panel and that 777K genotypes can be efficiently and effectively imputed using the lower density 50K SNP panel. Copyright © 2014 American Dairy Science Association. Published by Elsevier Inc. All rights reserved.
Integrative missing value estimation for microarray data.
Hu, Jianjun; Li, Haifeng; Waterman, Michael S; Zhou, Xianghong Jasmine
2006-10-12
Missing value estimation is an important preprocessing step in microarray analysis. Although several methods have been developed to solve this problem, their performance is unsatisfactory for datasets with high rates of missing data, high measurement noise, or limited numbers of samples. In fact, more than 80% of the time-series datasets in Stanford Microarray Database contain less than eight samples. We present the integrative Missing Value Estimation method (iMISS) by incorporating information from multiple reference microarray datasets to improve missing value estimation. For each gene with missing data, we derive a consistent neighbor-gene list by taking reference data sets into consideration. To determine whether the given reference data sets are sufficiently informative for integration, we use a submatrix imputation approach. Our experiments showed that iMISS can significantly and consistently improve the accuracy of the state-of-the-art Local Least Square (LLS) imputation algorithm by up to 15% improvement in our benchmark tests. We demonstrated that the order-statistics-based integrative imputation algorithms can achieve significant improvements over the state-of-the-art missing value estimation approaches such as LLS and is especially good for imputing microarray datasets with a limited number of samples, high rates of missing data, or very noisy measurements. With the rapid accumulation of microarray datasets, the performance of our approach can be further improved by incorporating larger and more appropriate reference datasets.
miRegulome: a knowledge-base of miRNA regulomics and analysis.
Barh, Debmalya; Kamapantula, Bhanu; Jain, Neha; Nalluri, Joseph; Bhattacharya, Antaripa; Juneja, Lucky; Barve, Neha; Tiwari, Sandeep; Miyoshi, Anderson; Azevedo, Vasco; Blum, Kenneth; Kumar, Anil; Silva, Artur; Ghosh, Preetam
2015-08-05
miRNAs regulate post transcriptional gene expression by targeting multiple mRNAs and hence can modulate multiple signalling pathways, biological processes, and patho-physiologies. Therefore, understanding of miRNA regulatory networks is essential in order to modulate the functions of a miRNA. The focus of several existing databases is to provide information on specific aspects of miRNA regulation. However, an integrated resource on the miRNA regulome is currently not available to facilitate the exploration and understanding of miRNA regulomics. miRegulome attempts to bridge this gap. The current version of miRegulome v1.0 provides details on the entire regulatory modules of miRNAs altered in response to chemical treatments and transcription factors, based on validated data manually curated from published literature. Modules of miRegulome (upstream regulators, downstream targets, miRNA regulated pathways, functions, diseases, etc) are hyperlinked to an appropriate external resource and are displayed visually to provide a comprehensive understanding. Four analysis tools are incorporated to identify relationships among different modules based on user specified datasets. miRegulome and its tools are helpful in understanding the biology of miRNAs and will also facilitate the discovery of biomarkers and therapeutics. With added features in upcoming releases, miRegulome will be an essential resource to the scientific community. http://bnet.egr.vcu.edu/miRegulome.
ERIC Educational Resources Information Center
Kallenbach, Silja; Viens, Julie
The Adult Multiple Intelligences (AMI) Study investigated how multiple intelligences (MI) theory can support instruction and assessment in adult literacy education across different adult learning contexts. Two interwoven qualitative research projects focused on applying MI theory in practice. One involved 10 teacher-conducted and AMI…
Xu, Fang; Dong, Haifeng; Cao, Yu; Lu, Huiting; Meng, Xiangdan; Dai, Wenhao; Zhang, Xueji; Al-Ghanim, Khalid Abdullah; Mahboob, Shahid
2016-12-14
A highly sensitive and multiple microRNA (miRNA) detection method by combining three-dimensional (3D) DNA tetrahedron-structured probes (TSPs) to increase the probe reactivity and accessibility with duplex-specific nuclease (DSN) for signal amplification for sensitive miRNA detection was proposed. Briefly, 3D DNA TSPs labeled with different fluorescent dyes for specific target miRNA recognition were modified on a gold nanoparticle (GNP) surface to increase the reactivity and accessibility. Upon hybridization with a specific target, the TSPs immobilized on the GNP surface hybridized with the corresponding target miRNA to form DNA-RNA heteroduplexes, and the DSN can recognize the formed DNA-RNA heteroduplexes to hydrolyze the DNA in the heteroduplexes to produce a specific fluorescent signal corresponding to a specific miRNA, while the released target miRNA strands can initiate another cycle, resulting in a significant signal amplification for sensitive miRNA detection. Different targets can produce different fluorescent signals, leading to the development of a sensitive detection for multiple miRNAs in a homogeneous solution. Under optimized conditions, the proposed assay can simultaneously detect three different miRNAs in a homogeneous solution with a logarithmic linear range spanning 5 magnitudes (10 -12 -10 -16 ) and achieving a limit of detection down to attomolar concentrations. Meanwhile, the proposed miRNA assay exhibited the capability of discriminating single bases (three bases mismatched miRNAs) and showed good eligibility in the analysis of miRNAs extracted from cell lysates and miRNAs in cell incubation media, which indicates its potential use in biomedical research and clinical analysis.
32 CFR 776.29 - Imputed disqualification: General rule.
Code of Federal Regulations, 2012 CFR
2012-07-01
... their federal, state, and local bar rules governing the representation of multiple or adverse clients within the same office before such representation is initiated, as such representation may expose them to... military (or Government) service may require representation of opposing sides by covered USG attorneys...
32 CFR 776.29 - Imputed disqualification: General rule.
Code of Federal Regulations, 2014 CFR
2014-07-01
... their federal, state, and local bar rules governing the representation of multiple or adverse clients within the same office before such representation is initiated, as such representation may expose them to... military (or Government) service may require representation of opposing sides by covered USG attorneys...
32 CFR 776.29 - Imputed disqualification: General rule.
Code of Federal Regulations, 2013 CFR
2013-07-01
... their federal, state, and local bar rules governing the representation of multiple or adverse clients within the same office before such representation is initiated, as such representation may expose them to... military (or Government) service may require representation of opposing sides by covered USG attorneys...
CGDSNPdb: a database resource for error-checked and imputed mouse SNPs.
Hutchins, Lucie N; Ding, Yueming; Szatkiewicz, Jin P; Von Smith, Randy; Yang, Hyuna; de Villena, Fernando Pardo-Manuel; Churchill, Gary A; Graber, Joel H
2010-07-06
The Center for Genome Dynamics Single Nucleotide Polymorphism Database (CGDSNPdb) is an open-source value-added database with more than nine million mouse single nucleotide polymorphisms (SNPs), drawn from multiple sources, with genotypes assigned to multiple inbred strains of laboratory mice. All SNPs are checked for accuracy and annotated for properties specific to the SNP as well as those implied by changes to overlapping protein-coding genes. CGDSNPdb serves as the primary interface to two unique data sets, the 'imputed genotype resource' in which a Hidden Markov Model was used to assess local haplotypes and the most probable base assignment at several million genomic loci in tens of strains of mice, and the Affymetrix Mouse Diversity Genotyping Array, a high density microarray with over 600,000 SNPs and over 900,000 invariant genomic probes. CGDSNPdb is accessible online through either a web-based query tool or a MySQL public login. Database URL: http://cgd.jax.org/cgdsnpdb/
Ascertainment bias from imputation methods evaluation in wheat.
Brandariz, Sofía P; González Reymúndez, Agustín; Lado, Bettina; Malosetti, Marcos; Garcia, Antonio Augusto Franco; Quincke, Martín; von Zitzewitz, Jarislav; Castro, Marina; Matus, Iván; Del Pozo, Alejandro; Castro, Ariel J; Gutiérrez, Lucía
2016-10-04
Whole-genome genotyping techniques like Genotyping-by-sequencing (GBS) are being used for genetic studies such as Genome-Wide Association (GWAS) and Genomewide Selection (GS), where different strategies for imputation have been developed. Nevertheless, imputation error may lead to poor performance (i.e. smaller power or higher false positive rate) when complete data is not required as it is for GWAS, and each marker is taken at a time. The aim of this study was to compare the performance of GWAS analysis for Quantitative Trait Loci (QTL) of major and minor effect using different imputation methods when no reference panel is available in a wheat GBS panel. In this study, we compared the power and false positive rate of dissecting quantitative traits for imputed and not-imputed marker score matrices in: (1) a complete molecular marker barley panel array, and (2) a GBS wheat panel with missing data. We found that there is an ascertainment bias in imputation method comparisons. Simulating over a complete matrix and creating missing data at random proved that imputation methods have a poorer performance. Furthermore, we found that when QTL were simulated with imputed data, the imputation methods performed better than the not-imputed ones. On the other hand, when QTL were simulated with not-imputed data, the not-imputed method and one of the imputation methods performed better for dissecting quantitative traits. Moreover, larger differences between imputation methods were detected for QTL of major effect than QTL of minor effect. We also compared the different marker score matrices for GWAS analysis in a real wheat phenotype dataset, and we found minimal differences indicating that imputation did not improve the GWAS performance when a reference panel was not available. Poorer performance was found in GWAS analysis when an imputed marker score matrix was used, no reference panel is available, in a wheat GBS panel.
Sakai, Atsushi; Saitow, Fumihito; Maruyama, Motoyo; Miyake, Noriko; Miyake, Koichi; Shimada, Takashi; Okada, Takashi; Suzuki, Hidenori
2017-01-01
miR-17-92 is a microRNA cluster with six distinct members. Here, we show that the miR-17-92 cluster and its individual members modulate chronic neuropathic pain. All cluster members are persistently upregulated in primary sensory neurons after nerve injury. Overexpression of miR-18a, miR-19a, miR-19b and miR-92a cluster members elicits mechanical allodynia in rats, while their blockade alleviates mechanical allodynia in a rat model of neuropathic pain. Plausible targets for the miR-17-92 cluster include genes encoding numerous voltage-gated potassium channels and their modulatory subunits. Single-cell analysis reveals extensive co-expression of miR-17-92 cluster and its predicted targets in primary sensory neurons. miR-17-92 downregulates the expression of potassium channels, and reduced outward potassium currents, in particular A-type currents. Combined application of potassium channel modulators synergistically alleviates mechanical allodynia induced by nerve injury or miR-17-92 overexpression. miR-17-92 cluster appears to cooperatively regulate the function of multiple voltage-gated potassium channel subunits, perpetuating mechanical allodynia. PMID:28677679
de Vries, Paul S; Sabater-Lleal, Maria; Chasman, Daniel I; Trompet, Stella; Ahluwalia, Tarunveer S; Teumer, Alexander; Kleber, Marcus E; Chen, Ming-Huei; Wang, Jie Jin; Attia, John R; Marioni, Riccardo E; Steri, Maristella; Weng, Lu-Chen; Pool, Rene; Grossmann, Vera; Brody, Jennifer A; Venturini, Cristina; Tanaka, Toshiko; Rose, Lynda M; Oldmeadow, Christopher; Mazur, Johanna; Basu, Saonli; Frånberg, Mattias; Yang, Qiong; Ligthart, Symen; Hottenga, Jouke J; Rumley, Ann; Mulas, Antonella; de Craen, Anton J M; Grotevendt, Anne; Taylor, Kent D; Delgado, Graciela E; Kifley, Annette; Lopez, Lorna M; Berentzen, Tina L; Mangino, Massimo; Bandinelli, Stefania; Morrison, Alanna C; Hamsten, Anders; Tofler, Geoffrey; de Maat, Moniek P M; Draisma, Harmen H M; Lowe, Gordon D; Zoledziewska, Magdalena; Sattar, Naveed; Lackner, Karl J; Völker, Uwe; McKnight, Barbara; Huang, Jie; Holliday, Elizabeth G; McEvoy, Mark A; Starr, John M; Hysi, Pirro G; Hernandez, Dena G; Guan, Weihua; Rivadeneira, Fernando; McArdle, Wendy L; Slagboom, P Eline; Zeller, Tanja; Psaty, Bruce M; Uitterlinden, André G; de Geus, Eco J C; Stott, David J; Binder, Harald; Hofman, Albert; Franco, Oscar H; Rotter, Jerome I; Ferrucci, Luigi; Spector, Tim D; Deary, Ian J; März, Winfried; Greinacher, Andreas; Wild, Philipp S; Cucca, Francesco; Boomsma, Dorret I; Watkins, Hugh; Tang, Weihong; Ridker, Paul M; Jukema, Jan W; Scott, Rodney J; Mitchell, Paul; Hansen, Torben; O'Donnell, Christopher J; Smith, Nicholas L; Strachan, David P; Dehghan, Abbas
2017-01-01
An increasing number of genome-wide association (GWA) studies are now using the higher resolution 1000 Genomes Project reference panel (1000G) for imputation, with the expectation that 1000G imputation will lead to the discovery of additional associated loci when compared to HapMap imputation. In order to assess the improvement of 1000G over HapMap imputation in identifying associated loci, we compared the results of GWA studies of circulating fibrinogen based on the two reference panels. Using both HapMap and 1000G imputation we performed a meta-analysis of 22 studies comprising the same 91,953 individuals. We identified six additional signals using 1000G imputation, while 29 loci were associated using both HapMap and 1000G imputation. One locus identified using HapMap imputation was not significant using 1000G imputation. The genome-wide significance threshold of 5×10-8 is based on the number of independent statistical tests using HapMap imputation, and 1000G imputation may lead to further independent tests that should be corrected for. When using a stricter Bonferroni correction for the 1000G GWA study (P-value < 2.5×10-8), the number of loci significant only using HapMap imputation increased to 4 while the number of loci significant only using 1000G decreased to 5. In conclusion, 1000G imputation enabled the identification of 20% more loci than HapMap imputation, although the advantage of 1000G imputation became less clear when a stricter Bonferroni correction was used. More generally, our results provide insights that are applicable to the implementation of other dense reference panels that are under development.
de Vries, Paul S.; Sabater-Lleal, Maria; Chasman, Daniel I.; Trompet, Stella; Kleber, Marcus E.; Chen, Ming-Huei; Wang, Jie Jin; Attia, John R.; Marioni, Riccardo E.; Weng, Lu-Chen; Grossmann, Vera; Brody, Jennifer A.; Venturini, Cristina; Tanaka, Toshiko; Rose, Lynda M.; Oldmeadow, Christopher; Mazur, Johanna; Basu, Saonli; Yang, Qiong; Ligthart, Symen; Hottenga, Jouke J.; Rumley, Ann; Mulas, Antonella; de Craen, Anton J. M.; Grotevendt, Anne; Taylor, Kent D.; Delgado, Graciela E.; Kifley, Annette; Lopez, Lorna M.; Berentzen, Tina L.; Mangino, Massimo; Bandinelli, Stefania; Morrison, Alanna C.; Hamsten, Anders; Tofler, Geoffrey; de Maat, Moniek P. M.; Draisma, Harmen H. M.; Lowe, Gordon D.; Zoledziewska, Magdalena; Sattar, Naveed; Lackner, Karl J.; Völker, Uwe; McKnight, Barbara; Huang, Jie; Holliday, Elizabeth G.; McEvoy, Mark A.; Starr, John M.; Hysi, Pirro G.; Hernandez, Dena G.; Guan, Weihua; Rivadeneira, Fernando; McArdle, Wendy L.; Slagboom, P. Eline; Zeller, Tanja; Psaty, Bruce M.; Uitterlinden, André G.; de Geus, Eco J. C.; Stott, David J.; Binder, Harald; Hofman, Albert; Franco, Oscar H.; Rotter, Jerome I.; Ferrucci, Luigi; Spector, Tim D.; Deary, Ian J.; März, Winfried; Greinacher, Andreas; Wild, Philipp S.; Cucca, Francesco; Boomsma, Dorret I.; Watkins, Hugh; Tang, Weihong; Ridker, Paul M.; Jukema, Jan W.; Scott, Rodney J.; Mitchell, Paul; Hansen, Torben; O'Donnell, Christopher J.; Smith, Nicholas L.; Strachan, David P.
2017-01-01
An increasing number of genome-wide association (GWA) studies are now using the higher resolution 1000 Genomes Project reference panel (1000G) for imputation, with the expectation that 1000G imputation will lead to the discovery of additional associated loci when compared to HapMap imputation. In order to assess the improvement of 1000G over HapMap imputation in identifying associated loci, we compared the results of GWA studies of circulating fibrinogen based on the two reference panels. Using both HapMap and 1000G imputation we performed a meta-analysis of 22 studies comprising the same 91,953 individuals. We identified six additional signals using 1000G imputation, while 29 loci were associated using both HapMap and 1000G imputation. One locus identified using HapMap imputation was not significant using 1000G imputation. The genome-wide significance threshold of 5×10−8 is based on the number of independent statistical tests using HapMap imputation, and 1000G imputation may lead to further independent tests that should be corrected for. When using a stricter Bonferroni correction for the 1000G GWA study (P-value < 2.5×10−8), the number of loci significant only using HapMap imputation increased to 4 while the number of loci significant only using 1000G decreased to 5. In conclusion, 1000G imputation enabled the identification of 20% more loci than HapMap imputation, although the advantage of 1000G imputation became less clear when a stricter Bonferroni correction was used. More generally, our results provide insights that are applicable to the implementation of other dense reference panels that are under development. PMID:28107422
Fang, Zhi Hong; Wang, Si Li; Zhao, Jin Tao; Lin, Zhi Juan; Chen, Lin Yan; Su, Rui; Xie, Si Ting; Carter, Bing Z; Xu, Bing
2016-01-01
MicroRNAs, a class of small noncoding RNAs, have been implicated to regulate gene expression in virtually all important biological processes. Although accumulating evidence demonstrates that miR-150, an important regulator in hematopoiesis, is deregulated in various types of hematopoietic malignancies, the precise mechanisms of miR-150 action are largely unknown. In this study, we found that miR-150 is downregulated in samples from patients with acute lymphoblastic leukemia, acute myeloid leukemia, and chronic myeloid leukemia, and normalized after patients achieved complete remission. Restoration of miR-150 markedly inhibited growth and induced apoptosis of leukemia cells, and reduced tumorigenicity in a xenograft leukemia murine model. Microarray analysis identified multiple novel targets of miR-150, which were validated by quantitative real-time PCR and luciferase reporter assay. Gene ontology and pathway analysis illustrated potential roles of these targets in small-molecule metabolism, transcriptional regulation, RNA metabolism, proteoglycan synthesis in cancer, mTOR signaling pathway, or Wnt signaling pathway. Interestingly, knockdown one of four miR-150 targets (EIF4B, FOXO4B, PRKCA, and TET3) showed an antileukemia activity similar to that of miR-150 restoration. Collectively, our study demonstrates that miR-150 functions as a tumor suppressor through multiple mechanisms in human leukemia and provides a rationale for utilizing miR-150 as a novel therapeutic agent for leukemia treatment. PMID:27899822
Guo, Wei-Li; Huang, De-Shuang
2017-08-22
Transcription factors (TFs) are DNA-binding proteins that have a central role in regulating gene expression. Identification of DNA-binding sites of TFs is a key task in understanding transcriptional regulation, cellular processes and disease. Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) enables genome-wide identification of in vivo TF binding sites. However, it is still difficult to map every TF in every cell line owing to cost and biological material availability, which poses an enormous obstacle for integrated analysis of gene regulation. To address this problem, we propose a novel computational approach, TFBSImpute, for predicting additional TF binding profiles by leveraging information from available ChIP-seq TF binding data. TFBSImpute fuses the dataset to a 3-mode tensor and imputes missing TF binding signals via simultaneous completion of multiple TF binding matrices with positional consistency. We show that signals predicted by our method achieve overall similarity with experimental data and that TFBSImpute significantly outperforms baseline approaches, by assessing the performance of imputation methods against observed ChIP-seq TF binding profiles. Besides, motif analysis shows that TFBSImpute preforms better in capturing binding motifs enriched in observed data compared with baselines, indicating that the higher performance of TFBSImpute is not simply due to averaging related samples. We anticipate that our approach will constitute a useful complement to experimental mapping of TF binding, which is beneficial for further study of regulation mechanisms and disease.
Epidemiologic Evaluation of Measurement Data in the Presence of Detection Limits
Lubin, Jay H.; Colt, Joanne S.; Camann, David; Davis, Scott; Cerhan, James R.; Severson, Richard K.; Bernstein, Leslie; Hartge, Patricia
2004-01-01
Quantitative measurements of environmental factors greatly improve the quality of epidemiologic studies but can pose challenges because of the presence of upper or lower detection limits or interfering compounds, which do not allow for precise measured values. We consider the regression of an environmental measurement (dependent variable) on several covariates (independent variables). Various strategies are commonly employed to impute values for interval-measured data, including assignment of one-half the detection limit to nondetected values or of “fill-in” values randomly selected from an appropriate distribution. On the basis of a limited simulation study, we found that the former approach can be biased unless the percentage of measurements below detection limits is small (5–10%). The fill-in approach generally produces unbiased parameter estimates but may produce biased variance estimates and thereby distort inference when 30% or more of the data are below detection limits. Truncated data methods (e.g., Tobit regression) and multiple imputation offer two unbiased approaches for analyzing measurement data with detection limits. If interest resides solely on regression parameters, then Tobit regression can be used. If individualized values for measurements below detection limits are needed for additional analysis, such as relative risk regression or graphical display, then multiple imputation produces unbiased estimates and nominal confidence intervals unless the proportion of missing data is extreme. We illustrate various approaches using measurements of pesticide residues in carpet dust in control subjects from a case–control study of non-Hodgkin lymphoma. PMID:15579415
Study Protocol, Sample Characteristics, and Loss to Follow-Up: The OPPERA Prospective Cohort Study
Bair, Eric; Brownstein, Naomi C.; Ohrbach, Richard; Greenspan, Joel D.; Dubner, Ron; Fillingim, Roger B.; Maixner, William; Smith, Shad; Diatchenko, Luda; Gonzalez, Yoly; Gordon, Sharon; Lim, Pei-Feng; Ribeiro-Dasilva, Margarete; Dampier, Dawn; Knott, Charles; Slade, Gary D.
2013-01-01
When studying incidence of pain conditions such as temporomandibular disorders (TMDs), repeated monitoring is needed in prospective cohort studies. However, monitoring methods usually have limitations and, over a period of years, some loss to follow-up is inevitable. The OPPERA prospective cohort study of first-onset TMD screened for symptoms using quarterly questionnaires and examined symptomatic participants to definitively ascertain TMD incidence. During the median 2.8-year observation period, 16% of the 3,263 enrollees completed no follow-up questionnaires, others provided incomplete follow-up, and examinations were not conducted for one third of symptomatic episodes. Although screening methods and examinations were found to have excellent reliability and validity, they were not perfect. Loss to follow-up varied according to some putative TMD risk factors, although multiple imputation to correct the problem suggested that bias was minimal. A second method of multiple imputation that evaluated bias associated with omitted and dubious examinations revealed a slight underestimate of incidence and some small biases in hazard ratios used to quantify effects of risk factors. Although “bottom line” statistical conclusions were not affected, multiply-imputed estimates should be considered when evaluating the large number of risk factors under investigation in the OPPERA study. Perspective These findings support the validity of the OPPERA prospective cohort study for the purpose of investigating the etiology of first-onset TMD, providing the foundation for other papers investigating risk factors hypothesized in the OPPERA project. PMID:24275220
Wang, Kevin Yuqi; Vankov, Emilian R; Lin, Doris Da May
2018-02-01
OBJECTIVE Oligodendroglioma is a rare primary CNS neoplasm in the pediatric population, and only a limited number of studies in the literature have characterized this entity. Existing studies are limited by small sample sizes and discrepant interstudy findings in identified prognostic factors. In the present study, the authors aimed to increase the statistical power in evaluating for potential prognostic factors of pediatric oligodendrogliomas and sought to reconcile the discrepant findings present among existing studies by performing an individual-patient-data (IPD) meta-analysis and using multiple imputation to address data not directly available from existing studies. METHODS A systematic search was performed, and all studies found to be related to pediatric oligodendrogliomas and associated outcomes were screened for inclusion. Each study was searched for specific demographic and clinical characteristics of each patient and the duration of event-free survival (EFS) and overall survival (OS). Given that certain demographic and clinical information of each patient was not available within all studies, a multivariable imputation via chained equations model was used to impute missing data after the mechanism of missing data was determined. The primary end points of interest were hazard ratios for EFS and OS, as calculated by the Cox proportional-hazards model. Both univariate and multivariate analyses were performed. The multivariate model was adjusted for age, sex, tumor grade, mixed pathologies, extent of resection, chemotherapy, radiation therapy, tumor location, and initial presentation. A p value of less than 0.05 was considered statistically significant. RESULTS A systematic search identified 24 studies with both time-to-event and IPD characteristics available, and a total of 237 individual cases were available for analysis. A median of 19.4% of the values among clinical, demographic, and outcome variables in the compiled 237 cases were missing. Multivariate Cox regression analysis revealed subtotal resection (p = 0.007 [EFS] and 0.043 [OS]), initial presentation of headache (p = 0.006 [EFS] and 0.004 [OS]), mixed pathologies (p = 0.005 [EFS] and 0.049 [OS]), and location of the tumor in the parietal lobe (p = 0.044 [EFS] and 0.030 [OS]) to be significant predictors of tumor progression or recurrence and death. CONCLUSIONS The use of IPD meta-analysis provides a valuable means for increasing statistical power in investigations of disease entities with a very low incidence. Missing data are common in research, and multiple imputation is a flexible and valid approach for addressing this issue, when it is used conscientiously. Undergoing subtotal resection, having a parietal tumor, having tumors with mixed pathologies, and suffering headaches at the time of diagnosis portended a poorer prognosis in pediatric patients with oligodendroglioma.
DrImpute: imputing dropout events in single cell RNA sequencing data.
Gong, Wuming; Kwak, Il-Youp; Pota, Pruthvi; Koyano-Nakagawa, Naoko; Garry, Daniel J
2018-06-08
The single cell RNA sequencing (scRNA-seq) technique begin a new era by allowing the observation of gene expression at the single cell level. However, there is also a large amount of technical and biological noise. Because of the low number of RNA transcriptomes and the stochastic nature of the gene expression pattern, there is a high chance of missing nonzero entries as zero, which are called dropout events. We develop DrImpute to impute dropout events in scRNA-seq data. We show that DrImpute has significantly better performance on the separation of the dropout zeros from true zeros than existing imputation algorithms. We also demonstrate that DrImpute can significantly improve the performance of existing tools for clustering, visualization and lineage reconstruction of nine published scRNA-seq datasets. DrImpute can serve as a very useful addition to the currently existing statistical tools for single cell RNA-seq analysis. DrImpute is implemented in R and is available at https://github.com/gongx030/DrImpute .
miR-186 inhibits cell proliferation in multiple myeloma by repressing Jagged1
DOE Office of Scientific and Technical Information (OSTI.GOV)
Liu, Zengyan; Department of Hematology, Hospital Affiliated to Binzhou Medical University, 661 Second Huanghe Street, Binzhou 256603; Zhang, Guoqiang
2016-01-15
MicroRNAs (miRNAs) are small, noncoding ribonucleic acids that regulate gene expression by targeting mRNAs for translational repression and degradation. Accumulating experimental evidence supports a causal role of miRNAs in hematology tumorigenesis. However, the specific functions of miRNAs in the pathogenesis of multiple myeloma (MM) remain to be established. In this study, we demonstrated that miR-186 is commonly downregulated in MM cell lines and patient MM cells. Ectopic expression of miR-186 significantly inhibited cell growth, both in vitro and in vivo, and induced cell cycle G{sub 0}/G{sub 1} arrest. Furthermore, miR-186 induced downregulation of Jagged1 protein expression by directly targeting its 3′-untranslated regionmore » (3′-UTR). Conversely, overexpression of Jagged1 rescued cells from miR-186-induced growth inhibition. Our collective results clearly indicate that miR-186 functions as a tumor suppressor in MM, supporting its potential as a therapeutic target for the disease. - Highlights: • miR-186 expression is decreased in MM. • miR-186 inhibits MM cell proliferation in vitro and in vivo. • Jagged1 is regulated by miR-186. • Overexpression of Jagged1 reverses the effects of miR-186.« less
USDA-ARS?s Scientific Manuscript database
Microsatellite markers (MS) have traditionally been used for parental verification and are still the international standard in spite of their higher cost, error rate, and turnaround time compared with Single Nucleotide Polymorphisms (SNP)-based assays. Despite domestic and international demands fro...
TargetCompare: A web interface to compare simultaneous miRNAs targets
Moreira, Fabiano Cordeiro; Dustan, Bruno; Hamoy, Igor G; Ribeiro-dos-Santos, André M; dos Santos, Ândrea Ribeiro
2014-01-01
MicroRNAs (miRNAs) are small non-coding nucleotide sequences between 17 and 25 nucleotides in length that primarily function in the regulation of gene expression. A since miRNA has thousand of predict targets in a complex, regulatory cell signaling network. Therefore, it is of interest to study multiple target genes simultaneously. Hence, we describe a web tool (developed using Java programming language and MySQL database server) to analyse multiple targets of pre-selected miRNAs. We cross validated the tool in eight most highly expressed miRNAs in the antrum region of stomach. This helped to identify 43 potential genes that are target of at least six of the referred miRNAs. The developed tool aims to reduce the randomness and increase the chance of selecting strong candidate target genes and miRNAs responsible for playing important roles in the studied tissue. Availability http://lghm.ufpa.br/targetcompare PMID:25352731
TargetCompare: A web interface to compare simultaneous miRNAs targets.
Moreira, Fabiano Cordeiro; Dustan, Bruno; Hamoy, Igor G; Ribeiro-Dos-Santos, André M; Dos Santos, Andrea Ribeiro
2014-01-01
MicroRNAs (miRNAs) are small non-coding nucleotide sequences between 17 and 25 nucleotides in length that primarily function in the regulation of gene expression. A since miRNA has thousand of predict targets in a complex, regulatory cell signaling network. Therefore, it is of interest to study multiple target genes simultaneously. Hence, we describe a web tool (developed using Java programming language and MySQL database server) to analyse multiple targets of pre-selected miRNAs. We cross validated the tool in eight most highly expressed miRNAs in the antrum region of stomach. This helped to identify 43 potential genes that are target of at least six of the referred miRNAs. The developed tool aims to reduce the randomness and increase the chance of selecting strong candidate target genes and miRNAs responsible for playing important roles in the studied tissue. http://lghm.ufpa.br/targetcompare.
Multiple imputation of rainfall missing data in the Iberian Mediterranean context
NASA Astrophysics Data System (ADS)
Miró, Juan Javier; Caselles, Vicente; Estrela, María José
2017-11-01
Given the increasing need for complete rainfall data networks, in recent years have been proposed diverse methods for filling gaps in observed precipitation series, progressively more advanced that traditional approaches to overcome the problem. The present study has consisted in validate 10 methods (6 linear, 2 non-linear and 2 hybrid) that allow multiple imputation, i.e., fill at the same time missing data of multiple incomplete series in a dense network of neighboring stations. These were applied for daily and monthly rainfall in two sectors in the Júcar River Basin Authority (east Iberian Peninsula), which is characterized by a high spatial irregularity and difficulty of rainfall estimation. A classification of precipitation according to their genetic origin was applied as pre-processing, and a quantile-mapping adjusting as post-processing technique. The results showed in general a better performance for the non-linear and hybrid methods, highlighting that the non-linear PCA (NLPCA) method outperforms considerably the Self Organizing Maps (SOM) method within non-linear approaches. On linear methods, the Regularized Expectation Maximization method (RegEM) was the best, but far from NLPCA. Applying EOF filtering as post-processing of NLPCA (hybrid approach) yielded the best results.
Fast imputation using medium- or low-coverage sequence data
USDA-ARS?s Scientific Manuscript database
Direct imputation from raw sequence reads can be more accurate than calling genotypes first and then imputing, especially if read depth is low or error rates high, but different imputation strategies are required than those used for data from genotyping chips. A fast algorithm to impute from lower t...
A Study of Imputation Algorithms. Working Paper Series.
ERIC Educational Resources Information Center
Hu, Ming-xiu; Salvucci, Sameena
Many imputation techniques and imputation software packages have been developed over the years to deal with missing data. Different methods may work well under different circumstances, and it is advisable to conduct a sensitivity analysis when choosing an imputation method for a particular survey. This study reviewed about 30 imputation methods…
Chou, Wen-Chi; Zheng, Hou-Feng; Cheng, Chia-Ho; Yan, Han; Wang, Li; Han, Fang; Richards, J. Brent; Karasik, David; Kiel, Douglas P.; Hsu, Yi-Hsiang
2016-01-01
Imputation using the 1000 Genomes haplotype reference panel has been widely adapted to estimate genotypes in genome wide association studies. To evaluate imputation quality with a relatively larger reference panel and a reference panel composed of different ethnic populations, we conducted imputations in the Framingham Heart Study and the North Chinese Study using a combined reference panel from the 1000 Genomes (N = 1,092) and UK10K (N = 3,781) projects. For rare variants with 0.01% < MAF ≤ 0.5%, imputation in the Framingham Heart Study with the combined reference panel increased well-imputed genotypes (with imputation quality score ≥0.4) from 62.9% to 76.1% when compared to imputation with the 1000 Genomes. For the North Chinese samples, imputation of rare variants with 0.01% < MAF ≤ 0.5% with the combined reference panel increased well-imputed genotypes by from 49.8% to 61.8%. The predominant European ancestry of the UK10K and the combined reference panels may explain why there was less of an increase in imputation success in the North Chinese samples. Our results underscore the importance and potential of larger reference panels to impute rare variants, while recognizing that increasing ethnic specific variants in reference panels may result in better imputation for genotypes in some ethnic groups. PMID:28004816
ERIC Educational Resources Information Center
Viens, Julie; Kallenbach, Silja
2001-01-01
Dr. Howard Gardner's introduction of multiple intelligences theory (MI theory) in 1983 generated considerable interest in the educational community. Multiple intelligences was a provocative new theory, claiming at least seven relatively independent intelligences. MI theory presented a conception of intelligence that was in marked contrast to the…
Identification of microRNA-mRNA modules using microarray data.
Jayaswal, Vivek; Lutherborrow, Mark; Ma, David D F; Yang, Yee H
2011-03-06
MicroRNAs (miRNAs) are post-transcriptional regulators of mRNA expression and are involved in numerous cellular processes. Consequently, miRNAs are an important component of gene regulatory networks and an improved understanding of miRNAs will further our knowledge of these networks. There is a many-to-many relationship between miRNAs and mRNAs because a single miRNA targets multiple mRNAs and a single mRNA is targeted by multiple miRNAs. However, most of the current methods for the identification of regulatory miRNAs and their target mRNAs ignore this biological observation and focus on miRNA-mRNA pairs. We propose a two-step method for the identification of many-to-many relationships between miRNAs and mRNAs. In the first step, we obtain miRNA and mRNA clusters using a combination of miRNA-target mRNA prediction algorithms and microarray expression data. In the second step, we determine the associations between miRNA clusters and mRNA clusters based on changes in miRNA and mRNA expression profiles. We consider the miRNA-mRNA clusters with statistically significant associations to be potentially regulatory and, therefore, of biological interest. Our method reduces the interactions between several hundred miRNAs and several thousand mRNAs to a few miRNA-mRNA groups, thereby facilitating a more meaningful biological analysis and a more targeted experimental validation.
A Review of Methods for Missing Data.
ERIC Educational Resources Information Center
Pigott, Therese D.
2001-01-01
Reviews methods for handling missing data in a research study. Model-based methods, such as maximum likelihood using the EM algorithm and multiple imputation, hold more promise than ad hoc methods. Although model-based methods require more specialized computer programs and assumptions about the nature of missing data, these methods are appropriate…
Applied Missing Data Analysis. Methodology in the Social Sciences Series
ERIC Educational Resources Information Center
Enders, Craig K.
2010-01-01
Walking readers step by step through complex concepts, this book translates missing data techniques into something that applied researchers and graduate students can understand and utilize in their own research. Enders explains the rationale and procedural details for maximum likelihood estimation, Bayesian estimation, multiple imputation, and…
Risk-Stratified Imputation in Survival Analysis
Kennedy, Richard E.; Adragni, Kofi P.; Tiwari, Hemant K.; Voeks, Jenifer H.; Brott, Thomas G.; Howard, George
2013-01-01
Background Censoring that is dependent on covariates associated with survival can arise in randomized trials due to changes in recruitment and eligibility criteria to minimize withdrawals, potentially leading to biased treatment effect estimates. Imputation approaches have been proposed to address censoring in survival analysis; and while these approaches may provide unbiased estimates of treatment effects, imputation of a large number of outcomes may over- or underestimate the associated variance based on the imputation pool selected. Purpose We propose an improved method, risk-stratified imputation, as an alternative to address withdrawal related to the risk of events in the context of time-to-event analyses. Methods Our algorithm performs imputation from a pool of replacement subjects with similar values of both treatment and covariate(s) of interest, that is, from a risk-stratified sample. This stratification prior to imputation addresses the requirement of time-to-event analysis that censored observations are representative of all other observations in the risk group with similar exposure variables. We compared our risk-stratified imputation to case deletion and bootstrap imputation in a simulated dataset in which the covariate of interest (study withdrawal) was related to treatment. A motivating example from a recent clinical trial is also presented to demonstrate the utility of our method. Results In our simulations, risk-stratified imputation gives estimates of treatment effect comparable to bootstrap and auxiliary variable imputation while avoiding inaccuracies of the latter two in estimating the associated variance. Similar results were obtained in analysis of clinical trial data. Limitations Risk-stratified imputation has little advantage over other imputation methods when covariates of interest are not related to treatment, although its performance is superior when covariates are related to treatment. Risk-stratified imputation is intended for categorical covariates, and may be sensitive to the width of the matching window if continuous covariates are used. Conclusions The use of the risk-stratified imputation should facilitate the analysis of many clinical trials, in which one group has a higher withdrawal rate that is related to treatment. PMID:23818434
Kwon, Ji-Sun; Kim, Jihye; Nam, Dougu; Kim, Sangsoo
2012-06-01
Gene set analysis (GSA) is useful in interpreting a genome-wide association study (GWAS) result in terms of biological mechanism. We compared the performance of two different GSA implementations that accept GWAS p-values of single nucleotide polymorphisms (SNPs) or gene-by-gene summaries thereof, GSA-SNP and i-GSEA4GWAS, under the same settings of inputs and parameters. GSA runs were made with two sets of p-values from a Korean type 2 diabetes mellitus GWAS study: 259,188 and 1,152,947 SNPs of the original and imputed genotype datasets, respectively. When Gene Ontology terms were used as gene sets, i-GSEA4GWAS produced 283 and 1,070 hits for the unimputed and imputed datasets, respectively. On the other hand, GSA-SNP reported 94 and 38 hits, respectively, for both datasets. Similar, but to a lesser degree, trends were observed with Kyoto Encyclopedia of Genes and Genomes (KEGG) gene sets as well. The huge number of hits by i-GSEA4GWAS for the imputed dataset was probably an artifact due to the scaling step in the algorithm. The decrease in hits by GSA-SNP for the imputed dataset may be due to the fact that it relies on Z-statistics, which is sensitive to variations in the background level of associations. Judicious evaluation of the GSA outcomes, perhaps based on multiple programs, is recommended.
Khateeb, O M; Osborne, D; Mulla, Z D
2010-04-01
Invasive group A streptococcal (GAS) disease is a condition of clinical and public health significance. We conducted epidemiological analyses to determine if the presence of gastrointestinal (GI) complaints (diarrhea and/or vomiting) early in the course of invasive GAS disease is associated with either of two severe outcomes: GAS necrotizing fasciitis, or hospital mortality. Subjects were hospitalized for invasive GAS disease throughout the state of Florida, USA, during a 4-year period. Multiple imputation using the Markov chain Monte Carlo method was used to replace missing values with plausible values. Excluding cases with missing data resulted in a sample size of 138 invasive GAS patients (the complete subject analysis) while the imputed datasets contained 257 records. GI symptomatology within 48 h of hospital admission was not associated with hospital mortality in either the complete subject analysis [adjusted odds ratio (aOR) 0.86, 95% confidence interval (CI) 0.31-2.39] or in the imputed datasets. GI symptoms were significantly associated with GAS necrotizing fasciitis in the complete subject analysis (aOR 4.64, 95% CI 1.18-18.23) and in the imputed datasets but only in patients aged <55 years. The common cause of GI symptoms and necrotizing fasciitis may be streptococcal exotoxins. Clinicians who are treating young individuals presumed to be in the early stages of invasive GAS disease should take note of GI symptoms and remain vigilant for the development of a GAS necrotizing soft-tissue infection.
Smuk, M; Carpenter, J R; Morris, T P
2017-02-06
Within epidemiological and clinical research, missing data are a common issue and often over looked in publications. When the issue of missing observations is addressed it is usually assumed that the missing data are 'missing at random' (MAR). This assumption should be checked for plausibility, however it is untestable, thus inferences should be assessed for robustness to departures from missing at random. We highlight the method of pattern mixture sensitivity analysis after multiple imputation using colorectal cancer data as an example. We focus on the Dukes' stage variable which has the highest proportion of missing observations. First, we find the probability of being in each Dukes' stage given the MAR imputed dataset. We use these probabilities in a questionnaire to elicit prior beliefs from experts on what they believe the probability would be in the missing data. The questionnaire responses are then used in a Dirichlet draw to create a Bayesian 'missing not at random' (MNAR) prior to impute the missing observations. The model of interest is applied and inferences are compared to those from the MAR imputed data. The inferences were largely insensitive to departure from MAR. Inferences under MNAR suggested a smaller association between Dukes' stage and death, though the association remained positive and with similarly low p values. We conclude by discussing the positives and negatives of our method and highlight the importance of making people aware of the need to test the MAR assumption.
Global exosome transcriptome profiling reveals biomarkers for multiple sclerosis.
Selmaj, Igor; Cichalewska, Maria; Namiecinska, Magdalena; Galazka, Grazyna; Horzelski, Wojciech; Selmaj, Krzysztof W; Mycko, Marcin P
2017-05-01
Accumulating evidence supports a role for exosomes in immune regulation. In this study, we investigated the total circulating exosome transcriptome in relapsing-remitting multiple sclerosis (RRMS) patients and healthy controls (HC). Next generation sequencing (NGS) was used to define the global RNA profile of serum exosomes in 19 RRMS patients (9 in relapse, 10 in remission) and 10 HC. We analyzed 5 million reads and >50,000 transcripts per sample, including a detailed analysis of microRNAs (miRNAs) differentially expressed in RRMS. The discovery set data were validated by quantification using digital quantitative polymerase chain reaction with an independent cohort of 63 RRMS patients (33 in relapse, 30 in remission) and 32 HC. Exosomal RNA NGS revealed that of 15 different classes of transcripts detected, 4 circulating exosomal sequences within the miRNA category were differentially expressed in RRMS patients versus HC: hsa-miR-122-5p, hsa-miR-196b-5p, hsa-miR-301a-3p, and hsa-miR-532-5p. Serum exosomal expression of these miRNAs was significantly decreased during relapse in RRMS. These miRNAs were also decreased in patients with a gadolinium enhancement on brain magnetic resonance imaging. In vitro secretion of these miRNAs by peripheral blood mononuclear cells was also significantly impaired in RRMS. These data show that circulating exosomes have a distinct RNA profile in RRMS. Because putative targets for these miRNAs include the signal transducer and activator of transcription 3 and the cell cycle regulator aryl hydrocarbon receptor, the data suggest a disturbed cell-to-cell communication in this disease. Thus, exosomal miRNAs might represent a useful biomarker to distinguish multiple sclerosis relapse. Ann Neurol 2017;81:703-717. © 2017 American Neurological Association.
Sung, Yun J; Gu, C Charles; Tiwari, Hemant K; Arnett, Donna K; Broeckel, Ulrich; Rao, Dabeeru C
2012-07-01
Genotype imputation provides imputation of untyped single nucleotide polymorphisms (SNPs) that are present on a reference panel such as those from the HapMap Project. It is popular for increasing statistical power and comparing results across studies using different platforms. Imputation for African American populations is challenging because their linkage disequilibrium blocks are shorter and also because no ideal reference panel is available due to admixture. In this paper, we evaluated three imputation strategies for African Americans. The intersection strategy used a combined panel consisting of SNPs polymorphic in both CEU and YRI. The union strategy used a panel consisting of SNPs polymorphic in either CEU or YRI. The merge strategy merged results from two separate imputations, one using CEU and the other using YRI. Because recent investigators are increasingly using the data from the 1000 Genomes (1KG) Project for genotype imputation, we evaluated both 1KG-based imputations and HapMap-based imputations. We used 23,707 SNPs from chromosomes 21 and 22 on Affymetrix SNP Array 6.0 genotyped for 1,075 HyperGEN African Americans. We found that 1KG-based imputations provided a substantially larger number of variants than HapMap-based imputations, about three times as many common variants and eight times as many rare and low-frequency variants. This higher yield is expected because the 1KG panel includes more SNPs. Accuracy rates using 1KG data were slightly lower than those using HapMap data before filtering, but slightly higher after filtering. The union strategy provided the highest imputation yield with next highest accuracy. The intersection strategy provided the lowest imputation yield but the highest accuracy. The merge strategy provided the lowest imputation accuracy. We observed that SNPs polymorphic only in CEU had much lower accuracy, reducing the accuracy of the union strategy. Our findings suggest that 1KG-based imputations can facilitate discovery of significant associations for SNPs across the whole MAF spectrum. Because the 1KG Project is still under way, we expect that later versions will provide better imputation performance. © 2012 Wiley Periodicals, Inc.
Inami, Shigenobu; Ishibashi, Fumiyuki; Waxman, Sergio; Okamatsu, Kentaro; Seimiya, Koji; Takano, Masamichi; Uemura, Ryota; Sano, Junko; Mizuno, Kyoichi
2008-03-01
Multiple angioscopic yellow plaques are associated with diffuse atherosclerotic plaque, and may be prevalent in patients with myocardial infarction (MI), so in the present study the yellow plaques in the coronary arteries of patients with MI was evaluated using quantitative colorimetry, and compared with those of patients with stable angina (SA). In the recorded angioscopic images of 3 coronary vessels in 29 patients (15 patients with MI, 14 with SA), yellow plaques were determined as visually yellow regions with b* value >0 (yellow color intensity) measured by the quantitative colorimetric method. A total of 90 yellow plaques were identified (b* =19.35+/-8.3, 3.05-45.35). Yellow plaques were significantly more prevalent in 14 (93%) of 15 culprit lesions of MI as compared with 8 (57%) of 14 of SA (p=0.03). In non-culprit segments, yellow plaques were similarly prevalent in 13 (87%) patients with MI and 11 (79%) with SA (p=0.65). Overall, multiple (> or =2) yellow plaques were prevalent in 13 (87%) patients with MI, similar to the 10 (71%) with SA (p=0.38). The number of yellow plaques was significantly higher in patients with MI (3.8+/-1.9) than in those with SA (2.4+/-1.6, p=0.03). The present study suggests that patients with MI tend to have diffuse atherosclerotic plaque in their coronary arteries.
Alarcón-Riquelme, Marta E.; Ziegler, Julie T.; Molineros, Julio; Howard, Timothy D.; Moreno-Estrada, Andrés; Sánchez-Rodríguez, Elena; Ainsworth, Hannah C.; Ortiz-Tello, Patricia; Comeau, Mary E.; Rasmussen, Astrid; Kelly, Jennifer A.; Adler, Adam; Acevedo-Vázquez, Eduardo; Cucho, Jorge Mariano; García-De la Torre, Ignacio; Cardiel, Mario H.; Miranda, Pedro; Catoggio, Luis; Maradiaga-Ceceña, Marco; Gaffney, Patrick; Vyse, Timothy; Criswell, Lindsey A.; Tsao, Betty P.; Sivils, Kathy L.; Bae, Sang-Cheol; James, Judith A.; Kimberly, Robert; Kaufman, Ken; Harley, John B.; Esquivel-Valerio, Jorge; Moctezuma, José F.; García, Mercedes A.; Berbotto, Guillermo; Babini, Alejandra; Scherbarth, Hugo; Toloza, Sergio; Baca, Vicente; Nath, Swapan K.; Salinas, Carlos Aguilar; Orozco, Lorena; Tusié-Luna, Teresa; Zidovetzki, Raphael; Pons-Estel, Bernardo A.; Langefeld, Carl D.; Jacob, Chaim O.
2016-01-01
OBJECTIVES Systemic lupus erythematosus (SLE) is a chronic autoimmune disease with a strong genetic component. Our aim was to perform the first genome-wide association study on individuals from the Americas enriched for Native American heritage. MATERIALS and METHODS We analyzed 3,710 individuals from four countries of Latin America and the Unites States diagnosed with SLE and healthy controls. Samples were genotyped with the HumanOmni1 BeadChip. Data of out-of-study controls was obtained for the HumanOmni2.5. Statistical analyses were performed using SNPTEST and SNPGWA. Data was adjusted for genomic control and FDR. Imputation was done using IMPUTE2, and HiBAG for classical HLA alleles. RESULTS The IRF5-TNPO3 region showed the strongest association and largest odds ratio (OR) (rs10488631, Pgcadj = 2.61×10−29, OR = 2.12, 95% CI: 1.88–2.39) followed by the HLA class II on the DQA2-DQB1 loci (rs9275572, Pgcadj = 1.11 × 10−16, OR = 1.62, 95% CI: 1.46–1.80; rs9271366, Pgcadj=6.46 × 10−12, OR = 2.06, 95% CI: 1.71–2.50). Other known SLE loci associated were ITGAM, STAT4, TNIP1, NCF2 and IRAK1. We identified a novel locus on 10q24.33 (rs4917385, Pgcadj =1.4×10−8) with a eQTL effect (Peqtl=8.0 × 10−37 at USMG5/miR1307), and describe novel loci. We corroborate SLE-risk loci previously identified in European and Asians. Local ancestry estimation showed that HLA allele risk contribution is of European ancestral origin. Imputation of HLA alleles suggested that autochthonous Native American haplotypes provide protection. CONCLUSIONS Our results show the insight gained by studying admixed populations to delineate the genetic architecture that underlies autoimmune and complex diseases. PMID:26606652
Factors Influencing Early Feeding of Foods and Drinks Containing Free Sugars—A Birth Cohort Study
Ha, Diep H.; Do, Loc G.; Spencer, Andrew John; Golley, Rebecca K.; Rugg-Gunn, Andrew J.; Levy, Steven M.
2017-01-01
Early feeding of free sugars to young children can increase the preference for sweetness and the risk of consuming a cariogenic diet high in free sugars later in life. This study aimed to investigate early life factors influencing early introduction of foods/drinks containing free sugars. Data from an ongoing population-based birth cohort study in Australia were used. Mothers of newborn children completed questionnaires at birth and subsequently at ages 3, 6, 12, and 24 months. The outcome was reported feeding (Yes/No) at age 6–9 months of common foods/drinks sources of free sugars (hereafter referred as foods/drinks with free sugars). Household income quartiles, mother’s sugar-sweetened beverage (SSB) consumption, and other maternal factors were exposure variables. Analysis was conducted progressively from bivariate to multivariable log-binomial regression with robust standard error estimation to calculate prevalence ratios (PR) of being fed foods/drinks with free sugars at an early age (by 6–9 months). Models for both complete cases and with multiple imputations (MI) for missing data were generated. Of 1479 mother/child dyads, 21% of children had been fed foods/drinks with free sugars. There was a strong income gradient and a significant positive association with maternal SSB consumption. In the complete-case model, income Q1 and Q2 had PRs of 1.9 (1.2–3.1) and 1.8 (1.2–2.6) against Q4, respectively. The PR for mothers ingesting SSB everyday was 1.6 (1.2–2.3). The PR for children who had been breastfed to at least three months was 0.6 (0.5–0.8). Similar findings were observed in the MI model. Household income at birth and maternal behaviours were significant determinants of early feeding of foods/drinks with free sugars. PMID:29065527
ERIC Educational Resources Information Center
Kallenbach, Silja, Ed.; Viens, Julie, Ed.
This document contains nine papers from a systematic, classroom-based study of multiple intelligences (MI) theory in different adult learning contexts during which adult educators from rural and urban areas throughout the United States conducted independent inquiries into the question of how MI theory can support instruction and assessment in…
Excessive expression of miR-27 impairs Treg-mediated immunological tolerance
Cruz, Leilani O.; Hashemifar, Somaye Sadat; Wu, Cheng-Jang; Cho, Sunglim; Nguyen, Duc T.; Lin, Ling-Li; Khan, Aly Azeem
2017-01-01
MicroRNAs (miRs) are tightly regulated in the immune system, and aberrant expression of miRs often results in hematopoietic malignancies and autoimmune diseases. Previously, it was suggested that elevated levels of miR-27 in T cells isolated from patients with multiple sclerosis facilitate disease progression by inhibiting Th2 immunity and promoting pathogenic Th1 responses. Here we have demonstrated that, although mice with T cell–specific overexpression of miR-27 harbor dysregulated Th1 responses and develop autoimmune pathology, these disease phenotypes are not driven by miR-27 in effector T cells in a cell-autonomous manner. Rather, dysregulation of Th1 responses and autoimmunity resulted from a perturbed Treg compartment. Excessive miR-27 expression in murine T cells severely impaired Treg differentiation. Moreover, Tregs with exaggerated miR-27–mediated gene regulation exhibited diminished homeostasis and suppressor function in vivo. Mechanistically, we determined that miR-27 represses several known as well as previously uncharacterized targets that play critical roles in controlling multiple aspects of Treg biology. Collectively, our data show that miR-27 functions as a key regulator in Treg development and function and suggest that proper regulation of miR-27 is pivotal to safeguarding Treg-mediated immunological tolerance. PMID:28067667
2012-01-01
Background Efficient, robust, and accurate genotype imputation algorithms make large-scale application of genomic selection cost effective. An algorithm that imputes alleles or allele probabilities for all animals in the pedigree and for all genotyped single nucleotide polymorphisms (SNP) provides a framework to combine all pedigree, genomic, and phenotypic information into a single-stage genomic evaluation. Methods An algorithm was developed for imputation of genotypes in pedigreed populations that allows imputation for completely ungenotyped animals and for low-density genotyped animals, accommodates a wide variety of pedigree structures for genotyped animals, imputes unmapped SNP, and works for large datasets. The method involves simple phasing rules, long-range phasing and haplotype library imputation and segregation analysis. Results Imputation accuracy was high and computational cost was feasible for datasets with pedigrees of up to 25 000 animals. The resulting single-stage genomic evaluation increased the accuracy of estimated genomic breeding values compared to a scenario in which phenotypes on relatives that were not genotyped were ignored. Conclusions The developed imputation algorithm and software and the resulting single-stage genomic evaluation method provide powerful new ways to exploit imputation and to obtain more accurate genetic evaluations. PMID:22462519
Performance of genotype imputation for low frequency and rare variants from the 1000 genomes.
Zheng, Hou-Feng; Rong, Jing-Jing; Liu, Ming; Han, Fang; Zhang, Xing-Wei; Richards, J Brent; Wang, Li
2015-01-01
Genotype imputation is now routinely applied in genome-wide association studies (GWAS) and meta-analyses. However, most of the imputations have been run using HapMap samples as reference, imputation of low frequency and rare variants (minor allele frequency (MAF) < 5%) are not systemically assessed. With the emergence of next-generation sequencing, large reference panels (such as the 1000 Genomes panel) are available to facilitate imputation of these variants. Therefore, in order to estimate the performance of low frequency and rare variants imputation, we imputed 153 individuals, each of whom had 3 different genotype array data including 317k, 610k and 1 million SNPs, to three different reference panels: the 1000 Genomes pilot March 2010 release (1KGpilot), the 1000 Genomes interim August 2010 release (1KGinterim), and the 1000 Genomes phase1 November 2010 and May 2011 release (1KGphase1) by using IMPUTE version 2. The differences between these three releases of the 1000 Genomes data are the sample size, ancestry diversity, number of variants and their frequency spectrum. We found that both reference panel and GWAS chip density affect the imputation of low frequency and rare variants. 1KGphase1 outperformed the other 2 panels, at higher concordance rate, higher proportion of well-imputed variants (info>0.4) and higher mean info score in each MAF bin. Similarly, 1M chip array outperformed 610K and 317K. However for very rare variants (MAF ≤ 0.3%), only 0-1% of the variants were well imputed. We conclude that the imputation of low frequency and rare variants improves with larger reference panels and higher density of genome-wide genotyping arrays. Yet, despite a large reference panel size and dense genotyping density, very rare variants remain difficult to impute.
Mitt, Mario; Kals, Mart; Pärn, Kalle; Gabriel, Stacey B; Lander, Eric S; Palotie, Aarno; Ripatti, Samuli; Morris, Andrew P; Metspalu, Andres; Esko, Tõnu; Mägi, Reedik; Palta, Priit
2017-06-01
Genetic imputation is a cost-efficient way to improve the power and resolution of genome-wide association (GWA) studies. Current publicly accessible imputation reference panels accurately predict genotypes for common variants with minor allele frequency (MAF)≥5% and low-frequency variants (0.5≤MAF<5%) across diverse populations, but the imputation of rare variation (MAF<0.5%) is still rather limited. In the current study, we evaluate imputation accuracy achieved with reference panels from diverse populations with a population-specific high-coverage (30 ×) whole-genome sequencing (WGS) based reference panel, comprising of 2244 Estonian individuals (0.25% of adult Estonians). Although the Estonian-specific panel contains fewer haplotypes and variants, the imputation confidence and accuracy of imputed low-frequency and rare variants was significantly higher. The results indicate the utility of population-specific reference panels for human genetic studies.
Mitt, Mario; Kals, Mart; Pärn, Kalle; Gabriel, Stacey B; Lander, Eric S; Palotie, Aarno; Ripatti, Samuli; Morris, Andrew P; Metspalu, Andres; Esko, Tõnu; Mägi, Reedik; Palta, Priit
2017-01-01
Genetic imputation is a cost-efficient way to improve the power and resolution of genome-wide association (GWA) studies. Current publicly accessible imputation reference panels accurately predict genotypes for common variants with minor allele frequency (MAF)≥5% and low-frequency variants (0.5≤MAF<5%) across diverse populations, but the imputation of rare variation (MAF<0.5%) is still rather limited. In the current study, we evaluate imputation accuracy achieved with reference panels from diverse populations with a population-specific high-coverage (30 ×) whole-genome sequencing (WGS) based reference panel, comprising of 2244 Estonian individuals (0.25% of adult Estonians). Although the Estonian-specific panel contains fewer haplotypes and variants, the imputation confidence and accuracy of imputed low-frequency and rare variants was significantly higher. The results indicate the utility of population-specific reference panels for human genetic studies. PMID:28401899
Xu, Zhengwei; Huang, Chen; Hao, Dingjun
2017-02-01
MicroRNAs (miRNAs) have emerged as important regulators in multiple myeloma (MM). miR-1271 is a tumor suppressor in many cancer types. However, the biological role of miR-1271 in MM remains unclear. In the present study, we elucidated the biological role of miR-1271 in MM. Results showed that miR-1271 was significantly decreased in primary MM cells from MM patients and MM cell lines. Overexpression of miR-1271 inhibited proliferation and promoted apoptosis of MM cells. Conversely, suppression of miR-1271 showed the opposite effect. Bioinformatics algorithm analysis predicted that smoothened (SMO), the activator of Hedgehog (HH) signaling pathway, was a direct target of miR-1271 that was experimentally verified by a dual-luciferase reporter assay. Furthermore, overexpression of miR-1271 inhibited SMO expression and HH signaling pathway. Conversely, the restoration of SMO expression markedly abolished the effect of miR-1271 overexpression on cell proliferation, apoptosis and HH signaling pathway in MM cells. Taken together, the present study suggests that miR-1271 functions as a tumor suppressor that inhibits proliferation and promotes apoptosis of MM cells through inhibiting SMO-mediated HH signaling pathway. This finding implies that miR-1271 is a potential therapeutic target for the treatment of MM.
Missing CD4+ cell response in randomized clinical trials of maraviroc and dolutegravir.
Cuffe, Robert; Barnett, Carly; Granier, Catherine; Machida, Mitsuaki; Wang, Cunshan; Roger, James
2015-10-01
Missing data can compromise inferences from clinical trials, yet the topic has received little attention in the clinical trial community. Shortcomings in commonly used methods used to analyze studies with missing data (complete case, last- or baseline-observation carried forward) have been highlighted in a recent Food and Drug Administration-sponsored report. This report recommends how to mitigate the issues associated with missing data. We present an example of the proposed concepts using data from recent clinical trials. CD4+ cell count data from the previously reported SINGLE and MOTIVATE studies of dolutegravir and maraviroc were analyzed using a variety of statistical methods to explore the impact of missing data. Four methodologies were used: complete case analysis, simple imputation, mixed models for repeated measures, and multiple imputation. We compared the sensitivity of conclusions to the volume of missing data and to the assumptions underpinning each method. Rates of missing data were greater in the MOTIVATE studies (35%-68% premature withdrawal) than in SINGLE (12%-20%). The sensitivity of results to assumptions about missing data was related to volume of missing data. Estimates of treatment differences by various analysis methods ranged across a 61 cells/mm3 window in MOTIVATE and a 22 cells/mm3 window in SINGLE. Where missing data are anticipated, analyses require robust statistical and clinical debate of the necessary but unverifiable underlying statistical assumptions. Multiple imputation makes these assumptions transparent, can accommodate a broad range of scenarios, and is a natural analysis for clinical trials in HIV with missing data.
ERIC Educational Resources Information Center
Peariso, Jamon F.
2008-01-01
Howard Gardner's Multiple Intelligences (MI) theory has been widely accepted in the field of education for the past two decades. Most educators have been subjugated to the MI theory and to the many issues that its implementation in the classroom brings. This is often done without ever looking at or being presented the critic's view or research on…
Loong, Bronwyn; Zaslavsky, Alan M.; He, Yulei; Harrington, David P.
2013-01-01
Statistical agencies have begun to partially synthesize public-use data for major surveys to protect the confidentiality of respondents’ identities and sensitive attributes, by replacing high disclosure risk and sensitive variables with multiple imputations. To date, there are few applications of synthetic data techniques to large-scale healthcare survey data. Here, we describe partial synthesis of survey data collected by CanCORS, a comprehensive observational study of the experiences, treatments, and outcomes of patients with lung or colorectal cancer in the United States. We review inferential methods for partially synthetic data, and discuss selection of high disclosure risk variables for synthesis, specification of imputation models, and identification disclosure risk assessment. We evaluate data utility by replicating published analyses and comparing results using original and synthetic data, and discuss practical issues in preserving inferential conclusions. We found that important subgroup relationships must be included in the synthetic data imputation model, to preserve the data utility of the observed data for a given analysis procedure. We conclude that synthetic CanCORS data are suited best for preliminary data analyses purposes. These methods address the requirement to share data in clinical research without compromising confidentiality. PMID:23670983
Nixon, Richard M; Duffy, Stephen W; Fender, Guy R K
2003-09-24
The Anglia Menorrhagia Education Study (AMES) is a randomized controlled trial testing the effectiveness of an education package applied to general practices. Binary data are available from two sources; general practitioner reported referrals to hospital, and referrals to hospital determined by independent audit of the general practices. The former may be regarded as a surrogate for the latter, which is regarded as the true endpoint. Data are only available for the true end point on a sub set of the practices, but there are surrogate data for almost all of the audited practices and for most of the remaining practices. The aim of this paper was to estimate the treatment effect using data from every practice in the study. Where the true endpoint was not available, it was estimated by three approaches, a regression method, multiple imputation and a full likelihood model. Including the surrogate data in the analysis yielded an estimate of the treatment effect which was more precise than an estimate gained from using the true end point data alone. The full likelihood method provides a new imputation tool at the disposal of trials with surrogate data.
Missing data imputation: focusing on single imputation.
Zhang, Zhongheng
2016-01-01
Complete case analysis is widely used for handling missing data, and it is the default method in many statistical packages. However, this method may introduce bias and some useful information will be omitted from analysis. Therefore, many imputation methods are developed to make gap end. The present article focuses on single imputation. Imputations with mean, median and mode are simple but, like complete case analysis, can introduce bias on mean and deviation. Furthermore, they ignore relationship with other variables. Regression imputation can preserve relationship between missing values and other variables. There are many sophisticated methods exist to handle missing values in longitudinal data. This article focuses primarily on how to implement R code to perform single imputation, while avoiding complex mathematical calculations.
Missing data imputation: focusing on single imputation
2016-01-01
Complete case analysis is widely used for handling missing data, and it is the default method in many statistical packages. However, this method may introduce bias and some useful information will be omitted from analysis. Therefore, many imputation methods are developed to make gap end. The present article focuses on single imputation. Imputations with mean, median and mode are simple but, like complete case analysis, can introduce bias on mean and deviation. Furthermore, they ignore relationship with other variables. Regression imputation can preserve relationship between missing values and other variables. There are many sophisticated methods exist to handle missing values in longitudinal data. This article focuses primarily on how to implement R code to perform single imputation, while avoiding complex mathematical calculations. PMID:26855945
Accounting for one-channel depletion improves missing value imputation in 2-dye microarray data.
Ritz, Cecilia; Edén, Patrik
2008-01-19
For 2-dye microarray platforms, some missing values may arise from an un-measurably low RNA expression in one channel only. Information of such "one-channel depletion" is so far not included in algorithms for imputation of missing values. Calculating the mean deviation between imputed values and duplicate controls in five datasets, we show that KNN-based imputation gives a systematic bias of the imputed expression values of one-channel depleted spots. Evaluating the correction of this bias by cross-validation showed that the mean square deviation between imputed values and duplicates were reduced up to 51%, depending on dataset. By including more information in the imputation step, we more accurately estimate missing expression values.
Shu, Le; Zhang, Xiaobo
2017-01-01
Growing evidence has indicated that the innate immune system can be regulated by microRNAs (miRNAs). However, the mechanism underlying miRNA-mediated simultaneous activation of multiple immune pathways remains unknown. To address this issue, the role of host miR-12 in shrimp (Marsupenaeus japonicus) antiviral immune responses was characterized in the present study. The results indicated that miR-12 participated in virus infection, host phagocytosis, and apoptosis in defense against white spot syndrome virus invasion. miR-12 could simultaneously trigger phagocytosis, apoptosis, and antiviral immunity through the synchronous downregulation of the expression of shrimp genes [PTEN (phosphatase and tensin homolog) and BI-1(transmembrane BAX inhibitor motif containing 6)] and the viral gene (wsv024). Further analysis showed that miR-12 could synchronously mediate the 5′–3′ exonucleolytic degradation of its target mRNAs, and this degradation terminated in the vicinity of the 3′ untranslated region sequence complementary to the seed sequence of miR-12. Therefore, the present study showed novel aspects of the miRNA-mediated simultaneous regulation of multiple immune pathways. PMID:28824612
Kou, Shu-Jun; Wu, Xiao-Meng; Liu, Zheng; Liu, Yuan-Long; Xu, Qiang; Guo, Wen-Wu
2012-12-01
miRNAs have recently been reported to modulate somatic embryogenesis (SE), a key pathway of plant regeneration in vitro. For expression level detection and subsequent function dissection of miRNAs in certain biological processes, qRT-PCR is one of the most effective and sensitive techniques, for which suitable reference gene selection is a prerequisite. In this study, three miRNAs and eight non-coding RNAs (ncRNA) were selected as reference candidates, and their expression stability was inspected in developing citrus SE tissues cultured at 20, 25, and 30 °C. Stability of the eight non-miRNA ncRNAs was further validated in five adult tissues without temperature treatment. The best single reference gene for SE tissues was snoR14 or snoRD25, while for the adult tissues the best one was U4; although they were not as stable as the optimal multiple references snoR14 + U6 for SE tissues and snoR14 + U5 for adult tissues. For expression normalization of less abundant miRNAs in SE tissues, miR3954 was assessed as a viable reference. Single reference gene snoR14 outperformed multiple references for the overall SE and adult tissues. As one of the pioneer systematic studies on reference gene identification for plant miRNA normalization, this study benefits future exploration on miRNA function in citrus and provides valuable information for similar studies in other higher plants. Three miRNAs and eight non-coding RNAs were tested as reference candidates on developing citrus SE tissues. Best single references snoR14 or snoRD25 and optimal multiple references snoR14 + U6, snoR14 + U5 were identified.
Evidence That Up-Regulation of MicroRNA-29 Contributes to Postnatal Body Growth Deceleration
Kamran, Fariha; Andrade, Anenisia C.; Nella, Aikaterini A.; Clokie, Samuel J.; Rezvani, Geoffrey; Nilsson, Ola; Baron, Jeffrey
2015-01-01
Body growth is rapid in infancy but subsequently slows and eventually ceases due to a progressive decline in cell proliferation that occurs simultaneously in multiple organs. We previously showed that this decline in proliferation is driven in part by postnatal down-regulation of a large set of growth-promoting genes in multiple organs. We hypothesized that this growth-limiting genetic program is orchestrated by microRNAs (miRNAs). Bioinformatic analysis identified target sequences of the miR-29 family of miRNAs to be overrepresented in age–down-regulated genes. Concomitantly, expression microarray analysis in mouse kidney and lung showed that all members of the miR-29 family, miR-29a, -b, and -c, were strongly up-regulated from 1 to 6 weeks of age. Real-time PCR confirmed that miR-29a, -b, and -c were up-regulated with age in liver, kidney, lung, and heart, and their expression levels were higher in hepatocytes isolated from 5-week-old mice than in hepatocytes from embryonic mouse liver at embryonic day 16.5. We next focused on 3 predicted miR-29 target genes (Igf1, Imp1, and Mest), all of which are growth-promoting. A 3′-untranslated region containing the predicted target sequences from each gene was placed individually in a luciferase reporter construct. Transfection of miR-29 mimics suppressed luciferase gene activity for all 3 genes, and this suppression was diminished by mutating the target sequences, suggesting that these genes are indeed regulated by miR-29. Taken together, the findings suggest that up-regulation of miR-29 during juvenile life drives the down-regulation of multiple growth-promoting genes, thus contributing to physiological slowing and eventual cessation of body growth. PMID:25866874
Evidence That Up-Regulation of MicroRNA-29 Contributes to Postnatal Body Growth Deceleration.
Kamran, Fariha; Andrade, Anenisia C; Nella, Aikaterini A; Clokie, Samuel J; Rezvani, Geoffrey; Nilsson, Ola; Baron, Jeffrey; Lui, Julian C
2015-06-01
Body growth is rapid in infancy but subsequently slows and eventually ceases due to a progressive decline in cell proliferation that occurs simultaneously in multiple organs. We previously showed that this decline in proliferation is driven in part by postnatal down-regulation of a large set of growth-promoting genes in multiple organs. We hypothesized that this growth-limiting genetic program is orchestrated by microRNAs (miRNAs). Bioinformatic analysis identified target sequences of the miR-29 family of miRNAs to be overrepresented in age-down-regulated genes. Concomitantly, expression microarray analysis in mouse kidney and lung showed that all members of the miR-29 family, miR-29a, -b, and -c, were strongly up-regulated from 1 to 6 weeks of age. Real-time PCR confirmed that miR-29a, -b, and -c were up-regulated with age in liver, kidney, lung, and heart, and their expression levels were higher in hepatocytes isolated from 5-week-old mice than in hepatocytes from embryonic mouse liver at embryonic day 16.5. We next focused on 3 predicted miR-29 target genes (Igf1, Imp1, and Mest), all of which are growth-promoting. A 3'-untranslated region containing the predicted target sequences from each gene was placed individually in a luciferase reporter construct. Transfection of miR-29 mimics suppressed luciferase gene activity for all 3 genes, and this suppression was diminished by mutating the target sequences, suggesting that these genes are indeed regulated by miR-29. Taken together, the findings suggest that up-regulation of miR-29 during juvenile life drives the down-regulation of multiple growth-promoting genes, thus contributing to physiological slowing and eventual cessation of body growth.
Genotype imputation in a tropical crossbred dairy cattle population.
Oliveira Júnior, Gerson A; Chud, Tatiane C S; Ventura, Ricardo V; Garrick, Dorian J; Cole, John B; Munari, Danísio P; Ferraz, José B S; Mullart, Erik; DeNise, Sue; Smith, Shannon; da Silva, Marcos Vinícius G B
2017-12-01
The objective of this study was to investigate different strategies for genotype imputation in a population of crossbred Girolando (Gyr × Holstein) dairy cattle. The data set consisted of 478 Girolando, 583 Gyr, and 1,198 Holstein sires genotyped at high density with the Illumina BovineHD (Illumina, San Diego, CA) panel, which includes ∼777K markers. The accuracy of imputation from low (20K) and medium densities (50K and 70K) to the HD panel density and from low to 50K density were investigated. Seven scenarios using different reference populations (RPop) considering Girolando, Gyr, and Holstein breeds separately or combinations of animals of these breeds were tested for imputing genotypes of 166 randomly chosen Girolando animals. The population genotype imputation were performed using FImpute. Imputation accuracy was measured as the correlation between observed and imputed genotypes (CORR) and also as the proportion of genotypes that were imputed correctly (CR). This is the first paper on imputation accuracy in a Girolando population. The sample-specific imputation accuracies ranged from 0.38 to 0.97 (CORR) and from 0.49 to 0.96 (CR) imputing from low and medium densities to HD, and 0.41 to 0.95 (CORR) and from 0.50 to 0.94 (CR) for imputation from 20K to 50K. The CORR anim exceeded 0.96 (for 50K and 70K panels) when only Girolando animals were included in RPop (S1). We found smaller CORR anim when Gyr (S2) was used instead of Holstein (S3) as RPop. The same behavior was observed between S4 (Gyr + Girolando) and S5 (Holstein + Girolando) because the target animals were more related to the Holstein population than to the Gyr population. The highest imputation accuracies were observed for scenarios including Girolando animals in the reference population, whereas using only Gyr animals resulted in low imputation accuracies, suggesting that the haplotypes segregating in the Girolando population had a greater effect on accuracy than the purebred haplotypes. All chromosomes had similar imputation accuracies (CORR snp ) within each scenario. Crossbred animals (Girolando) must be included in the reference population to provide the best imputation accuracies. Copyright © 2017 American Dairy Science Association. Published by Elsevier Inc. All rights reserved.
ERIC Educational Resources Information Center
Pfaffel, Andreas; Spiel, Christiane
2016-01-01
Approaches to correcting correlation coefficients for range restriction have been developed under the framework of large sample theory. The accuracy of missing data techniques for correcting correlation coefficients for range restriction has thus far only been investigated with relatively large samples. However, researchers and evaluators are…
The Agricultural Health Study (AHS), a large prospective cohort, was designed to elucidate associations between pesticide use and other agricultural exposures and health outcomes. The cohort includes 57,310 pesticide applicators who were enrolled between 1993 and 1997 in Iowa and...
USDA-ARS?s Scientific Manuscript database
Microsatellite markers (MS) have traditionally been used for parental verification and are still the international standard in spite of their higher cost, error rate, and turnaround time compared with Single Nucleotide Polymorphisms (SNP) -based assays. Despite domestic and international demands fr...
A Statistical Model for Misreported Binary Outcomes in Clustered RCTs of Education Interventions
ERIC Educational Resources Information Center
Schochet, Peter Z.
2013-01-01
In randomized control trials (RCTs) of educational interventions, there is a growing literature on impact estimation methods to adjust for missing student outcome data using such methods as multiple imputation, the construction of nonresponse weights, casewise deletion, and maximum likelihood methods (see, for example, Allison, 2002; Graham, 2009;…
ERIC Educational Resources Information Center
Asendorpf, Jens B.; van de Schoot, Rens; Denissen, Jaap J. A.; Hutteman, Roos
2014-01-01
Most longitudinal studies are plagued by drop-out related to variables at earlier assessments (systematic attrition). Although systematic attrition is often analysed in longitudinal studies, surprisingly few researchers attempt to reduce biases due to systematic attrition, even though this is possible and nowadays technically easy. This is…
The Technical Report of NAEP's 1990 Trial State Assessment Program.
ERIC Educational Resources Information Center
Koffler, Stephen L.; And Others
This report documents the design and data analysis procedures of the Trial State Assessment Program of the National Assessment of Educational Progress (NAEP). Today the NAEP is the only survey using advanced plausible values methodology that uses a multiple imputation procedure in a psychometric context. The 1990 Trial State Assessment collected…
Family structure as a predictor of screen time among youth.
McMillan, Rachel; McIsaac, Michael; Janssen, Ian
2015-01-01
The family plays a central role in the development of health-related behaviors among youth. The objective of this study was to determine whether non-traditional parental structure and shared custody arrangements predict how much time youth spend watching television, using a computer recreationally, and playing video games. Participants were a nationally representative sample of Canadian youth (N = 26,068) in grades 6-10 who participated in the 2009/10 Health Behaviour in School-aged Children Survey. Screen time in youth from single parent and reconstituted families, with or without regular visitation with their non-residential parent, was compared to that of youth from traditional dual-parent families. Multiple imputation was used to account for missing data. After multiple imputation, the relative odds of being in the highest television, computer use, video game, and total screen time quartiles were not different in boys and girls from non-traditional families by comparison to boys and girls from traditional dual-parent families. In conclusion, parental structure and child custody arrangements did not have a meaningful impact on screen time among youth.
Multiple imputation to evaluate the impact of an assay change in national surveys
Sternberg, Maya
2017-01-01
National health surveys, such as the National Health and Nutrition Examination Survey, are used to monitor trends of nutritional biomarkers. These surveys try to maintain the same biomarker assay over time, but there are a variety of reasons why the assay may change. In these cases, it is important to evaluate the potential impact of a change so that any observed fluctuations in concentrations over time are not confounded by changes in the assay. To this end, a subset of stored specimens previously analyzed with the old assay is retested using the new assay. These paired data are used to estimate an adjustment equation, which is then used to ‘adjust’ all the old assay results and convert them into ‘equivalent’ units of the new assay. In this paper, we present a new way of approaching this problem using modern statistical methods designed for missing data. Using simulations, we compare the proposed multiple imputation approach with the adjustment equation approach currently in use. We also compare these approaches using real National Health and Nutrition Examination Survey data for 25-hydroxyvitamin D. PMID:28419523
Multiple imputation for estimating the risk of developing dementia and its impact on survival.
Yu, Binbing; Saczynski, Jane S; Launer, Lenore
2010-10-01
Dementia, Alzheimer's disease in particular, is one of the major causes of disability and decreased quality of life among the elderly and a leading obstacle to successful aging. Given the profound impact on public health, much research has focused on the age-specific risk of developing dementia and the impact on survival. Early work has discussed various methods of estimating age-specific incidence of dementia, among which the illness-death model is popular for modeling disease progression. In this article we use multiple imputation to fit multi-state models for survival data with interval censoring and left truncation. This approach allows semi-Markov models in which survival after dementia depends on onset age. Such models can be used to estimate the cumulative risk of developing dementia in the presence of the competing risk of dementia-free death. Simulations are carried out to examine the performance of the proposed method. Data from the Honolulu Asia Aging Study are analyzed to estimate the age-specific and cumulative risks of dementia and to examine the effect of major risk factors on dementia onset and death.
Next-generation genotype imputation service and methods.
Das, Sayantan; Forer, Lukas; Schönherr, Sebastian; Sidore, Carlo; Locke, Adam E; Kwong, Alan; Vrieze, Scott I; Chew, Emily Y; Levy, Shawn; McGue, Matt; Schlessinger, David; Stambolian, Dwight; Loh, Po-Ru; Iacono, William G; Swaroop, Anand; Scott, Laura J; Cucca, Francesco; Kronenberg, Florian; Boehnke, Michael; Abecasis, Gonçalo R; Fuchsberger, Christian
2016-10-01
Genotype imputation is a key component of genetic association studies, where it increases power, facilitates meta-analysis, and aids interpretation of signals. Genotype imputation is computationally demanding and, with current tools, typically requires access to a high-performance computing cluster and to a reference panel of sequenced genomes. Here we describe improvements to imputation machinery that reduce computational requirements by more than an order of magnitude with no loss of accuracy in comparison to standard imputation tools. We also describe a new web-based service for imputation that facilitates access to new reference panels and greatly improves user experience and productivity.
Genotype Imputation with Millions of Reference Samples
Browning, Brian L.; Browning, Sharon R.
2016-01-01
We present a genotype imputation method that scales to millions of reference samples. The imputation method, based on the Li and Stephens model and implemented in Beagle v.4.1, is parallelized and memory efficient, making it well suited to multi-core computer processors. It achieves fast, accurate, and memory-efficient genotype imputation by restricting the probability model to markers that are genotyped in the target samples and by performing linear interpolation to impute ungenotyped variants. We compare Beagle v.4.1 with Impute2 and Minimac3 by using 1000 Genomes Project data, UK10K Project data, and simulated data. All three methods have similar accuracy but different memory requirements and different computation times. When imputing 10 Mb of sequence data from 50,000 reference samples, Beagle’s throughput was more than 100× greater than Impute2’s throughput on our computer servers. When imputing 10 Mb of sequence data from 200,000 reference samples in VCF format, Minimac3 consumed 26× more memory per computational thread and 15× more CPU time than Beagle. We demonstrate that Beagle v.4.1 scales to much larger reference panels by performing imputation from a simulated reference panel having 5 million samples and a mean marker density of one marker per four base pairs. PMID:26748515
Mikhchi, Abbas; Honarvar, Mahmood; Kashan, Nasser Emam Jomeh; Aminafshar, Mehdi
2016-06-21
Genotype imputation is an important tool for prediction of unknown genotypes for both unrelated individuals and parent-offspring trios. Several imputation methods are available and can either employ universal machine learning methods, or deploy algorithms dedicated to infer missing genotypes. In this research the performance of eight machine learning methods: Support Vector Machine, K-Nearest Neighbors, Extreme Learning Machine, Radial Basis Function, Random Forest, AdaBoost, LogitBoost, and TotalBoost compared in terms of the imputation accuracy, computation time and the factors affecting imputation accuracy. The methods employed using real and simulated datasets to impute the un-typed SNPs in parent-offspring trios. The tested methods show that imputation of parent-offspring trios can be accurate. The Random Forest and Support Vector Machine were more accurate than the other machine learning methods. The TotalBoost performed slightly worse than the other methods.The running times were different between methods. The ELM was always most fast algorithm. In case of increasing the sample size, the RBF requires long imputation time.The tested methods in this research can be an alternative for imputation of un-typed SNPs in low missing rate of data. However, it is recommended that other machine learning methods to be used for imputation. Copyright © 2016 Elsevier Ltd. All rights reserved.
Scognamiglio, Immacolata; Di Martino, Maria Teresa; Campani, Virginia; Virgilio, Antonella; Galeone, Aldo; Gullà, Annamaria; Gallo Cantafio, Maria Eugenia; Tagliaferri, Pierosandro; Tassone, Pierfrancesco; Caraglia, Michele
2014-01-01
Stable nucleic acid lipid vesicles (SNALPs) encapsulating miR-34a to treat multiple myeloma (MM) were developed. Wild type or completely 2′-O-methylated (OMet) MiR-34a was used in this study. Moreover, SNALPs were conjugated with transferrin (Tf) in order to target MM cells overexpressing transferrin receptors (TfRs). The type of miR-34a chemical backbone did not significantly affect the characteristics of SNALPs in terms of mean size, polydispersity index, and zeta potential, while the encapsulation of an OMet miR-34a resulted in a significant increase of miRNA encapsulation into the SNALPs. On the other hand, the chemical conjugation of SNALPs with Tf resulted in a significant decrease of the zeta potential, while size characteristics and miR-34a encapsulation into SNALPs were not significantly affected. In an experimental model of MM, all the animals treated with SNALPs encapsulating miR-34a showed a significant inhibition of the tumor growth. However, the use of SNALPs conjugated with Tf and encapsulating OMet miR-34a resulted in the highest increase of mice survival. These results may represent the proof of concept for the use of SNALPs encapsulating miR-34a for the treatment of MM. PMID:24683542
In Silico Characterization of miRNA and Long Non-Coding RNA Interplay in Multiple Myeloma
Ronchetti, Domenica; Manzoni, Martina; Todoerti, Katia; Neri, Antonino; Agnelli, Luca
2016-01-01
The identification of deregulated microRNAs (miRNAs) and long non-coding RNAs (lncRNAs) in multiple myeloma (MM) has progressively added a further level of complexity to MM biology. In addition, the cross-regulation between lncRNAs and miRNAs has begun to emerge, and theoretical and experimental studies have demonstrated the competing endogenous RNA (ceRNA) activity of lncRNAs as natural miRNA decoys in pathophysiological conditions, including cancer. Currently, information concerning lncRNA and miRNA interplay in MM is virtually absent. Herein, we investigated in silico the lncRNA and miRNA relationship in a representative datasets encompassing 95 MM and 30 plasma cell leukemia patients at diagnosis and in four normal controls, whose expression profiles were generated by a custom annotation pipeline to detect specific lncRNAs. We applied target prediction analysis based on miRanda and RNA22 algorithms to 235 lncRNAs and 459 miRNAs selected with a potential pivotal role in the pathology of MM. Among pairs that showed a significant correlation between lncRNA and miRNA expression levels, we identified 11 lncRNA–miRNA relationships suggestive of a novel ceRNA network with relevance in MM. PMID:27916857
Jeyapalan, Zina; Deng, Zhaoqun; Shatseva, Tatiana; Fang, Ling; He, Chengyan; Yang, Burton B
2011-04-01
The non-coding 3'-untranslated region (UTR) plays an important role in the regulation of microRNA (miRNA) functions, since it can bind and inactivate multiple miRNAs. Here, we show the 3'-UTR of CD44 is able to antagonize cytoplasmic miRNAs, and result in the increased translation of CD44 and downstream target mRNA, CDC42. A series of cell function assays in the human breast cancer cell line, MT-1, have shown that the CD44 3'-UTR inhibits proliferation, colony formation and tumor growth. Furthermore, it modulated endothelial cell activities, favored angiogenesis, induced tumor cell apoptosis and increased sensitivity to Docetaxel. These results are due to the interaction of the CD44 3'-UTR with multiple miRNAs. Computational algorithms have predicted three miRNAs, miR-216a, miR-330 and miR-608, can bind to both the CD44 and CDC42 3'-UTRs. This was confirmed with luciferase assays, western blotting and immunohistochemical staining and correlated with a series of siRNA assays. Thus, the non-coding CD44 3'-UTR serves as a competitor for miRNA binding and subsequently inactivates miRNA functions, by freeing the target mRNAs from being repressed.
Jeyapalan, Zina; Deng, Zhaoqun; Shatseva, Tatiana; Fang, Ling; He, Chengyan; Yang, Burton B.
2011-01-01
The non-coding 3′-untranslated region (UTR) plays an important role in the regulation of microRNA (miRNA) functions, since it can bind and inactivate multiple miRNAs. Here, we show the 3′-UTR of CD44 is able to antagonize cytoplasmic miRNAs, and result in the increased translation of CD44 and downstream target mRNA, CDC42. A series of cell function assays in the human breast cancer cell line, MT-1, have shown that the CD44 3′-UTR inhibits proliferation, colony formation and tumor growth. Furthermore, it modulated endothelial cell activities, favored angiogenesis, induced tumor cell apoptosis and increased sensitivity to Docetaxel. These results are due to the interaction of the CD44 3′-UTR with multiple miRNAs. Computational algorithms have predicted three miRNAs, miR-216a, miR-330 and miR-608, can bind to both the CD44 and CDC42 3′-UTRs. This was confirmed with luciferase assays, western blotting and immunohistochemical staining and correlated with a series of siRNA assays. Thus, the non-coding CD44 3′-UTR serves as a competitor for miRNA binding and subsequently inactivates miRNA functions, by freeing the target mRNAs from being repressed. PMID:21149267
Genotyping by sequencing for genomic prediction in a soybean breeding population.
Jarquín, Diego; Kocak, Kyle; Posadas, Luis; Hyma, Katie; Jedlicka, Joseph; Graef, George; Lorenz, Aaron
2014-08-29
Advances in genotyping technology, such as genotyping by sequencing (GBS), are making genomic prediction more attractive to reduce breeding cycle times and costs associated with phenotyping. Genomic prediction and selection has been studied in several crop species, but no reports exist in soybean. The objectives of this study were (i) evaluate prospects for genomic selection using GBS in a typical soybean breeding program and (ii) evaluate the effect of GBS marker selection and imputation on genomic prediction accuracy. To achieve these objectives, a set of soybean lines sampled from the University of Nebraska Soybean Breeding Program were genotyped using GBS and evaluated for yield and other agronomic traits at multiple Nebraska locations. Genotyping by sequencing scored 16,502 single nucleotide polymorphisms (SNPs) with minor-allele frequency (MAF) > 0.05 and percentage of missing values ≤ 5% on 301 elite soybean breeding lines. When SNPs with up to 80% missing values were included, 52,349 SNPs were scored. Prediction accuracy for grain yield, assessed using cross validation, was estimated to be 0.64, indicating good potential for using genomic selection for grain yield in soybean. Filtering SNPs based on missing data percentage had little to no effect on prediction accuracy, especially when random forest imputation was used to impute missing values. The highest accuracies were observed when random forest imputation was used on all SNPs, but differences were not significant. A standard additive G-BLUP model was robust; modeling additive-by-additive epistasis did not provide any improvement in prediction accuracy. The effect of training population size on accuracy began to plateau around 100, but accuracy steadily climbed until the largest possible size was used in this analysis. Including only SNPs with MAF > 0.30 provided higher accuracies when training populations were smaller. Using GBS for genomic prediction in soybean holds good potential to expedite genetic gain. Our results suggest that standard additive G-BLUP models can be used on unfiltered, imputed GBS data without loss in accuracy.
Impaired neurosteroid synthesis in multiple sclerosis
Noorbakhsh, Farshid; Ellestad, Kristofor K.; Maingat, Ferdinand; Warren, Kenneth G.; Han, May H.; Steinman, Lawrence; Baker, Glen B.
2011-01-01
High-throughput technologies have led to advances in the recognition of disease pathways and their underlying mechanisms. To investigate the impact of micro-RNAs on the disease process in multiple sclerosis, a prototypic inflammatory neurological disorder, we examined cerebral white matter from patients with or without the disease by micro-RNA profiling, together with confirmatory reverse transcription–polymerase chain reaction analysis, immunoblotting and gas chromatography-mass spectrometry. These observations were verified using the in vivo multiple sclerosis model, experimental autoimmune encephalomyelitis. Brains of patients with or without multiple sclerosis demonstrated differential expression of multiple micro-RNAs, but expression of three neurosteroid synthesis enzyme-specific micro-RNAs (miR-338, miR-155 and miR-491) showed a bias towards induction in patients with multiple sclerosis (P < 0.05). Analysis of the neurosteroidogenic pathways targeted by micro-RNAs revealed suppression of enzyme transcript and protein levels in the white matter of patients with multiple sclerosis (P < 0.05). This was confirmed by firefly/Renilla luciferase micro-RNA target knockdown experiments (P < 0.05) and detection of specific micro-RNAs by in situ hybridization in the brains of patients with or without multiple sclerosis. Levels of important neurosteroids, including allopregnanolone, were suppressed in the white matter of patients with multiple sclerosis (P < 0.05). Induction of the murine micro-RNAs, miR-338 and miR-155, accompanied by diminished expression of neurosteroidogenic enzymes and allopregnanolone, was also observed in the brains of mice with experimental autoimmune encephalomyelitis (P < 0.05). Allopregnanolone treatment of the experimental autoimmune encephalomyelitis mouse model limited the associated neuropathology, including neuroinflammation, myelin and axonal injury and reduced neurobehavioral deficits (P < 0.05). These multi-platform studies point to impaired neurosteroidogenesis in both multiple sclerosis and experimental autoimmune encephalomyelitis. The findings also indicate that allopregnanolone and perhaps other neurosteroid-like compounds might represent potential biomarkers or therapies for multiple sclerosis. PMID:21908875
NASA Astrophysics Data System (ADS)
Kobayashi, Y.; Miyata, A.; Nagai, H.; Mano, M.; Yamamoto, S.
2005-12-01
In last decade, numerous long-term eddy flux measurements have been conducted worldwide to assess annual/seasonal energy, water and carbon exchanges between terrestrial ecosystem and the atmosphere. And FLUXNET communities now seem to come into a next phase with the objectives: integration of flux data observed at various ecosystems and/or inter-sites comparative studies. For example, a big research project "S-1" is ongoing in Japan and other eastern Asian region to set up terrestrial carbon management of Asia in the 21st century. One of the highlights of S-1 project is to provide a carbon budget map of all over Asia based on integrated and inter-compared eddy flux data collected at 15 sites of S-1 membership. FLUXNET communities including S-1 project have recognized that integration and inter-comparison of eddy flux data are the key issues to understand aspects of energy, water and carbon budgets at regional scale. However, the issues have difficulties to be settled because each flux site applies own data processing methods and gap-filling methods with site-specified classification and threshold values. In order to conduct appropriate integrative and inter-comparative analysis for eddy flux data effectively, we made it clear that how the differences in the data processing method affect the obtained flux values and searched for suitable and common gap-filling methodology. The differences in the data processing methods affect the obtained flux data in the present study was discussed based on a comparative experiment in S-1 project. We prepared one-month common test data sets, which consisted of 10 Hz eddy covariance raw data and related half-hourly meteorological data obtained at a larch forest site and a paddy site, in the comparative experiment. The 15 sites of S-1 memberships processed the test data by using their own processing methods. The results indicated that combined influences of coordinate rotation, detrending and frequency response correction brought about up to 10% of flux discrepancy, and that the forest sites were more sensitive to differences in the data processing methods than the non-forest sites. Multiple imputation method (MI), one of the statistical operations for analyzing incomplete multivariate data set, is likely to be an easy-to-use and objective gap-filling method to account for missing eddy flux data. We also discussed validity of application of MI to fill missing flux data by comparing a gap-filled complete eddy flux data set obtained by MI with that by nonlinear regression method and look-up table method. It was revealed that, with suitable separation of the periods to be filled and proper selection of reference variables, MI has potential to be applied commonly to gap-filling missing flux data, and that MI can be a useful tool for FLUXNET communities to make inter-site comparison of long-term flux data.
Markkula, Niina; Härkänen, Tommi; Nieminen, Tarja; Peña, Sebastián; Mattila, Aino K; Koskinen, Seppo; Saarni, Samuli I; Suvisaari, Jaana
2016-01-15
Depressive disorders are among the most pressing public health challenges worldwide. Yet, not enough is known about their long-term outcomes. This study examines the course and predictors of different outcomes of depressive disorders in an eleven-year follow-up of a general population sample. In a nationally representative sample of Finns aged 30 and over (BRIF8901), major depressive disorder (MDD) and dysthymia were diagnosed with the Composite International Diagnostic Interview (M-CIDI) in 2000. The participants were followed up in 2011 (n=5733). Outcome measures were diagnostic status, mortality, depressive symptoms and health-related quality of life. Multiple imputation (MI) was used to account for nonresponse. At follow-up, 33.8% of persons with baseline MDD and 42.6% with baseline dysthymia received a diagnosis of depressive, anxiety or alcohol use disorder. Baseline severity of disorder, measured by the Beck Depression Inventory, predicted both persistence of depressive disorder and increased mortality risk. In addition, being never-married, separated or widowed predicted persistence of depressive disorders, whereas somatic and psychiatric comorbidity, childhood adversities and lower social capital did not. Those who received no psychiatric diagnosis at follow-up still had residual symptoms and lower quality of life. We only had one follow-up point at eleven years, and did not collect information on the subjects' health during the follow-up period. Depressive disorders in the general population are associated with multiple negative outcomes. Severity of index episode is the strongest predictor of negative outcomes. More emphasis should be placed on addressing the long-term consequences of depression. Copyright © 2015 Elsevier B.V. All rights reserved.
Fu, Yong-Bi
2014-01-01
Genotyping by sequencing (GBS) recently has emerged as a promising genomic approach for assessing genetic diversity on a genome-wide scale. However, concerns are not lacking about the uniquely large unbalance in GBS genotype data. Although some genotype imputation has been proposed to infer missing observations, little is known about the reliability of a genetic diversity analysis of GBS data, with up to 90% of observations missing. Here we performed an empirical assessment of accuracy in genetic diversity analysis of highly incomplete single nucleotide polymorphism genotypes with imputations. Three large single-nucleotide polymorphism genotype data sets for corn, wheat, and rice were acquired, and missing data with up to 90% of missing observations were randomly generated and then imputed for missing genotypes with three map-independent imputation methods. Estimating heterozygosity and inbreeding coefficient from original, missing, and imputed data revealed variable patterns of bias from assessed levels of missingness and genotype imputation, but the estimation biases were smaller for missing data without genotype imputation. The estimates of genetic differentiation were rather robust up to 90% of missing observations but became substantially biased when missing genotypes were imputed. The estimates of topology accuracy for four representative samples of interested groups generally were reduced with increased levels of missing genotypes. Probabilistic principal component analysis based imputation performed better in terms of topology accuracy than those analyses of missing data without genotype imputation. These findings are not only significant for understanding the reliability of the genetic diversity analysis with respect to large missing data and genotype imputation but also are instructive for performing a proper genetic diversity analysis of highly incomplete GBS or other genotype data. PMID:24626289
Johnson, Eric O; Hancock, Dana B; Levy, Joshua L; Gaddis, Nathan C; Saccone, Nancy L; Bierut, Laura J; Page, Grier P
2013-05-01
A great promise of publicly sharing genome-wide association data is the potential to create composite sets of controls. However, studies often use different genotyping arrays, and imputation to a common set of SNPs has shown substantial bias: a problem which has no broadly applicable solution. Based on the idea that using differing genotyped SNP sets as inputs creates differential imputation errors and thus bias in the composite set of controls, we examined the degree to which each of the following occurs: (1) imputation based on the union of genotyped SNPs (i.e., SNPs available on one or more arrays) results in bias, as evidenced by spurious associations (type 1 error) between imputed genotypes and arbitrarily assigned case/control status; (2) imputation based on the intersection of genotyped SNPs (i.e., SNPs available on all arrays) does not evidence such bias; and (3) imputation quality varies by the size of the intersection of genotyped SNP sets. Imputations were conducted in European Americans and African Americans with reference to HapMap phase II and III data. Imputation based on the union of genotyped SNPs across the Illumina 1M and 550v3 arrays showed spurious associations for 0.2 % of SNPs: ~2,000 false positives per million SNPs imputed. Biases remained problematic for very similar arrays (550v1 vs. 550v3) and were substantial for dissimilar arrays (Illumina 1M vs. Affymetrix 6.0). In all instances, imputing based on the intersection of genotyped SNPs (as few as 30 % of the total SNPs genotyped) eliminated such bias while still achieving good imputation quality.
MicroRNA-182 drives metastasis of primary sarcomas by targeting multiple genes
Sachdeva, Mohit; Mito, Jeffrey K.; Lee, Chang-Lung; Zhang, Minsi; Li, Zhizhong; Dodd, Rebecca D.; Cason, David; Luo, Lixia; Ma, Yan; Van Mater, David; Gladdy, Rebecca; Lev, Dina C.; Cardona, Diana M.; Kirsch, David G.
2014-01-01
Metastasis causes most cancer deaths, but is incompletely understood. MicroRNAs can regulate metastasis, but it is not known whether a single miRNA can regulate metastasis in primary cancer models in vivo. We compared the expression of miRNAs in metastatic and nonmetastatic primary mouse sarcomas and found that microRNA-182 (miR-182) was markedly overexpressed in some tumors that metastasized to the lungs. By utilizing genetically engineered mice with either deletion of or overexpression of miR-182 in primary sarcomas, we discovered that deletion of miR-182 substantially decreased, while overexpression of miR-182 considerably increased, the rate of lung metastasis after amputation of the tumor-bearing limb. Additionally, deletion of miR-182 decreased circulating tumor cells (CTCs), while overexpression of miR-182 increased CTCs, suggesting that miR-182 regulates intravasation of cancer cells into the circulation. We identified 4 miR-182 targets that inhibit either the migration of tumor cells or the degradation of the extracellular matrix. Notably, restoration of any of these targets in isolation did not alter the metastatic potential of sarcoma cells injected orthotopically, but the simultaneous restoration of all 4 targets together substantially decreased the number of metastases. These results demonstrate that a single miRNA can regulate metastasis of primary tumors in vivo by coordinated regulation of multiple genes. PMID:25180607
PmiRExAt: plant miRNA expression atlas database and web applications
Gurjar, Anoop Kishor Singh; Panwar, Abhijeet Singh; Gupta, Rajinder; Mantri, Shrikant S.
2016-01-01
High-throughput small RNA (sRNA) sequencing technology enables an entirely new perspective for plant microRNA (miRNA) research and has immense potential to unravel regulatory networks. Novel insights gained through data mining in publically available rich resource of sRNA data will help in designing biotechnology-based approaches for crop improvement to enhance plant yield and nutritional value. Bioinformatics resources enabling meta-analysis of miRNA expression across multiple plant species are still evolving. Here, we report PmiRExAt, a new online database resource that caters plant miRNA expression atlas. The web-based repository comprises of miRNA expression profile and query tool for 1859 wheat, 2330 rice and 283 maize miRNA. The database interface offers open and easy access to miRNA expression profile and helps in identifying tissue preferential, differential and constitutively expressing miRNAs. A feature enabling expression study of conserved miRNA across multiple species is also implemented. Custom expression analysis feature enables expression analysis of novel miRNA in total 117 datasets. New sRNA dataset can also be uploaded for analysing miRNA expression profiles for 73 plant species. PmiRExAt application program interface, a simple object access protocol web service allows other programmers to remotely invoke the methods written for doing programmatic search operations on PmiRExAt database. Database URL: http://pmirexat.nabi.res.in. PMID:27081157
Dipnall, Joanna F.
2016-01-01
Background Atheoretical large-scale data mining techniques using machine learning algorithms have promise in the analysis of large epidemiological datasets. This study illustrates the use of a hybrid methodology for variable selection that took account of missing data and complex survey design to identify key biomarkers associated with depression from a large epidemiological study. Methods The study used a three-step methodology amalgamating multiple imputation, a machine learning boosted regression algorithm and logistic regression, to identify key biomarkers associated with depression in the National Health and Nutrition Examination Study (2009–2010). Depression was measured using the Patient Health Questionnaire-9 and 67 biomarkers were analysed. Covariates in this study included gender, age, race, smoking, food security, Poverty Income Ratio, Body Mass Index, physical activity, alcohol use, medical conditions and medications. The final imputed weighted multiple logistic regression model included possible confounders and moderators. Results After the creation of 20 imputation data sets from multiple chained regression sequences, machine learning boosted regression initially identified 21 biomarkers associated with depression. Using traditional logistic regression methods, including controlling for possible confounders and moderators, a final set of three biomarkers were selected. The final three biomarkers from the novel hybrid variable selection methodology were red cell distribution width (OR 1.15; 95% CI 1.01, 1.30), serum glucose (OR 1.01; 95% CI 1.00, 1.01) and total bilirubin (OR 0.12; 95% CI 0.05, 0.28). Significant interactions were found between total bilirubin with Mexican American/Hispanic group (p = 0.016), and current smokers (p<0.001). Conclusion The systematic use of a hybrid methodology for variable selection, fusing data mining techniques using a machine learning algorithm with traditional statistical modelling, accounted for missing data and complex survey sampling methodology and was demonstrated to be a useful tool for detecting three biomarkers associated with depression for future hypothesis generation: red cell distribution width, serum glucose and total bilirubin. PMID:26848571
The impact of missing trauma data on predicting massive transfusion
Trickey, Amber W.; Fox, Erin E.; del Junco, Deborah J.; Ning, Jing; Holcomb, John B.; Brasel, Karen J.; Cohen, Mitchell J.; Schreiber, Martin A.; Bulger, Eileen M.; Phelan, Herb A.; Alarcon, Louis H.; Myers, John G.; Muskat, Peter; Cotton, Bryan A.; Wade, Charles E.; Rahbar, Mohammad H.
2013-01-01
INTRODUCTION Missing data are inherent in clinical research and may be especially problematic for trauma studies. This study describes a sensitivity analysis to evaluate the impact of missing data on clinical risk prediction algorithms. Three blood transfusion prediction models were evaluated utilizing an observational trauma dataset with valid missing data. METHODS The PRospective Observational Multi-center Major Trauma Transfusion (PROMMTT) study included patients requiring ≥ 1 unit of red blood cells (RBC) at 10 participating U.S. Level I trauma centers from July 2009 – October 2010. Physiologic, laboratory, and treatment data were collected prospectively up to 24h after hospital admission. Subjects who received ≥ 10 RBC units within 24h of admission were classified as massive transfusion (MT) patients. Correct classification percentages for three MT prediction models were evaluated using complete case analysis and multiple imputation. A sensitivity analysis for missing data was conducted to determine the upper and lower bounds for correct classification percentages. RESULTS PROMMTT enrolled 1,245 subjects. MT was received by 297 patients (24%). Missing percentage ranged from 2.2% (heart rate) to 45% (respiratory rate). Proportions of complete cases utilized in the MT prediction models ranged from 41% to 88%. All models demonstrated similar correct classification percentages using complete case analysis and multiple imputation. In the sensitivity analysis, correct classification upper-lower bound ranges per model were 4%, 10%, and 12%. Predictive accuracy for all models using PROMMTT data was lower than reported in the original datasets. CONCLUSIONS Evaluating the accuracy clinical prediction models with missing data can be misleading, especially with many predictor variables and moderate levels of missingness per variable. The proposed sensitivity analysis describes the influence of missing data on risk prediction algorithms. Reporting upper/lower bounds for percent correct classification may be more informative than multiple imputation, which provided similar results to complete case analysis in this study. PMID:23778514
Pelé, Fabienne; Bajeux, Emma; Gendron, Hélène; Monfort, Christine; Rouget, Florence; Multigner, Luc; Viel, Jean-François; Cordier, Sylvaine
2013-12-02
Environmental exposures, including dietary contaminants, may influence the developing immune system. This study assesses the association between maternal pre-parturition consumption of seafood and wheeze, eczema, and food allergy in preschool children. Fish and shellfish were studied separately as they differ according to their levels of omega-3 polyunsaturated fatty acids (which have anti-allergic properties) and their levels of contaminants. The PELAGIE cohort included 3421 women recruited at the beginning of pregnancy. Maternal fish and shellfish intake was measured at inclusion by a food frequency questionnaire. Wheeze, eczema, and food allergy were evaluated by a questionnaire completed by the mother when the child was 2 years old (n = 1500). Examination of the associations between seafood intake and outcomes took major confounders into account. Complementary sensitivity analyses with multiple imputation enabled us to handle missing data, due mostly to attrition. Moderate maternal pre-parturition fish intake (1 to 4 times a month) was, at borderline significance, associated with a lower risk of wheeze (adjusted OR = 0.69 (0.45-1.05)) before age 2, compared with low intake (< once/month). This result was not, however, consistent: after multiple imputation, the adjusted OR was 0.86 (0.63-1.17). Shellfish intake at least once a month was associated with a higher risk of food allergy before age 2 (adjusted OR = 1.62 (1.11-2.37)) compared to low or no intake (< once/month). Multiple imputation confirmed this association (adjusted OR = 1.52 (1.05-2.21)). This study suggests that maternal pre-parturition shellfish consumption may increase the risk of food allergy. Further large-scale epidemiologic studies are needed to corroborate these results, identify the contaminants or components of shellfish responsible for the effects observed, determine the persistence of the associations seen at age 2, and investigate potential associations with health effects observable at later ages, such as allergic asthma.
Dipnall, Joanna F; Pasco, Julie A; Berk, Michael; Williams, Lana J; Dodd, Seetal; Jacka, Felice N; Meyer, Denny
2016-01-01
Atheoretical large-scale data mining techniques using machine learning algorithms have promise in the analysis of large epidemiological datasets. This study illustrates the use of a hybrid methodology for variable selection that took account of missing data and complex survey design to identify key biomarkers associated with depression from a large epidemiological study. The study used a three-step methodology amalgamating multiple imputation, a machine learning boosted regression algorithm and logistic regression, to identify key biomarkers associated with depression in the National Health and Nutrition Examination Study (2009-2010). Depression was measured using the Patient Health Questionnaire-9 and 67 biomarkers were analysed. Covariates in this study included gender, age, race, smoking, food security, Poverty Income Ratio, Body Mass Index, physical activity, alcohol use, medical conditions and medications. The final imputed weighted multiple logistic regression model included possible confounders and moderators. After the creation of 20 imputation data sets from multiple chained regression sequences, machine learning boosted regression initially identified 21 biomarkers associated with depression. Using traditional logistic regression methods, including controlling for possible confounders and moderators, a final set of three biomarkers were selected. The final three biomarkers from the novel hybrid variable selection methodology were red cell distribution width (OR 1.15; 95% CI 1.01, 1.30), serum glucose (OR 1.01; 95% CI 1.00, 1.01) and total bilirubin (OR 0.12; 95% CI 0.05, 0.28). Significant interactions were found between total bilirubin with Mexican American/Hispanic group (p = 0.016), and current smokers (p<0.001). The systematic use of a hybrid methodology for variable selection, fusing data mining techniques using a machine learning algorithm with traditional statistical modelling, accounted for missing data and complex survey sampling methodology and was demonstrated to be a useful tool for detecting three biomarkers associated with depression for future hypothesis generation: red cell distribution width, serum glucose and total bilirubin.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lu, Yinghao; Department of Hematology, Affiliated Hospital of Guizhou Medical University, The Hematopoietic Stem Cell Transplant Center of Guizhou Province, Blood Diseases Diagnosis and Treatment Center of Guizhou Province, Guiyang, 550004, Guizhou Province; Wu, Depei, E-mail: wudepei@medmail.com.cn
2016-05-13
Aberrant expression of microRNAs (miRNAs) is implicated in cancer development and progression. While miR-320a is reported to be deregulated in many malignancy types, its biological role in multiple myeloma (MM) remains unclear. Here, we observed reduced expression of miR-320a in MM samples and cell lines. Ectopic expression of miR-320a dramatically suppressed cell viability and clonogenicity and induced apoptosis in vitro. Mechanistic investigation led to the identification of Pre-B-cellleukemia transcription factor 3 (PBX3) as a novel and direct downstream target of miR-320a. Interestingly, reintroduction of PBX3 abrogated miR-320a-induced MM cell growth inhibition and apoptosis. In a mouse xenograft model, miR-320a overexpression inhibitedmore » tumorigenicity and promoted apoptosis. Our findings collectively indicate that miR-320a inhibits cell proliferation and induces apoptosis in MM cells by directly targeting PBX3, supporting its utility as a novel and potential therapeutic agent for miRNA-based MM therapy. -- Highlights: •Expression of miR-320a in MM cell induces apoptosis in vitro. •miR-320a represses PBX3 via targeting specific sequences in the 3′UTR region. •Exogenous expression of PBX3 reverses the effects of miR-320a in inhibiting MM cell growth and promoting apoptosis. •Overexpression of miR-320a inhibits tumor growth and increases apoptosis in vivo.« less
Hara, Toshifumi; Jones, Matthew F.; Subramanian, Murugan; Li, Xiao Ling; Ou, Oliver; Zhu, Yuelin; Yang, Yuan; Wakefield, Lalage M.; Hussain, S. Perwez; Gaedcke, Jochen; Ried, Thomas; Luo, Ji; Caplen, Natasha J.; Lal, Ashish
2014-01-01
MicroRNAs (miRNAs) regulate the expression of hundreds of genes. However, identifying the critical targets within a miRNA-regulated gene network is challenging. One approach is to identify miRNAs that exert a context-dependent effect, followed by expression profiling to determine how specific targets contribute to this selective effect. In this study, we performed miRNA mimic screens in isogenic KRAS-Wild-type (WT) and KRAS-Mutant colorectal cancer (CRC) cell lines to identify miRNAs selectively targeting KRAS-Mutant cells. One of the miRNAs we identified as a selective inhibitor of the survival of multiple KRAS-Mutant CRC lines was miR-126. In KRAS-Mutant cells, miR-126 over-expression increased the G1 compartment, inhibited clonogenicity and tumorigenicity, while exerting no effect on KRAS-WT cells. Unexpectedly, the miR-126-regulated transcriptome of KRAS-WT and KRAS-Mutant cells showed no significant differences. However, by analyzing the overlap between miR-126 targets with the synthetic lethal genes identified by RNAi in KRAS-Mutant cells, we identified and validated a subset of miR-126-regulated genes selectively required for the survival and clonogenicity of KRAS-Mutant cells. Our strategy therefore identified critical target genes within the miR-126-regulated gene network. We propose that the selective effect of miR-126 on KRAS-Mutant cells could be utilized for the development of targeted therapy for KRAS mutant tumors. PMID:25245095
Gottlieb, Alice B; Blauvelt, Andrew; Prinz, Jörg C; Papanastasiou, Philemon; Pathan, Rashidkhan; Nyirady, Judit; Fox, Todd; Papavassilis, Charis
2016-10-01
Secukinumab, a human monoclonal antibody that selectively targets interleukin-17A, is highly efficacious in the treatment of moderate-to-severe psoriasis, starting at early time points, with a sustained effect and a favorable safety profile. Patients with moderate-to-severe plaque psoriasis were randomized to secukinumab 300 mg, secukinumab 150 mg, or placebo self-administered by prefilled syringe at baseline, weeks 1, 2, and 3, and then every four weeks from week 4 to 48. Efficacy responses (≥ 75/90/100% improvement in Psoriasis Area and Severity Index [PASI 75/90/100] and clear/almost clear skin by Investigator's Global Assessment 2011 modified version [IGA mod 2011 0/1]) were measured to week 52. Patient-reported usability of the prefilled syringe was evaluated by the Self-Injection Assessment Questionnaire to week 48. The efficacy of secukinumab increased to week 16 and was maintained to week 52. With secukinumab 300 mg at week 52, PASI 75/90/100 and IGA mod 2011 0/1 responses were achieved by 83.5%/68.0%/47.5% and 71.5% of patients when analyzed by multiple imputation, respectively, and by 75.9%/62.1%/43.1% and 63.8% of patients when analyzed by nonresponder imputation, respectively. With secukinumab 150 mg at week 52, PASI 75/90/100 and IGA mod 2011 0/1 responses were achieved by 63.5%/50.3%/31.1% and 43.6% of patients when analyzed by multiple imputation, respectively, and by 61.0%/49.2%/30.5% and 42.4% of patients when analyzed by nonresponder imputation, respectively. Self-reported acceptability of the prefilled syringe was high throughout the study. The incidence of adverse events (AE) was well balanced between groups, with AEs reported in 74.4% of patients receiving secukinumab 300 mg and 77.3% of patients receiving secukinumab 150 mg. Nasopharyngitis was the most common AE across both secukinumab groups. Self-administration of secukinumab by prefilled syringe was associated with robust and sustained efficacy and a favorable safety profile up to week 52.
J Drugs Dermatol. 2016;15(10):1226-1234.
Jia, Erik; Chen, Tianlu
2018-01-01
Left-censored missing values commonly exist in targeted metabolomics datasets and can be considered as missing not at random (MNAR). Improper data processing procedures for missing values will cause adverse impacts on subsequent statistical analyses. However, few imputation methods have been developed and applied to the situation of MNAR in the field of metabolomics. Thus, a practical left-censored missing value imputation method is urgently needed. We developed an iterative Gibbs sampler based left-censored missing value imputation approach (GSimp). We compared GSimp with other three imputation methods on two real-world targeted metabolomics datasets and one simulation dataset using our imputation evaluation pipeline. The results show that GSimp outperforms other imputation methods in terms of imputation accuracy, observation distribution, univariate and multivariate analyses, and statistical sensitivity. Additionally, a parallel version of GSimp was developed for dealing with large scale metabolomics datasets. The R code for GSimp, evaluation pipeline, tutorial, real-world and simulated targeted metabolomics datasets are available at: https://github.com/WandeRum/GSimp. PMID:29385130
Kumari, Bharti; Jain, Pratistha; Das, Shaoli; Ghosal, Suman; Hazra, Bibhabasu; Trivedi, Ashish Chandra; Basu, Anirban; Chakrabarti, Jayprokas; Vrati, Sudhanshu; Banerjee, Arup
2016-01-01
Microglia cells in the brain play essential role during Japanese Encephalitis Virus (JEV) infection and may lead to change in microRNA (miRNA) and mRNA profile. These changes may together control disease outcome. Using Affymetrix microarray platform, we profiled cellular miRNA and mRNA expression at multiple time points during viral infection in human microglial (CHME3) cells. In silico analysis of microarray data revealed a phased pattern of miRNAs expression, associated with JEV replication and provided unique signatures of infection. Target prediction and pathway enrichment analysis identified anti correlation between differentially expressed miRNA and the gene expression at multiple time point which ultimately affected diverse signaling pathways including Notch signaling pathways in microglia. Activation of Notch pathway during JEV infection was demonstrated in vitro and in vivo. The expression of a subset of miRNAs that target multiple genes in Notch signaling pathways were suppressed and their overexpression could affect JEV induced immune response. Further analysis provided evidence for the possible presence of cellular competing endogenous RNA (ceRNA) associated with innate immune response. Collectively, our data provide a uniquely comprehensive view of the changes in the host miRNAs induced by JEV during cellular infection and identify Notch pathway in modulating microglia mediated inflammation. PMID:26838068
Kumari, Bharti; Jain, Pratistha; Das, Shaoli; Ghosal, Suman; Hazra, Bibhabasu; Trivedi, Ashish Chandra; Basu, Anirban; Chakrabarti, Jayprokas; Vrati, Sudhanshu; Banerjee, Arup
2016-02-03
Microglia cells in the brain play essential role during Japanese Encephalitis Virus (JEV) infection and may lead to change in microRNA (miRNA) and mRNA profile. These changes may together control disease outcome. Using Affymetrix microarray platform, we profiled cellular miRNA and mRNA expression at multiple time points during viral infection in human microglial (CHME3) cells. In silico analysis of microarray data revealed a phased pattern of miRNAs expression, associated with JEV replication and provided unique signatures of infection. Target prediction and pathway enrichment analysis identified anti correlation between differentially expressed miRNA and the gene expression at multiple time point which ultimately affected diverse signaling pathways including Notch signaling pathways in microglia. Activation of Notch pathway during JEV infection was demonstrated in vitro and in vivo. The expression of a subset of miRNAs that target multiple genes in Notch signaling pathways were suppressed and their overexpression could affect JEV induced immune response. Further analysis provided evidence for the possible presence of cellular competing endogenous RNA (ceRNA) associated with innate immune response. Collectively, our data provide a uniquely comprehensive view of the changes in the host miRNAs induced by JEV during cellular infection and identify Notch pathway in modulating microglia mediated inflammation.
Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data
2016-01-01
Well-powered genomic studies require genome-wide marker coverage across many individuals. For non-model species with few genomic resources, high-throughput sequencing (HTS) methods, such as Genotyping-By-Sequencing (GBS), offer an inexpensive alternative to array-based genotyping. Although affordable, datasets derived from HTS methods suffer from sequencing error, alignment errors, and missing data, all of which introduce noise and uncertainty to variant discovery and genotype calling. Under such circumstances, meaningful analysis of the data is difficult. Our primary interest lies in the issue of how one can accurately infer or impute missing genotypes in HTS-derived datasets. Many of the existing genotype imputation algorithms and software packages were primarily developed by and optimized for the human genetics community, a field where a complete and accurate reference genome has been constructed and SNP arrays have, in large part, been the common genotyping platform. We set out to answer two questions: 1) can we use existing imputation methods developed by the human genetics community to impute missing genotypes in datasets derived from non-human species and 2) are these methods, which were developed and optimized to impute ascertained variants, amenable for imputation of missing genotypes at HTS-derived variants? We selected Beagle v.4, a widely used algorithm within the human genetics community with reportedly high accuracy, to serve as our imputation contender. We performed a series of cross-validation experiments, using GBS data collected from the species Manihot esculenta by the Next Generation (NEXTGEN) Cassava Breeding Project. NEXTGEN currently imputes missing genotypes in their datasets using a LASSO-penalized, linear regression method (denoted ‘glmnet’). We selected glmnet to serve as a benchmark imputation method for this reason. We obtained estimates of imputation accuracy by masking a subset of observed genotypes, imputing, and calculating the sample Pearson correlation between observed and imputed genotype dosages at the site and individual level; computation time served as a second metric for comparison. We then set out to examine factors affecting imputation accuracy, such as levels of missing data, read depth, minor allele frequency (MAF), and reference panel composition. PMID:27537694
Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data.
Chan, Ariel W; Hamblin, Martha T; Jannink, Jean-Luc
2016-01-01
Well-powered genomic studies require genome-wide marker coverage across many individuals. For non-model species with few genomic resources, high-throughput sequencing (HTS) methods, such as Genotyping-By-Sequencing (GBS), offer an inexpensive alternative to array-based genotyping. Although affordable, datasets derived from HTS methods suffer from sequencing error, alignment errors, and missing data, all of which introduce noise and uncertainty to variant discovery and genotype calling. Under such circumstances, meaningful analysis of the data is difficult. Our primary interest lies in the issue of how one can accurately infer or impute missing genotypes in HTS-derived datasets. Many of the existing genotype imputation algorithms and software packages were primarily developed by and optimized for the human genetics community, a field where a complete and accurate reference genome has been constructed and SNP arrays have, in large part, been the common genotyping platform. We set out to answer two questions: 1) can we use existing imputation methods developed by the human genetics community to impute missing genotypes in datasets derived from non-human species and 2) are these methods, which were developed and optimized to impute ascertained variants, amenable for imputation of missing genotypes at HTS-derived variants? We selected Beagle v.4, a widely used algorithm within the human genetics community with reportedly high accuracy, to serve as our imputation contender. We performed a series of cross-validation experiments, using GBS data collected from the species Manihot esculenta by the Next Generation (NEXTGEN) Cassava Breeding Project. NEXTGEN currently imputes missing genotypes in their datasets using a LASSO-penalized, linear regression method (denoted 'glmnet'). We selected glmnet to serve as a benchmark imputation method for this reason. We obtained estimates of imputation accuracy by masking a subset of observed genotypes, imputing, and calculating the sample Pearson correlation between observed and imputed genotype dosages at the site and individual level; computation time served as a second metric for comparison. We then set out to examine factors affecting imputation accuracy, such as levels of missing data, read depth, minor allele frequency (MAF), and reference panel composition.
VIGAN: Missing View Imputation with Generative Adversarial Networks.
Shang, Chao; Palmer, Aaron; Sun, Jiangwen; Chen, Ko-Shin; Lu, Jin; Bi, Jinbo
2017-01-01
In an era when big data are becoming the norm, there is less concern with the quantity but more with the quality and completeness of the data. In many disciplines, data are collected from heterogeneous sources, resulting in multi-view or multi-modal datasets. The missing data problem has been challenging to address in multi-view data analysis. Especially, when certain samples miss an entire view of data, it creates the missing view problem. Classic multiple imputations or matrix completion methods are hardly effective here when no information can be based on in the specific view to impute data for such samples. The commonly-used simple method of removing samples with a missing view can dramatically reduce sample size, thus diminishing the statistical power of a subsequent analysis. In this paper, we propose a novel approach for view imputation via generative adversarial networks (GANs), which we name by VIGAN. This approach first treats each view as a separate domain and identifies domain-to-domain mappings via a GAN using randomly-sampled data from each view, and then employs a multi-modal denoising autoencoder (DAE) to reconstruct the missing view from the GAN outputs based on paired data across the views. Then, by optimizing the GAN and DAE jointly, our model enables the knowledge integration for domain mappings and view correspondences to effectively recover the missing view. Empirical results on benchmark datasets validate the VIGAN approach by comparing against the state of the art. The evaluation of VIGAN in a genetic study of substance use disorders further proves the effectiveness and usability of this approach in life science.
Yeesoonsang, Seesai; Bilheem, Surichai; McNeil, Edward; Iamsirithaworn, Sophon; Jiraphongsa, Chuleeporn; Sriplung, Hutcha
2017-01-01
Histological specimens are not required for diagnosis of liver and bile duct (LBD) cancer, resulting in a high percentage of unknown histologies. We compared estimates of hepatocellular carcinoma (HCC) and cholangiocarcinoma (CCA) incidences by imputing these unknown histologies. A retrospective study was conducted using data from the Songkhla Cancer Registry, southern Thailand, from 1989 to 2013. Multivariate imputation by chained equations (mice) was used in re-classification of the unknown histologies. Age-standardized rates (ASR) of HCC and CCA by sex were calculated and the trends were compared. Of 2,387 LBD cases, 61% had unknown histology. After imputation, the ASR of HCC in males during 1989 to 2007 increased from 4 to 10 per 100,000 and then decreased after 2007. The ASR of CCA increased from 2 to 5.5 per 100,000, and the ASR of HCC in females decreased from 1.5 in 2009 to 1.3 in 2013 and that of CCA increased from less than 1 to 1.9 per 100,000 by 2013. of complete case analysis showed somewhat similar, although less dramatic, trends. In Songkhla, the incidence of CCA appears to be stable after increasing for 20 years whereas the incidence of HCC is now declining. The decline in incidence of HCC among males since 2007 is probably due to implementation of the hepatitis B virus vaccine in the 1990s. The rise in incidence of CCA is a concern and highlights the need for case control studies to elucidate the risk factors.
Outlier Removal in Model-Based Missing Value Imputation for Medical Datasets.
Huang, Min-Wei; Lin, Wei-Chao; Tsai, Chih-Fong
2018-01-01
Many real-world medical datasets contain some proportion of missing (attribute) values. In general, missing value imputation can be performed to solve this problem, which is to provide estimations for the missing values by a reasoning process based on the (complete) observed data. However, if the observed data contain some noisy information or outliers, the estimations of the missing values may not be reliable or may even be quite different from the real values. The aim of this paper is to examine whether a combination of instance selection from the observed data and missing value imputation offers better performance than performing missing value imputation alone. In particular, three instance selection algorithms, DROP3, GA, and IB3, and three imputation algorithms, KNNI, MLP, and SVM, are used in order to find out the best combination. The experimental results show that that performing instance selection can have a positive impact on missing value imputation over the numerical data type of medical datasets, and specific combinations of instance selection and imputation methods can improve the imputation results over the mixed data type of medical datasets. However, instance selection does not have a definitely positive impact on the imputation result for categorical medical datasets.
Viral MicroRNAs Repress the Cholesterol Pathway, and 25-Hydroxycholesterol Inhibits Infection.
Serquiña, Anna K P; Kambach, Diane M; Sarker, Ontara; Ziegelbauer, Joseph M
2017-07-11
From various screens, we found that Kaposi's sarcoma-associated herpesvirus (KSHV) viral microRNAs (miRNAs) target several enzymes in the mevalonate/cholesterol pathway. 3-Hydroxy-3-methylglutaryl-coenzyme A (CoA) synthase 1 (HMGCS1), 3-hydroxy-3-methylglutaryl-CoA reductase (HMGCR [a rate-limiting step in the mevalonate pathway]), and farnesyl-diphosphate farnesyltransferase 1 (FDFT1 [a committed step in the cholesterol branch]) are repressed by multiple KSHV miRNAs. Transfection of viral miRNA mimics in primary endothelial cells (human umbilical vein endothelial cells [HUVECs]) is sufficient to reduce intracellular cholesterol levels; however, small interfering RNAs (siRNAs) targeting only HMGCS1 did not reduce cholesterol levels. This suggests that multiple targets are needed to perturb this tightly regulated pathway. We also report here that cholesterol levels were decreased in de novo -infected HUVECs after 7 days. This reduction is at least partially due to viral miRNAs, since the mutant form of KSHV lacking 10 of the 12 miRNA genes had increased cholesterol compared to wild-type infections. We hypothesized that KSHV is downregulating cholesterol to suppress the antiviral response by a modified form of cholesterol, 25-hydroxycholesterol (25HC). We found that the cholesterol 25-hydroxylase (CH25H) gene, which is responsible for generating 25HC, had increased expression in de novo -infected HUVECs but was strongly suppressed in long-term latently infected cell lines. We found that 25HC inhibits KSHV infection when added exogenously prior to de novo infection. In conclusion, we found that multiple KSHV viral miRNAs target enzymes in the mevalonate pathway to modulate cholesterol in infected cells during latency. This repression of cholesterol levels could potentially be beneficial to viral infection by decreasing the levels of 25HC. IMPORTANCE A subset of viruses express unique microRNAs (miRNAs), which act like cellular miRNAs to generally repress host gene expression. A cancer virus, Kaposi's sarcoma-associated herpesvirus (KSHV, or human herpesvirus 8 [HHV-8]), encodes multiple miRNAs that repress gene expression of multiple enzymes that are important for cholesterol synthesis. In cells with these viral miRNAs or with natural infection, cholesterol levels are reduced, indicating these viral miRNAs decrease cholesterol levels. A modified form of cholesterol, 25-hydroxycholesterol, is generated directly from cholesterol. Addition of 25-hydroxycholesterol to primary cells inhibited KSHV infection of cells, suggesting that viral miRNAs may decrease cholesterol levels to decrease the concentration of 25-hydroxycholesterol and to promote infection. These results suggest a new virus-host relationship and indicate a previously unidentified viral strategy to lower cholesterol levels. Copyright © 2017 Serquiña et al.
Genotype imputation efficiency in Nelore Cattle
USDA-ARS?s Scientific Manuscript database
Genotype imputation efficiency in Nelore cattle was evaluated in different scenarios of lower density (LD) chips, imputation methods and sets of animals to have their genotypes imputed. Twelve commercial and virtual custom LD chips with densities varying from 7K to 75K SNPs were tested. Customized L...
Chen, Xing; Niu, Ya-Wei; Wang, Guang-Hui; Yan, Gui-Ying
2017-12-12
Recently, as the research of microRNA (miRNA) continues, there are plenty of experimental evidences indicating that miRNA could be associated with various human complex diseases development and progression. Hence, it is necessary and urgent to pay more attentions to the relevant study of predicting diseases associated miRNAs, which may be helpful for effective prevention, diagnosis and treatment of human diseases. Especially, constructing computational methods to predict potential miRNA-disease associations is worthy of more studies because of the feasibility and effectivity. In this work, we developed a novel computational model of multiple kernels learning-based Kronecker regularized least squares for MiRNA-disease association prediction (MKRMDA), which could reveal potential miRNA-disease associations by automatically optimizing the combination of multiple kernels for disease and miRNA. MKRMDA obtained AUCs of 0.9040 and 0.8446 in global and local leave-one-out cross validation, respectively. Meanwhile, MKRMDA achieved average AUCs of 0.8894 ± 0.0015 in fivefold cross validation. Furthermore, we conducted three different kinds of case studies on some important human cancers for further performance evaluation. In the case studies of colonic cancer, esophageal cancer and lymphoma based on known miRNA-disease associations in HMDDv2.0 database, 76, 94 and 88% of the corresponding top 50 predicted miRNAs were confirmed by experimental reports, respectively. In another two kinds of case studies for new diseases without any known associated miRNAs and diseases only with known associations in HMDDv1.0 database, the verified ratios of two different cancers were 88 and 94%, respectively. All the results mentioned above adequately showed the reliable prediction ability of MKRMDA. We anticipated that MKRMDA could serve to facilitate further developments in the field and the follow-up investigations by biomedical researchers.
Eun, Jung Woo; Kim, Hyung Seok; Shen, Qingyu; Yang, Hee Doo; Kim, Sang Yean; Yoon, Jung Hwan; Park, Won Sang; Lee, Jung Young; Nam, Suk Woo
2018-01-01
MicroRNAs (miRNAs) engage in complex interactions with the machinery that controls the transcriptome and concurrently target multiple mRNAs. Here, we demonstrate that microRNA-495-3p (miR-495-3p) functions as a potent tumor suppressor by governing ten oncogenic epigenetic modifiers (EMs) in gastric carcinogenesis. From the large cohort transcriptome datasets of gastric cancer (GC) patients available from The Cancer Genome Atlas (TCGA) and the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO), we were able to recapitulate 15 EMs as significantly overexpressed in GC among the 51 EMs that were previously reported to be involved in cancer progression. Computational target prediction yielded miR-495-3p, which targets as many as ten of the 15 candidate oncogenic EMs. Ectopic expression of miRNA mimics in GC cells caused miR-495-3p to suppress ten EMs, and inhibited tumor cell growth and proliferation via caspase-dependent and caspase-independent cell death processing. In addition, in vitro metastasis assays showed that miR-495-3p plays a role in the metastatic behavior of GC cells by regulating SLUG, vimentin, and N-cadherin. Furthermore, treatment of GC cells with 5-aza-2'-deoxcytidine restored miR-495-3p expression; sequence analysis revealed hypermethylation of the miR-495-3p promoter region in GC cells. A negative regulatory loop is proposed, whereby DNMT1, among ten oncogenic EMs, regulates miR-495-3p expression via hypermethylation of the miR-495-3p promoter. Our findings suggest that the functional loss or suppression of miR-495-3p triggers overexpression of multiple oncogenic EMs, and thereby contributes to malignant transformation and growth of gastric epithelial cells. Copyright © 2017 Pathological Society of Great Britain and Ireland. Published by John Wiley & Sons, Ltd. Copyright © 2017 Pathological Society of Great Britain and Ireland. Published by John Wiley & Sons, Ltd.
Fu, Lijuan; Shi, Zhimin; Luo, Guanzheng; Tu, Weihong; Wang, XiuJie; Fang, Zhide; Li, XiaoChing
2014-10-01
Mutations in the human FOXP2 gene cause speech and language impairments. The FOXP2 protein is a transcription factor that regulates the expression of many downstream genes, which may have important roles in nervous system development and function. An adequate amount of functional FOXP2 protein is thought to be critical for the proper development of the neural circuitry underlying speech and language. However, how FOXP2 gene expression is regulated is not clearly understood. The FOXP2 mRNA has an approximately 4-kb-long 3' untranslated region (3' UTR), twice as long as its protein coding region, indicating that FOXP2 can be regulated by microRNAs (miRNAs). We identified multiple miRNAs that regulate the expression of the human FOXP2 gene using sequence analysis and in vitro cell systems. Focusing on let-7a, miR-9, and miR-129-5p, three brain-enriched miRNAs, we show that these miRNAs regulate human FOXP2 expression in a dosage-dependent manner and target specific sequences in the FOXP2 3' UTR. We further show that these three miRNAs are expressed in the cerebellum of the human fetal brain, where FOXP2 is known to be expressed. Our results reveal novel regulatory functions of the human FOXP2 3' UTR sequence and regulatory interactions between multiple miRNAs and the human FOXP2 gene. The expression of let-7a, miR-9, and miR-129-5p in the human fetal cerebellum is consistent with their roles in regulating FOXP2 expression during early cerebellum development. These results suggest that various genetic and environmental factors may contribute to speech and language development and related neural developmental disorders via the miRNA-FOXP2 regulatory network.
Shah, Anoop D.; Bartlett, Jonathan W.; Carpenter, James; Nicholas, Owen; Hemingway, Harry
2014-01-01
Multivariate imputation by chained equations (MICE) is commonly used for imputing missing data in epidemiologic research. The “true” imputation model may contain nonlinearities which are not included in default imputation models. Random forest imputation is a machine learning technique which can accommodate nonlinearities and interactions and does not require a particular regression model to be specified. We compared parametric MICE with a random forest-based MICE algorithm in 2 simulation studies. The first study used 1,000 random samples of 2,000 persons drawn from the 10,128 stable angina patients in the CALIBER database (Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; 2001–2010) with complete data on all covariates. Variables were artificially made “missing at random,” and the bias and efficiency of parameter estimates obtained using different imputation methods were compared. Both MICE methods produced unbiased estimates of (log) hazard ratios, but random forest was more efficient and produced narrower confidence intervals. The second study used simulated data in which the partially observed variable depended on the fully observed variables in a nonlinear way. Parameter estimates were less biased using random forest MICE, and confidence interval coverage was better. This suggests that random forest imputation may be useful for imputing complex epidemiologic data sets in which some patients have missing data. PMID:24589914
Shah, Anoop D; Bartlett, Jonathan W; Carpenter, James; Nicholas, Owen; Hemingway, Harry
2014-03-15
Multivariate imputation by chained equations (MICE) is commonly used for imputing missing data in epidemiologic research. The "true" imputation model may contain nonlinearities which are not included in default imputation models. Random forest imputation is a machine learning technique which can accommodate nonlinearities and interactions and does not require a particular regression model to be specified. We compared parametric MICE with a random forest-based MICE algorithm in 2 simulation studies. The first study used 1,000 random samples of 2,000 persons drawn from the 10,128 stable angina patients in the CALIBER database (Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; 2001-2010) with complete data on all covariates. Variables were artificially made "missing at random," and the bias and efficiency of parameter estimates obtained using different imputation methods were compared. Both MICE methods produced unbiased estimates of (log) hazard ratios, but random forest was more efficient and produced narrower confidence intervals. The second study used simulated data in which the partially observed variable depended on the fully observed variables in a nonlinear way. Parameter estimates were less biased using random forest MICE, and confidence interval coverage was better. This suggests that random forest imputation may be useful for imputing complex epidemiologic data sets in which some patients have missing data.
Genotype Imputation with Millions of Reference Samples.
Browning, Brian L; Browning, Sharon R
2016-01-07
We present a genotype imputation method that scales to millions of reference samples. The imputation method, based on the Li and Stephens model and implemented in Beagle v.4.1, is parallelized and memory efficient, making it well suited to multi-core computer processors. It achieves fast, accurate, and memory-efficient genotype imputation by restricting the probability model to markers that are genotyped in the target samples and by performing linear interpolation to impute ungenotyped variants. We compare Beagle v.4.1 with Impute2 and Minimac3 by using 1000 Genomes Project data, UK10K Project data, and simulated data. All three methods have similar accuracy but different memory requirements and different computation times. When imputing 10 Mb of sequence data from 50,000 reference samples, Beagle's throughput was more than 100× greater than Impute2's throughput on our computer servers. When imputing 10 Mb of sequence data from 200,000 reference samples in VCF format, Minimac3 consumed 26× more memory per computational thread and 15× more CPU time than Beagle. We demonstrate that Beagle v.4.1 scales to much larger reference panels by performing imputation from a simulated reference panel having 5 million samples and a mean marker density of one marker per four base pairs. Copyright © 2016 The American Society of Human Genetics. Published by Elsevier Inc. All rights reserved.
Aghdam, Rosa; Baghfalaki, Taban; Khosravi, Pegah; Saberi Ansari, Elnaz
2017-12-01
Deciphering important genes and pathways from incomplete gene expression data could facilitate a better understanding of cancer. Different imputation methods can be applied to estimate the missing values. In our study, we evaluated various imputation methods for their performance in preserving significant genes and pathways. In the first step, 5% genes are considered in random for two types of ignorable and non-ignorable missingness mechanisms with various missing rates. Next, 10 well-known imputation methods were applied to the complete datasets. The significance analysis of microarrays (SAM) method was applied to detect the significant genes in rectal and lung cancers to showcase the utility of imputation approaches in preserving significant genes. To determine the impact of different imputation methods on the identification of important genes, the chi-squared test was used to compare the proportions of overlaps between significant genes detected from original data and those detected from the imputed datasets. Additionally, the significant genes are tested for their enrichment in important pathways, using the ConsensusPathDB. Our results showed that almost all the significant genes and pathways of the original dataset can be detected in all imputed datasets, indicating that there is no significant difference in the performance of various imputation methods tested. The source code and selected datasets are available on http://profiles.bs.ipm.ir/softwares/imputation_methods/. Copyright © 2017. Production and hosting by Elsevier B.V.
Kattimani, Yogita; Veerappa, Avinash M
2018-04-09
To identify Damaging mutations in microRNAs (miRNAs) and 3' untranslated regions (UTRs) of target genes to establish Multiple sclerosis (MS) disease pathway. Female aged 16, with Relapsing Remitting Multiple sclerosis (RRMS) was reported with initial symptoms of blurred vision, severe immobility, upper and lower limb numbness and backache. Whole Exome Sequencing (WES) and disease pathway analysis was performed to identify mutations in miRNAs and UTRs. We identified Deleterious/Damaging multibase mutations in MIR8485 and NRXN1. miR-8485 was found carrying frameshift homozygous deletion of bases CA, while NRXN1 was found carrying nonframeshift homozygous substitution of bases CT to TC in exon 8 replacing Serine with Leucine. Mutations in miR-8485 and NRXN1 was found to alter calcium homeostasis and NRXN1/NLGN1 cell adhesion molecule binding affinities. The miR-8485 mutation leads to overexpression of NRXN1 altering pre-synaptic Ca 2+ homeostasis, inducing neurodegeneration. Copyright © 2018 Elsevier B.V. All rights reserved.
DISTMIX: direct imputation of summary statistics for unmeasured SNPs from mixed ethnicity cohorts.
Lee, Donghyung; Bigdeli, T Bernard; Williamson, Vernell S; Vladimirov, Vladimir I; Riley, Brien P; Fanous, Ayman H; Bacanu, Silviu-Alin
2015-10-01
To increase the signal resolution for large-scale meta-analyses of genome-wide association studies, genotypes at unmeasured single nucleotide polymorphisms (SNPs) are commonly imputed using large multi-ethnic reference panels. However, the ever increasing size and ethnic diversity of both reference panels and cohorts makes genotype imputation computationally challenging for moderately sized computer clusters. Moreover, genotype imputation requires subject-level genetic data, which unlike summary statistics provided by virtually all studies, is not publicly available. While there are much less demanding methods which avoid the genotype imputation step by directly imputing SNP statistics, e.g. Directly Imputing summary STatistics (DIST) proposed by our group, their implicit assumptions make them applicable only to ethnically homogeneous cohorts. To decrease computational and access requirements for the analysis of cosmopolitan cohorts, we propose DISTMIX, which extends DIST capabilities to the analysis of mixed ethnicity cohorts. The method uses a relevant reference panel to directly impute unmeasured SNP statistics based only on statistics at measured SNPs and estimated/user-specified ethnic proportions. Simulations show that the proposed method adequately controls the Type I error rates. The 1000 Genomes panel imputation of summary statistics from the ethnically diverse Psychiatric Genetic Consortium Schizophrenia Phase 2 suggests that, when compared to genotype imputation methods, DISTMIX offers comparable imputation accuracy for only a fraction of computational resources. DISTMIX software, its reference population data, and usage examples are publicly available at http://code.google.com/p/distmix. dlee4@vcu.edu Supplementary Data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press.
Chen, Chen; Zhou, Yifan; Wang, Jingqi; Yan, Yaping; Peng, Lisheng; Qiu, Wei
2018-01-01
Multiple sclerosis (MS) is an immune-mediated demyelinating disease of the central nervous system. Growing evidence has proven that T helper 17 (Th17) cells are one of the regulators of neuroinflammation mechanisms in MS disease. Researchers have demonstrated that some microRNAs (miRNAs) are associated with disease activity and duration, even with different MS patterns. miRNAs regulate CD4+ T cells to differentiate toward various T cell subtypes including Th17 cells. In this review, we discuss the possible mechanisms of miRNAs in MS pathophysiology by regulating CD4+ T cell differentiation into Th17 cells, and potential miRNA targets for current disease-modifying treatments.
Conde, João; Edelman, Elazer R; Artzi, Natalie
2015-01-01
microRNAs (miRNAs) show high potential for cancer treatment, however one of the most significant bottlenecks in enabling miRNA effect is the need for an efficient vehicle capable of selective targeting to tumor cells without disrupting normal cells. Even more challenging is the ability to detect and silence multiple targets simultaneously with high sensitivity while precluding resistance to the therapeutic agents. Focusing on the pervasive role of miRNAs, herein we review the multiple nanomaterial-based systems that encapsulate DNA/RNA for miRNA sensing and inhibition in cancer therapy. Understanding the potential of miRNA detection and silencing while overcoming existing limitations will be critical to the optimization and clinical utilization of this technology. Copyright © 2014 Elsevier B.V. All rights reserved.
Competing targets of microRNA-608 affect anxiety and hypertension
Hanin, Geula; Shenhar-Tsarfaty, Shani; Yayon, Nadav; Hoe, Yau Yin; Bennett, Estelle R.; Sklan, Ella H.; Rao, Dabeeru. C.; Rankinen, Tuomo; Bouchard, Claude; Geifman-Shochat, Susana; Shifman, Sagiv; Greenberg, David S.; Soreq, Hermona
2014-01-01
MicroRNAs (miRNAs) can repress multiple targets, but how a single de-balanced interaction affects others remained unclear. We found that changing a single miRNA–target interaction can simultaneously affect multiple other miRNA–target interactions and modify physiological phenotype. We show that miR-608 targets acetylcholinesterase (AChE) and demonstrate weakened miR-608 interaction with the rs17228616 AChE allele having a single-nucleotide polymorphism (SNP) in the 3′-untranslated region (3′UTR). In cultured cells, this weakened interaction potentiated miR-608-mediated suppression of other targets, including CDC42 and interleukin-6 (IL6). Postmortem human cortices homozygote for the minor rs17228616 allele showed AChE elevation and CDC42/IL6 decreases compared with major allele homozygotes. Additionally, minor allele heterozygote and homozygote subjects showed reduced cortisol and elevated blood pressure, predicting risk of anxiety and hypertension. Parallel suppression of the conserved brain CDC42 activity by intracerebroventricular ML141 injection caused acute anxiety in mice. We demonstrate that SNPs in miRNA-binding regions could cause expanded downstream effects changing important biological pathways. PMID:24722204
Chand, Subodh Kumar; Nanda, Satyabrata; Mishra, Rukmini; Joshi, Raj Kumar
2017-04-01
The basal plate rot fungus, Fusarium oxysporum f. sp. cepae (FOC), is the most devastating pathogen posing a serious threat to garlic (Allium sativum L.) production worldwide. MicroRNAs (miRNAs) are key modulators of gene expression related to development and defense responses in eukaryotes. However, the miRNA species associated with garlic immunity against FOC are yet to be explored. In the present study, a small RNA library developed from FOC infected resistant garlic line was sequenced to identify immune responsive miRNAs. Forty-five miRNAs representing 39 conserved and six novel sequences responsive to FOC were detected. qRT-PCR analyses further classified them into three classes based on their expression patterns in susceptible line CBT-As11 and in the resistant line CBT-As153. North-blot analyses of six selective miRNAs confirmed the qRT-PCR results. Expression studies on a selective set of target genes revealed a negative correlation with the complementary miRNAs. Furthermore, transgenic garlic plant overexpresing miR164a, miR168a and miR393 showed enhanced resistance to FOC, as revealed by decreased fungal growth and up-regulated expression of defense-responsive genes. These results indicate that multiple miRNAs are involved in garlic immunity against FOC and that the overexpression of miR164a, miR168a and miR393 can augment garlic resistance to Fusarium basal rot infection. Copyright © 2017 Elsevier B.V. All rights reserved.
The Silkworm (Bombyx mori) microRNAs and Their Expressions in Multiple Developmental Stages
Luo, Qibin; Cai, Yimei; Lin, Wen-chang; Chen, Huan; Yang, Yue; Hu, Songnian; Yu, Jun
2008-01-01
Background MicroRNAs (miRNAs) play crucial roles in various physiological processes through post-transcriptional regulation of gene expressions and are involved in development, metabolism, and many other important molecular mechanisms and cellular processes. The Bombyx mori genome sequence provides opportunities for a thorough survey for miRNAs as well as comparative analyses with other sequenced insect species. Methodology/Principal Findings We identified 114 non-redundant conserved miRNAs and 148 novel putative miRNAs from the B. mori genome with an elaborate computational protocol. We also sequenced 6,720 clones from 14 developmental stage-specific small RNA libraries in which we identified 35 unique miRNAs containing 21 conserved miRNAs (including 17 predicted miRNAs) and 14 novel miRNAs (including 11 predicted novel miRNAs). Among the 114 conserved miRNAs, we found six pairs of clusters evolutionarily conserved cross insect lineages. Our observations on length heterogeneity at 5′ and/or 3′ ends of nine miRNAs between cloned and predicted sequences, and three mature forms deriving from the same arm of putative pre-miRNAs suggest a mechanism by which miRNAs gain new functions. Analyzing development-related miRNAs expression at 14 developmental stages based on clone-sampling and stem-loop RT PCR, we discovered an unusual abundance of 33 sequences representing 12 different miRNAs and sharply fluctuated expression of miRNAs at larva-molting stage. The potential functions of several stage-biased miRNAs were also analyzed in combination with predicted target genes and silkworm's phenotypic traits; our results indicated that miRNAs may play key regulatory roles in specific developmental stages in the silkworm, such as ecdysis. Conclusions/Significance Taking a combined approach, we identified 118 conserved miRNAs and 151 novel miRNA candidates from the B. mori genome sequence. Our expression analyses by sampling miRNAs and real-time PCR over multiple developmental stages allowed us to pinpoint molting stages as hotspots of miRNA expression both in sorts and quantities. Based on the analysis of target genes, we hypothesized that miRNAs regulate development through a particular emphasis on complex stages rather than general regulatory mechanisms. PMID:18714353
Data estimation and prediction for natural resources public data
Hans T. Schreuder; Robin M. Reich
1998-01-01
A key product of both Forest Inventory and Analysis (FIA) of the USDA Forest Service and the Natural Resources Inventory (NRI) of the Natural Resources Conservation Service is a scientific data base that should be defensible in court. Multiple imputation procedures (MIPs) have been proposed both for missing value estimation and prediction of non-remeasured cells in...
Evaluating Multiple Imputation Models for the Southern Annual Forest Inventory
Gregory A. Reams; Joseph M. McCollum
1999-01-01
The USDA Forest Service's Southern Research Station is implementing an annualized forest survey in thirteen states. The sample design is a systematic sample of five interpenetrating grids (panels), where each panel is measured sequentially. For example, panel one information is collected in year one, and panel five in year five. The area representative and time...
Pajic, Marina; Froio, Danielle; Daly, Sheridan; Doculara, Louise; Millar, Ewan; Graham, Peter H; Drury, Alison; Steinmann, Angela; de Bock, Charles E; Boulghourjian, Alice; Zaratzian, Anaiis; Carroll, Susan; Toohey, Joanne; O'Toole, Sandra A; Harris, Adrian L; Buffa, Francesca M; Gee, Harriet E; Hollway, Georgina E; Molloy, Timothy J
2018-01-15
Radiotherapy is essential to the treatment of most solid tumors and acquired or innate resistance to this therapeutic modality is a major clinical problem. Here we show that miR-139-5p is a potent modulator of radiotherapy response in breast cancer via its regulation of genes involved in multiple DNA repair and reactive oxygen species defense pathways. Treatment of breast cancer cells with a miR-139-5p mimic strongly synergized with radiation both in vitro and in vivo , resulting in significantly increased oxidative stress, accumulation of unrepaired DNA damage, and induction of apoptosis. Several miR-139-5p target genes were also strongly predictive of outcome in radiotherapy-treated patients across multiple independent breast cancer cohorts. These prognostically relevant miR-139-5p target genes were used as companion biomarkers to identify radioresistant breast cancer xenografts highly amenable to sensitization by cotreatment with a miR-139-5p mimetic. Significance: The microRNA described in this study offers a potentially useful predictive biomarker of radiosensitivity in solid tumors and a generally applicable druggable target for tumor radiosensitization. Cancer Res; 78(2); 501-15. ©2017 AACR . ©2017 American Association for Cancer Research.
Breaking the code: What is the best post-PCI MI definition?
Seto, Arnold H; Kern, Morton J
2017-04-01
Various definitions of post-PCI MI have been recommended by different professional societies and studies. This present study suggests that the troponin-based 3rd universal definition of post-PCI MI has prognostic value for recurrent MI but not mortality alone, unlike the CK-MB based SCAI definition. Absent a consensus on the best definition, clinical trials should report outcomes based on multiple definitions of post-PCI MI. © 2017 Wiley Periodicals, Inc.
Expression patterns of miR-146a and miR-146b in mastitis infected dairy cattle.
Wang, Xing Ping; Luoreng, Zhuo Ma; Zan, Lin Sen; Raza, Sayed Haidar Abbas; Li, Feng; Li, Na; Liu, Shuan
2016-10-01
This study reports a significant up-regulation of bta-miR-146a and bta-miR-146b expression levels in bovine mammary tissues infected with subclinical, clinical and experimental mastitis. Potential target genes are involved in multiple immunological pathways. These results suggest a regulatory function of both miRNAs for the bovine inflammatory response in mammary tissue. Copyright © 2016 Elsevier Ltd. All rights reserved.
Statistical primer: how to deal with missing data in scientific research?
Papageorgiou, Grigorios; Grant, Stuart W; Takkenberg, Johanna J M; Mokhles, Mostafa M
2018-05-10
Missing data are a common challenge encountered in research which can compromise the results of statistical inference when not handled appropriately. This paper aims to introduce basic concepts of missing data to a non-statistical audience, list and compare some of the most popular approaches for handling missing data in practice and provide guidelines and recommendations for dealing with and reporting missing data in scientific research. Complete case analysis and single imputation are simple approaches for handling missing data and are popular in practice, however, in most cases they are not guaranteed to provide valid inferences. Multiple imputation is a robust and general alternative which is appropriate for data missing at random, surpassing the disadvantages of the simpler approaches, but should always be conducted with care. The aforementioned approaches are illustrated and compared in an example application using Cox regression.
MicroRNA-188 suppresses G1/S transition by targeting multiple cyclin/CDK complexes.
Wu, Jiangbin; Lv, Qing; He, Jie; Zhang, Haoxiang; Mei, Xueshuang; Cui, Kai; Huang, Nunu; Xie, Weidong; Xu, Naihan; Zhang, Yaou
2014-10-11
Accelerated cell cycle progression is the common feature of most cancers. MiRNAs can act as oncogenes or tumor suppressors by directly modulating cell cycle machinery. It has been shown that miR-188 is upregulated in UVB-irradiated mouse skin and human nasopharyngeal carcinoma CNE cells under hypoxic stress. However, little is known about the function of miR-188 in cell proliferation and growth control. Overexpression of miR-188 inhibits cell proliferation, tumor colony formation and G1/S cell cycle transition in human nasopharyngeal carcinoma CNE cells. Using bioinformatics approach, we identify a series of genes regulating G1/S transition as putative miR-188 targets. MiR-188 inhibits both mRNA and protein expression of CCND1, CCND3, CCNE1, CCNA2, CDK4 and CDK2, suppresses Rb phosphorylation and downregulates E2F transcriptional activity. The expression level of miR-188 also inversely correlates with the expression of miR-188 targets in human nasopharyngeal carcinoma (NPC) tissues. Moreover, studies in xenograft mouse model reveal that miR-188 is capable of inhibiting tumor initiation and progression by suppressing target genes expression and Rb phosphorylation. This study demonstrates that miR-188 exerts anticancer effects, via downregulation of multiple G1/S related cyclin/CDKs and Rb/E2F signaling pathway.
Multiple Regression Analysis of mRNA-miRNA Associations in Colorectal Cancer Pathway
Wang, Fengfeng; Wong, S. C. Cesar; Chan, Lawrence W. C.; Cho, William C. S.; Yip, S. P.; Yung, Benjamin Y. M.
2014-01-01
Background. MicroRNA (miRNA) is a short and endogenous RNA molecule that regulates posttranscriptional gene expression. It is an important factor for tumorigenesis of colorectal cancer (CRC), and a potential biomarker for diagnosis, prognosis, and therapy of CRC. Our objective is to identify the related miRNAs and their associations with genes frequently involved in CRC microsatellite instability (MSI) and chromosomal instability (CIN) signaling pathways. Results. A regression model was adopted to identify the significantly associated miRNAs targeting a set of candidate genes frequently involved in colorectal cancer MSI and CIN pathways. Multiple linear regression analysis was used to construct the model and find the significant mRNA-miRNA associations. We identified three significantly associated mRNA-miRNA pairs: BCL2 was positively associated with miR-16 and SMAD4 was positively associated with miR-567 in the CRC tissue, while MSH6 was positively associated with miR-142-5p in the normal tissue. As for the whole model, BCL2 and SMAD4 models were not significant, and MSH6 model was significant. The significant associations were different in the normal and the CRC tissues. Conclusion. Our results have laid down a solid foundation in exploration of novel CRC mechanisms, and identification of miRNA roles as oncomirs or tumor suppressor mirs in CRC. PMID:24895601
MicroRNA-466l inhibits antiviral innate immune response by targeting interferon-alpha
Li, Yingke; Fan, Xiaohua; He, Xingying; Sun, Haijing; Zou, Zui; Yuan, Hongbin; Xu, Haitao; Wang, Chengcai; Shi, Xueyin
2012-01-01
Effective recognition of viral infections and subsequent triggering of antiviral innate immune responses are essential for the host antiviral defense, which is tightly regulated by multiple regulators, including microRNAs (miRNAs). A previous study showed that miR-466l upregulates IL-10 expression in macrophages by antagonizing RNA-binding protein tristetraprolin-mediated IL-10 mRNA degradation. However, the ability of miR-466l to regulate antiviral immune responses remains unknown. Here, we found that interferon-alpha (IFN-α) expression was repressed in Sendai virus (SeV)- and vesicular stomatitis virus (VSV)-infected macrophages and in dendritic cells transfected with miR-466l expression. Moreover, multiple IFN-α species can be directly targeted by miR-466l through their 3′ untranslated region (3′UTR). This study has demonstrated that miR-466l could directly target IFN-α expression to inhibit host antiviral innate immune response. PMID:23042536
A regressive methodology for estimating missing data in rainfall daily time series
NASA Astrophysics Data System (ADS)
Barca, E.; Passarella, G.
2009-04-01
The "presence" of gaps in environmental data time series represents a very common, but extremely critical problem, since it can produce biased results (Rubin, 1976). Missing data plagues almost all surveys. The problem is how to deal with missing data once it has been deemed impossible to recover the actual missing values. Apart from the amount of missing data, another issue which plays an important role in the choice of any recovery approach is the evaluation of "missingness" mechanisms. When data missing is conditioned by some other variable observed in the data set (Schafer, 1997) the mechanism is called MAR (Missing at Random). Otherwise, when the missingness mechanism depends on the actual value of the missing data, it is called NCAR (Not Missing at Random). This last is the most difficult condition to model. In the last decade interest arose in the estimation of missing data by using regression (single imputation). More recently multiple imputation has become also available, which returns a distribution of estimated values (Scheffer, 2002). In this paper an automatic methodology for estimating missing data is presented. In practice, given a gauging station affected by missing data (target station), the methodology checks the randomness of the missing data and classifies the "similarity" between the target station and the other gauging stations spread over the study area. Among different methods useful for defining the similarity degree, whose effectiveness strongly depends on the data distribution, the Spearman correlation coefficient was chosen. Once defined the similarity matrix, a suitable, nonparametric, univariate, and regressive method was applied in order to estimate missing data in the target station: the Theil method (Theil, 1950). Even though the methodology revealed to be rather reliable an improvement of the missing data estimation can be achieved by a generalization. A first possible improvement consists in extending the univariate technique to the multivariate approach. Another approach follows the paradigm of the "multiple imputation" (Rubin, 1987; Rubin, 1988), which consists in using a set of "similar stations" instead than the most similar. This way, a sort of estimation range can be determined allowing the introduction of uncertainty. Finally, time series can be grouped on the basis of monthly rainfall rates defining classes of wetness (i.e.: dry, moderately rainy and rainy), in order to achieve the estimation using homogeneous data subsets. We expect that integrating the methodology with these enhancements will certainly improve its reliability. The methodology was applied to the daily rainfall time series data registered in the Candelaro River Basin (Apulia - South Italy) from 1970 to 2001. REFERENCES D.B., Rubin, 1976. Inference and Missing Data. Biometrika 63 581-592 D.B. Rubin, 1987. Multiple Imputation for Nonresponce in Surveys, New York: John Wiley & Sons, Inc. D.B. Rubin, 1988. An overview of multiple imputation. In Survey Research Section, pp. 79-84, American Statistical Association, 1988. J.L., Schafer, 1997. Analysis of Incomplete Multivariate Data, Chapman & Hall. J., Scheffer, 2002. Dealing with Missing Data. Res. Lett. Inf. Math. Sci. 3, 153-160. Available online at http://www.massey.ac.nz/~wwiims/research/letters/ H. Theil, 1950. A rank-invariant method of linear and polynomial regression analysis. Indicationes Mathematicae, 12, pp.85-91.
Prognostic Value of microRNA-9 in Various Cancers: a Meta-analysis.
Zhang, Yunyuan; Zhou, Jun; Sun, Meiling; Sun, Guirong; Cao, Yongxian; Zhang, Haiping; Tian, Runhua; Zhou, Lan; Duan, Liang; Chen, Xian; Lun, Limin
2017-07-01
Recently, there are more and more evidences from studies have revealed the association between microRNA-9 (miR-9) expression and outcome in multiple cancers, but inconsistent results have also been reported. It is necessary to rationalize a meta analysis of all available data to clarify the prognostic role of miR-9. Eligible studies were selected through multiple search strategies and the quality was assessed by MOOSE. Data was extracted from studies according to the key statistics index. All analyses were performed using STATA software. Twenty studies were selected in the meta-analysis to evaluate the prognostic role of miR-9 in multiple tumors. MiR-9 expression level was an independent prognostic biomarker for OS in tumor patients using multivariate and univariate analyses. High expression levels of miR-9 was demonstrated to associated with poor overall survival (OS) (HR = 2.23, 95 % CI: 1.56-3.17, P < 0.05) and recurrence free survival/progress free survival (RFS/PFS) (HR = 2.08, 95 % CI: 1.33-3.27, P < 0.05). Subgroup analysis showed that residence region (China and Japan), sample size, cancer type (solid or leukemia), follow-up months and analysis method (qPCR) did not alter the predictive value of miR-9 on OS in various cancers. Furthermore, no significant associations were detected for miR-9 expression and lymph node metastasis or distant metastasis. The present results suggest that promoted miR-9 expression is associated with poor OS in patients with general cancers.
Gaussian mixture clustering and imputation of microarray data.
Ouyang, Ming; Welsh, William J; Georgopoulos, Panos
2004-04-12
In microarray experiments, missing entries arise from blemishes on the chips. In large-scale studies, virtually every chip contains some missing entries and more than 90% of the genes are affected. Many analysis methods require a full set of data. Either those genes with missing entries are excluded, or the missing entries are filled with estimates prior to the analyses. This study compares methods of missing value estimation. Two evaluation metrics of imputation accuracy are employed. First, the root mean squared error measures the difference between the true values and the imputed values. Second, the number of mis-clustered genes measures the difference between clustering with true values and that with imputed values; it examines the bias introduced by imputation to clustering. The Gaussian mixture clustering with model averaging imputation is superior to all other imputation methods, according to both evaluation metrics, on both time-series (correlated) and non-time series (uncorrelated) data sets.
MicroRNA regulation of immune events at conception.
Robertson, Sarah A; Zhang, Bihong; Chan, Honyueng; Sharkey, David J; Barry, Simon C; Fullston, Tod; Schjenken, John E
2017-09-01
The reproductive tract environment at conception programs the developmental trajectory of the embryo, sets the course of pregnancy, and impacts offspring phenotype and health. Despite the fundamental importance of this stage of reproduction, the rate-limiting regulatory mechanisms operating locally to control fertility and fecundity are incompletely understood. Emerging studies highlight roles for microRNAs (miRNAs) in regulating reproductive and developmental processes and in modulating the quality and strength of the female immune response. Since endometrial receptivity and robust placentation require specific adaptation of the immune response, we hypothesize that miRNAs participate in establishing pregnancy through effects on key gene networks in immune cells. Our recent studies investigated miRNAs that are induced in the peri-conception environment, focusing on miRNAs that have immune-regulatory roles-particularly miR-223, miR-155, and miR-146a. Genetic mouse models deficient in individual miRNAs are proving informative in defining roles for these miRNAs in the generation and stabilization of regulatory T cells (Treg cells) that confer adaptive immune tolerance. Overlapping and redundant functions between miRNAs that target multiple genes, combined with multiple miRNAs targeting individual genes, indicate complex and sensitive regulatory networks. Although to date most data on miRNA regulation of reproductive events are from mice, conserved functions of miRNAs across species imply similar biological pathways operate in all mammals. Understanding the regulation and roles of miRNAs in the peri-conception immune response will advance our knowledge of how environmental determinants act at conception, and could have practical applications for animal breeding as well as human fertility. © 2017 Wiley Periodicals, Inc.
Di Martino, Maria T.; Leone, Emanuela; Amodio, Nicola; Foresta, Umberto; Lionetti, Marta; Pitari, Maria R.; Gallo Cantafio, Maria E.; Gullà, Annamaria; Conforti, Francesco; Morelli, Eugenio; Tomaino, Vera; Rossi, Marco; Negrini, Massimo; Ferrarini, Manlio; Caraglia, Michele; Shammas, Masood A.; Munshi, Nikhil C.; Anderson, Kenneth C.; Neri, Antonino; Tagliaferri, Pierosandro; Tassone, Pierfrancesco
2015-01-01
Purpose Deregulated expression of microRNAs (miRNAs) has been demonstrated in multiple myeloma (MM). A promising strategy to achieve a therapeutic effect by targeting the miRNA regulatory network is to enforce the expression of miRNAs that act as tumor suppressor genes, such as miR-34a. Experimental Design Here, we investigated the therapeutic potential of synthetic miR-34a against human MM cells in vitro and in vivo. Results Either transient expression of miR-34a synthetic mimics or lentivirus-based miR-34a-stable enforced expression triggered growth inhibition and apoptosis in MM cells in vitro. Synthetic miR-34a downregulated canonic targets BCL2, CDK6 and NOTCH1 at both the mRNA and protein level. Lentiviral vector-transduced MM xenografts with constitutive miR-34a expression showed high growth inhibition in SCID mice. The anti-MM activity of lipidic-formulated miR-34a was further demonstrated in vivo in two different experimental settings: i) SCID mice bearing non-transduced MM xenografts; and ii) SCID-synth-hu mice implanted with synthetic 3D scaffolds reconstituted with human bone marrow stromal cells and then engrafted with human MM cells. Relevant tumor growth inhibition and survival improvement were observed in mice bearing TP53-mutated MM xenografts treated with miR-34a mimics in the absence of systemic toxicity. Conclusions Our findings provide a proof-of-principle that formulated synthetic miR-34a has therapeutic activity in preclinical models and support a framework for development of miR-34a-based treatment strategies in MM patients. PMID:23035210
Di Martino, Maria T; Leone, Emanuela; Amodio, Nicola; Foresta, Umberto; Lionetti, Marta; Pitari, Maria R; Cantafio, Maria E Gallo; Gullà, Annamaria; Conforti, Francesco; Morelli, Eugenio; Tomaino, Vera; Rossi, Marco; Negrini, Massimo; Ferrarini, Manlio; Caraglia, Michele; Shammas, Masood A; Munshi, Nikhil C; Anderson, Kenneth C; Neri, Antonino; Tagliaferri, Pierosandro; Tassone, Pierfrancesco
2012-11-15
Deregulated expression of miRNAs has been shown in multiple myeloma (MM). A promising strategy to achieve a therapeutic effect by targeting the miRNA regulatory network is to enforce the expression of miRNAs that act as tumor suppressor genes, such as miR-34a. Here, we investigated the therapeutic potential of synthetic miR-34a against human MM cells in vitro and in vivo. Either transient expression of miR-34a synthetic mimics or lentivirus-based miR-34a-stable enforced expression triggered growth inhibition and apoptosis in MM cells in vitro. Synthetic miR-34a downregulated canonic targets BCL2, CDK6, and NOTCH1 at both the mRNA and protein level. Lentiviral vector-transduced MM xenografts with constitutive miR-34a expression showed high growth inhibition in severe combined immunodeficient (SCID) mice. The anti-MM activity of lipidic-formulated miR-34a was further shown in vivo in two different experimental settings: (i) SCID mice bearing nontransduced MM xenografts; and (ii) SCID-synth-hu mice implanted with synthetic 3-dimensional scaffolds reconstituted with human bone marrow stromal cells and then engrafted with human MM cells. Relevant tumor growth inhibition and survival improvement were observed in mice bearing TP53-mutated MM xenografts treated with miR-34a mimics in the absence of systemic toxicity. Our findings provide a proof-of-principle that formulated synthetic miR-34a has therapeutic activity in preclinical models and support a framework for development of miR-34a-based treatment strategies in MM patients. ©2012 AACR.
Singh, Jagmohan; Boopathi, Ettickan; Addya, Sankar; Phillips, Benjamin; Rigoutsos, Isidore; Penn, Raymond B.
2016-01-01
A comprehensive genomic and proteomic, computational, and physiological approach was employed to examine the (previously unexplored) role of microRNAs (miRNAs) as regulators of internal anal sphincter (IAS) smooth muscle contractile phenotype and basal tone. miRNA profiling, genome-wide expression, validation, and network analyses were employed to assess changes in mRNA and miRNA expression in IAS smooth muscles from young vs. aging rats. Multiple miRNAs, including rno-miR-1, rno-miR-340-5p, rno-miR-185, rno-miR-199a-3p, rno-miR-200c, rno-miR-200b, rno-miR-31, rno-miR-133a, and rno-miR-206, were found to be upregulated in aging IAS. qPCR confirmed the upregulated expression of these miRNAs and downregulation of multiple, predicted targets (Eln, Col3a1, Col1a1, Zeb2, Myocd, Srf, Smad1, Smad2, Rhoa/Rock2, Fn1, Tagln v2, Klf4, and Acta2) involved in regulation of smooth muscle contractility. Subsequent studies demonstrated an aging-associated increase in the expression of miR-133a, corresponding decreases in RhoA, ROCK2, MYOCD, SRF, and SM22α protein expression, RhoA-signaling, and a decrease in basal and agonist [U-46619 (thromboxane A2 analog)]-induced increase in the IAS tone. Moreover, in vitro transfection of miR-133a caused a dose-dependent increase of IAS tone in strips, which was reversed by anti-miR-133a. Last, in vivo perianal injection of anti-miR-133a reversed the loss of IAS tone associated with age. This work establishes the important regulatory effect of miRNA-133a on basal and agonist-stimulated IAS tone. Moreover, reversal of age-associated loss of tone via anti-miR delivery strongly implicates miR dysregulation as a causal factor in the aging-associated decrease in IAS tone and suggests that miR-133a is a feasible therapeutic target in aging-associated rectoanal incontinence. PMID:27634012
Wu, Ping; Tu, Yunqiu; Qian, Yingdan; Zhang, Hui; Cai, Chenxin
2014-01-28
We report a new strategy for evaluating multiple miRNA expressions in cancer cells based on DNA strand-displacement-induced fluorescence enhancement. This assay has the ability to discriminate the target from even single-base mismatched sequences or other miRNAs.
Genotype imputation in a coalescent model with infinitely-many-sites mutation
Huang, Lucy; Buzbas, Erkan O.; Rosenberg, Noah A.
2012-01-01
Empirical studies have identified population-genetic factors as important determinants of the properties of genotype-imputation accuracy in imputation-based disease association studies. Here, we develop a simple coalescent model of three sequences that we use to explore the theoretical basis for the influence of these factors on genotype-imputation accuracy, under the assumption of infinitely-many-sites mutation. Employing a demographic model in which two populations diverged at a given time in the past, we derive the approximate expectation and variance of imputation accuracy in a study sequence sampled from one of the two populations, choosing between two reference sequences, one sampled from the same population as the study sequence and the other sampled from the other population. We show that under this model, imputation accuracy—as measured by the proportion of polymorphic sites that are imputed correctly in the study sequence—increases in expectation with the mutation rate, the proportion of the markers in a chromosomal region that are genotyped, and the time to divergence between the study and reference populations. Each of these effects derives largely from an increase in information available for determining the reference sequence that is genetically most similar to the sequence targeted for imputation. We analyze as a function of divergence time the expected gain in imputation accuracy in the target using a reference sequence from the same population as the target rather than from the other population. Together with a growing body of empirical investigations of genotype imputation in diverse human populations, our modeling framework lays a foundation for extending imputation techniques to novel populations that have not yet been extensively examined. PMID:23079542
Evaluation and application of summary statistic imputation to discover new height-associated loci.
Rüeger, Sina; McDaid, Aaron; Kutalik, Zoltán
2018-05-01
As most of the heritability of complex traits is attributed to common and low frequency genetic variants, imputing them by combining genotyping chips and large sequenced reference panels is the most cost-effective approach to discover the genetic basis of these traits. Association summary statistics from genome-wide meta-analyses are available for hundreds of traits. Updating these to ever-increasing reference panels is very cumbersome as it requires reimputation of the genetic data, rerunning the association scan, and meta-analysing the results. A much more efficient method is to directly impute the summary statistics, termed as summary statistics imputation, which we improved to accommodate variable sample size across SNVs. Its performance relative to genotype imputation and practical utility has not yet been fully investigated. To this end, we compared the two approaches on real (genotyped and imputed) data from 120K samples from the UK Biobank and show that, genotype imputation boasts a 3- to 5-fold lower root-mean-square error, and better distinguishes true associations from null ones: We observed the largest differences in power for variants with low minor allele frequency and low imputation quality. For fixed false positive rates of 0.001, 0.01, 0.05, using summary statistics imputation yielded a decrease in statistical power by 9, 43 and 35%, respectively. To test its capacity to discover novel associations, we applied summary statistics imputation to the GIANT height meta-analysis summary statistics covering HapMap variants, and identified 34 novel loci, 19 of which replicated using data in the UK Biobank. Additionally, we successfully replicated 55 out of the 111 variants published in an exome chip study. Our study demonstrates that summary statistics imputation is a very efficient and cost-effective way to identify and fine-map trait-associated loci. Moreover, the ability to impute summary statistics is important for follow-up analyses, such as Mendelian randomisation or LD-score regression.
Evaluation and application of summary statistic imputation to discover new height-associated loci
2018-01-01
As most of the heritability of complex traits is attributed to common and low frequency genetic variants, imputing them by combining genotyping chips and large sequenced reference panels is the most cost-effective approach to discover the genetic basis of these traits. Association summary statistics from genome-wide meta-analyses are available for hundreds of traits. Updating these to ever-increasing reference panels is very cumbersome as it requires reimputation of the genetic data, rerunning the association scan, and meta-analysing the results. A much more efficient method is to directly impute the summary statistics, termed as summary statistics imputation, which we improved to accommodate variable sample size across SNVs. Its performance relative to genotype imputation and practical utility has not yet been fully investigated. To this end, we compared the two approaches on real (genotyped and imputed) data from 120K samples from the UK Biobank and show that, genotype imputation boasts a 3- to 5-fold lower root-mean-square error, and better distinguishes true associations from null ones: We observed the largest differences in power for variants with low minor allele frequency and low imputation quality. For fixed false positive rates of 0.001, 0.01, 0.05, using summary statistics imputation yielded a decrease in statistical power by 9, 43 and 35%, respectively. To test its capacity to discover novel associations, we applied summary statistics imputation to the GIANT height meta-analysis summary statistics covering HapMap variants, and identified 34 novel loci, 19 of which replicated using data in the UK Biobank. Additionally, we successfully replicated 55 out of the 111 variants published in an exome chip study. Our study demonstrates that summary statistics imputation is a very efficient and cost-effective way to identify and fine-map trait-associated loci. Moreover, the ability to impute summary statistics is important for follow-up analyses, such as Mendelian randomisation or LD-score regression. PMID:29782485
NASA Astrophysics Data System (ADS)
Poyatos, Rafael; Sus, Oliver; Badiella, Llorenç; Mencuccini, Maurizio; Martínez-Vilalta, Jordi
2018-05-01
The ubiquity of missing data in plant trait databases may hinder trait-based analyses of ecological patterns and processes. Spatially explicit datasets with information on intraspecific trait variability are rare but offer great promise in improving our understanding of functional biogeography. At the same time, they offer specific challenges in terms of data imputation. Here we compare statistical imputation approaches, using varying levels of environmental information, for five plant traits (leaf biomass to sapwood area ratio, leaf nitrogen content, maximum tree height, leaf mass per area and wood density) in a spatially explicit plant trait dataset of temperate and Mediterranean tree species (Ecological and Forest Inventory of Catalonia, IEFC, dataset for Catalonia, north-east Iberian Peninsula, 31 900 km2). We simulated gaps at different missingness levels (10-80 %) in a complete trait matrix, and we used overall trait means, species means, k nearest neighbours (kNN), ordinary and regression kriging, and multivariate imputation using chained equations (MICE) to impute missing trait values. We assessed these methods in terms of their accuracy and of their ability to preserve trait distributions, multi-trait correlation structure and bivariate trait relationships. The relatively good performance of mean and species mean imputations in terms of accuracy masked a poor representation of trait distributions and multivariate trait structure. Species identity improved MICE imputations for all traits, whereas forest structure and topography improved imputations for some traits. No method performed best consistently for the five studied traits, but, considering all traits and performance metrics, MICE informed by relevant ecological variables gave the best results. However, at higher missingness (> 30 %), species mean imputations and regression kriging tended to outperform MICE for some traits. MICE informed by relevant ecological variables allowed us to fill the gaps in the IEFC incomplete dataset (5495 plots) and quantify imputation uncertainty. Resulting spatial patterns of the studied traits in Catalan forests were broadly similar when using species means, regression kriging or the best-performing MICE application, but some important discrepancies were observed at the local level. Our results highlight the need to assess imputation quality beyond just imputation accuracy and show that including environmental information in statistical imputation approaches yields more plausible imputations in spatially explicit plant trait datasets.
High-density marker imputation accuracy in sixteen French cattle breeds.
Hozé, Chris; Fouilloux, Marie-Noëlle; Venot, Eric; Guillaume, François; Dassonneville, Romain; Fritz, Sébastien; Ducrocq, Vincent; Phocas, Florence; Boichard, Didier; Croiseau, Pascal
2013-09-03
Genotyping with the medium-density Bovine SNP50 BeadChip® (50K) is now standard in cattle. The high-density BovineHD BeadChip®, which contains 777,609 single nucleotide polymorphisms (SNPs), was developed in 2010. Increasing marker density increases the level of linkage disequilibrium between quantitative trait loci (QTL) and SNPs and the accuracy of QTL localization and genomic selection. However, re-genotyping all animals with the high-density chip is not economically feasible. An alternative strategy is to genotype part of the animals with the high-density chip and to impute high-density genotypes for animals already genotyped with the 50K chip. Thus, it is necessary to investigate the error rate when imputing from the 50K to the high-density chip. Five thousand one hundred and fifty three animals from 16 breeds (89 to 788 per breed) were genotyped with the high-density chip. Imputation error rates from the 50K to the high-density chip were computed for each breed with a validation set that included the 20% youngest animals. Marker genotypes were masked for animals in the validation population in order to mimic 50K genotypes. Imputation was carried out using the Beagle 3.3.0 software. Mean allele imputation error rates ranged from 0.31% to 2.41% depending on the breed. In total, 1980 SNPs had high imputation error rates in several breeds, which is probably due to genome assembly errors, and we recommend to discard these in future studies. Differences in imputation accuracy between breeds were related to the high-density-genotyped sample size and to the genetic relationship between reference and validation populations, whereas differences in effective population size and level of linkage disequilibrium showed limited effects. Accordingly, imputation accuracy was higher in breeds with large populations and in dairy breeds than in beef breeds. More than 99% of the alleles were correctly imputed if more than 300 animals were genotyped at high-density. No improvement was observed when multi-breed imputation was performed. In all breeds, imputation accuracy was higher than 97%, which indicates that imputation to the high-density chip was accurate. Imputation accuracy depends mainly on the size of the reference population and the relationship between reference and target populations.
High-density marker imputation accuracy in sixteen French cattle breeds
2013-01-01
Background Genotyping with the medium-density Bovine SNP50 BeadChip® (50K) is now standard in cattle. The high-density BovineHD BeadChip®, which contains 777 609 single nucleotide polymorphisms (SNPs), was developed in 2010. Increasing marker density increases the level of linkage disequilibrium between quantitative trait loci (QTL) and SNPs and the accuracy of QTL localization and genomic selection. However, re-genotyping all animals with the high-density chip is not economically feasible. An alternative strategy is to genotype part of the animals with the high-density chip and to impute high-density genotypes for animals already genotyped with the 50K chip. Thus, it is necessary to investigate the error rate when imputing from the 50K to the high-density chip. Methods Five thousand one hundred and fifty three animals from 16 breeds (89 to 788 per breed) were genotyped with the high-density chip. Imputation error rates from the 50K to the high-density chip were computed for each breed with a validation set that included the 20% youngest animals. Marker genotypes were masked for animals in the validation population in order to mimic 50K genotypes. Imputation was carried out using the Beagle 3.3.0 software. Results Mean allele imputation error rates ranged from 0.31% to 2.41% depending on the breed. In total, 1980 SNPs had high imputation error rates in several breeds, which is probably due to genome assembly errors, and we recommend to discard these in future studies. Differences in imputation accuracy between breeds were related to the high-density-genotyped sample size and to the genetic relationship between reference and validation populations, whereas differences in effective population size and level of linkage disequilibrium showed limited effects. Accordingly, imputation accuracy was higher in breeds with large populations and in dairy breeds than in beef breeds. More than 99% of the alleles were correctly imputed if more than 300 animals were genotyped at high-density. No improvement was observed when multi-breed imputation was performed. Conclusion In all breeds, imputation accuracy was higher than 97%, which indicates that imputation to the high-density chip was accurate. Imputation accuracy depends mainly on the size of the reference population and the relationship between reference and target populations. PMID:24004563
Batistatou, Evridiki; McNamee, Roseanne
2012-12-10
It is known that measurement error leads to bias in assessing exposure effects, which can however, be corrected if independent replicates are available. For expensive replicates, two-stage (2S) studies that produce data 'missing by design', may be preferred over a single-stage (1S) study, because in the second stage, measurement of replicates is restricted to a sample of first-stage subjects. Motivated by an occupational study on the acute effect of carbon black exposure on respiratory morbidity, we compare the performance of several bias-correction methods for both designs in a simulation study: an instrumental variable method (EVROS IV) based on grouping strategies, which had been recommended especially when measurement error is large, the regression calibration and the simulation extrapolation methods. For the 2S design, either the problem of 'missing' data was ignored or the 'missing' data were imputed using multiple imputations. Both in 1S and 2S designs, in the case of small or moderate measurement error, regression calibration was shown to be the preferred approach in terms of root mean square error. For 2S designs, regression calibration as implemented by Stata software is not recommended in contrast to our implementation of this method; the 'problematic' implementation of regression calibration although substantially improved with use of multiple imputations. The EVROS IV method, under a good/fairly good grouping, outperforms the regression calibration approach in both design scenarios when exposure mismeasurement is severe. Both in 1S and 2S designs with moderate or large measurement error, simulation extrapolation severely failed to correct for bias. Copyright © 2012 John Wiley & Sons, Ltd.
Dysregulation of the mitogen granulin in human cancer through the miR-15/107 microRNA gene group
Wang, Wang-Xia; Kyprianou, Natasha; Wang, Xiaowei; Nelson, Peter T.
2010-01-01
Granulin (GRN) is a potent mitogen and growth factor implicated in many human cancers, but its regulation is poorly understood. Recent findings indicate that GRN is regulated strongly by the microRNA miR-107, which functionally overlap with miR-15, miR-16, and miR-195 due to a common 5' sequence critical for target specificity. In this study, we queried whether miR-107 and paralogs regulated GRN in human cancers. In cultured cells, anti-Argonaute RIP-ChIP experiments indicate that GRN mRNA is directly targeted by numerous miR-15/107 miRNAs. Further tests of this association in human tumors. MiR-15 and miR-16 are known to be downregulated in chronic lymphocytic leukemia (CLL). Using pre-existing microarray datasets, we found that GRN expression is higher in CLL relative to non-neoplastic lymphocytes (P>0.00001). By contrast, other prospective miR-15/miR-16 targets in the dataset (BCL-2 and cyclin D1) were not up-regulated in CLL. Unlike in CLL, GRN was not up-regulated in chronic myelogenous leukemia (CML) where miR-107 paralogs are not known to be dysregulated. Prior studies have shown that GRN is also up-regulated, and miR-107 down-regulated, in prostate carcinoma. Our results indicate that multiple members of the miR-107 gene group indeed repress GRN protein levels when transfected into prostate cancer cells. At least a dozen distinct types of cancer have the pattern of increased GRN and decreased miR-107 expression. These findings indicate for the first time that the mitogen and growth factor GRN is dysregulated via the miR-15/107 gene group in multiple human cancers, which may provide a potential common therapeutic target. PMID:20884628
miR-132, an experience-dependent microRNA, is essential for visual cortex plasticity
Mellios, Nikolaos; Sugihara, Hiroki; Castro, Jorge; Banerjee, Abhishek; Le, Chuong; Kumar, Arooshi; Crawford, Benjamin; Strathmann, Julia; Tropea, Daniela; Levine, Stuart S.; Edbauer, Dieter; Sur, Mriganka
2011-01-01
Using multiple quantitative analyses, we discovered microRNAs (miRNAs) abundantly expressed in visual cortex that respond to dark-rearing (DR) and/or monocular deprivation (MD). The most significantly altered miRNA, miR-132, was rapidly upregulated after eye-opening and delayed by DR. In vivo inhibition of miR-132 prevented ocular dominance plasticity in identified neurons following MD, and affected maturation of dendritic spines, demonstrating its critical role in the plasticity of visual cortex circuits. PMID:21892155
Fernandez, Serena; Risolino, Maurizio; Verde, Pasquale
2015-01-01
Oncosuppressor miRNAs inhibit cancer cell proliferation by targeting key components of the cell cycle machinery. In our recent report we showed that miR-340 is a novel tumor suppressor in non-small cell lung cancer. miR-340 inhibits neoplastic cell proliferation and induces p27KIP1 by targeting multiple translational and post-translational regulators of this cyclin-dependent kinase inhibitor. PMID:27308439
Partitioning error components for accuracy-assessment of near-neighbor methods of imputation
Albert R. Stage; Nicholas L. Crookston
2007-01-01
Imputation is applied for two quite different purposes: to supply missing data to complete a data set for subsequent modeling analyses or to estimate subpopulation totals. Error properties of the imputed values have different effects in these two contexts. We partition errors of imputation derived from similar observation units as arising from three sources:...
Loong, Bronwyn; Zaslavsky, Alan M; He, Yulei; Harrington, David P
2013-10-30
Statistical agencies have begun to partially synthesize public-use data for major surveys to protect the confidentiality of respondents' identities and sensitive attributes by replacing high disclosure risk and sensitive variables with multiple imputations. To date, there are few applications of synthetic data techniques to large-scale healthcare survey data. Here, we describe partial synthesis of survey data collected by the Cancer Care Outcomes Research and Surveillance (CanCORS) project, a comprehensive observational study of the experiences, treatments, and outcomes of patients with lung or colorectal cancer in the USA. We review inferential methods for partially synthetic data and discuss selection of high disclosure risk variables for synthesis, specification of imputation models, and identification disclosure risk assessment. We evaluate data utility by replicating published analyses and comparing results using original and synthetic data and discuss practical issues in preserving inferential conclusions. We found that important subgroup relationships must be included in the synthetic data imputation model, to preserve the data utility of the observed data for a given analysis procedure. We conclude that synthetic CanCORS data are suited best for preliminary data analyses purposes. These methods address the requirement to share data in clinical research without compromising confidentiality. Copyright © 2013 John Wiley & Sons, Ltd.
Ren, Qian; Huang, Xin; Cui, Yalei; Sun, Jiejie; Wang, Wen
2017-01-01
ABSTRACT In eukaryotes, microRNAs (miRNAs) serve as regulators of many biological processes, including virus infection. An miRNA can generally target diverse genes during virus-host interactions. However, the regulation of gene expression by multiple miRNAs has not yet been extensively explored during virus infection. This study found that the Spaztle (Spz)-Toll-Dorsal-antilipopolysaccharide factor (ALF) signaling pathway plays a very important role in antiviral immunity against invasion of white spot syndrome virus (WSSV) in shrimp (Marsupenaeus japonicus). Dorsal, the central gene in the Toll pathway, was targeted by two viral miRNAs (WSSV-miR-N13 and WSSV-miR-N23) during WSSV infection. The regulation of Dorsal expression by viral miRNAs suppressed the Spz-Toll-Dorsal-ALF signaling pathway in shrimp in vivo, leading to virus infection. Our study contributes novel insights into the viral miRNA-mediated Toll signaling pathway during the virus-host interaction. IMPORTANCE An miRNA can target diverse genes during virus-host interactions. However, the regulation of gene expression by multiple miRNAs during virus infection has not yet been extensively explored. The results of this study indicated that the shrimp Dorsal gene, the central gene in the Toll pathway, was targeted by two viral miRNAs during infection with white spot syndrome virus. Regulation of Dorsal expression by viral miRNAs suppressed the Spz-Toll-Dorsal-ALF signaling pathway in shrimp in vivo, leading to virus infection. Our study provides new insight into the viral miRNA-mediated Toll signaling pathway in virus-host interactions. PMID:28179524
Ren, Qian; Huang, Xin; Cui, Yalei; Sun, Jiejie; Wang, Wen; Zhang, Xiaobo
2017-04-15
In eukaryotes, microRNAs (miRNAs) serve as regulators of many biological processes, including virus infection. An miRNA can generally target diverse genes during virus-host interactions. However, the regulation of gene expression by multiple miRNAs has not yet been extensively explored during virus infection. This study found that the Spaztle (Spz)-Toll-Dorsal-antilipopolysaccharide factor (ALF) signaling pathway plays a very important role in antiviral immunity against invasion of white spot syndrome virus (WSSV) in shrimp ( Marsupenaeus japonicus ). Dorsal , the central gene in the Toll pathway, was targeted by two viral miRNAs (WSSV-miR-N13 and WSSV-miR-N23) during WSSV infection. The regulation of Dorsal expression by viral miRNAs suppressed the Spz-Toll-Dorsal-ALF signaling pathway in shrimp in vivo , leading to virus infection. Our study contributes novel insights into the viral miRNA-mediated Toll signaling pathway during the virus-host interaction. IMPORTANCE An miRNA can target diverse genes during virus-host interactions. However, the regulation of gene expression by multiple miRNAs during virus infection has not yet been extensively explored. The results of this study indicated that the shrimp Dorsal gene, the central gene in the Toll pathway, was targeted by two viral miRNAs during infection with white spot syndrome virus. Regulation of Dorsal expression by viral miRNAs suppressed the Spz-Toll-Dorsal-ALF signaling pathway in shrimp in vivo , leading to virus infection. Our study provides new insight into the viral miRNA-mediated Toll signaling pathway in virus-host interactions. Copyright © 2017 American Society for Microbiology.
Integration of Multiple Genomic and Phenotype Data to Infer Novel miRNA-Disease Associations
Zhou, Meng; Cheng, Liang; Yang, Haixiu; Wang, Jing; Sun, Jie; Wang, Zhenzhen
2016-01-01
MicroRNAs (miRNAs) play an important role in the development and progression of human diseases. The identification of disease-associated miRNAs will be helpful for understanding the molecular mechanisms of diseases at the post-transcriptional level. Based on different types of genomic data sources, computational methods for miRNA-disease association prediction have been proposed. However, individual source of genomic data tends to be incomplete and noisy; therefore, the integration of various types of genomic data for inferring reliable miRNA-disease associations is urgently needed. In this study, we present a computational framework, CHNmiRD, for identifying miRNA-disease associations by integrating multiple genomic and phenotype data, including protein-protein interaction data, gene ontology data, experimentally verified miRNA-target relationships, disease phenotype information and known miRNA-disease connections. The performance of CHNmiRD was evaluated by experimentally verified miRNA-disease associations, which achieved an area under the ROC curve (AUC) of 0.834 for 5-fold cross-validation. In particular, CHNmiRD displayed excellent performance for diseases without any known related miRNAs. The results of case studies for three human diseases (glioblastoma, myocardial infarction and type 1 diabetes) showed that all of the top 10 ranked miRNAs having no known associations with these three diseases in existing miRNA-disease databases were directly or indirectly confirmed by our latest literature mining. All these results demonstrated the reliability and efficiency of CHNmiRD, and it is anticipated that CHNmiRD will serve as a powerful bioinformatics method for mining novel disease-related miRNAs and providing a new perspective into molecular mechanisms underlying human diseases at the post-transcriptional level. CHNmiRD is freely available at http://www.bio-bigdata.com/CHNmiRD. PMID:26849207
Integration of Multiple Genomic and Phenotype Data to Infer Novel miRNA-Disease Associations.
Shi, Hongbo; Zhang, Guangde; Zhou, Meng; Cheng, Liang; Yang, Haixiu; Wang, Jing; Sun, Jie; Wang, Zhenzhen
2016-01-01
MicroRNAs (miRNAs) play an important role in the development and progression of human diseases. The identification of disease-associated miRNAs will be helpful for understanding the molecular mechanisms of diseases at the post-transcriptional level. Based on different types of genomic data sources, computational methods for miRNA-disease association prediction have been proposed. However, individual source of genomic data tends to be incomplete and noisy; therefore, the integration of various types of genomic data for inferring reliable miRNA-disease associations is urgently needed. In this study, we present a computational framework, CHNmiRD, for identifying miRNA-disease associations by integrating multiple genomic and phenotype data, including protein-protein interaction data, gene ontology data, experimentally verified miRNA-target relationships, disease phenotype information and known miRNA-disease connections. The performance of CHNmiRD was evaluated by experimentally verified miRNA-disease associations, which achieved an area under the ROC curve (AUC) of 0.834 for 5-fold cross-validation. In particular, CHNmiRD displayed excellent performance for diseases without any known related miRNAs. The results of case studies for three human diseases (glioblastoma, myocardial infarction and type 1 diabetes) showed that all of the top 10 ranked miRNAs having no known associations with these three diseases in existing miRNA-disease databases were directly or indirectly confirmed by our latest literature mining. All these results demonstrated the reliability and efficiency of CHNmiRD, and it is anticipated that CHNmiRD will serve as a powerful bioinformatics method for mining novel disease-related miRNAs and providing a new perspective into molecular mechanisms underlying human diseases at the post-transcriptional level. CHNmiRD is freely available at http://www.bio-bigdata.com/CHNmiRD.
Zitzer, Nina C; Snyder, Katiri; Meng, Xiamoei; Taylor, Patricia A; Efebera, Yvonne A; Devine, Steven M; Blazar, Bruce R; Garzon, Ramiro; Ranganathan, Parvathi
2018-06-15
MicroRNA-155 (miR-155) is a small noncoding RNA critical for the regulation of inflammation as well as innate and adaptive immune responses. MiR-155 has been shown to be dysregulated in both donor and recipient immune cells during acute graft-versus-host disease (aGVHD). We previously reported that miR-155 is upregulated in donor T cells of mice and humans with aGVHD and that mice receiving miR-155-deficient (miR155 -/- ) splenocytes had markedly reduced aGVHD. However, molecular mechanisms by which miR-155 modulates T cell function in aGVHD have not been fully investigated. We identify that miR-155 expression in both donor CD8 + T cells and conventional CD4 + CD25 - T cells is pivotal for aGVHD pathogenesis. Using murine aGVHD transplant experiments, we show that miR-155 strongly impacts alloreactive T cell expansion through multiple distinct mechanisms, modulating proliferation in CD8 + donor T cells and promoting exhaustion in donor CD4 + T cells in both the spleen and colon. Additionally, miR-155 drives a proinflammatory Th1 phenotype in donor T cells in these two sites, and miR-155 -/- donor T cells are polarized toward an IL-4-producing Th2 phenotype. We further demonstrate that miR-155 expression in donor T cells regulates CCR5 and CXCR4 chemokine-dependent migration. Notably, we show that miR-155 expression is crucial for donor T cell infiltration into multiple target organs. These findings provide further understanding of the role of miR-155 in modulating aGVHD through T cell expansion, effector cytokine production, and migration. Copyright © 2018 by The American Association of Immunologists, Inc.
ERIC Educational Resources Information Center
Azid, Nurulwahida Hj; Yaacob, Aizan; Shaik-Abdullah, Sarimah
2016-01-01
Purpose: Howard Gardners' concept of multiple intelligence (MI) offers an alternative perspective on intelligence which highlights the importance of acknowledging learner diversity, individual talents and the development of human potentials. MI has been used as a basis for the construction of modular enrichment activities to facilitate the…
ERIC Educational Resources Information Center
Peifer, Nancy
2012-01-01
The purpose of this study was to contribute to the academic discussion regarding the validity of Multiple Intelligences (MI) theory through focusing on the validity of an important construct embedded in the theory, that of congruence between instructional style and preferred MI style for optimal learning. Currently there is insufficient empirical…
Baker, Jannah; White, Nicole; Mengersen, Kerrie
2014-11-20
Spatial analysis is increasingly important for identifying modifiable geographic risk factors for disease. However, spatial health data from surveys are often incomplete, ranging from missing data for only a few variables, to missing data for many variables. For spatial analyses of health outcomes, selection of an appropriate imputation method is critical in order to produce the most accurate inferences. We present a cross-validation approach to select between three imputation methods for health survey data with correlated lifestyle covariates, using as a case study, type II diabetes mellitus (DM II) risk across 71 Queensland Local Government Areas (LGAs). We compare the accuracy of mean imputation to imputation using multivariate normal and conditional autoregressive prior distributions. Choice of imputation method depends upon the application and is not necessarily the most complex method. Mean imputation was selected as the most accurate method in this application. Selecting an appropriate imputation method for health survey data, after accounting for spatial correlation and correlation between covariates, allows more complete analysis of geographic risk factors for disease with more confidence in the results to inform public policy decision-making.
An Overview and Evaluation of Recent Machine Learning Imputation Methods Using Cardiac Imaging Data.
Liu, Yuzhe; Gopalakrishnan, Vanathi
2017-03-01
Many clinical research datasets have a large percentage of missing values that directly impacts their usefulness in yielding high accuracy classifiers when used for training in supervised machine learning. While missing value imputation methods have been shown to work well with smaller percentages of missing values, their ability to impute sparse clinical research data can be problem specific. We previously attempted to learn quantitative guidelines for ordering cardiac magnetic resonance imaging during the evaluation for pediatric cardiomyopathy, but missing data significantly reduced our usable sample size. In this work, we sought to determine if increasing the usable sample size through imputation would allow us to learn better guidelines. We first review several machine learning methods for estimating missing data. Then, we apply four popular methods (mean imputation, decision tree, k-nearest neighbors, and self-organizing maps) to a clinical research dataset of pediatric patients undergoing evaluation for cardiomyopathy. Using Bayesian Rule Learning (BRL) to learn ruleset models, we compared the performance of imputation-augmented models versus unaugmented models. We found that all four imputation-augmented models performed similarly to unaugmented models. While imputation did not improve performance, it did provide evidence for the robustness of our learned models.
Motivational interviewing: helping patients move toward change.
Richardson, Luann
2012-01-01
Motivational Interviewing (MI) is a valuable tool for nurses to help patients address behavior change. MI has been found effective for helping patients with multiple chronic conditions, adherence issues, and lifestyle issues change their health behaviors. For Christian nurses, MI is consistent with biblical principles and can be seen as a form of ministry. This article overviews the process of MI, stages of change, and offers direction for further learning.
Identifying microRNAs that Regulate Neuroblastoma Cell Differentiation
2015-10-01
Award Number: W81XWH-13-1-0241 TITLE: Identifying that Regulate Neuroblastoma Cell Differentiation PRINCIPAL INVESTIGATOR: Dr. Liqin Du...inducing miRNA, miR- 449a. We examined the differentiation-inducing function of miR-449a in multiple neuroblastoma cell lines. We have demonstrated that...miR-449a functions as an inducer of cell differentiation in neuroblastoma cell lines with distinct genetic backgrounds, including the MYCN
Dynamic evolution and biogenesis of small RNAs during sex reversal.
Liu, Jie; Luo, Majing; Sheng, Yue; Hong, Qiang; Cheng, Hanhua; Zhou, Rongjia
2015-05-06
Understanding origin, evolution and functions of small RNA (sRNA) genes has been a great challenge in the past decade. Molecular mechanisms underlying sexual reversal in vertebrates, particularly sRNAs involved in this process, are largely unknown. By deep-sequencing of small RNA transcriptomes in combination with genomic analysis, we identified a large amount of piRNAs and miRNAs including over 1,000 novel miRNAs, which were differentially expressed during gonad reversal from ovary to testis via ovotesis. Biogenesis and expressions of miRNAs were dynamically changed during the reversal. Notably, phylogenetic analysis revealed dynamic expansions of miRNAs in vertebrates and an evolutionary trajectory of conserved miR-17-92 cluster in the Eukarya. We showed that the miR-17-92 cluster in vertebrates was generated through multiple duplications from ancestor miR-92 in invertebrates Tetranychus urticae and Daphnia pulex from the Chelicerata around 580 Mya. Moreover, we identified the sexual regulator Dmrt1 as a direct target of the members miR-19a and -19b in the cluster. These data suggested dynamic biogenesis and expressions of small RNAs during sex reversal and revealed multiple expansions and evolutionary trajectory of miRNAs from invertebrates to vertebrates, which implicate small RNAs in sexual reversal and provide new insight into evolutionary and molecular mechanisms underlying sexual reversal.
Large-scale imputation of epigenomic datasets for systematic annotation of diverse human tissues.
Ernst, Jason; Kellis, Manolis
2015-04-01
With hundreds of epigenomic maps, the opportunity arises to exploit the correlated nature of epigenetic signals, across both marks and samples, for large-scale prediction of additional datasets. Here, we undertake epigenome imputation by leveraging such correlations through an ensemble of regression trees. We impute 4,315 high-resolution signal maps, of which 26% are also experimentally observed. Imputed signal tracks show overall similarity to observed signals and surpass experimental datasets in consistency, recovery of gene annotations and enrichment for disease-associated variants. We use the imputed data to detect low-quality experimental datasets, to find genomic sites with unexpected epigenomic signals, to define high-priority marks for new experiments and to delineate chromatin states in 127 reference epigenomes spanning diverse tissues and cell types. Our imputed datasets provide the most comprehensive human regulatory region annotation to date, and our approach and the ChromImpute software constitute a useful complement to large-scale experimental mapping of epigenomic information.
NASA Astrophysics Data System (ADS)
Kong, Jing
This thesis includes 4 pieces of work. In Chapter 1, we present the work with a method for examining mortality as it is seen to run in families, and lifestyle factors that are also seen to run in families, in a subpopulation of the Beaver Dam Eye Study that has died by 2011. We find significant distance correlations between death ages, lifestyle factors, and family relationships. Considering only sib pairs compared to unrelated persons, distance correlation between siblings and mortality is, not surprisingly, stronger than that between more distantly related family members and mortality. Chapter 2 introduces a feature screening procedure with the use of distance correlation and covariance. We demonstrate a property for distance covariance, which is incorporated in a novel feature screening procedure based on distance correlation as a stopping criterion. The approach is further implemented to two real examples, namely the famous small round blue cell tumors data and the Cancer Genome Atlas ovarian cancer data Chapter 3 pays attention to the right censored human longevity data and the estimation of lifetime expectancy. We propose a general framework of backward multiple imputation for estimating the conditional lifetime expectancy function and the variance of the estimator in the right censoring setting and prove the properties of the estimator. In addition, we apply the method to the Beaver Dam eye study data to study human longevity, where the expected human lifetime are modeled with smoothing spline ANOVA based on the covariates including baseline age, gender, lifestyle factors and disease variables. Chapter 4 compares two imputation methods for right censored data, namely the famous Buckley-James estimator and the backward imputation method proposed in Chapter 3 and shows that backward imputation method is less biased and more robust with heterogeneity.
Navarro, Alfons; Díaz, Tania; Tovar, Natalia; Pedrosa, Fabiola; Tejero, Rut; Cibeira, María Teresa; Magnano, Laura; Rosiñol, Laura; Monzó, Mariano; Bladé, Joan; de Larrea, Carlos Fernández
2015-01-01
We have examined serum microRNA expression in multiple myeloma (MM) patients at diagnosis and at complete response (CR) after autologous stem-cell transplantation (ASCT), in patients with stable monoclonal gammopathy of undetermined significance, and in healthy controls. MicroRNAs were first profiled using TaqMan Human MicroRNA Arrays. Differentially expressed microRNAs were then validated by individual TaqMan MicroRNA assays and correlated with CR and progression-free survival (PFS) after ASCT. Supervised analysis identified a differentially expressed 14-microRNA signature. The differential expression of miR-16 (P = 0.028), miR-17 (P = 0.016), miR-19b (P = 0.009), miR-20a (P = 0.017) and miR-660 (P = 0.048) at diagnosis and CR was then confirmed by individual assays. In addition, high levels of miR-25 were related to the presence of oligoclonal bands (P = 0.002). Longer PFS after ASCT was observed in patients with high levels of miR-19b (6 vs. 1.8 years; P < 0.001) or miR-331 (8.6 vs. 2.9 years; P = 0.001). Low expression of both miR-19b and miR-331 in combination was a marker of shorter PFS (HR 5.3; P = 0.033). We have identified a serum microRNA signature with potential as a diagnostic and prognostic tool in MM. PMID:25593199
Botta, C; Cucè, M; Pitari, M R; Caracciolo, D; Gullà, A; Morelli, E; Riillo, C; Biamonte, L; Gallo Cantafio, M E; Prabhala, R; Mignogna, C; Di Vito, A; Altomare, E; Amodio, N; Di Martino, M T; Correale, P; Rossi, M; Giordano, A; Munshi, N C; Tagliaferri, P; Tassone, P
2018-01-01
Dendritic cells (DCs) have a key role in regulating tumor immunity, tumor cell growth and drug resistance. We hypothesized that multiple myeloma (MM) cells might recruit and reprogram DCs to a tumor-permissive phenotype by changes within their microRNA (miRNA) network. By analyzing six different miRNA-profiling data sets, miR-29b was identified as the only miRNA upregulated in normal mature DCs and significantly downregulated in tumor-associated DCs. This finding was validated in primary DCs co-cultured in vitro with MM cell lines and in primary bone marrow DCs from MM patients. In DCs co-cultured with MM cells, enforced expression of miR-29b counteracted pro-inflammatory pathways, including signal transducer and activator of transcription 3 and nuclear factor-κB, and cytokine/chemokine signaling networks, which correlated with patients’ adverse prognosis and development of bone disease. Moreover, miR-29b downregulated interleukin-23 in vitro and in the SCID-synth-hu in vivo model, and antagonized a Th17 inflammatory response. All together, these effects translated into strong anti-proliferative activity and reduction of genomic instability of MM cells. Our study demonstrates that MM reprograms the DCs functional phenotype by downregulating miR-29b whose reconstitution impairs DCs ability to sustain MM cell growth and survival. These results underscore miR-29b as an innovative and attractive candidate for miRNA-based immune therapy of MM. PMID:29158557
Using Bayesian Imputation to Assess Racial and Ethnic Disparities in Pediatric Performance Measures.
Brown, David P; Knapp, Caprice; Baker, Kimberly; Kaufmann, Meggen
2016-06-01
To analyze health care disparities in pediatric quality of care measures and determine the impact of data imputation. Five HEDIS measures are calculated based on 2012 administrative data for 145,652 children in two public insurance programs in Florida. The Bayesian Improved Surname and Geocoding (BISG) imputation method is used to impute missing race and ethnicity data for 42 percent of the sample (61,954 children). Models are estimated with and without the imputed race and ethnicity data. Dropping individuals with missing race and ethnicity data biases quality of care measures for minorities downward relative to nonminority children for several measures. These results provide further support for the importance of appropriately accounting for missing race and ethnicity data through imputation methods. © Health Research and Educational Trust.
A Review On Missing Value Estimation Using Imputation Algorithm
NASA Astrophysics Data System (ADS)
Armina, Roslan; Zain, Azlan Mohd; Azizah Ali, Nor; Sallehuddin, Roselina
2017-09-01
The presence of the missing value in the data set has always been a major problem for precise prediction. The method for imputing missing value needs to minimize the effect of incomplete data sets for the prediction model. Many algorithms have been proposed for countermeasure of missing value problem. In this review, we provide a comprehensive analysis of existing imputation algorithm, focusing on the technique used and the implementation of global or local information of data sets for missing value estimation. In addition validation method for imputation result and way to measure the performance of imputation algorithm also described. The objective of this review is to highlight possible improvement on existing method and it is hoped that this review gives reader better understanding of imputation method trend.
Circulating microRNAs as Potential Biomarkers of Infectious Disease
Correia, Carolina N.; Nalpas, Nicolas C.; McLoughlin, Kirsten E.; Browne, John A.; Gordon, Stephen V.; MacHugh, David E.; Shaughnessy, Ronan G.
2017-01-01
microRNAs (miRNAs) are a class of small non-coding endogenous RNA molecules that regulate a wide range of biological processes by post-transcriptionally regulating gene expression. Thousands of these molecules have been discovered to date, and multiple miRNAs have been shown to coordinately fine-tune cellular processes key to organismal development, homeostasis, neurobiology, immunobiology, and control of infection. The fundamental regulatory role of miRNAs in a variety of biological processes suggests that differential expression of these transcripts may be exploited as a novel source of molecular biomarkers for many different disease pathologies or abnormalities. This has been emphasized by the recent discovery of remarkably stable miRNAs in mammalian biofluids, which may originate from intracellular processes elsewhere in the body. The potential of circulating miRNAs as biomarkers of disease has mainly been demonstrated for various types of cancer. More recently, however, attention has focused on the use of circulating miRNAs as diagnostic/prognostic biomarkers of infectious disease; for example, human tuberculosis caused by infection with Mycobacterium tuberculosis, sepsis caused by multiple infectious agents, and viral hepatitis. Here, we review these developments and discuss prospects and challenges for translating circulating miRNA into novel diagnostics for infectious disease. PMID:28261201
Yamaguchi, Takeshi; Kataoka, Kensuke; Watanabe, Kenji; Orii, Hidefumi
2014-02-01
DEADSouth mRNA encoding the RNA helicase DDX25 is a component of the germ plasm in Xenopus laevis. We investigated the mechanisms underlying its specific mRNA expression in primordial germ cells (PGCs). Based on our previous findings of several microRNA miR-427 recognition elements (MREs) in the 3' untranslated region of the mRNA, we first examined whether DEADSouth mRNA was degraded by miR-427 targeting in somatic cells. Injection of antisense miR-427 oligomer and reporter mRNA for mutated MREs revealed that DEADSouth mRNA was potentially degraded in somatic cells via miR-427 targeting, but not in PGCs after the mid-blastula transition (MBT). The expression level of miR-427 was very low in PGCs, which probably resulted in the lack of miR-427-mediated degradation. In addition, the DEADSouth gene was expressed zygotically after MBT. Thus, the predominant expression of DEADSouth mRNA in the PGCs is ensured by multiple mechanisms including zygotic expression and prohibition from miR-427-mediated degradation. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.
The use of multiple imputation in the Southern Annual Forest Inventory System
Gregory A. Reams; Joseph M. McCollum
2000-01-01
The Southern Research Station is currently implementing an annual forest survey in 7 of the 13 States that it is responsible for surveying. The Southern Annual Forest Inventory System (SAFIS) sampling design is a systematic sample of five interpenetrating grids, whereby an equal number of plots are measured each year. The area-representative and time-series...
The use of multiple imputation in the Southern Annual Forest Inventory System
Gregory A. Reams; Joseph M. McCollum
2000-01-01
The Southern Research Station is currently implementing an annual forest survey in 7 of the 13 states that it is responsible for surveying. The Southern Annual Forest Inventory System (SAFIS) sampling design is a systematic sample of five interpenetrating grids, whereby an equal number of plots are measured each year. The area representative and time series nature of...
ERIC Educational Resources Information Center
Monahan, Kathryn C.; Lee, Joanna M.; Steinberg, Laurence
2011-01-01
The impact of part-time employment on adolescent functioning remains unclear because most studies fail to adequately control for differential selection into the workplace. The present study reanalyzes data from L. Steinberg, S. Fegley, and S. M. Dornbusch (1993) using multiple imputation, which minimizes bias in effect size estimation, and 2 types…
ERIC Educational Resources Information Center
Finch, Holmes
2011-01-01
Methods of uniform differential item functioning (DIF) detection have been extensively studied in the complete data case. However, less work has been done examining the performance of these methods when missing item responses are present. Research that has been done in this regard appears to indicate that treating missing item responses as…
Mercer, Theresa G; Frostick, Lynne E; Walmsley, Anthony D
2011-10-15
This paper presents a statistical technique that can be applied to environmental chemistry data where missing values and limit of detection levels prevent the application of statistics. A working example is taken from an environmental leaching study that was set up to determine if there were significant differences in levels of leached arsenic (As), chromium (Cr) and copper (Cu) between lysimeters containing preservative treated wood waste and those containing untreated wood. Fourteen lysimeters were setup and left in natural conditions for 21 weeks. The resultant leachate was analysed by ICP-OES to determine the As, Cr and Cu concentrations. However, due to the variation inherent in each lysimeter combined with the limits of detection offered by ICP-OES, the collected quantitative data was somewhat incomplete. Initial data analysis was hampered by the number of 'missing values' in the data. To recover the dataset, the statistical tool of Statistical Multiple Imputation (SMI) was applied, and the data was re-analysed successfully. It was demonstrated that using SMI did not affect the variance in the data, but facilitated analysis of the complete dataset. Copyright © 2011 Elsevier B.V. All rights reserved.
Simultaneous and multiplexed detection of exosome microRNAs using molecular beacons.
Lee, Ji Hye; Kim, Jeong Ah; Jeong, Seunga; Rhee, Won Jong
2016-12-15
Simultaneous and multiplexed detection of microRNAs (miRNAs) in a whole exosome is developed, which can be utilized as a PCR-free efficient diagnosis method for various diseases. Exosomes are small extracellular vesicles that contain biomarker miRNAs from parental cells. Because they circulate throughout bodily fluids, exosomal biomarkers offer great advantages for diagnosis in many aspects. In general, PCR-based methods can be used for exosomal miRNA detection but they are laborious, expensive, and time-consuming, which make them unsuitable for high-throughput diagnosis of diseases. Previously, we reported that single miRNA in the exosomes can be detected specifically using an oligonucleotide probe or molecular beacon. Herein, we demonstrate for the first time that multiple miRNAs can be detected simultaneously in exosomes using miRNA-targeting molecular beacons. Exosomes from a breast cancer cell line, MCF-7, were used for the production of exosomes because MCF-7 has a high level of miR-21, miR-375, and miR-27a as target miRNAs. Molecular beacons successfully hybridized with multiple miRNAs in the cancer cell-derived exosomes even in the presence of high human serum concentration. In addition, it is noteworthy that the choice of fluorophores for multiplexing biomarkers in an exosome is crucial because of its small size. The proposed method described in this article is beneficial to high-throughput analysis for disease diagnosis, prognosis, and response to treatment because it is a time-, labor-, and cost-saving technique. Copyright © 2016 Elsevier B.V. All rights reserved.
Alternate approaches to repress endogenous microRNA activity in Arabidopsis thaliana
Wang, Ming-Bo
2011-01-01
MicroRnAs (miRnAs) are an endogenous class of regulatory small RnA (sRnA). in plants, miRnAs are processed from short non-protein-coding messenger RnAs (mRnAs) transcribed from small miRnA genes (MIR genes). Traditionally in the model plant Arabidopsis thaliana (Arabidopsis), the functional analysis of a gene product has relied on the identification of a corresponding T-DnA insertion knockout mutant from a large, randomly-mutagenized population. However, because of the small size of MIR genes and presence of multiple, highly conserved members in most plant miRnA families, it has been extremely laborious and time consuming to obtain a corresponding single or multiple, null mutant plant line. Our recent study published in Molecular Plant1 outlines an alternate method for the functional characterization of miRnA action in Arabidopsis, termed anti-miRnA technology. Using this approach we demonstrated that the expression of individual miRnAs or entire miRnA families, can be readily and efficiently knocked-down. Our approach is in addition to two previously reported methodologies that also allow for the targeted suppression of either individual miRnAs, or all members of a MIR gene family; these include miRnA target mimicry2,3 and transcriptional gene silencing (TGS) of MIR gene promoters.4 All three methodologies rely on endogenous gene regulatory machinery and in this article we provide an overview of these technologies and discuss their strengths and weaknesses in inhibiting the activity of their targeted miRnA(s). PMID:21358288
Cleveland, M A; Hickey, J M
2013-08-01
Genomic selection can be implemented in pig breeding at a reduced cost using genotype imputation. Accuracy of imputation and the impact on resulting genomic breeding values (gEBV) was investigated. High-density genotype data was available for 4,763 animals from a single pig line. Three low-density genotype panels were constructed with SNP densities of 450 (L450), 3,071 (L3k) and 5,963 (L6k). Accuracy of imputation was determined using 184 test individuals with no genotyped descendants in the data but with parents and grandparents genotyped using the Illumina PorcineSNP60 Beadchip. Alternative genotyping scenarios were created in which parents, grandparents, and individuals that were not direct ancestors of test animals (Other) were genotyped at high density (S1), grandparents were not genotyped (S2), dams and granddams were not genotyped (S3), and dams and granddams were genotyped at low density (S4). Four additional scenarios were created by excluding Other animal genotypes. Test individuals were always genotyped at low density. Imputation was performed with AlphaImpute. Genomic breeding values were calculated using the single-step genomic evaluation. Test animals were evaluated for the information retained in the gEBV, calculated as the correlation between gEBV using imputed genotypes and gEBV using true genotypes. Accuracy of imputation was high for all scenarios but decreased with fewer SNP on the low-density panel (0.995 to 0.965 for S1) and with reduced genotyping of ancestors, where the largest changes were for L450 (0.965 in S1 to 0.914 in S3). Exclusion of genotypes for Other animals resulted in only small accuracy decreases. Imputation accuracy was not consistent across the genome. Information retained in the gEBV was related to genotyping scenario and thus to imputation accuracy. Reducing the number of SNP on the low-density panel reduced the information retained in the gEBV, with the largest decrease observed from L3k to L450. Excluding Other animal genotypes had little impact on imputation accuracy but caused large decreases in the information retained in the gEBV. These results indicate that accuracy of gEBV from imputed genotypes depends on the level of genotyping in close relatives and the size of the genotyped dataset. Fewer high-density genotyped individuals are needed to obtain accurate imputation than are needed to obtain accurate gEBV. Strategies to optimize development of low-density panels can improve both imputation and gEBV accuracy.
Wang, Qi; Hu, Weina; Lei, Mingming; Wang, Yong; Yan, Bing; Liu, Jun; Zhang, Ren; Jin, Yuanzhe
2013-01-01
To investigate if microRNAs (miRNAs) play a role in regulating h-ERG trafficking in the setting of chronic oxidative stress as a common deleterious factor for many cardiac disorders. We treated neonatal rat ventricular myocytes and HEK293 cells with stable expression of h-ERG with H2O2 for 12 h and 48 h. Expression of miR-17-5p seed miRNAs was quantified by real-time RT-PCR. Protein levels of chaperones and h-ERG trafficking were measured by Western blot analysis. Luciferase reporter gene assay was used to study miRNA and target interactions. Whole-cell patch-clamp techniques were employed to record h-ERG K(+) current. H-ERG trafficking was impaired by H2O2 after 48 h treatment, accompanied by reciprocal changes of expression between miR-17-5p seed miRNAs and several chaperones (Hsp70, Hsc70, CANX, and Golga2), with the former upregulated and the latter downregulated. We established these chaperones as targets for miR-17-5p. Application miR-17-5p inhibitor rescued H2O2-induced impairment of h-ERG trafficking. Upregulation of endogenous by H2O2 or forced miR-17-5p expression either reduced h-ERG current. Sequestration of AP1 by its decoy molecule eliminated the upregulation of miR-17-5p, and ameliorated impairment of h-ERG trafficking. Collectively, deregulation of the miR-17-5p seed family miRNAs can cause severe impairment of h-ERG trafficking through targeting multiple ER stress-related chaperones, and activation of AP1 likely accounts for the deleterious upregulation of these miRNAs, in the setting of prolonged duration of oxidative stress. These findings revealed the role of miRNAs in h-ERG trafficking, which may contribute to the cardiac electrical disturbances associated with oxidative stress.
Liu, Yanwei; Yan, Wei; Zhang, Wei; Chen, Lingchao; You, Gan; Bao, Zhaoshi; Wang, Yongzhi; Wang, Hongjun; Kang, Chunsheng; Jiang, Tao
2012-09-01
The invasive behavior of glioblastoma multiforme (GBM) cells is one of the most important reasons for the poor prognosis of this cancer. For invasion, tumor cells must acquire an ability to digest the extracellular matrix and infiltrate the normal tissue bordering the tumor. Preventing this by altering effector molecules can significantly improve a patient's prognosis. Accumulating evidence suggests that miRNAs are involved in multiple biological functions, including cell invasion, by altering the expression of multiple target genes. The expression levels of miR-218 correlate with the invasive potential of GBM cells. In this study, we found that miR-218 expression was low in glioma tissues, especially in GBM. The data showed an inverse correlation in 60 GBM tissues between the levels of miR-218 and MMP mRNAs (MMP-2, -7 and -9). Additionally, ectopic expression of miR-218 suppressed the invasion of GBM cells whereas inhibition of miR-218 expression enhanced the invasive ability. Numerous members of the MMP family are downstream effectors of the Wnt/LEF1 pathway. Target prediction databases and luciferase data showed that LEF1 is a new direct target of miR-218. Importantly, western blot assays demonstrated that miR-218 can reduce protein levels of LEF1 and MMP-9. We, therefore, hypothesize that miR-218 directly targets LEF1, resulting in reduced synthesis of MMP-9. Results suggest that miR-218 is involved in the invasive behavior of GBM cells and by targeting LEF1 and blocking the invasive axis, miR-218-LEF1-MMPs, it may be useful for developing potential clinical strategies.
NASA Astrophysics Data System (ADS)
Pandey, Rajesh; Bhattacharya, Aniket; Bhardwaj, Vivek; Jha, Vineet; Mandal, Amit K.; Mukerji, Mitali
2016-09-01
Primate-specific Alus harbor different regulatory features, including miRNA targets. In this study, we provide evidence for miRNA-mediated modulation of transcript isoform levels during heat-shock response through exaptation of Alu-miRNA sites in mature mRNA. We performed genome-wide expression profiling coupled with functional validation of miRNA target sites within exonized Alus, and analyzed conservation of these targets across primates. We observed that two miRNAs (miR-15a-3p and miR-302d-3p) elevated in stress response, target RAD1, GTSE1, NR2C1, FKBP9 and UBE2I exclusively within Alu. These genes map onto the p53 regulatory network. Ectopic overexpression of miR-15a-3p downregulates GTSE1 and RAD1 at the protein level and enhances cell survival. This Alu-mediated fine-tuning seems to be unique to humans as evident from the absence of orthologous sites in other primate lineages. We further analyzed signatures of selection on Alu-miRNA targets in the genome, using 1000 Genomes Phase-I data. We found that 198 out of 3177 Alu-exonized genes exhibit signatures of selection within Alu-miRNA sites, with 60 of them containing SNPs supported by multiple evidences (global-FST > 0.3, pair-wise-FST > 0.5, Fay-Wu’s H < -20, iHS > 2.0, high ΔDAF) and implicated in p53 network. We propose that by affecting multiple genes, Alu-miRNA interactions have the potential to facilitate population-level adaptations in response to environmental challenges.
Badke, Yvonne M; Bates, Ronald O; Ernst, Catherine W; Fix, Justin; Steibel, Juan P
2014-04-16
Genomic selection has the potential to increase genetic progress. Genotype imputation of high-density single-nucleotide polymorphism (SNP) genotypes can improve the cost efficiency of genomic breeding value (GEBV) prediction for pig breeding. Consequently, the objectives of this work were to: (1) estimate accuracy of genomic evaluation and GEBV for three traits in a Yorkshire population and (2) quantify the loss of accuracy of genomic evaluation and GEBV when genotypes were imputed under two scenarios: a high-cost, high-accuracy scenario in which only selection candidates were imputed from a low-density platform and a low-cost, low-accuracy scenario in which all animals were imputed using a small reference panel of haplotypes. Phenotypes and genotypes obtained with the PorcineSNP60 BeadChip were available for 983 Yorkshire boars. Genotypes of selection candidates were masked and imputed using tagSNP in the GeneSeek Genomic Profiler (10K). Imputation was performed with BEAGLE using 128 or 1800 haplotypes as reference panels. GEBV were obtained through an animal-centric ridge regression model using de-regressed breeding values as response variables. Accuracy of genomic evaluation was estimated as the correlation between estimated breeding values and GEBV in a 10-fold cross validation design. Accuracy of genomic evaluation using observed genotypes was high for all traits (0.65-0.68). Using genotypes imputed from a large reference panel (accuracy: R(2) = 0.95) for genomic evaluation did not significantly decrease accuracy, whereas a scenario with genotypes imputed from a small reference panel (R(2) = 0.88) did show a significant decrease in accuracy. Genomic evaluation based on imputed genotypes in selection candidates can be implemented at a fraction of the cost of a genomic evaluation using observed genotypes and still yield virtually the same accuracy. On the other side, using a very small reference panel of haplotypes to impute training animals and candidates for selection results in lower accuracy of genomic evaluation.
Thelen, Joanie; Bruce, Amanda; Catley, Delwyn; Lynch, Sharon; Goggin, Kathy; Bradley-Ewing, Andrea; Glusman, Morgan; Norouzinia, Abigail; Strober, Lauren; Bruce, Jared
2018-04-01
Patients with multiple sclerosis (MS) are often nonadherent to their disease modifying therapy (DMT). While recent studies demonstrate enhanced DMT adherence following intervention grounded in motivational interviewing (MI), little is known about how to address DMT reinitiation among MS patients who have prematurely discontinued DMT against medical advice and do not intend to reinitiate. We examined baseline predictors of DMT reinitiation among patients with MS who discontinued medications against medical advice following a telephone-based MI and Cognitive Behavioral Therapy (MI-CBT) intervention. Following MI-CBT intervention, 66 patients reported whether or not they opted to reinitiate DMT. Rate of disease progression (β = 0.295) and perceived personal control (β = - 0.131) emerged as unique significant predictors of DMT reinitiation following intervention. Clinical characteristics and health-related beliefs may be used to prospectively identify patients most likely to reinitiate DMT following MI-CBT intervention, furthering the goal of preserving brain health and preventing neurologic decline in MS via appropriate DMT utilization. Further study is warranted to delineate potential mediators and moderators of DMT reinitiation outcomes.
A meta-data based method for DNA microarray imputation.
Jörnsten, Rebecka; Ouyang, Ming; Wang, Hui-Yu
2007-03-29
DNA microarray experiments are conducted in logical sets, such as time course profiling after a treatment is applied to the samples, or comparisons of the samples under two or more conditions. Due to cost and design constraints of spotted cDNA microarray experiments, each logical set commonly includes only a small number of replicates per condition. Despite the vast improvement of the microarray technology in recent years, missing values are prevalent. Intuitively, imputation of missing values is best done using many replicates within the same logical set. In practice, there are few replicates and thus reliable imputation within logical sets is difficult. However, it is in the case of few replicates that the presence of missing values, and how they are imputed, can have the most profound impact on the outcome of downstream analyses (e.g. significance analysis and clustering). This study explores the feasibility of imputation across logical sets, using the vast amount of publicly available microarray data to improve imputation reliability in the small sample size setting. We download all cDNA microarray data of Saccharomyces cerevisiae, Arabidopsis thaliana, and Caenorhabditis elegans from the Stanford Microarray Database. Through cross-validation and simulation, we find that, for all three species, our proposed imputation using data from public databases is far superior to imputation within a logical set, sometimes to an astonishing degree. Furthermore, the imputation root mean square error for significant genes is generally a lot less than that of non-significant ones. Since downstream analysis of significant genes, such as clustering and network analysis, can be very sensitive to small perturbations of estimated gene effects, it is highly recommended that researchers apply reliable data imputation prior to further analysis. Our method can also be applied to cDNA microarray experiments from other species, provided good reference data are available.
Use of partial least squares regression to impute SNP genotypes in Italian cattle breeds.
Dimauro, Corrado; Cellesi, Massimo; Gaspa, Giustino; Ajmone-Marsan, Paolo; Steri, Roberto; Marras, Gabriele; Macciotta, Nicolò P P
2013-06-05
The objective of the present study was to test the ability of the partial least squares regression technique to impute genotypes from low density single nucleotide polymorphisms (SNP) panels i.e. 3K or 7K to a high density panel with 50K SNP. No pedigree information was used. Data consisted of 2093 Holstein, 749 Brown Swiss and 479 Simmental bulls genotyped with the Illumina 50K Beadchip. First, a single-breed approach was applied by using only data from Holstein animals. Then, to enlarge the training population, data from the three breeds were combined and a multi-breed analysis was performed. Accuracies of genotypes imputed using the partial least squares regression method were compared with those obtained by using the Beagle software. The impact of genotype imputation on breeding value prediction was evaluated for milk yield, fat content and protein content. In the single-breed approach, the accuracy of imputation using partial least squares regression was around 90 and 94% for the 3K and 7K platforms, respectively; corresponding accuracies obtained with Beagle were around 85% and 90%. Moreover, computing time required by the partial least squares regression method was on average around 10 times lower than computing time required by Beagle. Using the partial least squares regression method in the multi-breed resulted in lower imputation accuracies than using single-breed data. The impact of the SNP-genotype imputation on the accuracy of direct genomic breeding values was small. The correlation between estimates of genetic merit obtained by using imputed versus actual genotypes was around 0.96 for the 7K chip. Results of the present work suggested that the partial least squares regression imputation method could be useful to impute SNP genotypes when pedigree information is not available.
Hieke, Stefanie; Benner, Axel; Schlenl, Richard F; Schumacher, Martin; Bullinger, Lars; Binder, Harald
2016-08-30
High-throughput technology allows for genome-wide measurements at different molecular levels for the same patient, e.g. single nucleotide polymorphisms (SNPs) and gene expression. Correspondingly, it might be beneficial to also integrate complementary information from different molecular levels when building multivariable risk prediction models for a clinical endpoint, such as treatment response or survival. Unfortunately, such a high-dimensional modeling task will often be complicated by a limited overlap of molecular measurements at different levels between patients, i.e. measurements from all molecular levels are available only for a smaller proportion of patients. We propose a sequential strategy for building clinical risk prediction models that integrate genome-wide measurements from two molecular levels in a complementary way. To deal with partial overlap, we develop an imputation approach that allows us to use all available data. This approach is investigated in two acute myeloid leukemia applications combining gene expression with either SNP or DNA methylation data. After obtaining a sparse risk prediction signature e.g. from SNP data, an automatically selected set of prognostic SNPs, by componentwise likelihood-based boosting, imputation is performed for the corresponding linear predictor by a linking model that incorporates e.g. gene expression measurements. The imputed linear predictor is then used for adjustment when building a prognostic signature from the gene expression data. For evaluation, we consider stability, as quantified by inclusion frequencies across resampling data sets. Despite an extremely small overlap in the application example with gene expression and SNPs, several genes are seen to be more stably identified when taking the (imputed) linear predictor from the SNP data into account. In the application with gene expression and DNA methylation, prediction performance with respect to survival also indicates that the proposed approach might work well. We consider imputation of linear predictor values to be a feasible and sensible approach for dealing with partial overlap in complementary integrative analysis of molecular measurements at different levels. More generally, these results indicate that a complementary strategy for integrating different molecular levels can result in more stable risk prediction signatures, potentially providing a more reliable insight into the underlying biology.
Nelson, Sarah C.; Stilp, Adrienne M.; Papanicolaou, George J.; Taylor, Kent D.; Rotter, Jerome I.; Thornton, Timothy A.; Laurie, Cathy C.
2016-01-01
Imputation is commonly used in genome-wide association studies to expand the set of genetic variants available for analysis. Larger and more diverse reference panels, such as the final Phase 3 of the 1000 Genomes Project, hold promise for improving imputation accuracy in genetically diverse populations such as Hispanics/Latinos in the USA. Here, we sought to empirically evaluate imputation accuracy when imputing to a 1000 Genomes Phase 3 versus a Phase 1 reference, using participants from the Hispanic Community Health Study/Study of Latinos. Our assessments included calculating the correlation between imputed and observed allelic dosage in a subset of samples genotyped on a supplemental array. We observed that the Phase 3 reference yielded higher accuracy at rare variants, but that the two reference panels were comparable at common variants. At a sample level, the Phase 3 reference improved imputation accuracy in Hispanic/Latino samples from the Caribbean more than for Mainland samples, which we attribute primarily to the additional reference panel samples available in Phase 3. We conclude that a 1000 Genomes Project Phase 3 reference panel can yield improved imputation accuracy compared with Phase 1, particularly for rare variants and for samples of certain genetic ancestry compositions. Our findings can inform imputation design for other genome-wide association studies of participants with diverse ancestries, especially as larger and more diverse reference panels continue to become available. PMID:27346520
LinkImpute: Fast and Accurate Genotype Imputation for Nonmodel Organisms
Money, Daniel; Gardner, Kyle; Migicovsky, Zoë; Schwaninger, Heidi; Zhong, Gan-Yuan; Myles, Sean
2015-01-01
Obtaining genome-wide genotype data from a set of individuals is the first step in many genomic studies, including genome-wide association and genomic selection. All genotyping methods suffer from some level of missing data, and genotype imputation can be used to fill in the missing data and improve the power of downstream analyses. Model organisms like human and cattle benefit from high-quality reference genomes and panels of reference genotypes that aid in imputation accuracy. In nonmodel organisms, however, genetic and physical maps often are either of poor quality or are completely absent, and there are no panels of reference genotypes available. There is therefore a need for imputation methods designed specifically for nonmodel organisms in which genomic resources are poorly developed and marker order is unreliable or unknown. Here we introduce LinkImpute, a software package based on a k-nearest neighbor genotype imputation method, LD-kNNi, which is designed for unordered markers. No physical or genetic maps are required, and it is designed to work on unphased genotype data from heterozygous species. It exploits the fact that markers useful for imputation often are not physically close to the missing genotype but rather distributed throughout the genome. Using genotyping-by-sequencing data from diverse and heterozygous accessions of apples, grapes, and maize, we compare LD-kNNi with several genotype imputation methods and show that LD-kNNi is fast, comparable in accuracy to the best-existing methods, and exhibits the least bias in allele frequency estimates. PMID:26377960
Shin, Kayeong; Choi, Jaeyeong; Kim, Yeoju; Lee, Yoonjeong; Kim, Joohoon; Lee, Seungho; Chung, Hoeil
2018-06-29
We propose a new analytical scheme in which field-flow fractionation (FFF)-based separation of target-specific polystyrene (PS) particle probes of different sizes are incorporated with amplified surface-enhanced Raman scattering (SERS) tagging for the simultaneous and sensitive detection of multiple microRNAs (miRNAs). For multiplexed detection, PS particles of three different diameters (15, 10, 5 μm) were used for the size-coding, and a probe single stranded DNA (ssDNA) complementary to a target miRNA was conjugated on an intended PS particle. After binding of a target miRNA on PS probe, polyadenylation reaction was executed to generate a long tail composed of adenine (A) serving as a binding site to thymine (T) conjugated Au nanoparticles (T-AuNPs) to increase SERS intensity. The three size-coded PS probes bound with T-AuNPs were then separated in a FFF channel. With the observation of extinction-based fractograms, separation of three size-coded PS probes was clearly confirmed, thereby enabling of measuring three miRNAs simultaneously. Raman intensities of FFF fractions collected at the peak maximum of 15, 10 and 5 μm PS probes varied fairy quantitatively with the change of miRNA concentrations, and the reproducibility of measurement was acceptable. The proposed method is potentially useful for simultaneous detection of multiple miRNAs with high sensitivity. Copyright © 2018 Elsevier B.V. All rights reserved.
Ahmad, Meraj; Sinha, Anubhav; Ghosh, Sreya; Kumar, Vikrant; Davila, Sonia; Yajnik, Chittaranjan S; Chandak, Giriraj R
2017-07-27
Imputation is a computational method based on the principle of haplotype sharing allowing enrichment of genome-wide association study datasets. It depends on the haplotype structure of the population and density of the genotype data. The 1000 Genomes Project led to the generation of imputation reference panels which have been used globally. However, recent studies have shown that population-specific panels provide better enrichment of genome-wide variants. We compared the imputation accuracy using 1000 Genomes phase 3 reference panel and a panel generated from genome-wide data on 407 individuals from Western India (WIP). The concordance of imputed variants was cross-checked with next-generation re-sequencing data on a subset of genomic regions. Further, using the genome-wide data from 1880 individuals, we demonstrate that WIP works better than the 1000 Genomes phase 3 panel and when merged with it, significantly improves the imputation accuracy throughout the minor allele frequency range. We also show that imputation using only South Asian component of the 1000 Genomes phase 3 panel works as good as the merged panel, making it computationally less intensive job. Thus, our study stresses that imputation accuracy using 1000 Genomes phase 3 panel can be further improved by including population-specific reference panels from South Asia.
Multilayer checkpoints for microRNA authenticity during RISC assembly.
Kawamata, Tomoko; Yoda, Mayuko; Tomari, Yukihide
2011-09-01
MicroRNAs (miRNAs) function through the RNA-induced silencing complex (RISC), which contains an Argonaute (Ago) protein at the core. RISC assembly follows a two-step pathway: miRNA/miRNA* duplex loading into Ago, and separation of the two strands within Ago. Here we show that the 5' phosphate of the miRNA strand is essential for duplex loading into Ago, whereas the preferred 5' nucleotide of the miRNA strand and the base-pairing status in the seed region and the middle of the 3' region function as additive anchors to Ago. Consequently, the miRNA authenticity is inspected at multiple steps during RISC assembly.
Sun, Ye-Ying; Qin, Shan-Shan; Cheng, Yun-Hui; Wang, Chao-Yun; Liu, Xiao-Jun; Liu, Ying; Zhang, Xiu-Li; Zhang, Wendy; Zhan, Jia-Xin; Shao, Shuai; Bian, Wei-Hua; Luo, Bi-Hui; Lu, Dong-Feng; Yang, Jian; Wang, Chun-Hua; Zhang, Chun-Xiang
2018-05-01
Contact inhibition and its disruption of vascular smooth muscle cells (VSMCs) are important cellular events in vascular diseases. But the underlying molecular mechanisms are unclear. In this study we investigated the roles of microRNAs (miRNAs) in the contact inhibition and its disruption of VSMCs and the molecular mechanisms involved. Rat VSMCs were seeded at 30% or 90% confluence. MiRNA expression profiles in contact-inhibited confluent VSMCs (90% confluence) and non-contact-inhibited low-density VSMCs (30% confluence) were determined. We found that multiple miRNAs were differentially expressed between the two groups. Among them, miR-145 was significantly increased in contact-inhibited VSMCs. Serum could disrupt the contact inhibition as shown by the elicited proliferation of confluent VSMCs. The contact inhibition disruption accompanied with a down-regulation of miR-145. Serum-induced contact inhibition disruption of VSMCs was blocked by overexpression of miR-145. Moreover, downregulation of miR-145 was sufficient to disrupt the contact inhibition of VSMCs. The downregulation of miR-145 in serum-induced contact inhibition disruption was related to the activation PI3-kinase/Akt pathway, which was blocked by the PI3-kinase inhibitor LY294002. KLF5, a target gene of miR-145, was identified to be involved in miR-145-mediated effect on VSMC contact inhibition disruption, as it could be inhibited by knockdown of KLF5. In summary, our results show that multiple miRNAs are differentially expressed in contact-inhibited VSMCs and in non-contact-inhibited VSMCs. Among them, miR-145 is a critical gene in contact inhibition and its disruption of VSMCs. PI3-kinase/Akt/miR-145/KLF5 is a critical signaling pathway in serum-induced contact inhibition disruption. Targeting of miRNAs related to the contact inhibition of VSMCs may represent a novel therapeutic approach for vascular diseases.
Global population-specific variation in miRNA associated with cancer risk and clinical biomarkers.
Rawlings-Goss, Renata A; Campbell, Michael C; Tishkoff, Sarah A
2014-08-28
MiRNA expression profiling is being actively investigated as a clinical biomarker and diagnostic tool to detect multiple cancer types and stages as well as other complex diseases. Initial investigations, however, have not comprehensively taken into account genetic variability affecting miRNA expression and/or function in populations of different ethnic backgrounds. Therefore, more complete surveys of miRNA genetic variability are needed to assess global patterns of miRNA variation within and between diverse human populations and their effect on clinically relevant miRNA genes. Genetic variation in 1524 miRNA genes was examined using whole genome sequencing (60x coverage) in a panel of 69 unrelated individuals from 14 global populations, including European, Asian and African populations. We identified 33 previously undescribed miRNA variants, and 31 miRNA containing variants that are globally population-differentiated in frequency between African and non-African populations (PD-miRNA). The top 1% of PD-miRNA were significantly enriched for regulation of genes involved in glucose/insulin metabolism and cell division (p < 10(-7)), most significantly the mitosis pathway, which is strongly linked to cancer onset. Overall, we identify 7 PD-miRNAs that are currently implicated as cancer biomarkers or diagnostics: hsa-mir-202, hsa-mir-423, hsa-mir-196a-2, hsa-mir-520h, hsa-mir-647, hsa-mir-943, and hsa-mir-1908. Notably, hsa-mir-202, a potential breast cancer biomarker, was found to show significantly high allele frequency differentiation at SNP rs12355840, which is known to affect miRNA expression levels in vivo and subsequently breast cancer mortality. MiRNA expression profiles represent a promising new category of disease biomarkers. However, population specific genetic variation can affect the prevalence and baseline expression of these miRNAs in diverse populations. Consequently, miRNA genetic and expression level variation among ethnic groups may be contributing in part to health disparities observed in multiple forms of cancer, specifically breast cancer, and will be an essential consideration when assessing the utility of miRNA biomarkers for the clinic.
Fucci, Carlo; Faggiano, Pompilio; Nardi, Matilde; D'Aloia, Antonio; Coletti, Giuseppe; De Cicco, Giuseppe; Latini, Leonardo; Vizzardi, Enrico; Lorusso, Roberto
2013-09-10
Barlow disease represents a surgical challenge for mitral valve repair (MR) in the presence of mitral insufficiency (MI) with multiple regurgitant jets. We hereby present our mid-term experience using a modified edge-to-edge technique to address this peculiar MI. From March 2003 till December 2010, 25 consecutive patients (mean age 54 ± 7 years, 14 males) affected by severe Barlow disease with multiple regurgitant jets were submitted to MR. Preoperative transesophageal echo (TEE) in all the cases showed at least 2 regurgitant jets, involving one or both leaflets in more than one segment. In all the patients, a triple orifice valve (TOV) repair with annuloplasty was performed. Intra-operative TEE and postoperative transthoracic echocardiography (TTE) were carried out to evaluate results of the TOV repair. There was no in-hospital death and one late death (non-cardiac related). At intra-operative TEE, the three orifices showed a mean total valve area of 2.9 ± 0.1cm(2) (range 2.5-3.3 cm(2)) with no residual regurgitation (2 cases of trivial MI) and no sign of valve stenosis (mean transvalvular gradient 4.6 ± 1.5 mmHg). At follow up (mean 38 ± 22 months), TTE showed favourable MR and no recurrence of significant MI (6 cases of trivial and 1 of mild MI). Stress TTE was performed in 5 cases showing persistent effective valve function (2 cases of trivial MI at peak exercise). All the patients showed significant NYHA functional class improvement. This report indicates that the TOV technique is effective in correcting complex Barlow mitral valves with multiple jets. Further studies are required to confirm long-term applicability and durability in more numerous clinical cases. Copyright © 2012 Elsevier Ireland Ltd. All rights reserved.
Variable Selection in the Presence of Missing Data: Imputation-based Methods.
Zhao, Yize; Long, Qi
2017-01-01
Variable selection plays an essential role in regression analysis as it identifies important variables that associated with outcomes and is known to improve predictive accuracy of resulting models. Variable selection methods have been widely investigated for fully observed data. However, in the presence of missing data, methods for variable selection need to be carefully designed to account for missing data mechanisms and statistical techniques used for handling missing data. Since imputation is arguably the most popular method for handling missing data due to its ease of use, statistical methods for variable selection that are combined with imputation are of particular interest. These methods, valid used under the assumptions of missing at random (MAR) and missing completely at random (MCAR), largely fall into three general strategies. The first strategy applies existing variable selection methods to each imputed dataset and then combine variable selection results across all imputed datasets. The second strategy applies existing variable selection methods to stacked imputed datasets. The third variable selection strategy combines resampling techniques such as bootstrap with imputation. Despite recent advances, this area remains under-developed and offers fertile ground for further research.
genipe: an automated genome-wide imputation pipeline with automatic reporting and statistical tools.
Lemieux Perreault, Louis-Philippe; Legault, Marc-André; Asselin, Géraldine; Dubé, Marie-Pierre
2016-12-01
Genotype imputation is now commonly performed following genome-wide genotyping experiments. Imputation increases the density of analyzed genotypes in the dataset, enabling fine-mapping across the genome. However, the process of imputation using the most recent publicly available reference datasets can require considerable computation power and the management of hundreds of large intermediate files. We have developed genipe, a complete genome-wide imputation pipeline which includes automatic reporting, imputed data indexing and management, and a suite of statistical tests for imputed data commonly used in genetic epidemiology (Sequence Kernel Association Test, Cox proportional hazards for survival analysis, and linear mixed models for repeated measurements in longitudinal studies). The genipe package is an open source Python software and is freely available for non-commercial use (CC BY-NC 4.0) at https://github.com/pgxcentre/genipe Documentation and tutorials are available at http://pgxcentre.github.io/genipe CONTACT: louis-philippe.lemieux.perreault@statgen.org or marie-pierre.dube@statgen.orgSupplementary information: Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.
NASA Astrophysics Data System (ADS)
Nishina, Kazuya; Ito, Akihiko; Hanasaki, Naota; Hayashi, Seiji
2017-02-01
Currently, available historical global N fertilizer map as an input data to global biogeochemical model is still limited and existing maps were not considered NH4+ and NO3- in the fertilizer application rates. This paper provides a method for constructing a new historical global nitrogen fertilizer application map (0.5° × 0.5° resolution) for the period 1961-2010 based on country-specific information from Food and Agriculture Organization statistics (FAOSTAT) and various global datasets. This new map incorporates the fraction of NH4+ (and NO3-) in N fertilizer inputs by utilizing fertilizer species information in FAOSTAT, in which species can be categorized as NH4+- and/or NO3--forming N fertilizers. During data processing, we applied a statistical data imputation method for the missing data (19 % of national N fertilizer consumption) in FAOSTAT. The multiple imputation method enabled us to fill gaps in the time-series data using plausible values using covariates information (year, population, GDP, and crop area). After the imputation, we downscaled the national consumption data to a gridded cropland map. Also, we applied the multiple imputation method to the available chemical fertilizer species consumption, allowing for the estimation of the NH4+ / NO3- ratio in national fertilizer consumption. In this study, the synthetic N fertilizer inputs in 2000 showed a general consistency with the existing N fertilizer map (Potter et al., 2010) in relation to the ranges of N fertilizer inputs. Globally, the estimated N fertilizer inputs based on the sum of filled data increased from 15 to 110 Tg-N during 1961-2010. On the other hand, the global NO3- input started to decline after the late 1980s and the fraction of NO3- in global N fertilizer decreased consistently from 35 to 13 % over a 50-year period. NH4+-forming fertilizers are dominant in most countries; however, the NH4+ / NO3- ratio in N fertilizer inputs shows clear differences temporally and geographically. This new map can be utilized as input data to global model studies and bring new insights for the assessment of historical terrestrial N cycling changes. Datasets available at doi:10.1594/PANGAEA.861203.
MicroRNA regulation of F-box proteins and its role in cancer.
Wu, Zhao-Hui; Pfeffer, Lawrence M
2016-02-01
MicroRNAs (miRNAs) are small endogenous non-coding RNAs, which play critical roles in cancer development by suppressing gene expression at the post-transcriptional level. In general, oncogenic miRNAs are upregulated in cancer, while miRNAs that act as tumor suppressors are downregulated, leading to decreased expression of tumor suppressors and upregulated oncogene expression, respectively. F-box proteins function as the substrate-recognition components of the SKP1-CUL1-F-box (SCF)-ubiquitin ligase complex for the degradation of their protein targets by the ubiquitin-proteasome system. Therefore F-box proteins and miRNAs both negatively regulate target gene expression post-transcriptionally. Since each miRNA is capable of fine-tuning the expression of multiple target genes, multiple F-box proteins may be suppressed by the same miRNA. Meanwhile, one F-box proteins could be regulated by several miRNAs in different cancer types. In this review, we will focus on miRNA-mediated downregulation of various F-box proteins, the resulting stabilization of F-box protein substrates and the impact of these processes on human malignancies. We provide insight into how the miRNA: F-box protein axis may regulate cancer progression and metastasis. We also consider the broader role of F-box proteins in the regulation of pathways that are independent of the ubiquitin ligase complex and how that impacts on oncogenesis. The area of miRNAs and the F-box proteins that they regulate in cancer is an emerging field and will inform new strategies in cancer treatment. Copyright © 2015 Elsevier Ltd. All rights reserved.
Xie, Ying; Wehrkamp, Cody J; Li, Jing; Wang, Yan; Wang, Yazhe; Mott, Justin L; Oupický, David
2016-03-07
Cholangiocarcinoma is the second most common primary liver malignancy with extremely poor prognosis due to early invasion and widespread metastasis. The invasion and metastasis are regulated by multiple factors including CXCR4 chemokine receptor and multiple microRNAs. The goal of this study was to test the hypothesis that inhibition of CXCR4 combined with the action of miR-200c mimic will cooperatively enhance the inhibition of the invasion of human cholangiocarcinoma cells. The results show that CXCR4-inhibition polycation PCX can effectively deliver miR-200c mimic and that the combination treatment consisting of PCX and miR-200c results in cooperative antimigration activity, most likely by coupling the CXCR4 axis blockade with epithelial-to-mesenchymal transition inhibition in the cholangiocarcinoma cells. The ability of the combined PCX/miR-200c treatment to obstruct two migratory pathways represents a promising antimetastatic strategy in cholangiocarcinoma.
Fernandez, Serena; Risolino, Maurizio; Mandia, Nadia; Talotta, Francesco; Soini, Ylermi; Incoronato, Mariarosaria; Condorelli, Gerolama; Banfi, Sandro; Verde, Pasquale
2014-01-01
MicroRNAs (miRNAs) control cell cycle progression by targeting the transcripts encoding for cyclins, CDKs and CDK inhibitors, such as p27KIP1 (p27). p27 expression is controlled by multiple transcriptional and posttranscriptional mechanisms, including translational inhibition by miR-221/222 and posttranslational regulation by the SCFSKP2 complex. The oncosuppressor activity of miR-340 has been recently characterized in breast, colorectal and osteosarcoma tumor cells. However, the mechanisms underlying miR-340-induced cell growth arrest have not been elucidated. Here we describe miR-340 as a novel tumor suppressor in non-small cell lung cancer (NSCLC). Starting from the observation that the growth-inhibitory and proapoptotic effects of miR-340 correlate with the accumulation of p27 in lung adenocarcinoma and glioblastoma cells, we have analyzed the functional relationship between miR-340 and p27 expression. miR-340 targets three key negative regulators of p27. The miR-340-mediated inhibition of both Pumilio-family RNA-binding proteins (PUM1 and PUM2), required for the miR-221/222 interaction with the p27 3′UTR, antagonizes the miRNA-dependent downregulation of p27. At the same time, miR-340 induces the stabilization of p27 by targeting SKP2, the key posttranslational regulator of p27. Therefore, miR-340 controls p27 at both translational and posttranslational levels. Accordingly, the inhibition of either PUM1 or SKP2 partially recapitulates the miR-340 effect on cell proliferation and apoptosis. In addition to the effect on tumor cell proliferation, miR-340 also inhibits intercellular adhesion and motility in lung cancer cells. These changes correlate with the miR-340-mediated inhibition of previously validated (MET and ROCK1) and potentially novel (RHOA and CDH1) miR-340 target transcripts. Finally, we show that in a small cohort of NSCLC patients (n=23), representative of all four stages of lung cancer, miR-340 expression inversely correlates with clinical staging, thus suggesting that miR-340 downregulation contributes to the disease progression. PMID:25151966
Saminathan, Thangasamy; Bodunrin, Abiodun; Singh, Nripendra V; Devarajan, Ramajayam; Nimmakayala, Padma; Jeff, Moersfelder; Aradhya, Mallikarjuna; Reddy, Umesh K
2016-05-26
MicroRNAs (miRNAs), a class of small non-coding endogenous RNAs that regulate gene expression post-transcriptionally, play multiple key roles in plant growth and development and in biotic and abiotic stress response. Knowledge and roles of miRNAs in pomegranate fruit development have not been explored. Pomegranate, which accumulates a large amount of anthocyanins in skin and arils, is valuable to human health, mainly because of its antioxidant properties. In this study, we developed a small RNA library from pooled RNA samples from young seedlings to mature fruits and identified both conserved and pomegranate-specific miRNA from 29,948,480 high-quality reads. For the pool of 15- to 30-nt small RNAs, ~50 % were 24 nt. The miR157 family was the most abundant, followed by miR156, miR166, and miR168, with variants within each family. The base bias at the first position from the 5' end had a strong preference for U for most 18- to 26-nt sRNAs but a preference for A for 18-nt sRNAs. In addition, for all 24-nt sRNAs, the nucleotide U was preferred (97 %) in the first position. Stem-loop RT-qPCR was used to validate the expression of the predominant miRNAs and novel miRNAs in leaves, male and female flowers, and multiple fruit developmental stages; miR156, miR156a, miR159a, miR159b, and miR319b were upregulated during the later stages of fruit development. Higher expression of miR156 in later fruit developmental may positively regulate anthocyanin biosynthesis by reducing SPL transcription factor. Novel miRNAs showed variation in expression among different tissues. These novel miRNAs targeted different transcription factors and hormone related regulators. Gene ontology and KEGG pathway analyses revealed predominant metabolic processes and catalytic activities, important for fruit development. In addition, KEGG pathway analyses revealed the involvement of miRNAs in ascorbate and linolenic acid, starch and sucrose metabolism; RNA transport; plant hormone signaling pathways; and circadian clock. Our first and preliminary report of miRNAs will provide information on the synthesis of biochemical compounds of pomegranate for future research. The functions of the targets of the novel miRNAs need further investigation.
MI as a Predictor of Students' Performance in Reading Competency
ERIC Educational Resources Information Center
Hajhashemi, Karim
2012-01-01
The purpose of this study was to examine whether performance in MI could predict the performance in reading competency. The other objectives were to identify the components of MI which are correlated with the reading test scores, and to determine the relationship between the multiple intelligences and reading proficiency. A descriptive and ex post…
Sheikh, Mashhood Ahmed; Abelsen, Birgit; Olsen, Jan Abel
2017-11-01
Previous methods for assessing mediation assume no multiplicative interactions. The inverse odds weighting (IOW) approach has been presented as a method that can be used even when interactions exist. The substantive aim of this study was to assess the indirect effect of education on health and well-being via four indicators of adult socioeconomic status (SES): income, management position, occupational hierarchy position and subjective social status. 8516 men and women from the Tromsø Study (Norway) were followed for 17 years. Education was measured at age 25-74 years, while SES and health and well-being were measured at age 42-91 years. Natural direct and indirect effects (NIE) were estimated using weighted Poisson regression models with IOW. Stata code is provided that makes it easy to assess mediation in any multiple imputed dataset with multiple mediators and interactions. Low education was associated with lower SES. Consequently, low SES was associated with being unhealthy and having a low level of well-being. The effect (NIE) of education on health and well-being is mediated by income, management position, occupational hierarchy position and subjective social status. This study contributes to the literature on mediation analysis, as well as the literature on the importance of education for health-related quality of life and subjective well-being. The influence of education on health and well-being had different pathways in this Norwegian sample. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2017. All rights reserved. No commercial use is permitted unless otherwise expressly granted.
Murray, Megan Y.; Rushworth, Stuart A.; Zaitseva, Lyubov; Bowles, Kristian M.; MacEwan, David J.
2013-01-01
Dexamethasone is a key front-line chemotherapeutic for B-cell malignant multiple myeloma (MM). Dexamethasone modulates MM cell survival signaling but fails to induce marked cytotoxicity when used as a monotherapy. We demonstrate here the mechanism behind this insufficient responsiveness of MM cells toward dexamethasone, revealing in MM a dramatic anti-apoptotic role for microRNA (miRNA)-125b in the insensitivity toward dexamethasone-induced apoptosis. MM cells responding to dexamethasone exhibited enhanced expression of oncogenic miR-125b. Dexamethasone also induced expression of miR-34a, which acts to suppress SIRT1 deacetylase, and thus allows maintained acetylation and inactivation of p53. p53 mRNA is also suppressed by miR-125b targeting. Reporter assays showed that both these dexamethasone-induced miRNAs act downstream of their target genes to prevent p53 tumor suppressor actions and, ultimately, resist cytotoxic responses in MM. Use of antisense miR-125b transcripts enhanced expression of pro-apoptotic p53, repressed expression of anti-apoptotic SIRT1 and, importantly, significantly enhanced dexamethasone-induced cell death responses in MM. Pharmacological manipulations showed that the key regulation enabling complete dexamethasone sensitivity in MM cells lies with miR-125b. In summary, dexamethasone-induced miR-125b induces cell death resistance mechanisms in MM cells via the p53/miR-34a/SIRT1 signaling network and provides these cells with an enhanced level of resistance to cytotoxic chemotherapeutics. Clearly, such anti-apoptotic mechanisms will need to be overcome to more effectively treat nascent, refractory and relapsed MM patients. These mechanisms provide insight into the role of miRNA regulation of apoptosis and their promotion of MM cell proliferative mechanisms. PMID:23759586
B. Tyler Wilson; Andrew J. Lister; Rachel I. Riemann
2012-01-01
The paper describes an efficient approach for mapping multiple individual tree species over large spatial domains. The method integrates vegetation phenology derived from MODIS imagery and raster data describing relevant environmental parameters with extensive field plot data of tree species basal area to create maps of tree species abundance and distribution at a 250-...
ERIC Educational Resources Information Center
Smith, Nichole Danielle
2017-01-01
According to the few quasi-experimental studies examining course outcomes for community college (Xu & Jaggars, 2011a, 2011b, 2013, 2014) and for-profit students (Bettinger, Fox, Loeb, & Taylor, 2014), there is a significant penalty for online students. No comparable research has been conducted on public four-year university students to…
ERIC Educational Resources Information Center
Hajhashemi, Karim; Ghombavani, Fatemeh Parasteh; Amirkhiz, Seyed Yasin Yazdi
2011-01-01
According to the theory of multiple intelligences (MI) propounded by Gardner (1983, 1999a, 1999b), each individual has a multitude of intelligences that are quite independent of each other and each individual has a unique cognitive profile. Having access to the MI profiles and learning strategies of learners could help the teachers in planning…
ERIC Educational Resources Information Center
Kaya, Osman Nafiz; Dogan, Alev; Gokcek, Nur; Kilic, Ziya; Kilic, Esma
2007-01-01
The purpose of this study was to investigate the effects of multiple intelligences (MI) teaching approach on 8th Grade students' achievement in and attitudes toward science. This study used a pretest-posttest control group experimental design. While the experimental group (n=30) was taught a unit on acids and bases using MI teaching approach, the…
ERIC Educational Resources Information Center
Sistani, Mahsa; Hashemian, Mahmood
2016-01-01
This study, first, examined whether there was any relationship between Iranian L2 learners' vocabulary learning strategies (VLSs), on the one hand, and their multiple intelligences (MI) types, on the other hand. In so doing, it explored the extent to which MI would predict L2 learners' VLSs. To these ends, 40 L2 learners from Isfahan University of…
ERIC Educational Resources Information Center
Boonma, Malai; Phaiboonnugulkij, Malinee
2014-01-01
This article calls for a strong need to propose the theoretical framework of the Multiple Intelligences theory (MI) and provide a suitable answer of the doubt in part of foreign language teaching. The article addresses the application of MI theory following various sources from Howard Gardner and the authors who revised this theory for use in the…
Genotype imputation in the domestic dog
Meurs, K. M.
2016-01-01
Application of imputation methods to accurately predict a dense array of SNP genotypes in the dog could provide an important supplement to current analyses of array-based genotyping data. Here, we developed a reference panel of 4,885,283 SNPs in 83 dogs across 15 breeds using whole genome sequencing. We used this panel to predict the genotypes of 268 dogs across three breeds with 84,193 SNP array-derived genotypes as inputs. We then (1) performed breed clustering of the actual and imputed data; (2) evaluated several reference panel breed combinations to determine an optimal reference panel composition; and (3) compared the accuracy of two commonly used software algorithms (Beagle and IMPUTE2). Breed clustering was well preserved in the imputation process across eigenvalues representing 75 % of the variation in the imputed data. Using Beagle with a target panel from a single breed, genotype concordance was highest using a multi-breed reference panel (92.4 %) compared to a breed-specific reference panel (87.0 %) or a reference panel containing no breeds overlapping with the target panel (74.9 %). This finding was confirmed using target panels derived from two other breeds. Additionally, using the multi-breed reference panel, genotype concordance was slightly higher with IMPUTE2 (94.1 %) compared to Beagle; Pearson correlation coefficients were slightly higher for both software packages (0.946 for Beagle, 0.961 for IMPUTE2). Our findings demonstrate that genotype imputation from SNP array-derived data to whole genome-level genotypes is both feasible and accurate in the dog with appropriate breed overlap between the target and reference panels. PMID:27129452
Blue, Elizabeth Marchani; Sun, Lei; Tintle, Nathan L.; Wijsman, Ellen M.
2014-01-01
When analyzing family data, we dream of perfectly informative data, even whole genome sequences (WGS) for all family members. Reality intervenes, and we find next-generation sequence (NGS) data have error, and are often too expensive or impossible to collect on everyone. Genetic Analysis Workshop 18 groups “Quality Control” and “Dropping WGS through families using GWAS framework” focused on finding, correcting, and using errors within the available sequence and family data, developing methods to infer and analyze missing sequence data among relatives, and testing for linkage and association with simulated blood pressure. We found that single nucleotide polymorphisms, NGS, and imputed data are generally concordant, but that errors are particularly likely at rare variants, homozygous genotypes, within regions with repeated sequences or structural variants, and within sequence data imputed from unrelateds. Admixture complicated identification of cryptic relatedness, but information from Mendelian transmission improved error detection and provided an estimate of the de novo mutation rate. Both genotype and pedigree errors had an adverse effect on subsequent analyses. Computationally fast rules-based imputation was accurate, but could not cover as many loci or subjects as more computationally demanding probability-based methods. Incorporating population-level data into pedigree-based imputation methods improved results. Observed data outperformed imputed data in association testing, but imputed data were also useful. We discuss the strengths and weaknesses of existing methods, and suggest possible future directions. Topics include improving communication between those performing data collection and analysis, establishing thresholds for and improving imputation quality, and incorporating error into imputation and analytical models. PMID:25112184
LinkImpute: Fast and Accurate Genotype Imputation for Nonmodel Organisms.
Money, Daniel; Gardner, Kyle; Migicovsky, Zoë; Schwaninger, Heidi; Zhong, Gan-Yuan; Myles, Sean
2015-09-15
Obtaining genome-wide genotype data from a set of individuals is the first step in many genomic studies, including genome-wide association and genomic selection. All genotyping methods suffer from some level of missing data, and genotype imputation can be used to fill in the missing data and improve the power of downstream analyses. Model organisms like human and cattle benefit from high-quality reference genomes and panels of reference genotypes that aid in imputation accuracy. In nonmodel organisms, however, genetic and physical maps often are either of poor quality or are completely absent, and there are no panels of reference genotypes available. There is therefore a need for imputation methods designed specifically for nonmodel organisms in which genomic resources are poorly developed and marker order is unreliable or unknown. Here we introduce LinkImpute, a software package based on a k-nearest neighbor genotype imputation method, LD-kNNi, which is designed for unordered markers. No physical or genetic maps are required, and it is designed to work on unphased genotype data from heterozygous species. It exploits the fact that markers useful for imputation often are not physically close to the missing genotype but rather distributed throughout the genome. Using genotyping-by-sequencing data from diverse and heterozygous accessions of apples, grapes, and maize, we compare LD-kNNi with several genotype imputation methods and show that LD-kNNi is fast, comparable in accuracy to the best-existing methods, and exhibits the least bias in allele frequency estimates. Copyright © 2015 Money et al.
Smith, Jessica L.; Jeng, Sophia; McWeeney, Shannon K.
2017-01-01
ABSTRACT The impact of mosquito-borne flavivirus infections worldwide is significant, and many critical aspects of these viruses' biology, including virus-host interactions, host cell requirements for replication, and how virus-host interactions impact pathology, remain to be fully understood. The recent reemergence and spread of flaviviruses, including dengue virus (DENV), West Nile virus (WNV), and Zika virus (ZIKV), highlight the importance of performing basic research on this important group of pathogens. MicroRNAs (miRNAs) are small, noncoding RNAs that modulate gene expression posttranscriptionally and have been demonstrated to regulate a broad range of cellular processes. Our research is focused on identifying pro- and antiflaviviral miRNAs as a means of characterizing cellular pathways that support or limit viral replication. We have screened a library of known human miRNA mimics for their effect on the replication of three flaviviruses, DENV, WNV, and Japanese encephalitis virus (JEV), using a high-content immunofluorescence screen. Several families of miRNAs were identified as inhibiting multiple flaviviruses, including the miRNA miR-34, miR-15, and miR-517 families. Members of the miR-34 family, which have been extensively characterized for their ability to repress Wnt/β-catenin signaling, demonstrated strong antiflaviviral effects, and this inhibitory activity extended to other viruses, including ZIKV, alphaviruses, and herpesviruses. Previous research suggested a possible link between the Wnt and type I interferon (IFN) signaling pathways. Therefore, we investigated the role of type I IFN induction in the antiviral effects of the miR-34 family and confirmed that these miRNAs potentiate interferon regulatory factor 3 (IRF3) phosphorylation and translocation to the nucleus, the induction of IFN-responsive genes, and the release of type I IFN from transfected cells. We further demonstrate that the intersection between the Wnt and IFN signaling pathways occurs at the point of glycogen synthase kinase 3β (GSK3β)–TANK-binding kinase 1 (TBK1) binding, inducing TBK1 to phosphorylate IRF3 and initiate downstream IFN signaling. In this way, we have identified a novel cellular signaling network with a critical role in regulating the replication of multiple virus families. These findings highlight the opportunities for using miRNAs as tools to discover and characterize unique cellular factors involved in supporting or limiting virus replication, opening up new avenues for antiviral research. IMPORTANCE MicroRNAs are a class of small regulatory RNAs that modulate cellular processes through the posttranscriptional repression of multiple transcripts. We hypothesized that individual miRNAs may be capable of inhibiting viral replication through their effects on host proteins or pathways. To test this, we performed a high-content screen for miRNAs that inhibit the replication of three medically relevant members of the flavivirus family: West Nile virus, Japanese encephalitis virus, and dengue virus 2. The results of this screen identify multiple miRNAs that inhibit one or more of these viruses. Extensive follow-up on members of the miR-34 family of miRNAs, which are active against all three viruses as well as the closely related Zika virus, demonstrated that miR-34 functions through increasing the infected cell's ability to respond to infection through the interferon-based innate immune pathway. Our results not only add to the knowledge of how viruses interact with cellular pathways but also provide a basis for more extensive data mining by providing a comprehensive list of miRNAs capable of inhibiting flavivirus replication. Finally, the miRNAs themselves or cellular pathways identified as modulating virus infection may prove to be novel candidates for the development of therapeutic interventions. PMID:28148804
Smith, Jessica L; Jeng, Sophia; McWeeney, Shannon K; Hirsch, Alec J
2017-04-15
The impact of mosquito-borne flavivirus infections worldwide is significant, and many critical aspects of these viruses' biology, including virus-host interactions, host cell requirements for replication, and how virus-host interactions impact pathology, remain to be fully understood. The recent reemergence and spread of flaviviruses, including dengue virus (DENV), West Nile virus (WNV), and Zika virus (ZIKV), highlight the importance of performing basic research on this important group of pathogens. MicroRNAs (miRNAs) are small, noncoding RNAs that modulate gene expression posttranscriptionally and have been demonstrated to regulate a broad range of cellular processes. Our research is focused on identifying pro- and antiflaviviral miRNAs as a means of characterizing cellular pathways that support or limit viral replication. We have screened a library of known human miRNA mimics for their effect on the replication of three flaviviruses, DENV, WNV, and Japanese encephalitis virus (JEV), using a high-content immunofluorescence screen. Several families of miRNAs were identified as inhibiting multiple flaviviruses, including the miRNA miR-34, miR-15, and miR-517 families. Members of the miR-34 family, which have been extensively characterized for their ability to repress Wnt/β-catenin signaling, demonstrated strong antiflaviviral effects, and this inhibitory activity extended to other viruses, including ZIKV, alphaviruses, and herpesviruses. Previous research suggested a possible link between the Wnt and type I interferon (IFN) signaling pathways. Therefore, we investigated the role of type I IFN induction in the antiviral effects of the miR-34 family and confirmed that these miRNAs potentiate interferon regulatory factor 3 (IRF3) phosphorylation and translocation to the nucleus, the induction of IFN-responsive genes, and the release of type I IFN from transfected cells. We further demonstrate that the intersection between the Wnt and IFN signaling pathways occurs at the point of glycogen synthase kinase 3β (GSK3β)-TANK-binding kinase 1 (TBK1) binding, inducing TBK1 to phosphorylate IRF3 and initiate downstream IFN signaling. In this way, we have identified a novel cellular signaling network with a critical role in regulating the replication of multiple virus families. These findings highlight the opportunities for using miRNAs as tools to discover and characterize unique cellular factors involved in supporting or limiting virus replication, opening up new avenues for antiviral research. IMPORTANCE MicroRNAs are a class of small regulatory RNAs that modulate cellular processes through the posttranscriptional repression of multiple transcripts. We hypothesized that individual miRNAs may be capable of inhibiting viral replication through their effects on host proteins or pathways. To test this, we performed a high-content screen for miRNAs that inhibit the replication of three medically relevant members of the flavivirus family: West Nile virus, Japanese encephalitis virus, and dengue virus 2. The results of this screen identify multiple miRNAs that inhibit one or more of these viruses. Extensive follow-up on members of the miR-34 family of miRNAs, which are active against all three viruses as well as the closely related Zika virus, demonstrated that miR-34 functions through increasing the infected cell's ability to respond to infection through the interferon-based innate immune pathway. Our results not only add to the knowledge of how viruses interact with cellular pathways but also provide a basis for more extensive data mining by providing a comprehensive list of miRNAs capable of inhibiting flavivirus replication. Finally, the miRNAs themselves or cellular pathways identified as modulating virus infection may prove to be novel candidates for the development of therapeutic interventions. Copyright © 2017 American Society for Microbiology.
Lecca, Davide; Marangon, Davide; Coppolino, Giusy T.; Méndez, Aida Menéndez; Finardi, Annamaria; Costa, Gloria Dalla; Martinelli, Vittorio; Furlan, Roberto; Abbracchio, Maria P.
2016-01-01
In the mature central nervous system (CNS), oligodendrocytes provide support and insulation to axons thanks to the production of a myelin sheath. During their maturation to myelinating cells, oligodendroglial precursors (OPCs) follow a very precise differentiation program, which is finely orchestrated by transcription factors, epigenetic factors and microRNAs (miRNAs), a class of small non-coding RNAs involved in post-transcriptional regulation. Any alterations in this program can potentially contribute to dysregulated myelination, impaired remyelination and neurodegenerative conditions, as it happens in multiple sclerosis (MS). Here, we identify miR-125a-3p, a developmentally regulated miRNA, as a new actor of oligodendroglial maturation, that, in the mammalian CNS regulates the expression of myelin genes by simultaneously acting on several of its already validated targets. In cultured OPCs, over-expression of miR-125a-3p by mimic treatment impairs while its inhibition with an antago-miR stimulates oligodendroglial maturation. Moreover, we show that miR-125a-3p levels are abnormally high in the cerebrospinal fluid of MS patients bearing active demyelinating lesions, suggesting that its pathological upregulation may contribute to MS development, at least in part by blockade of OPC differentiation leading to impaired repair of demyelinated lesions. PMID:27698367
Dai, Xianping; Li, Mengshun; Geng, Feng
2017-07-01
Dexamethasone is widely used in multiple myeloma (MM) for its cytotoxic effects on lymphoid cells. However, many MM patients are resistant to dexamethasone, although some can benefit from dexamethasone treatment. In this study, we noted that ω-3 polyunsaturated fatty acids (PUFAs) enhanced the dexamethasone sensitivity of MM cells by inducing cell apoptosis. q-PCR analysis revealed that miR-34a could be significantly induced by PUFAs in U266 and primary MM cells. Transfection with miR-34a antagonist or miR-34a agomir could restore or suppress the dexamethasone sensitivity in U266 cells. Both luciferase reporter assay and Western blot showed that Bcl-2 is the direct target of miR-34a in MM cells. In addition, we observed that PUFAs induced p53 protein expression in MM cells under dexamethasone administration. Furthermore, suppressing p53 by its inhibitor, Pifithrin-α, regulated the miR-34a expression and modulated the sensitivity to dexamethasone in U266 cells. In summary, these results suggest that PUFAs enhance dexamethasone sensitivity to MM cells through the p53/miR-34a axis with a likely contribution of Bcl-2 suppression.
LinkImputeR: user-guided genotype calling and imputation for non-model organisms.
Money, Daniel; Migicovsky, Zoë; Gardner, Kyle; Myles, Sean
2017-07-10
Genomic studies such as genome-wide association and genomic selection require genome-wide genotype data. All existing technologies used to create these data result in missing genotypes, which are often then inferred using genotype imputation software. However, existing imputation methods most often make use only of genotypes that are successfully inferred after having passed a certain read depth threshold. Because of this, any read information for genotypes that did not pass the threshold, and were thus set to missing, is ignored. Most genomic studies also choose read depth thresholds and quality filters without investigating their effects on the size and quality of the resulting genotype data. Moreover, almost all genotype imputation methods require ordered markers and are therefore of limited utility in non-model organisms. Here we introduce LinkImputeR, a software program that exploits the read count information that is normally ignored, and makes use of all available DNA sequence information for the purposes of genotype calling and imputation. It is specifically designed for non-model organisms since it requires neither ordered markers nor a reference panel of genotypes. Using next-generation DNA sequence (NGS) data from apple, cannabis and grape, we quantify the effect of varying read count and missingness thresholds on the quantity and quality of genotypes generated from LinkImputeR. We demonstrate that LinkImputeR can increase the number of genotype calls by more than an order of magnitude, can improve genotyping accuracy by several percent and can thus improve the power of downstream analyses. Moreover, we show that the effects of quality and read depth filters can differ substantially between data sets and should therefore be investigated on a per-study basis. By exploiting DNA sequence data that is normally ignored during genotype calling and imputation, LinkImputeR can significantly improve both the quantity and quality of genotype data generated from NGS technologies. It enables the user to quickly and easily examine the effects of varying thresholds and filters on the number and quality of the resulting genotype calls. In this manner, users can decide on thresholds that are most suitable for their purposes. We show that LinkImputeR can significantly augment the value and utility of NGS data sets, especially in non-model organisms with poor genomic resources.
Slattery, Martha L; Herrick, Jennifer S; Stevens, John R; Wolff, Roger K; Mullany, Lila E
2017-01-01
Determination of functional pathways regulated by microRNAs (miRNAs), while an essential step in developing therapeutics, is challenging. Some miRNAs have been studied extensively; others have limited information. In this study, we focus on 254 miRNAs previously identified as being associated with colorectal cancer and their database-identified validated target genes. We use RNA-Seq data to evaluate messenger RNA (mRNA) expression for 157 subjects who also had miRNA expression data. In the replication phase of the study, we replicated associations between 254 miRNAs associated with colorectal cancer and mRNA expression of database-identified target genes in normal colonic mucosa. In the discovery phase of the study, we evaluated expression of 18 miR-NAs (those with 20 or fewer database-identified target genes along with miR-21-5p, miR-215-5p, and miR-124-3p which have more than 500 database-identified target genes) with expression of 17 434 mRNAs to identify new targets in colon tissue. Seed region matches between miRNA and newly identified targeted mRNA were used to help determine direct miRNA-mRNA associations. From the replication of the 121 miRNAs that had at least 1 database-identified target gene using mRNA expression methods, 97.9% were expressed in normal colonic mucosa. Of the 8622 target miRNA-mRNA associations identified in the database, 2658 (30.2%) were associated with gene expression in normal colonic mucosa after adjusting for multiple comparisons. Of the 133 miRNAs with database-identified target genes by non-mRNA expression methods, 97.2% were expressed in normal colonic mucosa. After adjustment for multiple comparisons, 2416 miRNA-mRNA associations remained significant (19.8%). Results from the discovery phase based on detailed examination of 18 miRNAs identified more than 80 000 miRNA-mRNA associations that had not previously linked to the miRNA. Of these miRNA-mRNA associations, 15.6% and 14.8% had seed matches for CRCh38 and CRCh37, respectively. Our data suggest that miRNA target gene databases are incomplete; pathways derived from these databases have similar deficiencies. Although we know a lot about several miRNAs, little is known about other miRNAs in terms of their targeted genes. We encourage others to use their data to continue to further identify and validate miRNA-targeted genes.
Impact of missing data imputation methods on gene expression clustering and classification.
de Souto, Marcilio C P; Jaskowiak, Pablo A; Costa, Ivan G
2015-02-26
Several missing value imputation methods for gene expression data have been proposed in the literature. In the past few years, researchers have been putting a great deal of effort into presenting systematic evaluations of the different imputation algorithms. Initially, most algorithms were assessed with an emphasis on the accuracy of the imputation, using metrics such as the root mean squared error. However, it has become clear that the success of the estimation of the expression value should be evaluated in more practical terms as well. One can consider, for example, the ability of the method to preserve the significant genes in the dataset, or its discriminative/predictive power for classification/clustering purposes. We performed a broad analysis of the impact of five well-known missing value imputation methods on three clustering and four classification methods, in the context of 12 cancer gene expression datasets. We employed a statistical framework, for the first time in this field, to assess whether different imputation methods improve the performance of the clustering/classification methods. Our results suggest that the imputation methods evaluated have a minor impact on the classification and downstream clustering analyses. Simple methods such as replacing the missing values by mean or the median values performed as well as more complex strategies. The datasets analyzed in this study are available at http://costalab.org/Imputation/ .
Browning, Brian L.; Browning, Sharon R.
2009-01-01
We present methods for imputing data for ungenotyped markers and for inferring haplotype phase in large data sets of unrelated individuals and parent-offspring trios. Our methods make use of known haplotype phase when it is available, and our methods are computationally efficient so that the full information in large reference panels with thousands of individuals is utilized. We demonstrate that substantial gains in imputation accuracy accrue with increasingly large reference panel sizes, particularly when imputing low-frequency variants, and that unphased reference panels can provide highly accurate genotype imputation. We place our methodology in a unified framework that enables the simultaneous use of unphased and phased data from trios and unrelated individuals in a single analysis. For unrelated individuals, our imputation methods produce well-calibrated posterior genotype probabilities and highly accurate allele-frequency estimates. For trios, our haplotype-inference method is four orders of magnitude faster than the gold-standard PHASE program and has excellent accuracy. Our methods enable genotype imputation to be performed with unphased trio or unrelated reference panels, thus accounting for haplotype-phase uncertainty in the reference panel. We present a useful measure of imputation accuracy, allelic R2, and show that this measure can be estimated accurately from posterior genotype probabilities. Our methods are implemented in version 3.0 of the BEAGLE software package. PMID:19200528
Strategies for genotype imputation in composite beef cattle.
Chud, Tatiane C S; Ventura, Ricardo V; Schenkel, Flavio S; Carvalheiro, Roberto; Buzanskas, Marcos E; Rosa, Jaqueline O; Mudadu, Maurício de Alvarenga; da Silva, Marcos Vinicius G B; Mokry, Fabiana B; Marcondes, Cintia R; Regitano, Luciana C A; Munari, Danísio P
2015-08-07
Genotype imputation has been used to increase genomic information, allow more animals in genome-wide analyses, and reduce genotyping costs. In Brazilian beef cattle production, many animals are resulting from crossbreeding and such an event may alter linkage disequilibrium patterns. Thus, the challenge is to obtain accurately imputed genotypes in crossbred animals. The objective of this study was to evaluate the best fitting and most accurate imputation strategy on the MA genetic group (the progeny of a Charolais sire mated with crossbred Canchim X Zebu cows) and Canchim cattle. The data set contained 400 animals (born between 1999 and 2005) genotyped with the Illumina BovineHD panel. Imputation accuracy of genotypes from the Illumina-Bovine3K (3K), Illumina-BovineLD (6K), GeneSeek-Genomic-Profiler (GGP) BeefLD (GGP9K), GGP-IndicusLD (GGP20Ki), Illumina-BovineSNP50 (50K), GGP-IndicusHD (GGP75Ki), and GGP-BeefHD (GGP80K) to Illumina-BovineHD (HD) SNP panels were investigated. Seven scenarios for reference and target populations were tested; the animals were grouped according with birth year (S1), genetic groups (S2 and S3), genetic groups and birth year (S4 and S5), gender (S6), and gender and birth year (S7). Analyses were performed using FImpute and BEAGLE software and computation run-time was recorded. Genotype imputation accuracy was measured by concordance rate (CR) and allelic R square (R(2)). The highest imputation accuracy scenario consisted of a reference population with males and females and a target population with young females. Among the SNP panels in the tested scenarios, from the 50K, GGP75Ki and GGP80K were the most adequate to impute to HD in Canchim cattle. FImpute reduced computation run-time to impute genotypes from 20 to 100 times when compared to BEAGLE. The genotyping panels possessing at least 50 thousands markers are suitable for genotype imputation to HD with acceptable accuracy. The FImpute algorithm demonstrated a higher efficiency of imputed markers, especially in lower density panels. These considerations may assist to increase genotypic information, reduce genotyping costs, and aid in genomic selection evaluations in crossbred animals.
The utility of low-density genotyping for imputation in the Thoroughbred horse
2014-01-01
Background Despite the dramatic reduction in the cost of high-density genotyping that has occurred over the last decade, it remains one of the limiting factors for obtaining the large datasets required for genomic studies of disease in the horse. In this study, we investigated the potential for low-density genotyping and subsequent imputation to address this problem. Results Using the haplotype phasing and imputation program, BEAGLE, it is possible to impute genotypes from low- to high-density (50K) in the Thoroughbred horse with reasonable to high accuracy. Analysis of the sources of variation in imputation accuracy revealed dependence both on the minor allele frequency of the single nucleotide polymorphisms (SNPs) being imputed and on the underlying linkage disequilibrium structure. Whereas equidistant spacing of the SNPs on the low-density panel worked well, optimising SNP selection to increase their minor allele frequency was advantageous, even when the panel was subsequently used in a population of different geographical origin. Replacing base pair position with linkage disequilibrium map distance reduced the variation in imputation accuracy across SNPs. Whereas a 1K SNP panel was generally sufficient to ensure that more than 80% of genotypes were correctly imputed, other studies suggest that a 2K to 3K panel is more efficient to minimize the subsequent loss of accuracy in genomic prediction analyses. The relationship between accuracy and genotyping costs for the different low-density panels, suggests that a 2K SNP panel would represent good value for money. Conclusions Low-density genotyping with a 2K SNP panel followed by imputation provides a compromise between cost and accuracy that could promote more widespread genotyping, and hence the use of genomic information in horses. In addition to offering a low cost alternative to high-density genotyping, imputation provides a means to combine datasets from different genotyping platforms, which is becoming necessary since researchers are starting to use the recently developed equine 70K SNP chip. However, more work is needed to evaluate the impact of between-breed differences on imputation accuracy. PMID:24495673
Fast and accurate imputation of summary statistics enhances evidence of functional enrichment
Pasaniuc, Bogdan; Zaitlen, Noah; Shi, Huwenbo; Bhatia, Gaurav; Gusev, Alexander; Pickrell, Joseph; Hirschhorn, Joel; Strachan, David P.; Patterson, Nick; Price, Alkes L.
2014-01-01
Motivation: Imputation using external reference panels (e.g. 1000 Genomes) is a widely used approach for increasing power in genome-wide association studies and meta-analysis. Existing hidden Markov models (HMM)-based imputation approaches require individual-level genotypes. Here, we develop a new method for Gaussian imputation from summary association statistics, a type of data that is becoming widely available. Results: In simulations using 1000 Genomes (1000G) data, this method recovers 84% (54%) of the effective sample size for common (>5%) and low-frequency (1–5%) variants [increasing to 87% (60%) when summary linkage disequilibrium information is available from target samples] versus the gold standard of 89% (67%) for HMM-based imputation, which cannot be applied to summary statistics. Our approach accounts for the limited sample size of the reference panel, a crucial step to eliminate false-positive associations, and it is computationally very fast. As an empirical demonstration, we apply our method to seven case–control phenotypes from the Wellcome Trust Case Control Consortium (WTCCC) data and a study of height in the British 1958 birth cohort (1958BC). Gaussian imputation from summary statistics recovers 95% (105%) of the effective sample size (as quantified by the ratio of χ2 association statistics) compared with HMM-based imputation from individual-level genotypes at the 227 (176) published single nucleotide polymorphisms (SNPs) in the WTCCC (1958BC height) data. In addition, for publicly available summary statistics from large meta-analyses of four lipid traits, we publicly release imputed summary statistics at 1000G SNPs, which could not have been obtained using previously published methods, and demonstrate their accuracy by masking subsets of the data. We show that 1000G imputation using our approach increases the magnitude and statistical evidence of enrichment at genic versus non-genic loci for these traits, as compared with an analysis without 1000G imputation. Thus, imputation of summary statistics will be a valuable tool in future functional enrichment analyses. Availability and implementation: Publicly available software package available at http://bogdan.bioinformatics.ucla.edu/software/. Contact: bpasaniuc@mednet.ucla.edu or aprice@hsph.harvard.edu Supplementary information: Supplementary materials are available at Bioinformatics online. PMID:24990607
Pouladi, N; Achour, I; Li, H; Berghout, J; Kenost, C; Gonzalez-Garay, M L; Lussier, Y A
2016-11-10
Disease comorbidity is a pervasive phenomenon impacting patients' health outcomes, disease management, and clinical decisions. This review presents past, current and future research directions leveraging both phenotypic and molecular information to uncover disease similarity underpinning the biology and etiology of disease comorbidity. We retrieved ~130 publications and retained 59, ranging from 2006 to 2015, that comprise a minimum number of five diseases and at least one type of biomolecule. We surveyed their methods, disease similarity metrics, and calculation of comorbidities in the electronic health records, if present. Among the surveyed studies, 44% generated or validated disease similarity metrics in context of comorbidity, with 60% being published in the last two years. As inputs, 87% of studies utilized intragenic loci and proteins while 13% employed RNA (mRNA, LncRNA or miRNA). Network modeling was predominantly used (35%) followed by statistics (28%) to impute similarity between these biomolecules and diseases. Studies with large numbers of biomolecules and diseases used network models or naïve overlap of disease-molecule associations, while machine learning, statistics, and information retrieval were utilized in smaller and moderate sized studies. Multiscale computations comprising shared function, network topology, and phenotypes were performed exclusively on proteins. This review highlighted the growing methods for identifying the molecular mechanisms underpinning comorbidities that leverage multiscale molecular information and patterns from electronic health records. The survey unveiled that intergenic polymorphisms have been overlooked for similarity imputation compared to their intragenic counterparts, offering new opportunities to bridge the mechanistic and similarity gaps of comorbidity.
van der Kwast, Reginald V C T; van Ingen, Eva; Parma, Laura; Peters, Hendrika A B; Quax, Paul H A; Nossent, A Yaël
2018-02-02
Adenosine-to-inosine editing of microRNAs has the potential to cause a shift in target site selection. 2'-O-ribose-methylation of adenosine residues, however, has been shown to inhibit adenosine-to-inosine editing. To investigate whether angiomiR miR487b is subject to adenosine-to-inosine editing or 2'-O-ribose-methylation during neovascularization. Complementary DNA was prepared from C57BL/6-mice subjected to hindlimb ischemia. Using Sanger sequencing and endonuclease digestion, we identified and validated adenosine-to-inosine editing of the miR487b seed sequence. In the gastrocnemius muscle, pri-miR487b editing increased from 6.7±0.4% before to 11.7±1.6% ( P =0.02) 1 day after ischemia. Edited pri-miR487b is processed into a novel microRNA, edited miR487b, which is also upregulated after ischemia. We confirmed editing of miR487b in multiple human primary vascular cell types. Short interfering RNA-mediated knockdown demonstrated that editing is adenosine deaminase acting on RNA 1 and 2 dependent. Using reverse-transcription at low dNTP concentrations followed by quantitative-PCR, we found that the same adenosine residue is methylated in mice and human primary cells. In the murine gastrocnemius, the estimated methylation fraction increased from 32.8±14% before to 53.6±12% 1 day after ischemia. Short interfering RNA knockdown confirmed that methylation is fibrillarin dependent. Although we could not confirm that methylation directly inhibits editing, we do show that adenosine deaminase acting on RNA 1 and 2 and fibrillarin negatively influence each other's expression. Using multiple luciferase reporter gene assays, we could demonstrate that editing results in a complete switch of target site selection. In human primary cells, we confirmed the shift in miR487b targeting after editing, resulting in a edited miR487b targetome that is enriched for multiple proangiogenic pathways. Furthermore, overexpression of edited miR487b, but not wild-type miR487b, stimulates angiogenesis in both in vitro and ex vivo assays. MiR487b is edited in the seed sequence in mice and humans, resulting in a novel, proangiogenic microRNA with a unique targetome. The rate of miR487b editing, as well as 2'-O-ribose-methylation, is increased in murine muscle tissue during postischemic neovascularization. Our findings suggest miR487b editing plays an intricate role in postischemic neovascularization. © 2017 American Heart Association, Inc.
Combining item response theory with multiple imputation to equate health assessment questionnaires.
Gu, Chenyang; Gutman, Roee
2017-09-01
The assessment of patients' functional status across the continuum of care requires a common patient assessment tool. However, assessment tools that are used in various health care settings differ and cannot be easily contrasted. For example, the Functional Independence Measure (FIM) is used to evaluate the functional status of patients who stay in inpatient rehabilitation facilities, the Minimum Data Set (MDS) is collected for all patients who stay in skilled nursing facilities, and the Outcome and Assessment Information Set (OASIS) is collected if they choose home health care provided by home health agencies. All three instruments or questionnaires include functional status items, but the specific items, rating scales, and instructions for scoring different activities vary between the different settings. We consider equating different health assessment questionnaires as a missing data problem, and propose a variant of predictive mean matching method that relies on Item Response Theory (IRT) models to impute unmeasured item responses. Using real data sets, we simulated missing measurements and compared our proposed approach to existing methods for missing data imputation. We show that, for all of the estimands considered, and in most of the experimental conditions that were examined, the proposed approach provides valid inferences, and generally has better coverages, relatively smaller biases, and shorter interval estimates. The proposed method is further illustrated using a real data set. © 2016, The International Biometric Society.
miR-92a family and their target genes in tumorigenesis and metastasis
DOE Office of Scientific and Technical Information (OSTI.GOV)
Li, Molin, E-mail: molin_li@hotmail.com; Institute of Cancer Stem Cell, Dalian Medical University Cancer Center, Dalian 116044; Guan, Xingfang
2014-04-15
The miR-92a family, including miR-25, miR-92a-1, miR-92a-2 and miR-363, arises from three different paralog clusters miR-17-92, miR-106a-363, and miR-106b-25 that are highly conservative in the process of evolution, and it was thought as a group of microRNAs (miRNAs) correlated with endothelial cells. Aberrant expression of miR-92a family was detected in multiple cancers, and the disturbance of miR-92a family was related with tumorigenesis and tumor development. In this review, the progress on the relationship between miR-92a family and their target genes and malignant tumors will be summarized. - Highlights: • Aberrant expression of miR-92a, miR-25 and miR-363 can be observed inmore » many kinds of malignant tumors. • The expression of miR-92a family is regulated by LOH, epigenetic alteration, transcriptional factors such as SP1, MYC, E2F, wild-type p53 etc. • Roles of miR-92a family in tumorigenesis and development: promoting cell proliferation, invasion and metastasis, inhibiting cell apoptosis.« less
Multilayer checkpoints for microRNA authenticity during RISC assembly
Kawamata, Tomoko; Yoda, Mayuko; Tomari, Yukihide
2011-01-01
MicroRNAs (miRNAs) function through the RNA-induced silencing complex (RISC), which contains an Argonaute (Ago) protein at the core. RISC assembly follows a two-step pathway: miRNA/miRNA* duplex loading into Ago, and separation of the two strands within Ago. Here we show that the 5′ phosphate of the miRNA strand is essential for duplex loading into Ago, whereas the preferred 5′ nucleotide of the miRNA strand and the base-pairing status in the seed region and the middle of the 3′ region function as additive anchors to Ago. Consequently, the miRNA authenticity is inspected at multiple steps during RISC assembly. PMID:21738221
Webb-Robertson, Bobbie-Jo M.; Wiberg, Holli K.; Matzke, Melissa M.; ...
2015-04-09
In this review, we apply selected imputation strategies to label-free liquid chromatography–mass spectrometry (LC–MS) proteomics datasets to evaluate the accuracy with respect to metrics of variance and classification. We evaluate several commonly used imputation approaches for individual merits and discuss the caveats of each approach with respect to the example LC–MS proteomics data. In general, local similarity-based approaches, such as the regularized expectation maximization and least-squares adaptive algorithms, yield the best overall performances with respect to metrics of accuracy and robustness. However, no single algorithm consistently outperforms the remaining approaches, and in some cases, performing classification without imputation sometimes yieldedmore » the most accurate classification. Thus, because of the complex mechanisms of missing data in proteomics, which also vary from peptide to protein, no individual method is a single solution for imputation. In summary, on the basis of the observations in this review, the goal for imputation in the field of computational proteomics should be to develop new approaches that work generically for this data type and new strategies to guide users in the selection of the best imputation for their dataset and analysis objectives.« less
Traveling with MI Education in a Turbulent Sea: Stories of South Korea
ERIC Educational Resources Information Center
Kim, Myung-Hee; Cha, Kyung-Hee
2008-01-01
The purpose of this paper is to explore ways in which multiple intelligences (MI) theory has been disseminated in search of its meaning, effectiveness and possibilities over the last decade in Korea. There have been a great number of Korean practitioners who have properly applied an ideal of MI theory in their local context. Western readers will…
Liguori, Maria; Nuzziello, Nicoletta; Licciulli, Flavio; Consiglio, Arianna; Simone, Marta; Viterbo, Rosa Gemma; Creanza, Teresa Maria; Ancona, Nicola; Tortorella, Carla; Margari, Lucia; Grillo, Giorgio; Giordano, Paola; Liuni, Sabino; Trojano, Maria
2018-01-01
Multiple sclerosis (MS) is a complex disease of the CNS that usually affects young adults, although 3-5% of cases are diagnosed in childhood and adolescence (hence called pediatric MS, PedMS). Genetic predisposition, among other factors, seems to contribute to the risk of the onset, in pediatric as in adult ages, but few studies have investigated the genetic 'environmentally naïve' load of PedMS. The main goal of this study was to identify circulating markers (miRNAs), target genes (mRNAs) and functional pathways associated with PedMS; we also verified the impact of miRNAs on clinical features, i.e. disability and cognitive performances. The investigation was performed in 19 PedMS and 20 pediatric controls (PCs) using a High-Throughput Next-generation Sequencing (HT-NGS) approach followed by an integrated bioinformatics/biostatistics analysis. Twelve miRNAs were significantly upregulated (let-7a-5p, let-7b-5p, miR-25-3p, miR-125a-5p, miR-942-5p, miR-221-3p, miR-652-3p, miR-182-5p, miR-185-5p, miR-181a-5p, miR-320a, miR-99b-5p) and 1 miRNA was downregulated (miR-148b-3p) in PedMS compared with PCs. The interactions between the significant miRNAs and their targets uncovered predicted genes (i.e. TNFSF13B, TLR2, BACH2, KLF4) related to immunological functions, as well as genes involved in autophagy-related processes (i.e. ATG16L1, SORT1, LAMP2) and ATPase activity (i.e. ABCA1, GPX3). No significant molecular profiles were associated with any PedMS demographic/clinical features. Both miRNAs and mRNA expressions predicted the phenotypes (PedMS-PC) with an accuracy of 92% and 91%, respectively. In our view, this original strategy of contemporary miRNA/mRNA analysis may help to shed light in the genetic background of the disease, suggesting further molecular investigations in novel pathogenic mechanisms. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
ERIC Educational Resources Information Center
Mbuva, James
This paper focuses on the implementation of the multiple intelligences (MI) theory in 21st century teaching and learning environment, suggesting that it offers a new tool for effective teaching and learning at all levels. The eight current MI include: verbal/linguistic, logical/mathematical, visual/spatial, bodily/kinesthetic, musical/rhythmic,…