Shrinkage regression-based methods for microarray missing value imputation.
Wang, Hsiuying; Chiu, Chia-Chun; Wu, Yi-Ching; Wu, Wei-Sheng
2013-01-01
Missing values commonly occur in the microarray data, which usually contain more than 5% missing values with up to 90% of genes affected. Inaccurate missing value estimation results in reducing the power of downstream microarray data analyses. Many types of methods have been developed to estimate missing values. Among them, the regression-based methods are very popular and have been shown to perform better than the other types of methods in many testing microarray datasets. To further improve the performances of the regression-based methods, we propose shrinkage regression-based methods. Our methods take the advantage of the correlation structure in the microarray data and select similar genes for the target gene by Pearson correlation coefficients. Besides, our methods incorporate the least squares principle, utilize a shrinkage estimation approach to adjust the coefficients of the regression model, and then use the new coefficients to estimate missing values. Simulation results show that the proposed methods provide more accurate missing value estimation in six testing microarray datasets than the existing regression-based methods do. Imputation of missing values is a very important aspect of microarray data analyses because most of the downstream analyses require a complete dataset. Therefore, exploring accurate and efficient methods for estimating missing values has become an essential issue. Since our proposed shrinkage regression-based methods can provide accurate missing value estimation, they are competitive alternatives to the existing regression-based methods.
Autoregressive-model-based missing value estimation for DNA microarray time series data.
Choong, Miew Keen; Charbit, Maurice; Yan, Hong
2009-01-01
Missing value estimation is important in DNA microarray data analysis. A number of algorithms have been developed to solve this problem, but they have several limitations. Most existing algorithms are not able to deal with the situation where a particular time point (column) of the data is missing entirely. In this paper, we present an autoregressive-model-based missing value estimation method (ARLSimpute) that takes into account the dynamic property of microarray temporal data and the local similarity structures in the data. ARLSimpute is especially effective for the situation where a particular time point contains many missing values or where the entire time point is missing. Experiment results suggest that our proposed algorithm is an accurate missing value estimator in comparison with other imputation methods on simulated as well as real microarray time series datasets.
Sehgal, Muhammad Shoaib B; Gondal, Iqbal; Dooley, Laurence S
2005-05-15
Microarray data are used in a range of application areas in biology, although often it contains considerable numbers of missing values. These missing values can significantly affect subsequent statistical analysis and machine learning algorithms so there is a strong motivation to estimate these values as accurately as possible before using these algorithms. While many imputation algorithms have been proposed, more robust techniques need to be developed so that further analysis of biological data can be accurately undertaken. In this paper, an innovative missing value imputation algorithm called collateral missing value estimation (CMVE) is presented which uses multiple covariance-based imputation matrices for the final prediction of missing values. The matrices are computed and optimized using least square regression and linear programming methods. The new CMVE algorithm has been compared with existing estimation techniques including Bayesian principal component analysis imputation (BPCA), least square impute (LSImpute) and K-nearest neighbour (KNN). All these methods were rigorously tested to estimate missing values in three separate non-time series (ovarian cancer based) and one time series (yeast sporulation) dataset. Each method was quantitatively analyzed using the normalized root mean square (NRMS) error measure, covering a wide range of randomly introduced missing value probabilities from 0.01 to 0.2. Experiments were also undertaken on the yeast dataset, which comprised 1.7% actual missing values, to test the hypothesis that CMVE performed better not only for randomly occurring but also for a real distribution of missing values. The results confirmed that CMVE consistently demonstrated superior and robust estimation capability of missing values compared with other methods for both series types of data, for the same order of computational complexity. A concise theoretical framework has also been formulated to validate the improved performance of the CMVE algorithm. The CMVE software is available upon request from the authors.
40 CFR 98.35 - Procedures for estimating missing data.
Code of Federal Regulations, 2014 CFR
2014-07-01
... 40 Protection of Environment 21 2014-07-01 2014-07-01 false Procedures for estimating missing data... Procedures for estimating missing data. Whenever a quality-assured value of a required parameter is... substitute data value for the missing parameter shall be used in the calculations. (a) For all units subject...
40 CFR 98.35 - Procedures for estimating missing data.
Code of Federal Regulations, 2013 CFR
2013-07-01
... 40 Protection of Environment 22 2013-07-01 2013-07-01 false Procedures for estimating missing data... Procedures for estimating missing data. Whenever a quality-assured value of a required parameter is... substitute data value for the missing parameter shall be used in the calculations. (a) For all units subject...
40 CFR 98.35 - Procedures for estimating missing data.
Code of Federal Regulations, 2010 CFR
2010-07-01
... 40 Protection of Environment 20 2010-07-01 2010-07-01 false Procedures for estimating missing data... Procedures for estimating missing data. Whenever a quality-assured value of a required parameter is... substitute data value for the missing parameter shall be used in the calculations. (a) For all units subject...
40 CFR 98.35 - Procedures for estimating missing data.
Code of Federal Regulations, 2012 CFR
2012-07-01
... 40 Protection of Environment 22 2012-07-01 2012-07-01 false Procedures for estimating missing data... Procedures for estimating missing data. Whenever a quality-assured value of a required parameter is... substitute data value for the missing parameter shall be used in the calculations. (a) For all units subject...
40 CFR 98.35 - Procedures for estimating missing data.
Code of Federal Regulations, 2011 CFR
2011-07-01
... 40 Protection of Environment 21 2011-07-01 2011-07-01 false Procedures for estimating missing data... Procedures for estimating missing data. Whenever a quality-assured value of a required parameter is... substitute data value for the missing parameter shall be used in the calculations. (a) For all units subject...
40 CFR 98.285 - Procedures for estimating missing data.
Code of Federal Regulations, 2010 CFR
2010-07-01
... 40 Protection of Environment 20 2010-07-01 2010-07-01 false Procedures for estimating missing data... estimating missing data. For the petroleum coke input procedure in § 98.283(b), a complete record of all...) For each missing value of the monthly carbon content of petroleum coke, the substitute data value...
40 CFR 98.285 - Procedures for estimating missing data.
Code of Federal Regulations, 2012 CFR
2012-07-01
... 40 Protection of Environment 22 2012-07-01 2012-07-01 false Procedures for estimating missing data... estimating missing data. For the petroleum coke input procedure in § 98.283(b), a complete record of all...) For each missing value of the monthly carbon content of petroleum coke, the substitute data value...
40 CFR 98.285 - Procedures for estimating missing data.
Code of Federal Regulations, 2014 CFR
2014-07-01
... 40 Protection of Environment 21 2014-07-01 2014-07-01 false Procedures for estimating missing data... estimating missing data. For the petroleum coke input procedure in § 98.283(b), a complete record of all...) For each missing value of the monthly carbon content of petroleum coke, the substitute data value...
40 CFR 98.285 - Procedures for estimating missing data.
Code of Federal Regulations, 2013 CFR
2013-07-01
... 40 Protection of Environment 22 2013-07-01 2013-07-01 false Procedures for estimating missing data... estimating missing data. For the petroleum coke input procedure in § 98.283(b), a complete record of all...) For each missing value of the monthly carbon content of petroleum coke, the substitute data value...
40 CFR 98.285 - Procedures for estimating missing data.
Code of Federal Regulations, 2011 CFR
2011-07-01
... 40 Protection of Environment 21 2011-07-01 2011-07-01 false Procedures for estimating missing data... estimating missing data. For the petroleum coke input procedure in § 98.283(b), a complete record of all...) For each missing value of the monthly carbon content of petroleum coke, the substitute data value...
Integrative missing value estimation for microarray data.
Hu, Jianjun; Li, Haifeng; Waterman, Michael S; Zhou, Xianghong Jasmine
2006-10-12
Missing value estimation is an important preprocessing step in microarray analysis. Although several methods have been developed to solve this problem, their performance is unsatisfactory for datasets with high rates of missing data, high measurement noise, or limited numbers of samples. In fact, more than 80% of the time-series datasets in Stanford Microarray Database contain less than eight samples. We present the integrative Missing Value Estimation method (iMISS) by incorporating information from multiple reference microarray datasets to improve missing value estimation. For each gene with missing data, we derive a consistent neighbor-gene list by taking reference data sets into consideration. To determine whether the given reference data sets are sufficiently informative for integration, we use a submatrix imputation approach. Our experiments showed that iMISS can significantly and consistently improve the accuracy of the state-of-the-art Local Least Square (LLS) imputation algorithm by up to 15% improvement in our benchmark tests. We demonstrated that the order-statistics-based integrative imputation algorithms can achieve significant improvements over the state-of-the-art missing value estimation approaches such as LLS and is especially good for imputing microarray datasets with a limited number of samples, high rates of missing data, or very noisy measurements. With the rapid accumulation of microarray datasets, the performance of our approach can be further improved by incorporating larger and more appropriate reference datasets.
Missing-value estimation using linear and non-linear regression with Bayesian gene selection.
Zhou, Xiaobo; Wang, Xiaodong; Dougherty, Edward R
2003-11-22
Data from microarray experiments are usually in the form of large matrices of expression levels of genes under different experimental conditions. Owing to various reasons, there are frequently missing values. Estimating these missing values is important because they affect downstream analysis, such as clustering, classification and network design. Several methods of missing-value estimation are in use. The problem has two parts: (1) selection of genes for estimation and (2) design of an estimation rule. We propose Bayesian variable selection to obtain genes to be used for estimation, and employ both linear and nonlinear regression for the estimation rule itself. Fast implementation issues for these methods are discussed, including the use of QR decomposition for parameter estimation. The proposed methods are tested on data sets arising from hereditary breast cancer and small round blue-cell tumors. The results compare very favorably with currently used methods based on the normalized root-mean-square error. The appendix is available from http://gspsnap.tamu.edu/gspweb/zxb/missing_zxb/ (user: gspweb; passwd: gsplab).
Ito, Tetsuya; Fukawa, Kazuo; Kamikawa, Mai; Nikaidou, Satoshi; Taniguchi, Masaaki; Arakawa, Aisaku; Tanaka, Genki; Mikawa, Satoshi; Furukawa, Tsutomu; Hirose, Kensuke
2018-01-01
Daily feed intake (DFI) is an important consideration for improving feed efficiency, but measurements using electronic feeder systems contain many missing and incorrect values. Therefore, we evaluated three methods for correcting missing DFI data (quadratic, orthogonal polynomial, and locally weighted (Loess) regression equations) and assessed the effects of these missing values on the genetic parameters and the estimated breeding values (EBV) for feeding traits. DFI records were obtained from 1622 Duroc pigs, comprising 902 individuals without missing DFI and 720 individuals with missing DFI. The Loess equation was the most suitable method for correcting the missing DFI values in 5-50% randomly deleted datasets among the three equations. Both variance components and heritability for the average DFI (ADFI) did not change because of the missing DFI proportion and Loess correction. In terms of rank correlation and information criteria, Loess correction improved the accuracy of EBV for ADFI compared to randomly deleted cases. These findings indicate that the Loess equation is useful for correcting missing DFI values for individual pigs and that the correction of missing DFI values could be effective for the estimation of breeding values and genetic improvement using EBV for feeding traits. © 2017 The Authors. Animal Science Journal published by John Wiley & Sons Australia, Ltd on behalf of Japanese Society of Animal Science.
40 CFR 98.265 - Procedures for estimating missing data.
Code of Federal Regulations, 2011 CFR
2011-07-01
... 40 Protection of Environment 21 2011-07-01 2011-07-01 false Procedures for estimating missing data... estimating missing data. (a) For each missing value of the inorganic carbon content of phosphate rock or... immediately preceding and immediately following the missing data incident. You must document and keep records...
40 CFR 98.265 - Procedures for estimating missing data.
Code of Federal Regulations, 2013 CFR
2013-07-01
... 40 Protection of Environment 22 2013-07-01 2013-07-01 false Procedures for estimating missing data... estimating missing data. (a) For each missing value of the inorganic carbon content of phosphate rock or... immediately preceding and immediately following the missing data incident. You must document and keep records...
40 CFR 98.265 - Procedures for estimating missing data.
Code of Federal Regulations, 2012 CFR
2012-07-01
... 40 Protection of Environment 22 2012-07-01 2012-07-01 false Procedures for estimating missing data... estimating missing data. (a) For each missing value of the inorganic carbon content of phosphate rock or... immediately preceding and immediately following the missing data incident. You must document and keep records...
Ertefaie, Ashkan; Flory, James H; Hennessy, Sean; Small, Dylan S
2017-06-15
Instrumental variable (IV) methods provide unbiased treatment effect estimation in the presence of unmeasured confounders under certain assumptions. To provide valid estimates of treatment effect, treatment effect confounders that are associated with the IV (IV-confounders) must be included in the analysis, and not including observations with missing values may lead to bias. Missing covariate data are particularly problematic when the probability that a value is missing is related to the value itself, which is known as nonignorable missingness. In such cases, imputation-based methods are biased. Using health-care provider preference as an IV method, we propose a 2-step procedure with which to estimate a valid treatment effect in the presence of baseline variables with nonignorable missing values. First, the provider preference IV value is estimated by performing a complete-case analysis using a random-effects model that includes IV-confounders. Second, the treatment effect is estimated using a 2-stage least squares IV approach that excludes IV-confounders with missing values. Simulation results are presented, and the method is applied to an analysis comparing the effects of sulfonylureas versus metformin on body mass index, where the variables baseline body mass index and glycosylated hemoglobin have missing values. Our result supports the association of sulfonylureas with weight gain. © The Author 2017. Published by Oxford University Press on behalf of the Johns Hopkins Bloomberg School of Public Health. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
40 CFR 98.275 - Procedures for estimating missing data.
Code of Federal Regulations, 2011 CFR
2011-07-01
... 40 Protection of Environment 21 2011-07-01 2011-07-01 false Procedures for estimating missing data... for estimating missing data. A complete record of all measured parameters used in the GHG emissions... substitute data value for the missing parameter shall be used in the calculations, according to the...
40 CFR 98.365 - Procedures for estimating missing data.
Code of Federal Regulations, 2010 CFR
2010-07-01
... 40 Protection of Environment 20 2010-07-01 2010-07-01 false Procedures for estimating missing data... estimating missing data. (a) A complete record of all measured parameters used in the GHG emissions... substitute data value for the missing parameter shall be used in the calculations, according to the...
40 CFR 98.365 - Procedures for estimating missing data.
Code of Federal Regulations, 2014 CFR
2014-07-01
... 40 Protection of Environment 21 2014-07-01 2014-07-01 false Procedures for estimating missing data... estimating missing data. (a) A complete record of all measured parameters used in the GHG emissions... substitute data value for the missing parameter shall be used in the calculations, according to the...
40 CFR 98.175 - Procedures for estimating missing data.
Code of Federal Regulations, 2014 CFR
2014-07-01
... 40 Protection of Environment 21 2014-07-01 2014-07-01 false Procedures for estimating missing data... estimating missing data. A complete record of all measured parameters used in the GHG emissions calculations... unavailable, a substitute data value for the missing parameter shall be used in the calculations as specified...
40 CFR 98.345 - Procedures for estimating missing data.
Code of Federal Regulations, 2013 CFR
2013-07-01
... 40 Protection of Environment 22 2013-07-01 2013-07-01 false Procedures for estimating missing data... for estimating missing data. A complete record of all measured parameters used in the GHG emissions... substitute data value for the missing parameter shall be used in the calculations, according to the...
40 CFR 98.465 - Procedures for estimating missing data.
Code of Federal Regulations, 2012 CFR
2012-07-01
... 40 Protection of Environment 22 2012-07-01 2012-07-01 false Procedures for estimating missing data... estimating missing data. (a) A complete record of all measured parameters used in the GHG emissions... substitute data value for the missing parameter shall be used in the calculations, in accordance with...
40 CFR 98.355 - Procedures for estimating missing data.
Code of Federal Regulations, 2014 CFR
2014-07-01
... 40 Protection of Environment 21 2014-07-01 2014-07-01 false Procedures for estimating missing data... for estimating missing data. A complete record of all measured parameters used in the GHG emissions... substitute data value for the missing parameter must be used in the calculations, according to the following...
40 CFR 98.215 - Procedures for estimating missing data.
Code of Federal Regulations, 2011 CFR
2011-07-01
... 40 Protection of Environment 21 2011-07-01 2011-07-01 false Procedures for estimating missing data... for estimating missing data. (a) A complete record of all measured parameters used in the GHG... unavailable, a substitute data value for the missing parameter shall be used in the calculations as specified...
40 CFR 98.55 - Procedures for estimating missing data.
Code of Federal Regulations, 2013 CFR
2013-07-01
... 40 Protection of Environment 22 2013-07-01 2013-07-01 false Procedures for estimating missing data... estimating missing data. A complete record of all measured parameters used in the GHG emissions calculations... substitute data value for the missing parameter shall be used in the calculations as specified in paragraphs...
40 CFR 98.155 - Procedures for estimating missing data.
Code of Federal Regulations, 2014 CFR
2014-07-01
... 40 Protection of Environment 21 2014-07-01 2014-07-01 false Procedures for estimating missing data... Procedures for estimating missing data. (a) A complete record of all measured parameters used in the GHG...), a substitute data value for the missing parameter shall be used in the calculations, according to...
40 CFR 98.125 - Procedures for estimating missing data.
Code of Federal Regulations, 2011 CFR
2011-07-01
... 40 Protection of Environment 21 2011-07-01 2011-07-01 false Procedures for estimating missing data... estimating missing data. (a) A complete record of all measured parameters used in the GHG emissions... unavailable, a substitute data value for the missing parameter must be used in the calculations as specified...
40 CFR 98.265 - Procedures for estimating missing data.
Code of Federal Regulations, 2010 CFR
2010-07-01
... 40 Protection of Environment 20 2010-07-01 2010-07-01 false Procedures for estimating missing data... estimating missing data. A complete record of all measured parameters used in the GHG emissions calculations... unavailable, a substitute data value for the missing parameter shall be used in the calculations as specified...
40 CFR 98.175 - Procedures for estimating missing data.
Code of Federal Regulations, 2013 CFR
2013-07-01
... 40 Protection of Environment 22 2013-07-01 2013-07-01 false Procedures for estimating missing data... estimating missing data. A complete record of all measured parameters used in the GHG emissions calculations... unavailable, a substitute data value for the missing parameter shall be used in the calculations as specified...
40 CFR 98.125 - Procedures for estimating missing data.
Code of Federal Regulations, 2013 CFR
2013-07-01
... 40 Protection of Environment 22 2013-07-01 2013-07-01 false Procedures for estimating missing data... estimating missing data. (a) A complete record of all measured parameters used in the GHG emissions... unavailable, a substitute data value for the missing parameter must be used in the calculations as specified...
40 CFR 98.275 - Procedures for estimating missing data.
Code of Federal Regulations, 2014 CFR
2014-07-01
... 40 Protection of Environment 21 2014-07-01 2014-07-01 false Procedures for estimating missing data... for estimating missing data. A complete record of all measured parameters used in the GHG emissions... substitute data value for the missing parameter shall be used in the calculations, according to the...
40 CFR 98.215 - Procedures for estimating missing data.
Code of Federal Regulations, 2012 CFR
2012-07-01
... 40 Protection of Environment 22 2012-07-01 2012-07-01 false Procedures for estimating missing data... for estimating missing data. (a) A complete record of all measured parameters used in the GHG... unavailable, a substitute data value for the missing parameter shall be used in the calculations as specified...
40 CFR 98.345 - Procedures for estimating missing data.
Code of Federal Regulations, 2010 CFR
2010-07-01
... 40 Protection of Environment 20 2010-07-01 2010-07-01 false Procedures for estimating missing data... for estimating missing data. A complete record of all measured parameters used in the GHG emissions... substitute data value for the missing parameter shall be used in the calculations, according to the...
40 CFR 98.345 - Procedures for estimating missing data.
Code of Federal Regulations, 2014 CFR
2014-07-01
... 40 Protection of Environment 21 2014-07-01 2014-07-01 false Procedures for estimating missing data... for estimating missing data. A complete record of all measured parameters used in the GHG emissions... substitute data value for the missing parameter shall be used in the calculations, according to the...
40 CFR 98.155 - Procedures for estimating missing data.
Code of Federal Regulations, 2010 CFR
2010-07-01
... 40 Protection of Environment 20 2010-07-01 2010-07-01 false Procedures for estimating missing data... Procedures for estimating missing data. (a) A complete record of all measured parameters used in the GHG...), a substitute data value for the missing parameter shall be used in the calculations, according to...
40 CFR 98.115 - Procedures for estimating missing data.
Code of Federal Regulations, 2010 CFR
2010-07-01
... 40 Protection of Environment 20 2010-07-01 2010-07-01 false Procedures for estimating missing data... estimating missing data. A complete record of all measured parameters used in the GHG emissions calculations... unavailable, a substitute data value for the missing parameter shall be used in the calculations as specified...
40 CFR 98.325 - Procedures for estimating missing data.
Code of Federal Regulations, 2011 CFR
2011-07-01
... 40 Protection of Environment 21 2011-07-01 2011-07-01 false Procedures for estimating missing data... estimating missing data. (a) A complete record of all measured parameters used in the GHG emissions... substitute data value for the missing parameter shall be used in the calculations, in accordance with...
40 CFR 98.175 - Procedures for estimating missing data.
Code of Federal Regulations, 2012 CFR
2012-07-01
... 40 Protection of Environment 22 2012-07-01 2012-07-01 false Procedures for estimating missing data... estimating missing data. A complete record of all measured parameters used in the GHG emissions calculations... unavailable, a substitute data value for the missing parameter shall be used in the calculations as specified...
40 CFR 98.215 - Procedures for estimating missing data.
Code of Federal Regulations, 2010 CFR
2010-07-01
... 40 Protection of Environment 20 2010-07-01 2010-07-01 false Procedures for estimating missing data... for estimating missing data. (a) A complete record of all measured parameters used in the GHG... unavailable, a substitute data value for the missing parameter shall be used in the calculations as specified...
40 CFR 98.55 - Procedures for estimating missing data.
Code of Federal Regulations, 2014 CFR
2014-07-01
... 40 Protection of Environment 21 2014-07-01 2014-07-01 false Procedures for estimating missing data... estimating missing data. A complete record of all measured parameters used in the GHG emissions calculations... substitute data value for the missing parameter shall be used in the calculations as specified in paragraphs...
40 CFR 98.325 - Procedures for estimating missing data.
Code of Federal Regulations, 2013 CFR
2013-07-01
... 40 Protection of Environment 22 2013-07-01 2013-07-01 false Procedures for estimating missing data... estimating missing data. (a) A complete record of all measured parameters used in the GHG emissions... substitute data value for the missing parameter shall be used in the calculations, in accordance with...
40 CFR 98.275 - Procedures for estimating missing data.
Code of Federal Regulations, 2013 CFR
2013-07-01
... 40 Protection of Environment 22 2013-07-01 2013-07-01 false Procedures for estimating missing data... for estimating missing data. A complete record of all measured parameters used in the GHG emissions... substitute data value for the missing parameter shall be used in the calculations, according to the...
40 CFR 98.215 - Procedures for estimating missing data.
Code of Federal Regulations, 2013 CFR
2013-07-01
... 40 Protection of Environment 22 2013-07-01 2013-07-01 false Procedures for estimating missing data... for estimating missing data. (a) A complete record of all measured parameters used in the GHG... unavailable, a substitute data value for the missing parameter shall be used in the calculations as specified...
40 CFR 98.355 - Procedures for estimating missing data.
Code of Federal Regulations, 2011 CFR
2011-07-01
... 40 Protection of Environment 21 2011-07-01 2011-07-01 false Procedures for estimating missing data... for estimating missing data. A complete record of all measured parameters used in the GHG emissions... substitute data value for the missing parameter must be used in the calculations, according to the following...
40 CFR 98.275 - Procedures for estimating missing data.
Code of Federal Regulations, 2012 CFR
2012-07-01
... 40 Protection of Environment 22 2012-07-01 2012-07-01 false Procedures for estimating missing data... for estimating missing data. A complete record of all measured parameters used in the GHG emissions... substitute data value for the missing parameter shall be used in the calculations, according to the...
40 CFR 98.155 - Procedures for estimating missing data.
Code of Federal Regulations, 2011 CFR
2011-07-01
... 40 Protection of Environment 21 2011-07-01 2011-07-01 false Procedures for estimating missing data... Procedures for estimating missing data. (a) A complete record of all measured parameters used in the GHG...), a substitute data value for the missing parameter shall be used in the calculations, according to...
40 CFR 98.365 - Procedures for estimating missing data.
Code of Federal Regulations, 2011 CFR
2011-07-01
... 40 Protection of Environment 21 2011-07-01 2011-07-01 false Procedures for estimating missing data... estimating missing data. (a) A complete record of all measured parameters used in the GHG emissions... substitute data value for the missing parameter shall be used in the calculations, according to the...
40 CFR 98.65 - Procedures for estimating missing data.
Code of Federal Regulations, 2012 CFR
2012-07-01
... 40 Protection of Environment 22 2012-07-01 2012-07-01 false Procedures for estimating missing data... estimating missing data. A complete record of all measured parameters used in the GHG emissions calculations... substitute data value for the missing parameter shall be used in the calculations, according to the following...
40 CFR 98.65 - Procedures for estimating missing data.
Code of Federal Regulations, 2013 CFR
2013-07-01
... 40 Protection of Environment 22 2013-07-01 2013-07-01 false Procedures for estimating missing data... estimating missing data. A complete record of all measured parameters used in the GHG emissions calculations... substitute data value for the missing parameter shall be used in the calculations, according to the following...
40 CFR 98.115 - Procedures for estimating missing data.
Code of Federal Regulations, 2012 CFR
2012-07-01
... 40 Protection of Environment 22 2012-07-01 2012-07-01 false Procedures for estimating missing data... estimating missing data. A complete record of all measured parameters used in the GHG emissions calculations... unavailable, a substitute data value for the missing parameter shall be used in the calculations as specified...
40 CFR 98.115 - Procedures for estimating missing data.
Code of Federal Regulations, 2013 CFR
2013-07-01
... 40 Protection of Environment 22 2013-07-01 2013-07-01 false Procedures for estimating missing data... estimating missing data. A complete record of all measured parameters used in the GHG emissions calculations... unavailable, a substitute data value for the missing parameter shall be used in the calculations as specified...
40 CFR 98.115 - Procedures for estimating missing data.
Code of Federal Regulations, 2014 CFR
2014-07-01
... 40 Protection of Environment 21 2014-07-01 2014-07-01 false Procedures for estimating missing data... estimating missing data. A complete record of all measured parameters used in the GHG emissions calculations... unavailable, a substitute data value for the missing parameter shall be used in the calculations as specified...
40 CFR 98.225 - Procedures for estimating missing data.
Code of Federal Regulations, 2014 CFR
2014-07-01
... 40 Protection of Environment 21 2014-07-01 2014-07-01 false Procedures for estimating missing data... estimating missing data. A complete record of all measured parameters used in the GHG emissions calculations... substitute data value for the missing parameter shall be used in the calculations as specified in paragraphs...
40 CFR 98.175 - Procedures for estimating missing data.
Code of Federal Regulations, 2010 CFR
2010-07-01
... 40 Protection of Environment 20 2010-07-01 2010-07-01 false Procedures for estimating missing data... estimating missing data. A complete record of all measured parameters used in the GHG emissions calculations... unavailable, a substitute data value for the missing parameter shall be used in the calculations as specified...
40 CFR 98.115 - Procedures for estimating missing data.
Code of Federal Regulations, 2011 CFR
2011-07-01
... 40 Protection of Environment 21 2011-07-01 2011-07-01 false Procedures for estimating missing data... estimating missing data. A complete record of all measured parameters used in the GHG emissions calculations... unavailable, a substitute data value for the missing parameter shall be used in the calculations as specified...
40 CFR 98.125 - Procedures for estimating missing data.
Code of Federal Regulations, 2012 CFR
2012-07-01
... 40 Protection of Environment 22 2012-07-01 2012-07-01 false Procedures for estimating missing data... estimating missing data. (a) A complete record of all measured parameters used in the GHG emissions... unavailable, a substitute data value for the missing parameter must be used in the calculations as specified...
40 CFR 98.355 - Procedures for estimating missing data.
Code of Federal Regulations, 2013 CFR
2013-07-01
... 40 Protection of Environment 22 2013-07-01 2013-07-01 false Procedures for estimating missing data... for estimating missing data. A complete record of all measured parameters used in the GHG emissions... substitute data value for the missing parameter must be used in the calculations, according to the following...
40 CFR 98.465 - Procedures for estimating missing data.
Code of Federal Regulations, 2011 CFR
2011-07-01
... 40 Protection of Environment 21 2011-07-01 2011-07-01 false Procedures for estimating missing data... estimating missing data. (a) A complete record of all measured parameters used in the GHG emissions... substitute data value for the missing parameter shall be used in the calculations, in accordance with...
40 CFR 98.325 - Procedures for estimating missing data.
Code of Federal Regulations, 2014 CFR
2014-07-01
... 40 Protection of Environment 21 2014-07-01 2014-07-01 false Procedures for estimating missing data... estimating missing data. (a) A complete record of all measured parameters used in the GHG emissions... substitute data value for the missing parameter shall be used in the calculations, in accordance with...
40 CFR 98.365 - Procedures for estimating missing data.
Code of Federal Regulations, 2012 CFR
2012-07-01
... 40 Protection of Environment 22 2012-07-01 2012-07-01 false Procedures for estimating missing data... estimating missing data. (a) A complete record of all measured parameters used in the GHG emissions... substitute data value for the missing parameter shall be used in the calculations, according to the...
40 CFR 98.465 - Procedures for estimating missing data.
Code of Federal Regulations, 2014 CFR
2014-07-01
... 40 Protection of Environment 21 2014-07-01 2014-07-01 false Procedures for estimating missing data... estimating missing data. (a) A complete record of all measured parameters used in the GHG emissions... substitute data value for the missing parameter shall be used in the calculations, in accordance with...
40 CFR 98.225 - Procedures for estimating missing data.
Code of Federal Regulations, 2011 CFR
2011-07-01
... 40 Protection of Environment 21 2011-07-01 2011-07-01 false Procedures for estimating missing data... estimating missing data. A complete record of all measured parameters used in the GHG emissions calculations... substitute data value for the missing parameter shall be used in the calculations as specified in paragraphs...
40 CFR 98.345 - Procedures for estimating missing data.
Code of Federal Regulations, 2012 CFR
2012-07-01
... 40 Protection of Environment 22 2012-07-01 2012-07-01 false Procedures for estimating missing data... for estimating missing data. A complete record of all measured parameters used in the GHG emissions... substitute data value for the missing parameter shall be used in the calculations, according to the...
40 CFR 98.65 - Procedures for estimating missing data.
Code of Federal Regulations, 2010 CFR
2010-07-01
... 40 Protection of Environment 20 2010-07-01 2010-07-01 false Procedures for estimating missing data... estimating missing data. A complete record of all measured parameters used in the GHG emissions calculations... substitute data value for the missing parameter shall be used in the calculations, according to the following...
40 CFR 98.125 - Procedures for estimating missing data.
Code of Federal Regulations, 2014 CFR
2014-07-01
... 40 Protection of Environment 21 2014-07-01 2014-07-01 false Procedures for estimating missing data... estimating missing data. (a) A complete record of all measured parameters used in the GHG emissions... unavailable, a substitute data value for the missing parameter must be used in the calculations as specified...
40 CFR 98.55 - Procedures for estimating missing data.
Code of Federal Regulations, 2011 CFR
2011-07-01
... 40 Protection of Environment 21 2011-07-01 2011-07-01 false Procedures for estimating missing data... estimating missing data. A complete record of all measured parameters used in the GHG emissions calculations... substitute data value for the missing parameter shall be used in the calculations as specified in paragraphs...
40 CFR 98.155 - Procedures for estimating missing data.
Code of Federal Regulations, 2012 CFR
2012-07-01
... 40 Protection of Environment 22 2012-07-01 2012-07-01 false Procedures for estimating missing data... Procedures for estimating missing data. (a) A complete record of all measured parameters used in the GHG...), a substitute data value for the missing parameter shall be used in the calculations, according to...
40 CFR 98.55 - Procedures for estimating missing data.
Code of Federal Regulations, 2012 CFR
2012-07-01
... 40 Protection of Environment 22 2012-07-01 2012-07-01 false Procedures for estimating missing data... estimating missing data. A complete record of all measured parameters used in the GHG emissions calculations... substitute data value for the missing parameter shall be used in the calculations as specified in paragraphs...
40 CFR 98.65 - Procedures for estimating missing data.
Code of Federal Regulations, 2014 CFR
2014-07-01
... 40 Protection of Environment 21 2014-07-01 2014-07-01 false Procedures for estimating missing data... estimating missing data. A complete record of all measured parameters used in the GHG emissions calculations... substitute data value for the missing parameter shall be used in the calculations, according to the following...
40 CFR 98.265 - Procedures for estimating missing data.
Code of Federal Regulations, 2014 CFR
2014-07-01
... 40 Protection of Environment 21 2014-07-01 2014-07-01 false Procedures for estimating missing data... estimating missing data. A complete record of all measured parameters used in the GHG emissions calculations... substitute data value for the missing parameter must be used in the calculations as specified in paragraphs...
40 CFR 98.355 - Procedures for estimating missing data.
Code of Federal Regulations, 2012 CFR
2012-07-01
... 40 Protection of Environment 22 2012-07-01 2012-07-01 false Procedures for estimating missing data... for estimating missing data. A complete record of all measured parameters used in the GHG emissions... substitute data value for the missing parameter must be used in the calculations, according to the following...
40 CFR 98.345 - Procedures for estimating missing data.
Code of Federal Regulations, 2011 CFR
2011-07-01
... 40 Protection of Environment 21 2011-07-01 2011-07-01 false Procedures for estimating missing data... for estimating missing data. A complete record of all measured parameters used in the GHG emissions... substitute data value for the missing parameter shall be used in the calculations, according to the...
40 CFR 98.215 - Procedures for estimating missing data.
Code of Federal Regulations, 2014 CFR
2014-07-01
... 40 Protection of Environment 21 2014-07-01 2014-07-01 false Procedures for estimating missing data... for estimating missing data. (a) A complete record of all measured parameters used in the GHG... unavailable, a substitute data value for the missing parameter shall be used in the calculations as specified...
40 CFR 98.325 - Procedures for estimating missing data.
Code of Federal Regulations, 2012 CFR
2012-07-01
... 40 Protection of Environment 22 2012-07-01 2012-07-01 false Procedures for estimating missing data... estimating missing data. (a) A complete record of all measured parameters used in the GHG emissions... substitute data value for the missing parameter shall be used in the calculations, in accordance with...
40 CFR 98.465 - Procedures for estimating missing data.
Code of Federal Regulations, 2013 CFR
2013-07-01
... 40 Protection of Environment 22 2013-07-01 2013-07-01 false Procedures for estimating missing data... estimating missing data. (a) A complete record of all measured parameters used in the GHG emissions... substitute data value for the missing parameter shall be used in the calculations, in accordance with...
40 CFR 98.175 - Procedures for estimating missing data.
Code of Federal Regulations, 2011 CFR
2011-07-01
... 40 Protection of Environment 21 2011-07-01 2011-07-01 false Procedures for estimating missing data... estimating missing data. A complete record of all measured parameters used in the GHG emissions calculations... unavailable, a substitute data value for the missing parameter shall be used in the calculations as specified...
40 CFR 98.225 - Procedures for estimating missing data.
Code of Federal Regulations, 2013 CFR
2013-07-01
... 40 Protection of Environment 22 2013-07-01 2013-07-01 false Procedures for estimating missing data... estimating missing data. A complete record of all measured parameters used in the GHG emissions calculations... substitute data value for the missing parameter shall be used in the calculations as specified in paragraphs...
40 CFR 98.155 - Procedures for estimating missing data.
Code of Federal Regulations, 2013 CFR
2013-07-01
... 40 Protection of Environment 22 2013-07-01 2013-07-01 false Procedures for estimating missing data... Procedures for estimating missing data. (a) A complete record of all measured parameters used in the GHG...), a substitute data value for the missing parameter shall be used in the calculations, according to...
40 CFR 98.65 - Procedures for estimating missing data.
Code of Federal Regulations, 2011 CFR
2011-07-01
... 40 Protection of Environment 21 2011-07-01 2011-07-01 false Procedures for estimating missing data... estimating missing data. A complete record of all measured parameters used in the GHG emissions calculations... substitute data value for the missing parameter shall be used in the calculations, according to the following...
40 CFR 98.225 - Procedures for estimating missing data.
Code of Federal Regulations, 2012 CFR
2012-07-01
... 40 Protection of Environment 22 2012-07-01 2012-07-01 false Procedures for estimating missing data... estimating missing data. A complete record of all measured parameters used in the GHG emissions calculations... substitute data value for the missing parameter shall be used in the calculations as specified in paragraphs...
40 CFR 98.365 - Procedures for estimating missing data.
Code of Federal Regulations, 2013 CFR
2013-07-01
... 40 Protection of Environment 22 2013-07-01 2013-07-01 false Procedures for estimating missing data... estimating missing data. (a) A complete record of all measured parameters used in the GHG emissions... substitute data value for the missing parameter shall be used in the calculations, according to the...
40 CFR 98.205 - Procedures for estimating missing data.
Code of Federal Regulations, 2014 CFR
2014-07-01
... 40 Protection of Environment 21 2014-07-01 2014-07-01 false Procedures for estimating missing data... estimating missing data. (a) A complete record of all measured parameters used in the GHG emission... substitute data value for the missing parameter will be used in the calculations as specified in paragraph (b...
40 CFR 98.255 - Procedures for estimating missing data.
Code of Federal Regulations, 2010 CFR
2010-07-01
... 40 Protection of Environment 20 2010-07-01 2010-07-01 false Procedures for estimating missing data... estimating missing data. A complete record of all measured parameters used in the GHG emissions calculations... during unit operation or if a required fuel sample is not taken), a substitute data value for the missing...
40 CFR 98.415 - Procedures for estimating missing data.
Code of Federal Regulations, 2014 CFR
2014-07-01
... 40 Protection of Environment 21 2014-07-01 2014-07-01 false Procedures for estimating missing data... Procedures for estimating missing data. (a) A complete record of all measured parameters used in the GHG... unavailable (e.g., if a meter malfunctions), a substitute data value for the missing parameter shall be used...
40 CFR 98.315 - Procedures for estimating missing data.
Code of Federal Regulations, 2014 CFR
2014-07-01
... 40 Protection of Environment 21 2014-07-01 2014-07-01 false Procedures for estimating missing data... estimating missing data. For the petroleum coke input procedure in § 98.313(b), a complete record of all... substitute data value for the missing parameter shall be used in the calculations as specified in the...
40 CFR 98.425 - Procedures for estimating missing data.
Code of Federal Regulations, 2011 CFR
2011-07-01
... 40 Protection of Environment 21 2011-07-01 2011-07-01 false Procedures for estimating missing data... estimating missing data. (a) Whenever the quality assurance procedures in § 98.424(a)(1) of this subpart... following missing data procedures shall be followed: (1) A quarterly CO2 mass flow or volumetric flow value...
40 CFR 98.415 - Procedures for estimating missing data.
Code of Federal Regulations, 2012 CFR
2012-07-01
... 40 Protection of Environment 22 2012-07-01 2012-07-01 false Procedures for estimating missing data... Procedures for estimating missing data. (a) A complete record of all measured parameters used in the GHG... unavailable (e.g., if a meter malfunctions), a substitute data value for the missing parameter shall be used...
40 CFR 98.295 - Procedures for estimating missing data.
Code of Federal Regulations, 2013 CFR
2013-07-01
... 40 Protection of Environment 22 2013-07-01 2013-07-01 false Procedures for estimating missing data... estimating missing data. For the emission calculation methodologies in § 98.293(b)(2) and (b)(3), a complete... unavailable, a substitute data value for the missing parameter shall be used in the calculations as specified...
40 CFR 98.425 - Procedures for estimating missing data.
Code of Federal Regulations, 2010 CFR
2010-07-01
... 40 Protection of Environment 20 2010-07-01 2010-07-01 false Procedures for estimating missing data... estimating missing data. (a) Whenever the quality assurance procedures in § 98.424(a) of this subpart cannot... following missing data procedures shall be followed: (1) A quarterly CO2 mass flow or volumetric flow value...
40 CFR 98.205 - Procedures for estimating missing data.
Code of Federal Regulations, 2011 CFR
2011-07-01
... 40 Protection of Environment 21 2011-07-01 2011-07-01 false Procedures for estimating missing data... estimating missing data. (a) A complete record of all measured parameters used in the GHG emission... substitute data value for the missing parameter will be used in the calculations as specified in paragraph (b...
40 CFR 98.255 - Procedures for estimating missing data.
Code of Federal Regulations, 2014 CFR
2014-07-01
... 40 Protection of Environment 21 2014-07-01 2014-07-01 false Procedures for estimating missing data... estimating missing data. A complete record of all measured parameters used in the GHG emissions calculations... during unit operation or if a required fuel sample is not taken), a substitute data value for the missing...
40 CFR 98.205 - Procedures for estimating missing data.
Code of Federal Regulations, 2012 CFR
2012-07-01
... 40 Protection of Environment 22 2012-07-01 2012-07-01 false Procedures for estimating missing data... estimating missing data. (a) A complete record of all measured parameters used in the GHG emission... substitute data value for the missing parameter will be used in the calculations as specified in paragraph (b...
40 CFR 98.255 - Procedures for estimating missing data.
Code of Federal Regulations, 2011 CFR
2011-07-01
... 40 Protection of Environment 21 2011-07-01 2011-07-01 false Procedures for estimating missing data... estimating missing data. A complete record of all measured parameters used in the GHG emissions calculations... during unit operation or if a required fuel sample is not taken), a substitute data value for the missing...
40 CFR 98.315 - Procedures for estimating missing data.
Code of Federal Regulations, 2012 CFR
2012-07-01
... 40 Protection of Environment 22 2012-07-01 2012-07-01 false Procedures for estimating missing data... estimating missing data. For the petroleum coke input procedure in § 98.313(b), a complete record of all... substitute data value for the missing parameter shall be used in the calculations as specified in the...
40 CFR 98.315 - Procedures for estimating missing data.
Code of Federal Regulations, 2013 CFR
2013-07-01
... 40 Protection of Environment 22 2013-07-01 2013-07-01 false Procedures for estimating missing data... estimating missing data. For the petroleum coke input procedure in § 98.313(b), a complete record of all... substitute data value for the missing parameter shall be used in the calculations as specified in the...
40 CFR 98.205 - Procedures for estimating missing data.
Code of Federal Regulations, 2013 CFR
2013-07-01
... 40 Protection of Environment 22 2013-07-01 2013-07-01 false Procedures for estimating missing data... estimating missing data. (a) A complete record of all measured parameters used in the GHG emission... substitute data value for the missing parameter will be used in the calculations as specified in paragraph (b...
40 CFR 98.315 - Procedures for estimating missing data.
Code of Federal Regulations, 2011 CFR
2011-07-01
... 40 Protection of Environment 21 2011-07-01 2011-07-01 false Procedures for estimating missing data... estimating missing data. For the petroleum coke input procedure in § 98.313(b), a complete record of all... substitute data value for the missing parameter shall be used in the calculations as specified in the...
40 CFR 98.295 - Procedures for estimating missing data.
Code of Federal Regulations, 2012 CFR
2012-07-01
... 40 Protection of Environment 22 2012-07-01 2012-07-01 false Procedures for estimating missing data... estimating missing data. For the emission calculation methodologies in § 98.293(b)(2) and (b)(3), a complete... unavailable, a substitute data value for the missing parameter shall be used in the calculations as specified...
40 CFR 98.425 - Procedures for estimating missing data.
Code of Federal Regulations, 2014 CFR
2014-07-01
... 40 Protection of Environment 21 2014-07-01 2014-07-01 false Procedures for estimating missing data... estimating missing data. (a) Whenever the quality assurance procedures in § 98.424(a)(1) of this subpart... following missing data procedures shall be followed: (1) A quarterly CO2 mass flow or volumetric flow value...
40 CFR 98.255 - Procedures for estimating missing data.
Code of Federal Regulations, 2013 CFR
2013-07-01
... 40 Protection of Environment 22 2013-07-01 2013-07-01 false Procedures for estimating missing data... estimating missing data. A complete record of all measured parameters used in the GHG emissions calculations... during unit operation or if a required fuel sample is not taken), a substitute data value for the missing...
40 CFR 98.195 - Procedures for estimating missing data.
Code of Federal Regulations, 2014 CFR
2014-07-01
... 40 Protection of Environment 21 2014-07-01 2014-07-01 false Procedures for estimating missing data... estimating missing data. For the procedure in § 98.193(b)(1), a complete record of all measured parameters... all available process data or data used for accounting purposes. (b) For missing values related to the...
40 CFR 98.295 - Procedures for estimating missing data.
Code of Federal Regulations, 2011 CFR
2011-07-01
... 40 Protection of Environment 21 2011-07-01 2011-07-01 false Procedures for estimating missing data... estimating missing data. For the emission calculation methodologies in § 98.293(b)(2) and (b)(3), a complete... unavailable, a substitute data value for the missing parameter shall be used in the calculations as specified...
40 CFR 98.415 - Procedures for estimating missing data.
Code of Federal Regulations, 2011 CFR
2011-07-01
... 40 Protection of Environment 21 2011-07-01 2011-07-01 false Procedures for estimating missing data... Procedures for estimating missing data. (a) A complete record of all measured parameters used in the GHG... unavailable (e.g., if a meter malfunctions), a substitute data value for the missing parameter shall be used...
40 CFR 98.415 - Procedures for estimating missing data.
Code of Federal Regulations, 2010 CFR
2010-07-01
... 40 Protection of Environment 20 2010-07-01 2010-07-01 false Procedures for estimating missing data... Procedures for estimating missing data. (a) A complete record of all measured parameters used in the GHG... unavailable (e.g., if a meter malfunctions), a substitute data value for the missing parameter shall be used...
40 CFR 98.425 - Procedures for estimating missing data.
Code of Federal Regulations, 2013 CFR
2013-07-01
... 40 Protection of Environment 22 2013-07-01 2013-07-01 false Procedures for estimating missing data... estimating missing data. (a) Whenever the quality assurance procedures in § 98.424(a)(1) of this subpart... following missing data procedures shall be followed: (1) A quarterly CO2 mass flow or volumetric flow value...
40 CFR 98.295 - Procedures for estimating missing data.
Code of Federal Regulations, 2014 CFR
2014-07-01
... 40 Protection of Environment 21 2014-07-01 2014-07-01 false Procedures for estimating missing data... estimating missing data. For the emission calculation methodologies in § 98.293(b)(2) and (b)(3), a complete... unavailable, a substitute data value for the missing parameter shall be used in the calculations as specified...
40 CFR 98.255 - Procedures for estimating missing data.
Code of Federal Regulations, 2012 CFR
2012-07-01
... 40 Protection of Environment 22 2012-07-01 2012-07-01 false Procedures for estimating missing data... estimating missing data. A complete record of all measured parameters used in the GHG emissions calculations... during unit operation or if a required fuel sample is not taken), a substitute data value for the missing...
40 CFR 98.425 - Procedures for estimating missing data.
Code of Federal Regulations, 2012 CFR
2012-07-01
... 40 Protection of Environment 22 2012-07-01 2012-07-01 false Procedures for estimating missing data... estimating missing data. (a) Whenever the quality assurance procedures in § 98.424(a)(1) of this subpart... following missing data procedures shall be followed: (1) A quarterly CO2 mass flow or volumetric flow value...
40 CFR 98.415 - Procedures for estimating missing data.
Code of Federal Regulations, 2013 CFR
2013-07-01
... 40 Protection of Environment 22 2013-07-01 2013-07-01 false Procedures for estimating missing data... Procedures for estimating missing data. (a) A complete record of all measured parameters used in the GHG... unavailable (e.g., if a meter malfunctions), a substitute data value for the missing parameter shall be used...
40 CFR 98.315 - Procedures for estimating missing data.
Code of Federal Regulations, 2010 CFR
2010-07-01
... measured parameters used in the GHG emissions calculations is required (e.g., carbon content values, etc... such estimates. (a) For each missing value of the monthly carbon content of calcined petroleum coke the substitute data value shall be the arithmetic average of the quality-assured values of carbon contents for...
Outlier Removal in Model-Based Missing Value Imputation for Medical Datasets.
Huang, Min-Wei; Lin, Wei-Chao; Tsai, Chih-Fong
2018-01-01
Many real-world medical datasets contain some proportion of missing (attribute) values. In general, missing value imputation can be performed to solve this problem, which is to provide estimations for the missing values by a reasoning process based on the (complete) observed data. However, if the observed data contain some noisy information or outliers, the estimations of the missing values may not be reliable or may even be quite different from the real values. The aim of this paper is to examine whether a combination of instance selection from the observed data and missing value imputation offers better performance than performing missing value imputation alone. In particular, three instance selection algorithms, DROP3, GA, and IB3, and three imputation algorithms, KNNI, MLP, and SVM, are used in order to find out the best combination. The experimental results show that that performing instance selection can have a positive impact on missing value imputation over the numerical data type of medical datasets, and specific combinations of instance selection and imputation methods can improve the imputation results over the mixed data type of medical datasets. However, instance selection does not have a definitely positive impact on the imputation result for categorical medical datasets.
Two-pass imputation algorithm for missing value estimation in gene expression time series.
Tsiporkova, Elena; Boeva, Veselka
2007-10-01
Gene expression microarray experiments frequently generate datasets with multiple values missing. However, most of the analysis, mining, and classification methods for gene expression data require a complete matrix of gene array values. Therefore, the accurate estimation of missing values in such datasets has been recognized as an important issue, and several imputation algorithms have already been proposed to the biological community. Most of these approaches, however, are not particularly suitable for time series expression profiles. In view of this, we propose a novel imputation algorithm, which is specially suited for the estimation of missing values in gene expression time series data. The algorithm utilizes Dynamic Time Warping (DTW) distance in order to measure the similarity between time expression profiles, and subsequently selects for each gene expression profile with missing values a dedicated set of candidate profiles for estimation. Three different DTW-based imputation (DTWimpute) algorithms have been considered: position-wise, neighborhood-wise, and two-pass imputation. These have initially been prototyped in Perl, and their accuracy has been evaluated on yeast expression time series data using several different parameter settings. The experiments have shown that the two-pass algorithm consistently outperforms, in particular for datasets with a higher level of missing entries, the neighborhood-wise and the position-wise algorithms. The performance of the two-pass DTWimpute algorithm has further been benchmarked against the weighted K-Nearest Neighbors algorithm, which is widely used in the biological community; the former algorithm has appeared superior to the latter one. Motivated by these findings, indicating clearly the added value of the DTW techniques for missing value estimation in time series data, we have built an optimized C++ implementation of the two-pass DTWimpute algorithm. The software also provides for a choice between three different initial rough imputation methods.
40 CFR 98.195 - Procedures for estimating missing data.
Code of Federal Regulations, 2012 CFR
2012-07-01
... 40 Protection of Environment 22 2012-07-01 2012-07-01 false Procedures for estimating missing data... estimating missing data. For the procedure in § 98.193(b)(1), a complete record of all measured parameters... available process data or data used for accounting purposes. (b) For missing values related to the CaO and...
40 CFR 98.195 - Procedures for estimating missing data.
Code of Federal Regulations, 2011 CFR
2011-07-01
... 40 Protection of Environment 21 2011-07-01 2011-07-01 false Procedures for estimating missing data... estimating missing data. For the procedure in § 98.193(b)(1), a complete record of all measured parameters... available process data or data used for accounting purposes. (b) For missing values related to the CaO and...
40 CFR 98.195 - Procedures for estimating missing data.
Code of Federal Regulations, 2010 CFR
2010-07-01
... 40 Protection of Environment 20 2010-07-01 2010-07-01 false Procedures for estimating missing data... estimating missing data. For the procedure in § 98.193(b)(2), a complete record of all measured parameters... process data or data used for accounting purposes. (b) For missing values related to the CaO and MgO...
40 CFR 98.195 - Procedures for estimating missing data.
Code of Federal Regulations, 2013 CFR
2013-07-01
... 40 Protection of Environment 22 2013-07-01 2013-07-01 false Procedures for estimating missing data... estimating missing data. For the procedure in § 98.193(b)(1), a complete record of all measured parameters... available process data or data used for accounting purposes. (b) For missing values related to the CaO and...
A Review On Missing Value Estimation Using Imputation Algorithm
NASA Astrophysics Data System (ADS)
Armina, Roslan; Zain, Azlan Mohd; Azizah Ali, Nor; Sallehuddin, Roselina
2017-09-01
The presence of the missing value in the data set has always been a major problem for precise prediction. The method for imputing missing value needs to minimize the effect of incomplete data sets for the prediction model. Many algorithms have been proposed for countermeasure of missing value problem. In this review, we provide a comprehensive analysis of existing imputation algorithm, focusing on the technique used and the implementation of global or local information of data sets for missing value estimation. In addition validation method for imputation result and way to measure the performance of imputation algorithm also described. The objective of this review is to highlight possible improvement on existing method and it is hoped that this review gives reader better understanding of imputation method trend.
40 CFR 98.295 - Procedures for estimating missing data.
Code of Federal Regulations, 2010 CFR
2010-07-01
... value shall be the best available estimate(s) of the parameter(s), based on all available process data or data used for accounting purposes. (c) For each missing value collected during the performance test (hourly CO2 concentration, stack gas volumetric flow rate, or average process vent flow from mine...
Estimating monthly streamflow values by cokriging
Solow, A.R.; Gorelick, S.M.
1986-01-01
Cokriging is applied to estimation of missing monthly streamflow values in three records from gaging stations in west central Virginia. Missing values are estimated from optimal consideration of the pattern of auto- and cross-correlation among standardized residual log-flow records. Investigation of the sensitivity of estimation to data configuration showed that when observations are available within two months of a missing value, estimation is improved by accounting for correlation. Concurrent and lag-one observations tend to screen the influence of other available observations. Three models of covariance structure in residual log-flow records are compared using cross-validation. Models differ in how much monthly variation they allow in covariance. Precision of estimation, reflected in mean squared error (MSE), proved to be insensitive to this choice. Cross-validation is suggested as a tool for choosing an inverse transformation when an initial nonlinear transformation is applied to flow values. ?? 1986 Plenum Publishing Corporation.
Depth inpainting by tensor voting.
Kulkarni, Mandar; Rajagopalan, Ambasamudram N
2013-06-01
Depth maps captured by range scanning devices or by using optical cameras often suffer from missing regions due to occlusions, reflectivity, limited scanning area, sensor imperfections, etc. In this paper, we propose a fast and reliable algorithm for depth map inpainting using the tensor voting (TV) framework. For less complex missing regions, local edge and depth information is utilized for synthesizing missing values. The depth variations are modeled by local planes using 3D TV, and missing values are estimated using plane equations. For large and complex missing regions, we collect and evaluate depth estimates from self-similar (training) datasets. We align the depth maps of the training set with the target (defective) depth map and evaluate the goodness of depth estimates among candidate values using 3D TV. We demonstrate the effectiveness of the proposed approaches on real as well as synthetic data.
Graffelman, Jan; Sánchez, Milagros; Cook, Samantha; Moreno, Victor
2013-01-01
In genetic association studies, tests for Hardy-Weinberg proportions are often employed as a quality control checking procedure. Missing genotypes are typically discarded prior to testing. In this paper we show that inference for Hardy-Weinberg proportions can be biased when missing values are discarded. We propose to use multiple imputation of missing values in order to improve inference for Hardy-Weinberg proportions. For imputation we employ a multinomial logit model that uses information from allele intensities and/or neighbouring markers. Analysis of an empirical data set of single nucleotide polymorphisms possibly related to colon cancer reveals that missing genotypes are not missing completely at random. Deviation from Hardy-Weinberg proportions is mostly due to a lack of heterozygotes. Inbreeding coefficients estimated by multiple imputation of the missings are typically lowered with respect to inbreeding coefficients estimated by discarding the missings. Accounting for missings by multiple imputation qualitatively changed the results of 10 to 17% of the statistical tests performed. Estimates of inbreeding coefficients obtained by multiple imputation showed high correlation with estimates obtained by single imputation using an external reference panel. Our conclusion is that imputation of missing data leads to improved statistical inference for Hardy-Weinberg proportions.
40 CFR 98.85 - Procedures for estimating missing data.
Code of Federal Regulations, 2011 CFR
2011-07-01
... 40 Protection of Environment 21 2011-07-01 2011-07-01 false Procedures for estimating missing data... missing data. A complete record of all measured parameters used in the GHG emissions calculations in § 98... substitute data value for the missing parameter shall be used in the calculations. The owner or operator must...
40 CFR 98.85 - Procedures for estimating missing data.
Code of Federal Regulations, 2012 CFR
2012-07-01
... 40 Protection of Environment 22 2012-07-01 2012-07-01 false Procedures for estimating missing data... missing data. A complete record of all measured parameters used in the GHG emissions calculations in § 98... substitute data value for the missing parameter shall be used in the calculations. The owner or operator must...
40 CFR 98.85 - Procedures for estimating missing data.
Code of Federal Regulations, 2013 CFR
2013-07-01
... 40 Protection of Environment 22 2013-07-01 2013-07-01 false Procedures for estimating missing data... missing data. A complete record of all measured parameters used in the GHG emissions calculations in § 98... substitute data value for the missing parameter shall be used in the calculations. The owner or operator must...
40 CFR 98.185 - Procedures for estimating missing data.
Code of Federal Regulations, 2011 CFR
2011-07-01
... 40 Protection of Environment 21 2011-07-01 2011-07-01 false Procedures for estimating missing data... missing data. A complete record of all measured parameters used in the GHG emissions calculations in § 98... substitute data value for the missing parameter shall be used in the calculations as specified in the...
40 CFR 98.85 - Procedures for estimating missing data.
Code of Federal Regulations, 2014 CFR
2014-07-01
... 40 Protection of Environment 21 2014-07-01 2014-07-01 false Procedures for estimating missing data... missing data. A complete record of all measured parameters used in the GHG emissions calculations in § 98... substitute data value for the missing parameter shall be used in the calculations. The owner or operator must...
40 CFR 98.185 - Procedures for estimating missing data.
Code of Federal Regulations, 2012 CFR
2012-07-01
... 40 Protection of Environment 22 2012-07-01 2012-07-01 false Procedures for estimating missing data... missing data. A complete record of all measured parameters used in the GHG emissions calculations in § 98... substitute data value for the missing parameter shall be used in the calculations as specified in the...
40 CFR 98.185 - Procedures for estimating missing data.
Code of Federal Regulations, 2014 CFR
2014-07-01
... 40 Protection of Environment 21 2014-07-01 2014-07-01 false Procedures for estimating missing data... missing data. A complete record of all measured parameters used in the GHG emissions calculations in § 98... substitute data value for the missing parameter shall be used in the calculations as specified in the...
40 CFR 98.185 - Procedures for estimating missing data.
Code of Federal Regulations, 2013 CFR
2013-07-01
... 40 Protection of Environment 22 2013-07-01 2013-07-01 false Procedures for estimating missing data... missing data. A complete record of all measured parameters used in the GHG emissions calculations in § 98... substitute data value for the missing parameter shall be used in the calculations as specified in the...
Improving cluster-based missing value estimation of DNA microarray data.
Brás, Lígia P; Menezes, José C
2007-06-01
We present a modification of the weighted K-nearest neighbours imputation method (KNNimpute) for missing values (MVs) estimation in microarray data based on the reuse of estimated data. The method was called iterative KNN imputation (IKNNimpute) as the estimation is performed iteratively using the recently estimated values. The estimation efficiency of IKNNimpute was assessed under different conditions (data type, fraction and structure of missing data) by the normalized root mean squared error (NRMSE) and the correlation coefficients between estimated and true values, and compared with that of other cluster-based estimation methods (KNNimpute and sequential KNN). We further investigated the influence of imputation on the detection of differentially expressed genes using SAM by examining the differentially expressed genes that are lost after MV estimation. The performance measures give consistent results, indicating that the iterative procedure of IKNNimpute can enhance the prediction ability of cluster-based methods in the presence of high missing rates, in non-time series experiments and in data sets comprising both time series and non-time series data, because the information of the genes having MVs is used more efficiently and the iterative procedure allows refining the MV estimates. More importantly, IKNN has a smaller detrimental effect on the detection of differentially expressed genes.
Gaussian mixture clustering and imputation of microarray data.
Ouyang, Ming; Welsh, William J; Georgopoulos, Panos
2004-04-12
In microarray experiments, missing entries arise from blemishes on the chips. In large-scale studies, virtually every chip contains some missing entries and more than 90% of the genes are affected. Many analysis methods require a full set of data. Either those genes with missing entries are excluded, or the missing entries are filled with estimates prior to the analyses. This study compares methods of missing value estimation. Two evaluation metrics of imputation accuracy are employed. First, the root mean squared error measures the difference between the true values and the imputed values. Second, the number of mis-clustered genes measures the difference between clustering with true values and that with imputed values; it examines the bias introduced by imputation to clustering. The Gaussian mixture clustering with model averaging imputation is superior to all other imputation methods, according to both evaluation metrics, on both time-series (correlated) and non-time series (uncorrelated) data sets.
Tian, Ting; McLachlan, Geoffrey J.; Dieters, Mark J.; Basford, Kaye E.
2015-01-01
It is a common occurrence in plant breeding programs to observe missing values in three-way three-mode multi-environment trial (MET) data. We proposed modifications of models for estimating missing observations for these data arrays, and developed a novel approach in terms of hierarchical clustering. Multiple imputation (MI) was used in four ways, multiple agglomerative hierarchical clustering, normal distribution model, normal regression model, and predictive mean match. The later three models used both Bayesian analysis and non-Bayesian analysis, while the first approach used a clustering procedure with randomly selected attributes and assigned real values from the nearest neighbour to the one with missing observations. Different proportions of data entries in six complete datasets were randomly selected to be missing and the MI methods were compared based on the efficiency and accuracy of estimating those values. The results indicated that the models using Bayesian analysis had slightly higher accuracy of estimation performance than those using non-Bayesian analysis but they were more time-consuming. However, the novel approach of multiple agglomerative hierarchical clustering demonstrated the overall best performances. PMID:26689369
Tian, Ting; McLachlan, Geoffrey J; Dieters, Mark J; Basford, Kaye E
2015-01-01
It is a common occurrence in plant breeding programs to observe missing values in three-way three-mode multi-environment trial (MET) data. We proposed modifications of models for estimating missing observations for these data arrays, and developed a novel approach in terms of hierarchical clustering. Multiple imputation (MI) was used in four ways, multiple agglomerative hierarchical clustering, normal distribution model, normal regression model, and predictive mean match. The later three models used both Bayesian analysis and non-Bayesian analysis, while the first approach used a clustering procedure with randomly selected attributes and assigned real values from the nearest neighbour to the one with missing observations. Different proportions of data entries in six complete datasets were randomly selected to be missing and the MI methods were compared based on the efficiency and accuracy of estimating those values. The results indicated that the models using Bayesian analysis had slightly higher accuracy of estimation performance than those using non-Bayesian analysis but they were more time-consuming. However, the novel approach of multiple agglomerative hierarchical clustering demonstrated the overall best performances.
Analysis of Longitudinal Outcome Data with Missing Values in Total Knee Arthroplasty.
Kang, Yeon Gwi; Lee, Jang Taek; Kang, Jong Yeal; Kim, Ga Hye; Kim, Tae Kyun
2016-01-01
We sought to determine the influence of missing data on the statistical results, and to determine which statistical method is most appropriate for the analysis of longitudinal outcome data of TKA with missing values among repeated measures ANOVA, generalized estimating equation (GEE) and mixed effects model repeated measures (MMRM). Data sets with missing values were generated with different proportion of missing data, sample size and missing-data generation mechanism. Each data set was analyzed with three statistical methods. The influence of missing data was greater with higher proportion of missing data and smaller sample size. MMRM tended to show least changes in the statistics. When missing values were generated by 'missing not at random' mechanism, no statistical methods could fully avoid deviations in the results. Copyright © 2016 Elsevier Inc. All rights reserved.
Missing value imputation in DNA microarrays based on conjugate gradient method.
Dorri, Fatemeh; Azmi, Paeiz; Dorri, Faezeh
2012-02-01
Analysis of gene expression profiles needs a complete matrix of gene array values; consequently, imputation methods have been suggested. In this paper, an algorithm that is based on conjugate gradient (CG) method is proposed to estimate missing values. k-nearest neighbors of the missed entry are first selected based on absolute values of their Pearson correlation coefficient. Then a subset of genes among the k-nearest neighbors is labeled as the best similar ones. CG algorithm with this subset as its input is then used to estimate the missing values. Our proposed CG based algorithm (CGimpute) is evaluated on different data sets. The results are compared with sequential local least squares (SLLSimpute), Bayesian principle component analysis (BPCAimpute), local least squares imputation (LLSimpute), iterated local least squares imputation (ILLSimpute) and adaptive k-nearest neighbors imputation (KNNKimpute) methods. The average of normalized root mean squares error (NRMSE) and relative NRMSE in different data sets with various missing rates shows CGimpute outperforms other methods. Copyright © 2011 Elsevier Ltd. All rights reserved.
Accounting for undetected compounds in statistical analyses of mass spectrometry 'omic studies.
Taylor, Sandra L; Leiserowitz, Gary S; Kim, Kyoungmi
2013-12-01
Mass spectrometry is an important high-throughput technique for profiling small molecular compounds in biological samples and is widely used to identify potential diagnostic and prognostic compounds associated with disease. Commonly, this data generated by mass spectrometry has many missing values resulting when a compound is absent from a sample or is present but at a concentration below the detection limit. Several strategies are available for statistically analyzing data with missing values. The accelerated failure time (AFT) model assumes all missing values result from censoring below a detection limit. Under a mixture model, missing values can result from a combination of censoring and the absence of a compound. We compare power and estimation of a mixture model to an AFT model. Based on simulated data, we found the AFT model to have greater power to detect differences in means and point mass proportions between groups. However, the AFT model yielded biased estimates with the bias increasing as the proportion of observations in the point mass increased while estimates were unbiased with the mixture model except if all missing observations came from censoring. These findings suggest using the AFT model for hypothesis testing and mixture model for estimation. We demonstrated this approach through application to glycomics data of serum samples from women with ovarian cancer and matched controls.
40 CFR 98.225 - Procedures for estimating missing data.
Code of Federal Regulations, 2010 CFR
2010-07-01
... data shall be the best available estimate based on all available process data or data used for accounting purposes (such as sales records). (b) For missing values related to the performance test...
Treatment of Missing Data in Workforce Education Research
ERIC Educational Resources Information Center
Gemici, Sinan; Rojewski, Jay W.; Lee, In Heok
2012-01-01
Most quantitative analyses in workforce education are affected by missing data. Traditional approaches to remedy missing data problems often result in reduced statistical power and biased parameter estimates due to systematic differences between missing and observed values. This article examines the treatment of missing data in pertinent…
Kalman Filtering for Genetic Regulatory Networks with Missing Values
Liu, Qiuhua; Lai, Tianyue; Wang, Wu
2017-01-01
The filter problem with missing value for genetic regulation networks (GRNs) is addressed, in which the noises exist in both the state dynamics and measurement equations; furthermore, the correlation between process noise and measurement noise is also taken into consideration. In order to deal with the filter problem, a class of discrete-time GRNs with missing value, noise correlation, and time delays is established. Then a new observation model is proposed to decrease the adverse effect caused by the missing value and to decouple the correlation between process noise and measurement noise in theory. Finally, a Kalman filtering is used to estimate the states of GRNs. Meanwhile, a typical example is provided to verify the effectiveness of the proposed method, and it turns out to be the case that the concentrations of mRNA and protein could be estimated accurately. PMID:28814967
40 CFR 98.55 - Procedures for estimating missing data.
Code of Federal Regulations, 2010 CFR
2010-07-01
... substitute data shall be the best available estimate based on all available process data or data used for accounting purposes (such as sales records). (b) For missing values related to the performance test...
ERIC Educational Resources Information Center
Acock, Alan C.
2005-01-01
Less than optimum strategies for missing values can produce biased estimates, distorted statistical power, and invalid conclusions. After reviewing traditional approaches (listwise, pairwise, and mean substitution), selected alternatives are covered including single imputation, multiple imputation, and full information maximum likelihood…
What You Don't Know Can Hurt You: Missing Data and Partial Credit Model Estimates
Thomas, Sarah L.; Schmidt, Karen M.; Erbacher, Monica K.; Bergeman, Cindy S.
2017-01-01
The authors investigated the effect of Missing Completely at Random (MCAR) item responses on partial credit model (PCM) parameter estimates in a longitudinal study of Positive Affect. Participants were 307 adults from the older cohort of the Notre Dame Study of Health and Well-Being (Bergeman and Deboeck, 2014) who completed questionnaires including Positive Affect items for 56 days. Additional missing responses were introduced to the data, randomly replacing 20%, 50%, and 70% of the responses on each item and each day with missing values, in addition to the existing missing data. Results indicated that item locations and person trait level measures diverged from the original estimates as the level of degradation from induced missing data increased. In addition, standard errors of these estimates increased with the level of degradation. Thus, MCAR data does damage the quality and precision of PCM estimates. PMID:26784376
NASA Astrophysics Data System (ADS)
Zainudin, Mohd Lutfi; Saaban, Azizan; Bakar, Mohd Nazari Abu
2015-12-01
The solar radiation values have been composed by automatic weather station using the device that namely pyranometer. The device is functions to records all the radiation values that have been dispersed, and these data are very useful for it experimental works and solar device's development. In addition, for modeling and designing on solar radiation system application is needed for complete data observation. Unfortunately, lack for obtained the complete solar radiation data frequently occur due to several technical problems, which mainly contributed by monitoring device. Into encountering this matter, estimation missing values in an effort to substitute absent values with imputed data. This paper aimed to evaluate several piecewise interpolation techniques likes linear, splines, cubic, and nearest neighbor into dealing missing values in hourly solar radiation data. Then, proposed an extendable work into investigating the potential used of cubic Bezier technique and cubic Said-ball method as estimator tools. As result, methods for cubic Bezier and Said-ball perform the best compare to another piecewise imputation technique.
Missing data and multiple imputation in clinical epidemiological research.
Pedersen, Alma B; Mikkelsen, Ellen M; Cronin-Fenton, Deirdre; Kristensen, Nickolaj R; Pham, Tra My; Pedersen, Lars; Petersen, Irene
2017-01-01
Missing data are ubiquitous in clinical epidemiological research. Individuals with missing data may differ from those with no missing data in terms of the outcome of interest and prognosis in general. Missing data are often categorized into the following three types: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). In clinical epidemiological research, missing data are seldom MCAR. Missing data can constitute considerable challenges in the analyses and interpretation of results and can potentially weaken the validity of results and conclusions. A number of methods have been developed for dealing with missing data. These include complete-case analyses, missing indicator method, single value imputation, and sensitivity analyses incorporating worst-case and best-case scenarios. If applied under the MCAR assumption, some of these methods can provide unbiased but often less precise estimates. Multiple imputation is an alternative method to deal with missing data, which accounts for the uncertainty associated with missing data. Multiple imputation is implemented in most statistical software under the MAR assumption and provides unbiased and valid estimates of associations based on information from the available data. The method affects not only the coefficient estimates for variables with missing data but also the estimates for other variables with no missing data.
Missing data and multiple imputation in clinical epidemiological research
Pedersen, Alma B; Mikkelsen, Ellen M; Cronin-Fenton, Deirdre; Kristensen, Nickolaj R; Pham, Tra My; Pedersen, Lars; Petersen, Irene
2017-01-01
Missing data are ubiquitous in clinical epidemiological research. Individuals with missing data may differ from those with no missing data in terms of the outcome of interest and prognosis in general. Missing data are often categorized into the following three types: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). In clinical epidemiological research, missing data are seldom MCAR. Missing data can constitute considerable challenges in the analyses and interpretation of results and can potentially weaken the validity of results and conclusions. A number of methods have been developed for dealing with missing data. These include complete-case analyses, missing indicator method, single value imputation, and sensitivity analyses incorporating worst-case and best-case scenarios. If applied under the MCAR assumption, some of these methods can provide unbiased but often less precise estimates. Multiple imputation is an alternative method to deal with missing data, which accounts for the uncertainty associated with missing data. Multiple imputation is implemented in most statistical software under the MAR assumption and provides unbiased and valid estimates of associations based on information from the available data. The method affects not only the coefficient estimates for variables with missing data but also the estimates for other variables with no missing data. PMID:28352203
Accounting for one-channel depletion improves missing value imputation in 2-dye microarray data.
Ritz, Cecilia; Edén, Patrik
2008-01-19
For 2-dye microarray platforms, some missing values may arise from an un-measurably low RNA expression in one channel only. Information of such "one-channel depletion" is so far not included in algorithms for imputation of missing values. Calculating the mean deviation between imputed values and duplicate controls in five datasets, we show that KNN-based imputation gives a systematic bias of the imputed expression values of one-channel depleted spots. Evaluating the correction of this bias by cross-validation showed that the mean square deviation between imputed values and duplicates were reduced up to 51%, depending on dataset. By including more information in the imputation step, we more accurately estimate missing expression values.
Tran, V H Huynh; Gilbert, H; David, I
2017-01-01
With the development of automatic self-feeders, repeated measurements of feed intake are becoming easier in an increasing number of species. However, the corresponding BW are not always recorded, and these missing values complicate the longitudinal analysis of the feed conversion ratio (FCR). Our aim was to evaluate the impact of missing BW data on estimations of the genetic parameters of FCR and ways to improve the estimations. On the basis of the missing BW profile in French Large White pigs (male pigs weighed weekly, females and castrated males weighed monthly), we compared 2 different ways of predicting missing BW, 1 using a Gompertz model and 1 using a linear interpolation. For the first part of the study, we used 17,398 weekly records of BW and feed intake recorded over 16 consecutive weeks in 1,222 growing male pigs. We performed a simulation study on this data set to mimic missing BW values according to the pattern of weekly proportions of incomplete BW data in females and castrated males. The FCR was then computed for each week using observed data (obser_FCR), data with missing BW (miss_FCR), data with BW predicted using a Gompertz model (Gomp_FCR), and data with BW predicted by linear interpolation (interp_FCR). Heritability (h) was estimated, and the EBV was predicted for each repeated FCR using a random regression model. In the second part of the study, the full data set (males with their complete BW records, castrated males and females with missing BW) was analyzed using the same methods (miss_FCR, Gomp_FCR, and interp_FCR). Results of the simulation study showed that h were overestimated in the case of missing BW and that predicting BW using a linear interpolation provided a more accurate estimation of h and of EBV than a Gompertz model. Over 100 simulations, the correlation between obser_EBV and interp_EBV, Gomp_EBV, and miss_EBV was 0.93 ± 0.02, 0.91 ± 0.01, and 0.79 ± 0.04, respectively. The heritabilities obtained with the full data set were quite similar for miss_FCR, Gomp_FCR, and interp_FCR. In conclusion, when the proportion of missing BW is high, genetic parameters of FCR are not well estimated. In French Large White pigs, in the growing period extending from d 65 to 168, prediction of missing BW using a Gompertz growth model slightly improved the estimations, but the linear interpolation improved the estimation to a greater extent. This result is due to the linear rather than sigmoidal increase in BW over the study period.
A bias-corrected estimator in multiple imputation for missing data.
Tomita, Hiroaki; Fujisawa, Hironori; Henmi, Masayuki
2018-05-29
Multiple imputation (MI) is one of the most popular methods to deal with missing data, and its use has been rapidly increasing in medical studies. Although MI is rather appealing in practice since it is possible to use ordinary statistical methods for a complete data set once the missing values are fully imputed, the method of imputation is still problematic. If the missing values are imputed from some parametric model, the validity of imputation is not necessarily ensured, and the final estimate for a parameter of interest can be biased unless the parametric model is correctly specified. Nonparametric methods have been also proposed for MI, but it is not so straightforward as to produce imputation values from nonparametrically estimated distributions. In this paper, we propose a new method for MI to obtain a consistent (or asymptotically unbiased) final estimate even if the imputation model is misspecified. The key idea is to use an imputation model from which the imputation values are easily produced and to make a proper correction in the likelihood function after the imputation by using the density ratio between the imputation model and the true conditional density function for the missing variable as a weight. Although the conditional density must be nonparametrically estimated, it is not used for the imputation. The performance of our method is evaluated by both theory and simulation studies. A real data analysis is also conducted to illustrate our method by using the Duke Cardiac Catheterization Coronary Artery Disease Diagnostic Dataset. Copyright © 2018 John Wiley & Sons, Ltd.
A nonparametric multiple imputation approach for missing categorical data.
Zhou, Muhan; He, Yulei; Yu, Mandi; Hsu, Chiu-Hsieh
2017-06-06
Incomplete categorical variables with more than two categories are common in public health data. However, most of the existing missing-data methods do not use the information from nonresponse (missingness) probabilities. We propose a nearest-neighbour multiple imputation approach to impute a missing at random categorical outcome and to estimate the proportion of each category. The donor set for imputation is formed by measuring distances between each missing value with other non-missing values. The distance function is calculated based on a predictive score, which is derived from two working models: one fits a multinomial logistic regression for predicting the missing categorical outcome (the outcome model) and the other fits a logistic regression for predicting missingness probabilities (the missingness model). A weighting scheme is used to accommodate contributions from two working models when generating the predictive score. A missing value is imputed by randomly selecting one of the non-missing values with the smallest distances. We conduct a simulation to evaluate the performance of the proposed method and compare it with several alternative methods. A real-data application is also presented. The simulation study suggests that the proposed method performs well when missingness probabilities are not extreme under some misspecifications of the working models. However, the calibration estimator, which is also based on two working models, can be highly unstable when missingness probabilities for some observations are extremely high. In this scenario, the proposed method produces more stable and better estimates. In addition, proper weights need to be chosen to balance the contributions from the two working models and achieve optimal results for the proposed method. We conclude that the proposed multiple imputation method is a reasonable approach to dealing with missing categorical outcome data with more than two levels for assessing the distribution of the outcome. In terms of the choices for the working models, we suggest a multinomial logistic regression for predicting the missing outcome and a binary logistic regression for predicting the missingness probability.
Jiang, Z; Dou, Z; Song, W L; Xu, J; Wu, Z Y
2017-11-10
Objective: To compare results of different methods: in organizing HIV viral load (VL) data with missing values mechanism. Methods We used software SPSS 17.0 to simulate complete and missing data with different missing value mechanism from HIV viral loading data collected from MSM in 16 cities in China in 2013. Maximum Likelihood Methods Using the Expectation and Maximization Algorithm (EM), regressive method, mean imputation, delete method, and Markov Chain Monte Carlo (MCMC) were used to supplement missing data respectively. The results: of different methods were compared according to distribution characteristics, accuracy and precision. Results HIV VL data could not be transferred into a normal distribution. All the methods showed good results in iterating data which is Missing Completely at Random Mechanism (MCAR). For the other types of missing data, regressive and MCMC methods were used to keep the main characteristic of the original data. The means of iterating database with different methods were all close to the original one. The EM, regressive method, mean imputation, and delete method under-estimate VL while MCMC overestimates it. Conclusion: MCMC can be used as the main imputation method for HIV virus loading missing data. The iterated data can be used as a reference for mean HIV VL estimation among the investigated population.
A regressive methodology for estimating missing data in rainfall daily time series
NASA Astrophysics Data System (ADS)
Barca, E.; Passarella, G.
2009-04-01
The "presence" of gaps in environmental data time series represents a very common, but extremely critical problem, since it can produce biased results (Rubin, 1976). Missing data plagues almost all surveys. The problem is how to deal with missing data once it has been deemed impossible to recover the actual missing values. Apart from the amount of missing data, another issue which plays an important role in the choice of any recovery approach is the evaluation of "missingness" mechanisms. When data missing is conditioned by some other variable observed in the data set (Schafer, 1997) the mechanism is called MAR (Missing at Random). Otherwise, when the missingness mechanism depends on the actual value of the missing data, it is called NCAR (Not Missing at Random). This last is the most difficult condition to model. In the last decade interest arose in the estimation of missing data by using regression (single imputation). More recently multiple imputation has become also available, which returns a distribution of estimated values (Scheffer, 2002). In this paper an automatic methodology for estimating missing data is presented. In practice, given a gauging station affected by missing data (target station), the methodology checks the randomness of the missing data and classifies the "similarity" between the target station and the other gauging stations spread over the study area. Among different methods useful for defining the similarity degree, whose effectiveness strongly depends on the data distribution, the Spearman correlation coefficient was chosen. Once defined the similarity matrix, a suitable, nonparametric, univariate, and regressive method was applied in order to estimate missing data in the target station: the Theil method (Theil, 1950). Even though the methodology revealed to be rather reliable an improvement of the missing data estimation can be achieved by a generalization. A first possible improvement consists in extending the univariate technique to the multivariate approach. Another approach follows the paradigm of the "multiple imputation" (Rubin, 1987; Rubin, 1988), which consists in using a set of "similar stations" instead than the most similar. This way, a sort of estimation range can be determined allowing the introduction of uncertainty. Finally, time series can be grouped on the basis of monthly rainfall rates defining classes of wetness (i.e.: dry, moderately rainy and rainy), in order to achieve the estimation using homogeneous data subsets. We expect that integrating the methodology with these enhancements will certainly improve its reliability. The methodology was applied to the daily rainfall time series data registered in the Candelaro River Basin (Apulia - South Italy) from 1970 to 2001. REFERENCES D.B., Rubin, 1976. Inference and Missing Data. Biometrika 63 581-592 D.B. Rubin, 1987. Multiple Imputation for Nonresponce in Surveys, New York: John Wiley & Sons, Inc. D.B. Rubin, 1988. An overview of multiple imputation. In Survey Research Section, pp. 79-84, American Statistical Association, 1988. J.L., Schafer, 1997. Analysis of Incomplete Multivariate Data, Chapman & Hall. J., Scheffer, 2002. Dealing with Missing Data. Res. Lett. Inf. Math. Sci. 3, 153-160. Available online at http://www.massey.ac.nz/~wwiims/research/letters/ H. Theil, 1950. A rank-invariant method of linear and polynomial regression analysis. Indicationes Mathematicae, 12, pp.85-91.
Estimating missing daily temperature extremes in Jaffna, Sri Lanka
NASA Astrophysics Data System (ADS)
Thevakaran, A.; Sonnadara, D. U. J.
2018-04-01
The accuracy of reconstructing missing daily temperature extremes in the Jaffna climatological station, situated in the northern part of the dry zone of Sri Lanka, is presented. The adopted method utilizes standard departures of daily maximum and minimum temperature values at four neighbouring stations, Mannar, Anuradhapura, Puttalam and Trincomalee to estimate the standard departures of daily maximum and minimum temperatures at the target station, Jaffna. The daily maximum and minimum temperatures from 1966 to 1980 (15 years) were used to test the validity of the method. The accuracy of the estimation is higher for daily maximum temperature compared to daily minimum temperature. About 95% of the estimated daily maximum temperatures are within ±1.5 °C of the observed values. For daily minimum temperature, the percentage is about 92. By calculating the standard deviation of the difference in estimated and observed values, we have shown that the error in estimating the daily maximum and minimum temperatures is ±0.7 and ±0.9 °C, respectively. To obtain the best accuracy when estimating the missing daily temperature extremes, it is important to include Mannar which is the nearest station to the target station, Jaffna. We conclude from the analysis that the method can be applied successfully to reconstruct the missing daily temperature extremes in Jaffna where no data is available due to frequent disruptions caused by civil unrests and hostilities in the region during the period, 1984 to 2000.
Estimating Missing Unit Process Data in Life Cycle Assessment Using a Similarity-Based Approach.
Hou, Ping; Cai, Jiarui; Qu, Shen; Xu, Ming
2018-05-01
In life cycle assessment (LCA), collecting unit process data from the empirical sources (i.e., meter readings, operation logs/journals) is often costly and time-consuming. We propose a new computational approach to estimate missing unit process data solely relying on limited known data based on a similarity-based link prediction method. The intuition is that similar processes in a unit process network tend to have similar material/energy inputs and waste/emission outputs. We use the ecoinvent 3.1 unit process data sets to test our method in four steps: (1) dividing the data sets into a training set and a test set; (2) randomly removing certain numbers of data in the test set indicated as missing; (3) using similarity-weighted means of various numbers of most similar processes in the training set to estimate the missing data in the test set; and (4) comparing estimated data with the original values to determine the performance of the estimation. The results show that missing data can be accurately estimated when less than 5% data are missing in one process. The estimation performance decreases as the percentage of missing data increases. This study provides a new approach to compile unit process data and demonstrates a promising potential of using computational approaches for LCA data compilation.
Haji-Maghsoudi, Saiedeh; Haghdoost, Ali-akbar; Rastegari, Azam; Baneshi, Mohammad Reza
2013-01-01
Background: Policy makers need models to be able to detect groups at high risk of HIV infection. Incomplete records and dirty data are frequently seen in national data sets. Presence of missing data challenges the practice of model development. Several studies suggested that performance of imputation methods is acceptable when missing rate is moderate. One of the issues which was of less concern, to be addressed here, is the role of the pattern of missing data. Methods: We used information of 2720 prisoners. Results derived from fitting regression model to whole data were served as gold standard. Missing data were then generated so that 10%, 20% and 50% of data were lost. In scenario 1, we generated missing values, at above rates, in one variable which was significant in gold model (age). In scenario 2, a small proportion of each of independent variable was dropped out. Four imputation methods, under different Event Per Variable (EPV) values, were compared in terms of selection of important variables and parameter estimation. Results: In scenario 2, bias in estimates was low and performances of all methods for handing missing data were similar. All methods at all missing rates were able to detect significance of age. In scenario 1, biases in estimations were increased, in particular at 50% missing rate. Here at EPVs of 10 and 5, imputation methods failed to capture effect of age. Conclusion: In scenario 2, all imputation methods at all missing rates, were able to detect age as being significant. This was not the case in scenario 1. Our results showed that performance of imputation methods depends on the pattern of missing data. PMID:24596839
Minimax Rate-optimal Estimation of High-dimensional Covariance Matrices with Incomplete Data*
Cai, T. Tony; Zhang, Anru
2016-01-01
Missing data occur frequently in a wide range of applications. In this paper, we consider estimation of high-dimensional covariance matrices in the presence of missing observations under a general missing completely at random model in the sense that the missingness is not dependent on the values of the data. Based on incomplete data, estimators for bandable and sparse covariance matrices are proposed and their theoretical and numerical properties are investigated. Minimax rates of convergence are established under the spectral norm loss and the proposed estimators are shown to be rate-optimal under mild regularity conditions. Simulation studies demonstrate that the estimators perform well numerically. The methods are also illustrated through an application to data from four ovarian cancer studies. The key technical tools developed in this paper are of independent interest and potentially useful for a range of related problems in high-dimensional statistical inference with missing data. PMID:27777471
Minimax Rate-optimal Estimation of High-dimensional Covariance Matrices with Incomplete Data.
Cai, T Tony; Zhang, Anru
2016-09-01
Missing data occur frequently in a wide range of applications. In this paper, we consider estimation of high-dimensional covariance matrices in the presence of missing observations under a general missing completely at random model in the sense that the missingness is not dependent on the values of the data. Based on incomplete data, estimators for bandable and sparse covariance matrices are proposed and their theoretical and numerical properties are investigated. Minimax rates of convergence are established under the spectral norm loss and the proposed estimators are shown to be rate-optimal under mild regularity conditions. Simulation studies demonstrate that the estimators perform well numerically. The methods are also illustrated through an application to data from four ovarian cancer studies. The key technical tools developed in this paper are of independent interest and potentially useful for a range of related problems in high-dimensional statistical inference with missing data.
Missing Value Monitoring Enhances the Robustness in Proteomics Quantitation.
Matafora, Vittoria; Corno, Andrea; Ciliberto, Andrea; Bachi, Angela
2017-04-07
In global proteomic analysis, it is estimated that proteins span from millions to less than 100 copies per cell. The challenge of protein quantitation by classic shotgun proteomic techniques relies on the presence of missing values in peptides belonging to low-abundance proteins that lowers intraruns reproducibility affecting postdata statistical analysis. Here, we present a new analytical workflow MvM (missing value monitoring) able to recover quantitation of missing values generated by shotgun analysis. In particular, we used confident data-dependent acquisition (DDA) quantitation only for proteins measured in all the runs, while we filled the missing values with data-independent acquisition analysis using the library previously generated in DDA. We analyzed cell cycle regulated proteins, as they are low abundance proteins with highly dynamic expression levels. Indeed, we found that cell cycle related proteins are the major components of the missing values-rich proteome. Using the MvM workflow, we doubled the number of robustly quantified cell cycle related proteins, and we reduced the number of missing values achieving robust quantitation for proteins over ∼50 molecules per cell. MvM allows lower quantification variance among replicates for low abundance proteins with respect to DDA analysis, which demonstrates the potential of this novel workflow to measure low abundance, dynamically regulated proteins.
On piecewise interpolation techniques for estimating solar radiation missing values in Kedah
DOE Office of Scientific and Technical Information (OSTI.GOV)
Saaban, Azizan; Zainudin, Lutfi; Bakar, Mohd Nazari Abu
2014-12-04
This paper discusses the use of piecewise interpolation method based on cubic Ball and Bézier curves representation to estimate the missing value of solar radiation in Kedah. An hourly solar radiation dataset is collected at Alor Setar Meteorology Station that is taken from Malaysian Meteorology Deparment. The piecewise cubic Ball and Bézier functions that interpolate the data points are defined on each hourly intervals of solar radiation measurement and is obtained by prescribing first order derivatives at the starts and ends of the intervals. We compare the performance of our proposed method with existing methods using Root Mean Squared Errormore » (RMSE) and Coefficient of Detemination (CoD) which is based on missing values simulation datasets. The results show that our method is outperformed the other previous methods.« less
Estimation of Return Values of Wave Height: Consequences of Missing Observations
ERIC Educational Resources Information Center
Ryden, Jesper
2008-01-01
Extreme-value statistics is often used to estimate so-called return values (actually related to quantiles) for environmental quantities like wind speed or wave height. A basic method for estimation is the method of block maxima which consists in partitioning observations in blocks, where maxima from each block could be considered independent.…
ERIC Educational Resources Information Center
Enders, Craig K.
2008-01-01
Recent missing data studies have argued in favor of an "inclusive analytic strategy" that incorporates auxiliary variables into the estimation routine, and Graham (2003) outlined methods for incorporating auxiliary variables into structural equation analyses. In practice, the auxiliary variables often have missing values, so it is reasonable to…
Relying on Your Own Best Judgment: Imputing Values to Missing Information in Decision Making.
ERIC Educational Resources Information Center
Johnson, Richard D.; And Others
Processes involved in making estimates of the value of missing information that could help in a decision making process were studied. Hypothetical purchases of ground beef were selected for the study as such purchases have the desirable property of quantifying both the price and quality. A total of 150 students at the University of Iowa rated the…
DOE Office of Scientific and Technical Information (OSTI.GOV)
Pichara, Karim; Protopapas, Pavlos
We present an automatic classification method for astronomical catalogs with missing data. We use Bayesian networks and a probabilistic graphical model that allows us to perform inference to predict missing values given observed data and dependency relationships between variables. To learn a Bayesian network from incomplete data, we use an iterative algorithm that utilizes sampling methods and expectation maximization to estimate the distributions and probabilistic dependencies of variables from data with missing values. To test our model, we use three catalogs with missing data (SAGE, Two Micron All Sky Survey, and UBVI) and one complete catalog (MACHO). We examine howmore » classification accuracy changes when information from missing data catalogs is included, how our method compares to traditional missing data approaches, and at what computational cost. Integrating these catalogs with missing data, we find that classification of variable objects improves by a few percent and by 15% for quasar detection while keeping the computational cost the same.« less
Inverse-Probability-Weighted Estimation for Monotone and Nonmonotone Missing Data.
Sun, BaoLuo; Perkins, Neil J; Cole, Stephen R; Harel, Ofer; Mitchell, Emily M; Schisterman, Enrique F; Tchetgen Tchetgen, Eric J
2018-03-01
Missing data is a common occurrence in epidemiologic research. In this paper, 3 data sets with induced missing values from the Collaborative Perinatal Project, a multisite US study conducted from 1959 to 1974, are provided as examples of prototypical epidemiologic studies with missing data. Our goal was to estimate the association of maternal smoking behavior with spontaneous abortion while adjusting for numerous confounders. At the same time, we did not necessarily wish to evaluate the joint distribution among potentially unobserved covariates, which is seldom the subject of substantive scientific interest. The inverse probability weighting (IPW) approach preserves the semiparametric structure of the underlying model of substantive interest and clearly separates the model of substantive interest from the model used to account for the missing data. However, IPW often will not result in valid inference if the missing-data pattern is nonmonotone, even if the data are missing at random. We describe a recently proposed approach to modeling nonmonotone missing-data mechanisms under missingness at random to use in constructing the weights in IPW complete-case estimation, and we illustrate the approach using 3 data sets described in a companion article (Am J Epidemiol. 2018;187(3):568-575).
Inverse-Probability-Weighted Estimation for Monotone and Nonmonotone Missing Data
Sun, BaoLuo; Perkins, Neil J; Cole, Stephen R; Harel, Ofer; Mitchell, Emily M; Schisterman, Enrique F; Tchetgen Tchetgen, Eric J
2018-01-01
Abstract Missing data is a common occurrence in epidemiologic research. In this paper, 3 data sets with induced missing values from the Collaborative Perinatal Project, a multisite US study conducted from 1959 to 1974, are provided as examples of prototypical epidemiologic studies with missing data. Our goal was to estimate the association of maternal smoking behavior with spontaneous abortion while adjusting for numerous confounders. At the same time, we did not necessarily wish to evaluate the joint distribution among potentially unobserved covariates, which is seldom the subject of substantive scientific interest. The inverse probability weighting (IPW) approach preserves the semiparametric structure of the underlying model of substantive interest and clearly separates the model of substantive interest from the model used to account for the missing data. However, IPW often will not result in valid inference if the missing-data pattern is nonmonotone, even if the data are missing at random. We describe a recently proposed approach to modeling nonmonotone missing-data mechanisms under missingness at random to use in constructing the weights in IPW complete-case estimation, and we illustrate the approach using 3 data sets described in a companion article (Am J Epidemiol. 2018;187(3):568–575). PMID:29165557
Reverse engineering gene regulatory networks from measurement with missing values.
Ogundijo, Oyetunji E; Elmas, Abdulkadir; Wang, Xiaodong
2016-12-01
Gene expression time series data are usually in the form of high-dimensional arrays. Unfortunately, the data may sometimes contain missing values: for either the expression values of some genes at some time points or the entire expression values of a single time point or some sets of consecutive time points. This significantly affects the performance of many algorithms for gene expression analysis that take as an input, the complete matrix of gene expression measurement. For instance, previous works have shown that gene regulatory interactions can be estimated from the complete matrix of gene expression measurement. Yet, till date, few algorithms have been proposed for the inference of gene regulatory network from gene expression data with missing values. We describe a nonlinear dynamic stochastic model for the evolution of gene expression. The model captures the structural, dynamical, and the nonlinear natures of the underlying biomolecular systems. We present point-based Gaussian approximation (PBGA) filters for joint state and parameter estimation of the system with one-step or two-step missing measurements . The PBGA filters use Gaussian approximation and various quadrature rules, such as the unscented transform (UT), the third-degree cubature rule and the central difference rule for computing the related posteriors. The proposed algorithm is evaluated with satisfying results for synthetic networks, in silico networks released as a part of the DREAM project, and the real biological network, the in vivo reverse engineering and modeling assessment (IRMA) network of yeast Saccharomyces cerevisiae . PBGA filters are proposed to elucidate the underlying gene regulatory network (GRN) from time series gene expression data that contain missing values. In our state-space model, we proposed a measurement model that incorporates the effect of the missing data points into the sequential algorithm. This approach produces a better inference of the model parameters and hence, more accurate prediction of the underlying GRN compared to when using the conventional Gaussian approximation (GA) filters ignoring the missing data points.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Khachatryan, Vardan
The performance of missing transverse energy reconstruction algorithms is presented by our team using√s=8 TeV proton-proton (pp) data collected with the CMS detector. Events with anomalous missing transverse energy are studied, and the performance of algorithms used to identify and remove these events is presented. The scale and resolution for missing transverse energy, including the effects of multiple pp interactions (pileup), are measured using events with an identified Z boson or isolated photon, and are found to be well described by the simulation. Novel missing transverse energy reconstruction algorithms developed specifically to mitigate the effects of large numbers of pileupmore » interactions on the missing transverse energy resolution are presented. These algorithms significantly reduce the dependence of the missing transverse energy resolution on pileup interactions. Furthermore, an algorithm that provides an estimate of the significance of the missing transverse energy is presented, which is used to estimate the compatibility of the reconstructed missing transverse energy with a zero nominal value.« less
Fiero, Mallorie H; Hsu, Chiu-Hsieh; Bell, Melanie L
2017-11-20
We extend the pattern-mixture approach to handle missing continuous outcome data in longitudinal cluster randomized trials, which randomize groups of individuals to treatment arms, rather than the individuals themselves. Individuals who drop out at the same time point are grouped into the same dropout pattern. We approach extrapolation of the pattern-mixture model by applying multilevel multiple imputation, which imputes missing values while appropriately accounting for the hierarchical data structure found in cluster randomized trials. To assess parameters of interest under various missing data assumptions, imputed values are multiplied by a sensitivity parameter, k, which increases or decreases imputed values. Using simulated data, we show that estimates of parameters of interest can vary widely under differing missing data assumptions. We conduct a sensitivity analysis using real data from a cluster randomized trial by increasing k until the treatment effect inference changes. By performing a sensitivity analysis for missing data, researchers can assess whether certain missing data assumptions are reasonable for their cluster randomized trial. Copyright © 2017 John Wiley & Sons, Ltd.
Vedadi, Farhang; Shirani, Shahram
2014-01-01
A new method of image resolution up-conversion (image interpolation) based on maximum a posteriori sequence estimation is proposed. Instead of making a hard decision about the value of each missing pixel, we estimate the missing pixels in groups. At each missing pixel of the high resolution (HR) image, we consider an ensemble of candidate interpolation methods (interpolation functions). The interpolation functions are interpreted as states of a Markov model. In other words, the proposed method undergoes state transitions from one missing pixel position to the next. Accordingly, the interpolation problem is translated to the problem of estimating the optimal sequence of interpolation functions corresponding to the sequence of missing HR pixel positions. We derive a parameter-free probabilistic model for this to-be-estimated sequence of interpolation functions. Then, we solve the estimation problem using a trellis representation and the Viterbi algorithm. Using directional interpolation functions and sequence estimation techniques, we classify the new algorithm as an adaptive directional interpolation using soft-decision estimation techniques. Experimental results show that the proposed algorithm yields images with higher or comparable peak signal-to-noise ratios compared with some benchmark interpolation methods in the literature while being efficient in terms of implementation and complexity considerations.
New Insights into Handling Missing Values in Environmental Epidemiological Studies
Roda, Célina; Nicolis, Ioannis; Momas, Isabelle; Guihenneuc, Chantal
2014-01-01
Missing data are unavoidable in environmental epidemiologic surveys. The aim of this study was to compare methods for handling large amounts of missing values: omission of missing values, single and multiple imputations (through linear regression or partial least squares regression), and a fully Bayesian approach. These methods were applied to the PARIS birth cohort, where indoor domestic pollutant measurements were performed in a random sample of babies' dwellings. A simulation study was conducted to assess performances of different approaches with a high proportion of missing values (from 50% to 95%). Different simulation scenarios were carried out, controlling the true value of the association (odds ratio of 1.0, 1.2, and 1.4), and varying the health outcome prevalence. When a large amount of data is missing, omitting these missing data reduced statistical power and inflated standard errors, which affected the significance of the association. Single imputation underestimated the variability, and considerably increased risk of type I error. All approaches were conservative, except the Bayesian joint model. In the case of a common health outcome, the fully Bayesian approach is the most efficient approach (low root mean square error, reasonable type I error, and high statistical power). Nevertheless for a less prevalent event, the type I error is increased and the statistical power is reduced. The estimated posterior distribution of the OR is useful to refine the conclusion. Among the methods handling missing values, no approach is absolutely the best but when usual approaches (e.g. single imputation) are not sufficient, joint modelling approach of missing process and health association is more efficient when large amounts of data are missing. PMID:25226278
Khachatryan, Vardan
2015-02-12
The performance of missing transverse energy reconstruction algorithms is presented by our team using√s=8 TeV proton-proton (pp) data collected with the CMS detector. Events with anomalous missing transverse energy are studied, and the performance of algorithms used to identify and remove these events is presented. The scale and resolution for missing transverse energy, including the effects of multiple pp interactions (pileup), are measured using events with an identified Z boson or isolated photon, and are found to be well described by the simulation. Novel missing transverse energy reconstruction algorithms developed specifically to mitigate the effects of large numbers of pileupmore » interactions on the missing transverse energy resolution are presented. These algorithms significantly reduce the dependence of the missing transverse energy resolution on pileup interactions. Furthermore, an algorithm that provides an estimate of the significance of the missing transverse energy is presented, which is used to estimate the compatibility of the reconstructed missing transverse energy with a zero nominal value.« less
Andrew D. Richardson; David Y. Hollinger
2007-01-01
Missing values in any data set create problems for researchers. The process by which missing values are replaced, and the data set is made complete, is generally referred to as imputation. Within the eddy flux community, the term "gap filling" is more commonly applied. A major challenge is that random errors in measured data result in uncertainty in the gap-...
Microarray missing data imputation based on a set theoretic framework and biological knowledge.
Gan, Xiangchao; Liew, Alan Wee-Chung; Yan, Hong
2006-01-01
Gene expressions measured using microarrays usually suffer from the missing value problem. However, in many data analysis methods, a complete data matrix is required. Although existing missing value imputation algorithms have shown good performance to deal with missing values, they also have their limitations. For example, some algorithms have good performance only when strong local correlation exists in data while some provide the best estimate when data is dominated by global structure. In addition, these algorithms do not take into account any biological constraint in their imputation. In this paper, we propose a set theoretic framework based on projection onto convex sets (POCS) for missing data imputation. POCS allows us to incorporate different types of a priori knowledge about missing values into the estimation process. The main idea of POCS is to formulate every piece of prior knowledge into a corresponding convex set and then use a convergence-guaranteed iterative procedure to obtain a solution in the intersection of all these sets. In this work, we design several convex sets, taking into consideration the biological characteristic of the data: the first set mainly exploit the local correlation structure among genes in microarray data, while the second set captures the global correlation structure among arrays. The third set (actually a series of sets) exploits the biological phenomenon of synchronization loss in microarray experiments. In cyclic systems, synchronization loss is a common phenomenon and we construct a series of sets based on this phenomenon for our POCS imputation algorithm. Experiments show that our algorithm can achieve a significant reduction of error compared to the KNNimpute, SVDimpute and LSimpute methods.
The Estimation of Gestational Age at Birth in Database Studies.
Eberg, Maria; Platt, Robert W; Filion, Kristian B
2017-11-01
Studies on the safety of prenatal medication use require valid estimation of the pregnancy duration. However, gestational age is often incompletely recorded in administrative and clinical databases. Our objective was to compare different approaches to estimating the pregnancy duration. Using data from the Clinical Practice Research Datalink and Hospital Episode Statistics, we examined the following four approaches to estimating missing gestational age: (1) generalized estimating equations for longitudinal data; (2) multiple imputation; (3) estimation based on fetal birth weight and sex; and (4) conventional approaches that assigned a fixed value (39 weeks for all or 39 weeks for full term and 35 weeks for preterm). The gestational age recorded in Hospital Episode Statistics was considered the gold standard. We conducted a simulation study comparing the described approaches in terms of estimated bias and mean square error. A total of 25,929 infants from 22,774 mothers were included in our "gold standard" cohort. The smallest average absolute bias was observed for the generalized estimating equation that included birth weight, while the largest absolute bias occurred when assigning 39-week gestation to all those with missing values. The smallest mean square errors were detected with generalized estimating equations while multiple imputation had the highest mean square errors. The use of generalized estimating equations resulted in the most accurate estimation of missing gestational age when birth weight information was available. In the absence of birth weight, assignment of fixed gestational age based on term/preterm status may be the optimal approach.
Multiple Imputation for Incomplete Data in Epidemiologic Studies
Harel, Ofer; Mitchell, Emily M; Perkins, Neil J; Cole, Stephen R; Tchetgen Tchetgen, Eric J; Sun, BaoLuo; Schisterman, Enrique F
2018-01-01
Abstract Epidemiologic studies are frequently susceptible to missing information. Omitting observations with missing variables remains a common strategy in epidemiologic studies, yet this simple approach can often severely bias parameter estimates of interest if the values are not missing completely at random. Even when missingness is completely random, complete-case analysis can reduce the efficiency of estimated parameters, because large amounts of available data are simply tossed out with the incomplete observations. Alternative methods for mitigating the influence of missing information, such as multiple imputation, are becoming an increasing popular strategy in order to retain all available information, reduce potential bias, and improve efficiency in parameter estimation. In this paper, we describe the theoretical underpinnings of multiple imputation, and we illustrate application of this method as part of a collaborative challenge to assess the performance of various techniques for dealing with missing data (Am J Epidemiol. 2018;187(3):568–575). We detail the steps necessary to perform multiple imputation on a subset of data from the Collaborative Perinatal Project (1959–1974), where the goal is to estimate the odds of spontaneous abortion associated with smoking during pregnancy. PMID:29165547
Establishing a threshold for the number of missing days using 7 d pedometer data.
Kang, Minsoo; Hart, Peter D; Kim, Youngdeok
2012-11-01
The purpose of this study was to examine the threshold of the number of missing days of recovery using the individual information (II)-centered approach. Data for this study came from 86 participants, aged from 17 to 79 years old, who had 7 consecutive days of complete pedometer (Yamax SW 200) wear. Missing datasets (1 d through 5 d missing) were created by a SAS random process 10,000 times each. All missing values were replaced using the II-centered approach. A 7 d average was calculated for each dataset, including the complete dataset. Repeated measure ANOVA was used to determine the differences between 1 d through 5 d missing datasets and the complete dataset. Mean absolute percentage error (MAPE) was also computed. Mean (SD) daily step count for the complete 7 d dataset was 7979 (3084). Mean (SD) values for the 1 d through 5 d missing datasets were 8072 (3218), 8066 (3109), 7968 (3273), 7741 (3050) and 8314 (3529), respectively (p > 0.05). The lower MAPEs were estimated for 1 d missing (5.2%, 95% confidence interval (CI) 4.4-6.0) and 2 d missing (8.4%, 95% CI 7.0-9.8), while all others were greater than 10%. The results of this study show that the 1 d through 5 d missing datasets, with replaced values, were not significantly different from the complete dataset. Based on the MAPE results, it is not recommended to replace more than two days of missing step counts.
Mavridis, Dimitris; White, Ian R; Higgins, Julian P T; Cipriani, Andrea; Salanti, Georgia
2015-02-28
Missing outcome data are commonly encountered in randomized controlled trials and hence may need to be addressed in a meta-analysis of multiple trials. A common and simple approach to deal with missing data is to restrict analysis to individuals for whom the outcome was obtained (complete case analysis). However, estimated treatment effects from complete case analyses are potentially biased if informative missing data are ignored. We develop methods for estimating meta-analytic summary treatment effects for continuous outcomes in the presence of missing data for some of the individuals within the trials. We build on a method previously developed for binary outcomes, which quantifies the degree of departure from a missing at random assumption via the informative missingness odds ratio. Our new model quantifies the degree of departure from missing at random using either an informative missingness difference of means or an informative missingness ratio of means, both of which relate the mean value of the missing outcome data to that of the observed data. We propose estimating the treatment effects, adjusted for informative missingness, and their standard errors by a Taylor series approximation and by a Monte Carlo method. We apply the methodology to examples of both pairwise and network meta-analysis with multi-arm trials. © 2014 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd.
NASA Astrophysics Data System (ADS)
Paredes, P.; Fontes, J. C.; Azevedo, E. B.; Pereira, L. S.
2017-11-01
Reference crop evapotranspiration (ETo) estimations using the FAO Penman-Monteith equation (PM-ETo) require a set of weather data including maximum and minimum air temperatures (T max, T min), actual vapor pressure (e a), solar radiation (R s), and wind speed (u 2). However, those data are often not available, or data sets are incomplete due to missing values. A set of procedures were proposed in FAO56 (Allen et al. 1998) to overcome these limitations, and which accuracy for estimating daily ETo in the humid climate of Azores islands is assessed in this study. Results show that after locally and seasonally calibrating the temperature adjustment factor a d used for dew point temperature (T dew) computation from mean temperature, ETo estimations shown small bias and small RMSE ranging from 0.15 to 0.53 mm day-1. When R s data are missing, their estimation from the temperature difference (T max-T min), using a locally and seasonal calibrated radiation adjustment coefficient (k Rs), yielded highly accurate ETo estimates, with RMSE averaging 0.41 mm day-1 and ranging from 0.33 to 0.58 mm day-1. If wind speed observations are missing, the use of the default u 2 = 2 m s-1, or 3 m s-1 in case of weather measurements over clipped grass in airports, revealed appropriated even for the windy locations (u 2 > 4 m s-1), with RMSE < 0.36 mm day-1. The appropriateness of procedure to estimating the missing values of e a, R s, and u 2 was confirmed.
Toward Best Practices in Analyzing Datasets with Missing Data: Comparisons and Recommendations
ERIC Educational Resources Information Center
Johnson, David R.; Young, Rebekah
2011-01-01
Although several methods have been developed to allow for the analysis of data in the presence of missing values, no clear guide exists to help family researchers in choosing among the many options and procedures available. We delineate these options and examine the sensitivity of the findings in a regression model estimated in three random…
Zhang, Zhiyong; Yuan, Ke-Hai
2016-06-01
Cronbach's coefficient alpha is a widely used reliability measure in social, behavioral, and education sciences. It is reported in nearly every study that involves measuring a construct through multiple items. With non-tau-equivalent items, McDonald's omega has been used as a popular alternative to alpha in the literature. Traditional estimation methods for alpha and omega often implicitly assume that data are complete and normally distributed. This study proposes robust procedures to estimate both alpha and omega as well as corresponding standard errors and confidence intervals from samples that may contain potential outlying observations and missing values. The influence of outlying observations and missing data on the estimates of alpha and omega is investigated through two simulation studies. Results show that the newly developed robust method yields substantially improved alpha and omega estimates as well as better coverage rates of confidence intervals than the conventional nonrobust method. An R package coefficientalpha is developed and demonstrated to obtain robust estimates of alpha and omega.
Zhang, Zhiyong; Yuan, Ke-Hai
2015-01-01
Cronbach’s coefficient alpha is a widely used reliability measure in social, behavioral, and education sciences. It is reported in nearly every study that involves measuring a construct through multiple items. With non-tau-equivalent items, McDonald’s omega has been used as a popular alternative to alpha in the literature. Traditional estimation methods for alpha and omega often implicitly assume that data are complete and normally distributed. This study proposes robust procedures to estimate both alpha and omega as well as corresponding standard errors and confidence intervals from samples that may contain potential outlying observations and missing values. The influence of outlying observations and missing data on the estimates of alpha and omega is investigated through two simulation studies. Results show that the newly developed robust method yields substantially improved alpha and omega estimates as well as better coverage rates of confidence intervals than the conventional nonrobust method. An R package coefficientalpha is developed and demonstrated to obtain robust estimates of alpha and omega. PMID:29795870
Hansson, Lisbeth; Khamis, Harry J
2008-12-01
Simulated data sets are used to evaluate conditional and unconditional maximum likelihood estimation in an individual case-control design with continuous covariates when there are different rates of excluded cases and different levels of other design parameters. The effectiveness of the estimation procedures is measured by method bias, variance of the estimators, root mean square error (RMSE) for logistic regression and the percentage of explained variation. Conditional estimation leads to higher RMSE than unconditional estimation in the presence of missing observations, especially for 1:1 matching. The RMSE is higher for the smaller stratum size, especially for the 1:1 matching. The percentage of explained variation appears to be insensitive to missing data, but is generally higher for the conditional estimation than for the unconditional estimation. It is particularly good for the 1:2 matching design. For minimizing RMSE, a high matching ratio is recommended; in this case, conditional and unconditional logistic regression models yield comparable levels of effectiveness. For maximizing the percentage of explained variation, the 1:2 matching design with the conditional logistic regression model is recommended.
Nuclear Forensics Analysis with Missing and Uncertain Data
Langan, Roisin T.; Archibald, Richard K.; Lamberti, Vincent
2015-10-05
We have applied a new imputation-based method for analyzing incomplete data, called Monte Carlo Bayesian Database Generation (MCBDG), to the Spent Fuel Isotopic Composition (SFCOMPO) database. About 60% of the entries are absent for SFCOMPO. The method estimates missing values of a property from a probability distribution created from the existing data for the property, and then generates multiple instances of the completed database for training a machine learning algorithm. Uncertainty in the data is represented by an empirical or an assumed error distribution. The method makes few assumptions about the underlying data, and compares favorably against results obtained bymore » replacing missing information with constant values.« less
Lo Presti, Rossella; Barca, Emanuele; Passarella, Giuseppe
2010-01-01
Environmental time series are often affected by the "presence" of missing data, but when dealing statistically with data, the need to fill in the gaps estimating the missing values must be considered. At present, a large number of statistical techniques are available to achieve this objective; they range from very simple methods, such as using the sample mean, to very sophisticated ones, such as multiple imputation. A brand new methodology for missing data estimation is proposed, which tries to merge the obvious advantages of the simplest techniques (e.g. their vocation to be easily implemented) with the strength of the newest techniques. The proposed method consists in the application of two consecutive stages: once it has been ascertained that a specific monitoring station is affected by missing data, the "most similar" monitoring stations are identified among neighbouring stations on the basis of a suitable similarity coefficient; in the second stage, a regressive method is applied in order to estimate the missing data. In this paper, four different regressive methods are applied and compared, in order to determine which is the most reliable for filling in the gaps, using rainfall data series measured in the Candelaro River Basin located in South Italy.
Statistical theory and methodology for remote sensing data analysis with special emphasis on LACIE
NASA Technical Reports Server (NTRS)
Odell, P. L.
1975-01-01
Crop proportion estimators for determining crop acreage through the use of remote sensing were evaluated. Several studies of these estimators were conducted, including an empirical comparison of the different estimators (using actual data) and an empirical study of the sensitivity (robustness) of the class of mixture estimators. The effect of missing data upon crop classification procedures is discussed in detail including a simulation of the missing data effect. The final problem addressed is that of taking yield data (bushels per acre) gathered at several yield stations and extrapolating these values over some specified large region. Computer programs developed in support of some of these activities are described.
Cox regression analysis with missing covariates via nonparametric multiple imputation.
Hsu, Chiu-Hsieh; Yu, Mandi
2018-01-01
We consider the situation of estimating Cox regression in which some covariates are subject to missing, and there exists additional information (including observed event time, censoring indicator and fully observed covariates) which may be predictive of the missing covariates. We propose to use two working regression models: one for predicting the missing covariates and the other for predicting the missing probabilities. For each missing covariate observation, these two working models are used to define a nearest neighbor imputing set. This set is then used to non-parametrically impute covariate values for the missing observation. Upon the completion of imputation, Cox regression is performed on the multiply imputed datasets to estimate the regression coefficients. In a simulation study, we compare the nonparametric multiple imputation approach with the augmented inverse probability weighted (AIPW) method, which directly incorporates the two working models into estimation of Cox regression, and the predictive mean matching imputation (PMM) method. We show that all approaches can reduce bias due to non-ignorable missing mechanism. The proposed nonparametric imputation method is robust to mis-specification of either one of the two working models and robust to mis-specification of the link function of the two working models. In contrast, the PMM method is sensitive to misspecification of the covariates included in imputation. The AIPW method is sensitive to the selection probability. We apply the approaches to a breast cancer dataset from Surveillance, Epidemiology and End Results (SEER) Program.
NASA Astrophysics Data System (ADS)
Miller, Lindsay; Xu, Xiaohong; Wheeler, Amanda; Zhang, Tianchu; Hamadani, Mariam; Ejaz, Unam
2018-05-01
High density air monitoring campaigns provide spatial patterns of pollutant concentrations which are integral in exposure assessment. Such analysis can assist with the determination of links between air quality and health outcomes, however, problems due to missing data can threaten to compromise these studies. This research evaluates four methods; mean value imputation, inverse distance weighting (IDW), inter-species ratios, and regression, to address missing spatial concentration data ranging from one missing data point up to 50% missing data. BTEX (benzene, toluene, ethylbenzene, and xylenes) concentrations were measured in Windsor and Sarnia, Ontario in the fall of 2005. Concentrations and inter-species ratios were generally similar between the two cities. Benzene (B) was observed to be higher in Sarnia, whereas toluene (T) and the T/B ratios were higher in Windsor. Using these urban, industrialized cities as case studies, this research demonstrates that using inter-species ratios or regression of the data for which there is complete information, along with one measured concentration (i.e. benzene) to predict for missing concentrations (i.e. TEX) results in good agreement between predicted and measured values. In both cities, the general trend remains that best agreement is observed for the leave-one-out scenario, followed by 10% and 25% missing, and the least agreement for the 50% missing cases. In the absence of any known concentrations IDW can provide reasonable agreement between observed and estimated concentrations for the BTEX species, and was superior over mean value imputation which was not able to preserve the spatial trend. The proposed methods can be used to fill in missing data, while preserving the general characteristics and rank order of the data which are sufficient for epidemiologic studies.
NASA Astrophysics Data System (ADS)
Hasan, Haliza; Ahmad, Sanizah; Osman, Balkish Mohd; Sapri, Shamsiah; Othman, Nadirah
2017-08-01
In regression analysis, missing covariate data has been a common problem. Many researchers use ad hoc methods to overcome this problem due to the ease of implementation. However, these methods require assumptions about the data that rarely hold in practice. Model-based methods such as Maximum Likelihood (ML) using the expectation maximization (EM) algorithm and Multiple Imputation (MI) are more promising when dealing with difficulties caused by missing data. Then again, inappropriate methods of missing value imputation can lead to serious bias that severely affects the parameter estimates. The main objective of this study is to provide a better understanding regarding missing data concept that can assist the researcher to select the appropriate missing data imputation methods. A simulation study was performed to assess the effects of different missing data techniques on the performance of a regression model. The covariate data were generated using an underlying multivariate normal distribution and the dependent variable was generated as a combination of explanatory variables. Missing values in covariate were simulated using a mechanism called missing at random (MAR). Four levels of missingness (10%, 20%, 30% and 40%) were imposed. ML and MI techniques available within SAS software were investigated. A linear regression analysis was fitted and the model performance measures; MSE, and R-Squared were obtained. Results of the analysis showed that MI is superior in handling missing data with highest R-Squared and lowest MSE when percent of missingness is less than 30%. Both methods are unable to handle larger than 30% level of missingness.
Dealing with gene expression missing data.
Brás, L P; Menezes, J C
2006-05-01
Compared evaluation of different methods is presented for estimating missing values in microarray data: weighted K-nearest neighbours imputation (KNNimpute), regression-based methods such as local least squares imputation (LLSimpute) and partial least squares imputation (PLSimpute) and Bayesian principal component analysis (BPCA). The influence in prediction accuracy of some factors, such as methods' parameters, type of data relationships used in the estimation process (i.e. row-wise, column-wise or both), missing rate and pattern and type of experiment [time series (TS), non-time series (NTS) or mixed (MIX) experiments] is elucidated. Improvements based on the iterative use of data (iterative LLS and PLS imputation--ILLSimpute and IPLSimpute), the need to perform initial imputations (modified PLS and Helland PLS imputation--MPLSimpute and HPLSimpute) and the type of relationships employed (KNNarray, LLSarray, HPLSarray and alternating PLS--APLSimpute) are proposed. Overall, it is shown that data set properties (type of experiment, missing rate and pattern) affect the data similarity structure, therefore influencing the methods' performance. LLSimpute and ILLSimpute are preferable in the presence of data with a stronger similarity structure (TS and MIX experiments), whereas PLS-based methods (MPLSimpute, IPLSimpute and APLSimpute) are preferable when estimating NTS missing data.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zainudin, Mohd Lutfi, E-mail: mdlutfi07@gmail.com; Institut Matematik Kejuruteraan; Saaban, Azizan, E-mail: azizan.s@uum.edu.my
The solar radiation values have been composed by automatic weather station using the device that namely pyranometer. The device is functions to records all the radiation values that have been dispersed, and these data are very useful for it experimental works and solar device’s development. In addition, for modeling and designing on solar radiation system application is needed for complete data observation. Unfortunately, lack for obtained the complete solar radiation data frequently occur due to several technical problems, which mainly contributed by monitoring device. Into encountering this matter, estimation missing values in an effort to substitute absent values with imputedmore » data. This paper aimed to evaluate several piecewise interpolation techniques likes linear, splines, cubic, and nearest neighbor into dealing missing values in hourly solar radiation data. Then, proposed an extendable work into investigating the potential used of cubic Bezier technique and cubic Said-ball method as estimator tools. As result, methods for cubic Bezier and Said-ball perform the best compare to another piecewise imputation technique.« less
Peterson, Josh F.; Eden, Svetlana K.; Moons, Karel G.; Ikizler, T. Alp; Matheny, Michael E.
2013-01-01
Summary Background and objectives Baseline creatinine (BCr) is frequently missing in AKI studies. Common surrogate estimates can misclassify AKI and adversely affect the study of related outcomes. This study examined whether multiple imputation improved accuracy of estimating missing BCr beyond current recommendations to apply assumed estimated GFR (eGFR) of 75 ml/min per 1.73 m2 (eGFR 75). Design, setting, participants, & measurements From 41,114 unique adult admissions (13,003 with and 28,111 without BCr data) at Vanderbilt University Hospital between 2006 and 2008, a propensity score model was developed to predict likelihood of missing BCr. Propensity scoring identified 6502 patients with highest likelihood of missing BCr among 13,003 patients with known BCr to simulate a “missing” data scenario while preserving actual reference BCr. Within this cohort (n=6502), the ability of various multiple-imputation approaches to estimate BCr and classify AKI were compared with that of eGFR 75. Results All multiple-imputation methods except the basic one more closely approximated actual BCr than did eGFR 75. Total AKI misclassification was lower with multiple imputation (full multiple imputation + serum creatinine) (9.0%) than with eGFR 75 (12.3%; P<0.001). Improvements in misclassification were greater in patients with impaired kidney function (full multiple imputation + serum creatinine) (15.3%) versus eGFR 75 (40.5%; P<0.001). Multiple imputation improved specificity and positive predictive value for detecting AKI at the expense of modestly decreasing sensitivity relative to eGFR 75. Conclusions Multiple imputation can improve accuracy in estimating missing BCr and reduce misclassification of AKI beyond currently proposed methods. PMID:23037980
Simulation-based sensitivity analysis for non-ignorably missing data.
Yin, Peng; Shi, Jian Q
2017-01-01
Sensitivity analysis is popular in dealing with missing data problems particularly for non-ignorable missingness, where full-likelihood method cannot be adopted. It analyses how sensitively the conclusions (output) may depend on assumptions or parameters (input) about missing data, i.e. missing data mechanism. We call models with the problem of uncertainty sensitivity models. To make conventional sensitivity analysis more useful in practice we need to define some simple and interpretable statistical quantities to assess the sensitivity models and make evidence based analysis. We propose a novel approach in this paper on attempting to investigate the possibility of each missing data mechanism model assumption, by comparing the simulated datasets from various MNAR models with the observed data non-parametrically, using the K-nearest-neighbour distances. Some asymptotic theory has also been provided. A key step of this method is to plug in a plausibility evaluation system towards each sensitivity parameter, to select plausible values and reject unlikely values, instead of considering all proposed values of sensitivity parameters as in the conventional sensitivity analysis method. The method is generic and has been applied successfully to several specific models in this paper including meta-analysis model with publication bias, analysis of incomplete longitudinal data and mean estimation with non-ignorable missing data.
Examining solutions to missing data in longitudinal nursing research.
Roberts, Mary B; Sullivan, Mary C; Winchester, Suzy B
2017-04-01
Longitudinal studies are highly valuable in pediatrics because they provide useful data about developmental patterns of child health and behavior over time. When data are missing, the value of the research is impacted. The study's purpose was to (1) introduce a three-step approach to assess and address missing data and (2) illustrate this approach using categorical and continuous-level variables from a longitudinal study of premature infants. A three-step approach with simulations was followed to assess the amount and pattern of missing data and to determine the most appropriate imputation method for the missing data. Patterns of missingness were Missing Completely at Random, Missing at Random, and Not Missing at Random. Missing continuous-level data were imputed using mean replacement, stochastic regression, multiple imputation, and fully conditional specification (FCS). Missing categorical-level data were imputed using last value carried forward, hot-decking, stochastic regression, and FCS. Simulations were used to evaluate these imputation methods under different patterns of missingness at different levels of missing data. The rate of missingness was 16-23% for continuous variables and 1-28% for categorical variables. FCS imputation provided the least difference in mean and standard deviation estimates for continuous measures. FCS imputation was acceptable for categorical measures. Results obtained through simulation reinforced and confirmed these findings. Significant investments are made in the collection of longitudinal data. The prudent handling of missing data can protect these investments and potentially improve the scientific information contained in pediatric longitudinal studies. © 2017 Wiley Periodicals, Inc.
An Overview and Evaluation of Recent Machine Learning Imputation Methods Using Cardiac Imaging Data.
Liu, Yuzhe; Gopalakrishnan, Vanathi
2017-03-01
Many clinical research datasets have a large percentage of missing values that directly impacts their usefulness in yielding high accuracy classifiers when used for training in supervised machine learning. While missing value imputation methods have been shown to work well with smaller percentages of missing values, their ability to impute sparse clinical research data can be problem specific. We previously attempted to learn quantitative guidelines for ordering cardiac magnetic resonance imaging during the evaluation for pediatric cardiomyopathy, but missing data significantly reduced our usable sample size. In this work, we sought to determine if increasing the usable sample size through imputation would allow us to learn better guidelines. We first review several machine learning methods for estimating missing data. Then, we apply four popular methods (mean imputation, decision tree, k-nearest neighbors, and self-organizing maps) to a clinical research dataset of pediatric patients undergoing evaluation for cardiomyopathy. Using Bayesian Rule Learning (BRL) to learn ruleset models, we compared the performance of imputation-augmented models versus unaugmented models. We found that all four imputation-augmented models performed similarly to unaugmented models. While imputation did not improve performance, it did provide evidence for the robustness of our learned models.
Missing value imputation for microarray data: a comprehensive comparison study and a web tool.
Chiu, Chia-Chun; Chan, Shih-Yao; Wang, Chung-Ching; Wu, Wei-Sheng
2013-01-01
Microarray data are usually peppered with missing values due to various reasons. However, most of the downstream analyses for microarray data require complete datasets. Therefore, accurate algorithms for missing value estimation are needed for improving the performance of microarray data analyses. Although many algorithms have been developed, there are many debates on the selection of the optimal algorithm. The studies about the performance comparison of different algorithms are still incomprehensive, especially in the number of benchmark datasets used, the number of algorithms compared, the rounds of simulation conducted, and the performance measures used. In this paper, we performed a comprehensive comparison by using (I) thirteen datasets, (II) nine algorithms, (III) 110 independent runs of simulation, and (IV) three types of measures to evaluate the performance of each imputation algorithm fairly. First, the effects of different types of microarray datasets on the performance of each imputation algorithm were evaluated. Second, we discussed whether the datasets from different species have different impact on the performance of different algorithms. To assess the performance of each algorithm fairly, all evaluations were performed using three types of measures. Our results indicate that the performance of an imputation algorithm mainly depends on the type of a dataset but not on the species where the samples come from. In addition to the statistical measure, two other measures with biological meanings are useful to reflect the impact of missing value imputation on the downstream data analyses. Our study suggests that local-least-squares-based methods are good choices to handle missing values for most of the microarray datasets. In this work, we carried out a comprehensive comparison of the algorithms for microarray missing value imputation. Based on such a comprehensive comparison, researchers could choose the optimal algorithm for their datasets easily. Moreover, new imputation algorithms could be compared with the existing algorithms using this comparison strategy as a standard protocol. In addition, to assist researchers in dealing with missing values easily, we built a web-based and easy-to-use imputation tool, MissVIA (http://cosbi.ee.ncku.edu.tw/MissVIA), which supports many imputation algorithms. Once users upload a real microarray dataset and choose the imputation algorithms, MissVIA will determine the optimal algorithm for the users' data through a series of simulations, and then the imputed results can be downloaded for the downstream data analyses.
Analyzing time-ordered event data with missed observations.
Dokter, Adriaan M; van Loon, E Emiel; Fokkema, Wimke; Lameris, Thomas K; Nolet, Bart A; van der Jeugd, Henk P
2017-09-01
A common problem with observational datasets is that not all events of interest may be detected. For example, observing animals in the wild can difficult when animals move, hide, or cannot be closely approached. We consider time series of events recorded in conditions where events are occasionally missed by observers or observational devices. These time series are not restricted to behavioral protocols, but can be any cyclic or recurring process where discrete outcomes are observed. Undetected events cause biased inferences on the process of interest, and statistical analyses are needed that can identify and correct the compromised detection processes. Missed observations in time series lead to observed time intervals between events at multiples of the true inter-event time, which conveys information on their detection probability. We derive the theoretical probability density function for observed intervals between events that includes a probability of missed detection. Methodology and software tools are provided for analysis of event data with potential observation bias and its removal. The methodology was applied to simulation data and a case study of defecation rate estimation in geese, which is commonly used to estimate their digestive throughput and energetic uptake, or to calculate goose usage of a feeding site from dropping density. Simulations indicate that at a moderate chance to miss arrival events ( p = 0.3), uncorrected arrival intervals were biased upward by up to a factor 3, while parameter values corrected for missed observations were within 1% of their true simulated value. A field case study shows that not accounting for missed observations leads to substantial underestimates of the true defecation rate in geese, and spurious rate differences between sites, which are introduced by differences in observational conditions. These results show that the derived methodology can be used to effectively remove observational biases in time-ordered event data.
Comparing multiple imputation methods for systematically missing subject-level data.
Kline, David; Andridge, Rebecca; Kaizar, Eloise
2017-06-01
When conducting research synthesis, the collection of studies that will be combined often do not measure the same set of variables, which creates missing data. When the studies to combine are longitudinal, missing data can occur on the observation-level (time-varying) or the subject-level (non-time-varying). Traditionally, the focus of missing data methods for longitudinal data has been on missing observation-level variables. In this paper, we focus on missing subject-level variables and compare two multiple imputation approaches: a joint modeling approach and a sequential conditional modeling approach. We find the joint modeling approach to be preferable to the sequential conditional approach, except when the covariance structure of the repeated outcome for each individual has homogenous variance and exchangeable correlation. Specifically, the regression coefficient estimates from an analysis incorporating imputed values based on the sequential conditional method are attenuated and less efficient than those from the joint method. Remarkably, the estimates from the sequential conditional method are often less efficient than a complete case analysis, which, in the context of research synthesis, implies that we lose efficiency by combining studies. Copyright © 2015 John Wiley & Sons, Ltd. Copyright © 2015 John Wiley & Sons, Ltd.
40 CFR 98.385 - Procedures for estimating missing data.
Code of Federal Regulations, 2010 CFR
2010-07-01
... 40 Protection of Environment 20 2010-07-01 2010-07-01 false Procedures for estimating missing data... Procedures for estimating missing data. You must follow the procedures for estimating missing data in § 98... estimating missing data for petroleum products in § 98.395 also applies to coal-to-liquid products. ...
40 CFR 98.385 - Procedures for estimating missing data.
Code of Federal Regulations, 2012 CFR
2012-07-01
... 40 Protection of Environment 22 2012-07-01 2012-07-01 false Procedures for estimating missing data... Procedures for estimating missing data. You must follow the procedures for estimating missing data in § 98... estimating missing data for petroleum products in § 98.395 also applies to coal-to-liquid products. ...
40 CFR 98.385 - Procedures for estimating missing data.
Code of Federal Regulations, 2013 CFR
2013-07-01
... 40 Protection of Environment 22 2013-07-01 2013-07-01 false Procedures for estimating missing data... Procedures for estimating missing data. You must follow the procedures for estimating missing data in § 98... estimating missing data for petroleum products in § 98.395 also applies to coal-to-liquid products. ...
40 CFR 98.385 - Procedures for estimating missing data.
Code of Federal Regulations, 2011 CFR
2011-07-01
... 40 Protection of Environment 21 2011-07-01 2011-07-01 false Procedures for estimating missing data... Procedures for estimating missing data. You must follow the procedures for estimating missing data in § 98... estimating missing data for petroleum products in § 98.395 also applies to coal-to-liquid products. ...
40 CFR 98.385 - Procedures for estimating missing data.
Code of Federal Regulations, 2014 CFR
2014-07-01
... 40 Protection of Environment 21 2014-07-01 2014-07-01 false Procedures for estimating missing data... Procedures for estimating missing data. You must follow the procedures for estimating missing data in § 98... estimating missing data for petroleum products in § 98.395 also applies to coal-to-liquid products. ...
Examining Solutions to Missing Data in Longitudinal Nursing Research
Roberts, Mary B.; Sullivan, Mary C.; Winchester, Suzy B.
2017-01-01
Purpose Longitudinal studies are highly valuable in pediatrics because they provide useful data about developmental patterns of child health and behavior over time. When data are missing, the value of the research is impacted. The study’s purpose was to: (1) introduce a 3-step approach to assess and address missing data; (2) illustrate this approach using categorical and continuous level variables from a longitudinal study of premature infants. Methods A three-step approach with simulations was followed to assess the amount and pattern of missing data and to determine the most appropriate imputation method for the missing data. Patterns of missingness were Missing Completely at Random, Missing at Random, and Not Missing at Random. Missing continuous-level data were imputed using mean replacement, stochastic regression, multiple imputation, and fully conditional specification. Missing categorical-level data were imputed using last value carried forward, hot-decking, stochastic regression, and fully conditional specification. Simulations were used to evaluate these imputation methods under different patterns of missingness at different levels of missing data. Results The rate of missingness was 16–23% for continuous variables and 1–28% for categorical variables. Fully conditional specification imputation provided the least difference in mean and standard deviation estimates for continuous measures. Fully conditional specification imputation was acceptable for categorical measures. Results obtained through simulation reinforced and confirmed these findings. Practice Implications Significant investments are made in the collection of longitudinal data. The prudent handling of missing data can protect these investments and potentially improve the scientific information contained in pediatric longitudinal studies. PMID:28425202
40 CFR 98.245 - Procedures for estimating missing data.
Code of Federal Regulations, 2014 CFR
2014-07-01
... 40 Protection of Environment 21 2014-07-01 2014-07-01 false Procedures for estimating missing data... estimating missing data. For missing feedstock and product flow rates, use the same procedures as for missing... contents and missing molecular weights for fuels as specified in § 98.35(b)(1). For missing flare data...
NASA Astrophysics Data System (ADS)
Jimenez-Pizarro, R.; Rojas, A. M.; Pulido-Guio, A. D.
2012-12-01
The development of environmentally, socially and financially suitable greenhouse gas (GHG) mitigation portfolios requires detailed disaggregation of emissions by activity sector, preferably at the regional level. Bottom-up (BU) emission inventories are intrinsically disaggregated, but although detailed, they are frequently incomplete. Missing and erroneous activity data are rather common in emission inventories of GHG, criteria and toxic pollutants, even in developed countries. The fraction of missing and erroneous data can be rather large in developing country inventories. In addition, the cost and time for obtaining or correcting this information can be prohibitive or can delay the inventory development. This is particularly true for regional BU inventories in the developing world. Moreover, a rather common practice is to disregard or to arbitrarily impute low default activity or emission values to missing data, which typically leads to significant underestimation of the total emissions. Our investigation focuses on GHG emissions by fossil fuel combustion in industry in the Bogota Region, composed by Bogota and its adjacent, semi-rural area of influence, the Province of Cundinamarca. We found that the BU inventories for this sub-category substantially underestimate emissions when compared to top-down (TD) estimations based on sub-sector specific national fuel consumption data and regional energy intensities. Although both BU inventories have a substantial number of missing and evidently erroneous entries, i.e. information on fuel consumption per combustion unit per company, the validated energy use and emission data display clear and smooth frequency distributions, which can be adequately fitted to bimodal log-normal distributions. This is not unexpected as industrial plant sizes are typically log-normally distributed. Moreover, our statistical tests suggest that industrial sub-sectors, as classified by the International Standard Industrial Classification (ISIC), are also well represented by log-normal distributions. Using the validated data, we tested several missing data estimation procedures, including Montecarlo sampling of the real and fitted distributions, and a per ISIC estimation based on bootstrap-calculated mean values. These results will be presented and discussed in detail. Our results suggest that the accuracy of sub-sector BU emission inventories, particularly in developing regions, could be significantly improved if they are designed and carried out to be representative sub-samples (surveys) of the actual universe of emitters. A large fraction the missing data could be subsequently estimated by robust statistical procedures provided that most of the emitters were accounted by number and ISIC.
Empirical likelihood method for non-ignorable missing data problems.
Guan, Zhong; Qin, Jing
2017-01-01
Missing response problem is ubiquitous in survey sampling, medical, social science and epidemiology studies. It is well known that non-ignorable missing is the most difficult missing data problem where the missing of a response depends on its own value. In statistical literature, unlike the ignorable missing data problem, not many papers on non-ignorable missing data are available except for the full parametric model based approach. In this paper we study a semiparametric model for non-ignorable missing data in which the missing probability is known up to some parameters, but the underlying distributions are not specified. By employing Owen (1988)'s empirical likelihood method we can obtain the constrained maximum empirical likelihood estimators of the parameters in the missing probability and the mean response which are shown to be asymptotically normal. Moreover the likelihood ratio statistic can be used to test whether the missing of the responses is non-ignorable or completely at random. The theoretical results are confirmed by a simulation study. As an illustration, the analysis of a real AIDS trial data shows that the missing of CD4 counts around two years are non-ignorable and the sample mean based on observed data only is biased.
NASA Astrophysics Data System (ADS)
Friedel, M. J.; Daughney, C.
2016-12-01
The development of a successful surface-groundwater management strategy depends on the quality of data provided for analysis. This study evaluates the statistical robustness when using a modified self-organizing map (MSOM) technique to estimate missing values for three hypersurface models: synoptic groundwater-surface water hydrochemistry, time-series of groundwater-surface water hydrochemistry, and mixed-survey (combination of groundwater-surface water hydrochemistry and lithologies) hydrostratigraphic unit data. These models of increasing complexity are developed and validated based on observations from the Southland region of New Zealand. In each case, the estimation method is sufficiently robust to cope with groundwater-surface water hydrochemistry vagaries due to sample size and extreme data insufficiency, even when >80% of the data are missing. The estimation of surface water hydrochemistry time series values enabled the evaluation of seasonal variation, and the imputation of lithologies facilitated the evaluation of hydrostratigraphic controls on groundwater-surface water interaction. The robust statistical results for groundwater-surface water models of increasing data complexity provide justification to apply the MSOM technique in other regions of New Zealand and abroad.
Reuse of imputed data in microarray analysis increases imputation efficiency
Kim, Ki-Yeol; Kim, Byoung-Jin; Yi, Gwan-Su
2004-01-01
Background The imputation of missing values is necessary for the efficient use of DNA microarray data, because many clustering algorithms and some statistical analysis require a complete data set. A few imputation methods for DNA microarray data have been introduced, but the efficiency of the methods was low and the validity of imputed values in these methods had not been fully checked. Results We developed a new cluster-based imputation method called sequential K-nearest neighbor (SKNN) method. This imputes the missing values sequentially from the gene having least missing values, and uses the imputed values for the later imputation. Although it uses the imputed values, the efficiency of this new method is greatly improved in its accuracy and computational complexity over the conventional KNN-based method and other methods based on maximum likelihood estimation. The performance of SKNN was in particular higher than other imputation methods for the data with high missing rates and large number of experiments. Application of Expectation Maximization (EM) to the SKNN method improved the accuracy, but increased computational time proportional to the number of iterations. The Multiple Imputation (MI) method, which is well known but not applied previously to microarray data, showed a similarly high accuracy as the SKNN method, with slightly higher dependency on the types of data sets. Conclusions Sequential reuse of imputed data in KNN-based imputation greatly increases the efficiency of imputation. The SKNN method should be practically useful to save the data of some microarray experiments which have high amounts of missing entries. The SKNN method generates reliable imputed values which can be used for further cluster-based analysis of microarray data. PMID:15504240
Assessment of BSRN radiation records for the computation of monthly means
NASA Astrophysics Data System (ADS)
Roesch, A.; Wild, M.; Ohmura, A.; Dutton, E. G.; Long, C. N.; Zhang, T.
2011-02-01
The integrity of the Baseline Surface Radiation Network (BSRN) radiation monthly averages are assessed by investigating the impact on monthly means due to the frequency of data gaps caused by missing or discarded high time resolution data. The monthly statistics, especially means, are considered to be important and useful values for climate research, model performance evaluations and for assessing the quality of satellite (time- and space-averaged) data products. The study investigates the spread in different algorithms that have been applied for the computation of monthly means from 1-min values. The paper reveals that the computation of monthly means from 1-min observations distinctly depends on the method utilized to account for the missing data. The intra-method difference generally increases with an increasing fraction of missing data. We found that a substantial fraction of the radiation fluxes observed at BSRN sites is either missing or flagged as questionable. The percentage of missing data is 4.4%, 13.0%, and 6.5% for global radiation, direct shortwave radiation, and downwelling longwave radiation, respectively. Most flagged data in the shortwave are due to nighttime instrumental noise and can reasonably be set to zero after correcting for thermal offsets in the daytime data. The study demonstrates that the handling of flagged data clearly impacts on monthly mean estimates obtained with different methods. We showed that the spread of monthly shortwave fluxes is generally clearly higher than for downwelling longwave radiation. Overall, BSRN observations provide sufficient accuracy and completeness for reliable estimates of monthly mean values. However, the value of future data could be further increased by reducing the frequency of data gaps and the number of outliers. It is shown that two independent methods for accounting for the diurnal and seasonal variations in the missing data permit consistent monthly means to within less than 1 W m-2 in most cases. The authors suggest using a standardized method for the computation of monthly means which addresses diurnal variations in the missing data in order to avoid a mismatch of future published monthly mean radiation fluxes from BSRN. The application of robust statistics would probably lead to less biased results for data records with frequent gaps and/or flagged data and outliers. The currently applied empirical methods should, therefore, be completed by the development of robust methods.
Missing value imputation for microarray data: a comprehensive comparison study and a web tool
2013-01-01
Background Microarray data are usually peppered with missing values due to various reasons. However, most of the downstream analyses for microarray data require complete datasets. Therefore, accurate algorithms for missing value estimation are needed for improving the performance of microarray data analyses. Although many algorithms have been developed, there are many debates on the selection of the optimal algorithm. The studies about the performance comparison of different algorithms are still incomprehensive, especially in the number of benchmark datasets used, the number of algorithms compared, the rounds of simulation conducted, and the performance measures used. Results In this paper, we performed a comprehensive comparison by using (I) thirteen datasets, (II) nine algorithms, (III) 110 independent runs of simulation, and (IV) three types of measures to evaluate the performance of each imputation algorithm fairly. First, the effects of different types of microarray datasets on the performance of each imputation algorithm were evaluated. Second, we discussed whether the datasets from different species have different impact on the performance of different algorithms. To assess the performance of each algorithm fairly, all evaluations were performed using three types of measures. Our results indicate that the performance of an imputation algorithm mainly depends on the type of a dataset but not on the species where the samples come from. In addition to the statistical measure, two other measures with biological meanings are useful to reflect the impact of missing value imputation on the downstream data analyses. Our study suggests that local-least-squares-based methods are good choices to handle missing values for most of the microarray datasets. Conclusions In this work, we carried out a comprehensive comparison of the algorithms for microarray missing value imputation. Based on such a comprehensive comparison, researchers could choose the optimal algorithm for their datasets easily. Moreover, new imputation algorithms could be compared with the existing algorithms using this comparison strategy as a standard protocol. In addition, to assist researchers in dealing with missing values easily, we built a web-based and easy-to-use imputation tool, MissVIA (http://cosbi.ee.ncku.edu.tw/MissVIA), which supports many imputation algorithms. Once users upload a real microarray dataset and choose the imputation algorithms, MissVIA will determine the optimal algorithm for the users' data through a series of simulations, and then the imputed results can be downloaded for the downstream data analyses. PMID:24565220
Yue Xu, Selene; Nelson, Sandahl; Kerr, Jacqueline; Godbole, Suneeta; Patterson, Ruth; Merchant, Gina; Abramson, Ian; Staudenmayer, John; Natarajan, Loki
2018-04-01
Physical inactivity is a recognized risk factor for many chronic diseases. Accelerometers are increasingly used as an objective means to measure daily physical activity. One challenge in using these devices is missing data due to device nonwear. We used a well-characterized cohort of 333 overweight postmenopausal breast cancer survivors to examine missing data patterns of accelerometer outputs over the day. Based on these observed missingness patterns, we created psuedo-simulated datasets with realistic missing data patterns. We developed statistical methods to design imputation and variance weighting algorithms to account for missing data effects when fitting regression models. Bias and precision of each method were evaluated and compared. Our results indicated that not accounting for missing data in the analysis yielded unstable estimates in the regression analysis. Incorporating variance weights and/or subject-level imputation improved precision by >50%, compared to ignoring missing data. We recommend that these simple easy-to-implement statistical tools be used to improve analysis of accelerometer data.
Impact of missing data imputation methods on gene expression clustering and classification.
de Souto, Marcilio C P; Jaskowiak, Pablo A; Costa, Ivan G
2015-02-26
Several missing value imputation methods for gene expression data have been proposed in the literature. In the past few years, researchers have been putting a great deal of effort into presenting systematic evaluations of the different imputation algorithms. Initially, most algorithms were assessed with an emphasis on the accuracy of the imputation, using metrics such as the root mean squared error. However, it has become clear that the success of the estimation of the expression value should be evaluated in more practical terms as well. One can consider, for example, the ability of the method to preserve the significant genes in the dataset, or its discriminative/predictive power for classification/clustering purposes. We performed a broad analysis of the impact of five well-known missing value imputation methods on three clustering and four classification methods, in the context of 12 cancer gene expression datasets. We employed a statistical framework, for the first time in this field, to assess whether different imputation methods improve the performance of the clustering/classification methods. Our results suggest that the imputation methods evaluated have a minor impact on the classification and downstream clustering analyses. Simple methods such as replacing the missing values by mean or the median values performed as well as more complex strategies. The datasets analyzed in this study are available at http://costalab.org/Imputation/ .
DOE Office of Scientific and Technical Information (OSTI.GOV)
Saaban, Azizan; Zainudin, Lutfi; Bakar, Mohd Nazari Abu
This paper intends to reveal the ability of the linear interpolation method to predict missing values in solar radiation time series. Reliable dataset is equally tends to complete time series observed dataset. The absence or presence of radiation data alters long-term variation of solar radiation measurement values. Based on that change, the opportunities to provide bias output result for modelling and the validation process is higher. The completeness of the observed variable dataset has significantly important for data analysis. Occurrence the lack of continual and unreliable time series solar radiation data widely spread and become the main problematic issue. However,more » the limited number of research quantity that has carried out to emphasize and gives full attention to estimate missing values in the solar radiation dataset.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Langan, Roisin T.; Archibald, Richard K.; Lamberti, Vincent
We have applied a new imputation-based method for analyzing incomplete data, called Monte Carlo Bayesian Database Generation (MCBDG), to the Spent Fuel Isotopic Composition (SFCOMPO) database. About 60% of the entries are absent for SFCOMPO. The method estimates missing values of a property from a probability distribution created from the existing data for the property, and then generates multiple instances of the completed database for training a machine learning algorithm. Uncertainty in the data is represented by an empirical or an assumed error distribution. The method makes few assumptions about the underlying data, and compares favorably against results obtained bymore » replacing missing information with constant values.« less
Data estimation and prediction for natural resources public data
Hans T. Schreuder; Robin M. Reich
1998-01-01
A key product of both Forest Inventory and Analysis (FIA) of the USDA Forest Service and the Natural Resources Inventory (NRI) of the Natural Resources Conservation Service is a scientific data base that should be defensible in court. Multiple imputation procedures (MIPs) have been proposed both for missing value estimation and prediction of non-remeasured cells in...
40 CFR 98.435 - Procedures for estimating missing data.
Code of Federal Regulations, 2011 CFR
2011-07-01
... 40 Protection of Environment 21 2011-07-01 2011-07-01 false Procedures for estimating missing data... Gases Contained in Pre-Charged Equipment or Closed-Cell Foams § 98.435 Procedures for estimating missing data. Procedures for estimating missing data are not provided for importers and exporters of...
40 CFR 98.435 - Procedures for estimating missing data.
Code of Federal Regulations, 2013 CFR
2013-07-01
... 40 Protection of Environment 22 2013-07-01 2013-07-01 false Procedures for estimating missing data... Gases Contained in Pre-Charged Equipment or Closed-Cell Foams § 98.435 Procedures for estimating missing data. Procedures for estimating missing data are not provided for importers and exporters of...
40 CFR 98.435 - Procedures for estimating missing data.
Code of Federal Regulations, 2014 CFR
2014-07-01
... 40 Protection of Environment 21 2014-07-01 2014-07-01 false Procedures for estimating missing data... Gases Contained in Pre-Charged Equipment or Closed-Cell Foams § 98.435 Procedures for estimating missing data. Procedures for estimating missing data are not provided for importers and exporters of...
40 CFR 98.435 - Procedures for estimating missing data.
Code of Federal Regulations, 2012 CFR
2012-07-01
... 40 Protection of Environment 22 2012-07-01 2012-07-01 false Procedures for estimating missing data... Gases Contained in Pre-Charged Equipment or Closed-Cell Foams § 98.435 Procedures for estimating missing data. Procedures for estimating missing data are not provided for importers and exporters of...
NASA Astrophysics Data System (ADS)
Lyu, Baolei; Hu, Yongtao; Chang, Howard; Russell, Armistead; Bai, Yuqi
2017-04-01
The satellite-borne Moderate Resolution Imaging Spectroradiometer (MODIS) aerosol optical depth (AOD) is often used to predict ground-level fine particulate matter (PM2.5) concentrations. The associated estimation accuracy is always reduced by AOD missing values and by insufficiently accounting for the spatio-temporal PM2.5 variations. This study aims to estimate PM2.5 concentrations at a high resolution with enhanced accuracy by fusing MODIS AOD and ground observations in the polluted and populated Beijing-Tianjin-Hebei (BTH) area of China in 2014 and 2015. A Bayesian-based statistical downscaler was employed to model the spatio-temporally varied AOD-PM2.5 relationships. We resampled a 3 km MODIS AOD product to a 4 km resolution in a Lambert conic conformal projection, to assist comparison and fusion with CMAQ predictions. A two-step method was used to fill the missing AOD values to obtain a full AOD dataset with complete spatial coverage. The downscaler has a relatively good performance in the fitting procedure (R2 = 0.75) and in the cross validation procedure (with two evaluation methods, R2 = 0.58 by random method and R2 = 0.47 by city-specific method). The number of missing AOD values was serious and related to elevated PM2.5 concentrations. The gap-filled AOD values corresponded well with our understanding of PM2.5 pollution conditions in BTH. The prediction accuracy of PM2.5 concentrations were improved in terms of their annual and seasonal mean. As a result of its fine spatio-temporal resolution and complete spatial coverage, the daily PM2.5 estimation dataset could provide extensive and insightful benefits to related studies in the BTH area. This may include understanding the formation processes of regional PM2.5 pollution episodes, evaluating daily human exposure, and establishing pollution controlling measures.
40 CFR 98.445 - Procedures for estimating missing data.
Code of Federal Regulations, 2011 CFR
2011-07-01
... 40 Protection of Environment 21 2011-07-01 2011-07-01 false Procedures for estimating missing data... Procedures for estimating missing data. A complete record of all measured parameters used in the GHG... following missing data procedures: (a) A quarterly flow rate of CO2 received that is missing must be...
40 CFR 98.126 - Data reporting requirements.
Code of Federal Regulations, 2011 CFR
2011-07-01
... fluorinated GHG emitted from equipment leaks (metric tons). (d) Reporting for missing data. Where missing data have been estimated pursuant to § 98.125, you must report the reason the data were missing, the length of time the data were missing, the method used to estimate the missing data, and the estimates of...
40 CFR 98.245 - Procedures for estimating missing data.
Code of Federal Regulations, 2012 CFR
2012-07-01
... 40 Protection of Environment 22 2012-07-01 2012-07-01 false Procedures for estimating missing data... estimating missing data. For missing feedstock flow rates, product flow rates, and carbon contents, use the same procedures as for missing flow rates and carbon contents for fuels as specified in § 98.35. ...
40 CFR 98.245 - Procedures for estimating missing data.
Code of Federal Regulations, 2011 CFR
2011-07-01
... 40 Protection of Environment 21 2011-07-01 2011-07-01 false Procedures for estimating missing data... estimating missing data. For missing feedstock flow rates, product flow rates, and carbon contents, use the same procedures as for missing flow rates and carbon contents for fuels as specified in § 98.35. ...
40 CFR 98.245 - Procedures for estimating missing data.
Code of Federal Regulations, 2013 CFR
2013-07-01
... 40 Protection of Environment 22 2013-07-01 2013-07-01 false Procedures for estimating missing data... estimating missing data. For missing feedstock flow rates, product flow rates, and carbon contents, use the same procedures as for missing flow rates and carbon contents for fuels as specified in § 98.35. ...
40 CFR 98.126 - Data reporting requirements.
Code of Federal Regulations, 2012 CFR
2012-07-01
... fluorinated GHG emitted from equipment leaks (metric tons). (d) Reporting for missing data. Where missing data have been estimated pursuant to § 98.125, you must report the reason the data were missing, the length of time the data were missing, the method used to estimate the missing data, and the estimates of...
40 CFR 98.245 - Procedures for estimating missing data.
Code of Federal Regulations, 2010 CFR
2010-07-01
... 40 Protection of Environment 20 2010-07-01 2010-07-01 false Procedures for estimating missing data... estimating missing data. For missing feedstock flow rates, product flow rates, and carbon contents, use the same procedures as for missing flow rates and carbon contents for fuels as specified in § 98.35. ...
Ambler, Gareth; Omar, Rumana Z; Royston, Patrick
2007-06-01
Risk models that aim to predict the future course and outcome of disease processes are increasingly used in health research, and it is important that they are accurate and reliable. Most of these risk models are fitted using routinely collected data in hospitals or general practices. Clinical outcomes such as short-term mortality will be near-complete, but many of the predictors may have missing values. A common approach to dealing with this is to perform a complete-case analysis. However, this may lead to overfitted models and biased estimates if entire patient subgroups are excluded. The aim of this paper is to investigate a number of methods for imputing missing data to evaluate their effect on risk model estimation and the reliability of the predictions. Multiple imputation methods, including hotdecking and multiple imputation by chained equations (MICE), were investigated along with several single imputation methods. A large national cardiac surgery database was used to create simulated yet realistic datasets. The results suggest that complete case analysis may produce unreliable risk predictions and should be avoided. Conditional mean imputation performed well in our scenario, but may not be appropriate if using variable selection methods. MICE was amongst the best performing multiple imputation methods with regards to the quality of the predictions. Additionally, it produced the least biased estimates, with good coverage, and hence is recommended for use in practice.
40 CFR 98.235 - Procedures for estimating missing data.
Code of Federal Regulations, 2013 CFR
2013-07-01
... 40 Protection of Environment 22 2013-07-01 2013-07-01 false Procedures for estimating missing data... Procedures for estimating missing data. A complete record of all estimated and/or measured parameters used in... sources as soon as possible, including in the subsequent calendar year if missing data are not discovered...
40 CFR 98.235 - Procedures for estimating missing data.
Code of Federal Regulations, 2014 CFR
2014-07-01
... 40 Protection of Environment 21 2014-07-01 2014-07-01 false Procedures for estimating missing data... Procedures for estimating missing data. A complete record of all estimated and/or measured parameters used in... sources as soon as possible, including in the subsequent calendar year if missing data are not discovered...
40 CFR 98.235 - Procedures for estimating missing data.
Code of Federal Regulations, 2012 CFR
2012-07-01
... 40 Protection of Environment 22 2012-07-01 2012-07-01 false Procedures for estimating missing data... Procedures for estimating missing data. A complete record of all estimated and/or measured parameters used in... sources as soon as possible, including in the subsequent calendar year if missing data are not discovered...
40 CFR 98.235 - Procedures for estimating missing data.
Code of Federal Regulations, 2011 CFR
2011-07-01
... 40 Protection of Environment 21 2011-07-01 2011-07-01 false Procedures for estimating missing data... Procedures for estimating missing data. A complete record of all estimated and/or measured parameters used in... sources as soon as possible, including in the subsequent calendar year if missing data are not discovered...
Dealing with missing data in remote sensing images within land and crop classification
NASA Astrophysics Data System (ADS)
Skakun, Sergii; Kussul, Nataliia; Basarab, Ruslan
Optical remote sensing images from space provide valuable data for environmental monitoring, disaster management [1], agriculture mapping [2], so forth. In many cases, a time-series of satellite images is used to discriminate or estimate particular land parameters. One of the factors that influence the efficiency of satellite imagery is the presence of clouds. This leads to the occurrence of missing data that need to be addressed. Numerous approaches have been proposed to fill in missing data (or gaps) and can be categorized into inpainting-based, multispectral-based, and multitemporal-based. In [3], ancillary MODIS data are utilized for filling gaps and predicting Landsat data. In this paper we propose to use self-organizing Kohonen maps (SOMs) for missing data restoration in time-series of satellite imagery. Such approach was previously used for MODIS data [4], but applying this approach for finer spatial resolution data such as Sentinel-2 and Landsat-8 represents a challenge. Moreover, data for training the SOMs are selected manually in [4] that complicates the use of the method in an automatic mode. SOM is a type of artificial neural network that is trained using unsupervised learning to produce a discretised representation of the input space of the training samples, called a map. The map seeks to preserve the topological properties of the input space. The reconstruction of satellite images is performed for each spectral band separately, i.e. a separate SOM is trained for each spectral band. Pixels that have no missing values in the time-series are selected for training. Selecting the number of training pixels represent a trade-off, in particular increasing the number of training samples will lead to the increased time of SOM training while increasing the quality of restoration. Also, training data sets should be selected automatically. As such, we propose to select training samples on a regular grid of pixels. Therefore, the SOM seeks to project a large number of non-missing data to the subspace vectors in the map. Restoration of the missing values is performed in the following way. The multi-temporal pixel values (with gaps) are put to the neural network. A neuron-winner (or a best matching unit, BMU) in the SOM is selected based on the distance metric (for example, Euclidian). It should be noted that missing values are omitted from metric estimation when selecting BMU. When the BMU is selected, missing values are substituted by corresponding components of the BMU values. The efficiency of the proposed approach was tested on a time-series of Landsat-8 images over the JECAM test site in Ukraine and Sich-2 images over Crimea (Sich-2 is Ukrainian remote sensing satellite acquiring images at 8m spatial resolution). Landsat-8 images were first converted to the TOA reflectance, and then were atmospherically corrected so each pixel value represents a surface reflectance in the range from 0 to 1. The error of reconstruction (error of quantization) on training data was: band-2: 0.015; band-3: 0.020; band-4: 0.026; band-5: 0.070; band-6: 0.060; band-7: 0.055. The reconstructed images were also used for crop classification using a multi-layer perceptron (MLP). Overall accuracy was 85.98% and Cohen's kappa was 0.83. References. 1. Skakun, S., Kussul, N., Shelestov, A. and Kussul, O. “Flood Hazard and Flood Risk Assessment Using a Time Series of Satellite Images: A Case Study in Namibia,” Risk Analysis, 2013, doi: 10.1111/risa.12156. 2. Gallego, F.J., Kussul, N., Skakun, S., Kravchenko, O., Shelestov, A., Kussul, O. “Efficiency assessment of using satellite data for crop area estimation in Ukraine,” International Journal of Applied Earth Observation and Geoinformation, vol. 29, pp. 22-30, 2014. 3. Roy D.P., Ju, J., Lewis, P., Schaaf, C., Gao, F., Hansen, M., and Lindquist, E., “Multi-temporal MODIS-Landsat data fusion for relative radiometric normalization, gap filling, and prediction of Landsat data,” Remote Sensing of Environment, 112(6), pp. 3112-3130, 2008. 4. Latif, B.A., and Mercier, G., “Self-Organizing maps for processing of data with missing values and outliers: application to remote sensing images,” Self-Organizing Maps. InTech, pp. 189-210, 2010.
SAS program for quantitative stratigraphic correlation by principal components
Hohn, M.E.
1985-01-01
A SAS program is presented which constructs a composite section of stratigraphic events through principal components analysis. The variables in the analysis are stratigraphic sections and the observational units are range limits of taxa. The program standardizes data in each section, extracts eigenvectors, estimates missing range limits, and computes the composite section from scores of events on the first principal component. Provided is an option of several types of diagnostic plots; these help one to determine conservative range limits or unrealistic estimates of missing values. Inspection of the graphs and eigenvalues allow one to evaluate goodness of fit between the composite and measured data. The program is extended easily to the creation of a rank-order composite. ?? 1985.
Bounthavong, Mark; Watanabe, Jonathan H; Sullivan, Kevin M
2015-04-01
The complete capture of all values for each variable of interest in pharmacy research studies remains aspirational. The absence of these possibly influential values is a common problem for pharmacist investigators. Failure to account for missing data may translate to biased study findings and conclusions. Our goal in this analysis was to apply validated statistical methods for missing data to a previously analyzed data set and compare results when missing data methods were implemented versus standard analytics that ignore missing data effects. Using data from a retrospective cohort study, the statistical method of multiple imputation was used to provide regression-based estimates of the missing values to improve available data usable for study outcomes measurement. These findings were then contrasted with a complete-case analysis that restricted estimation to subjects in the cohort that had no missing values. Odds ratios were compared to assess differences in findings of the analyses. A nonadjusted regression analysis ("crude analysis") was also performed as a reference for potential bias. Veterans Integrated Systems Network that includes VA facilities in the Southern California and Nevada regions. New statin users between November 30, 2006, and December 2, 2007, with a diagnosis of dyslipidemia. We compared the odds ratios (ORs) and 95% confidence intervals (CIs) for the crude, complete-case, and multiple imputation analyses for the end points of a 25% or greater reduction in atherogenic lipids. Data were missing for 21.5% of identified patients (1665 subjects of 7739). Regression model results were similar for the crude, complete-case, and multiple imputation analyses with overlap of 95% confidence limits at each end point. The crude, complete-case, and multiple imputation ORs (95% CIs) for a 25% or greater reduction in low-density lipoprotein cholesterol were 3.5 (95% CI 3.1-3.9), 4.3 (95% CI 3.8-4.9), and 4.1 (95% CI 3.7-4.6), respectively. The crude, complete-case, and multiple imputation ORs (95% CIs) for a 25% or greater reduction in non-high-density lipoprotein cholesterol were 3.5 (95% CI 3.1-3.9), 4.5 (95% CI 4.0-5.2), and 4.4 (95% CI 3.9-4.9), respectively. The crude, complete-case, and multiple imputation ORs (95% CIs) for 25% or greater reduction in TGs were 3.1 (95% CI 2.8-3.6), 4.0 (95% CI 3.5-4.6), and 4.1 (95% CI 3.6-4.6), respectively. The use of the multiple imputation method to account for missing data did not alter conclusions based on a complete-case analysis. Given the frequency of missing data in research using electronic health records and pharmacy claims data, multiple imputation may play an important role in the validation of study findings. © 2015 Pharmacotherapy Publications, Inc.
Weir, Christopher J; Butcher, Isabella; Assi, Valentina; Lewis, Stephanie C; Murray, Gordon D; Langhorne, Peter; Brady, Marian C
2018-03-07
Rigorous, informative meta-analyses rely on availability of appropriate summary statistics or individual participant data. For continuous outcomes, especially those with naturally skewed distributions, summary information on the mean or variability often goes unreported. While full reporting of original trial data is the ideal, we sought to identify methods for handling unreported mean or variability summary statistics in meta-analysis. We undertook two systematic literature reviews to identify methodological approaches used to deal with missing mean or variability summary statistics. Five electronic databases were searched, in addition to the Cochrane Colloquium abstract books and the Cochrane Statistics Methods Group mailing list archive. We also conducted cited reference searching and emailed topic experts to identify recent methodological developments. Details recorded included the description of the method, the information required to implement the method, any underlying assumptions and whether the method could be readily applied in standard statistical software. We provided a summary description of the methods identified, illustrating selected methods in example meta-analysis scenarios. For missing standard deviations (SDs), following screening of 503 articles, fifteen methods were identified in addition to those reported in a previous review. These included Bayesian hierarchical modelling at the meta-analysis level; summary statistic level imputation based on observed SD values from other trials in the meta-analysis; a practical approximation based on the range; and algebraic estimation of the SD based on other summary statistics. Following screening of 1124 articles for methods estimating the mean, one approximate Bayesian computation approach and three papers based on alternative summary statistics were identified. Illustrative meta-analyses showed that when replacing a missing SD the approximation using the range minimised loss of precision and generally performed better than omitting trials. When estimating missing means, a formula using the median, lower quartile and upper quartile performed best in preserving the precision of the meta-analysis findings, although in some scenarios, omitting trials gave superior results. Methods based on summary statistics (minimum, maximum, lower quartile, upper quartile, median) reported in the literature facilitate more comprehensive inclusion of randomised controlled trials with missing mean or variability summary statistics within meta-analyses.
Impact of Missing Data for Body Mass Index in an Epidemiologic Study.
Razzaghi, Hilda; Tinker, Sarah C; Herring, Amy H; Howards, Penelope P; Waller, D Kim; Johnson, Candice Y
2016-07-01
Objective To assess the potential impact of missing data on body mass index (BMI) on the association between prepregnancy obesity and specific birth defects. Methods Data from the National Birth Defects Prevention Study (NBDPS) were analyzed. We assessed the factors associated with missing BMI data among mothers of infants without birth defects. Four analytic methods were then used to assess the impact of missing BMI data on the association between maternal prepregnancy obesity and three birth defects; spina bifida, gastroschisis, and cleft lip with/without cleft palate. The analytic methods were: (1) complete case analysis; (2) assignment of missing values to either obese or normal BMI; (3) multiple imputation; and (4) probabilistic sensitivity analysis. Logistic regression was used to estimate crude and adjusted odds ratios (aOR) and 95 % confidence intervals (CI). Results Of NBDPS control mothers 4.6 % were missing BMI data, and most of the missing values were attributable to missing height (~90 %). Missing BMI data was associated with birth outside of the US (aOR 8.6; 95 % CI 5.5, 13.4), interview in Spanish (aOR 2.4; 95 % CI 1.8, 3.2), Hispanic ethnicity (aOR 2.0; 95 % CI 1.2, 3.4), and <12 years education (aOR 2.3; 95 % CI 1.7, 3.1). Overall the results of the multiple imputation and probabilistic sensitivity analysis were similar to the complete case analysis. Conclusions Although in some scenarios missing BMI data can bias the magnitude of association, it does not appear likely to have impacted conclusions from a traditional complete case analysis of these data.
Missing Data and Multiple Imputation: An Unbiased Approach
NASA Technical Reports Server (NTRS)
Foy, M.; VanBaalen, M.; Wear, M.; Mendez, C.; Mason, S.; Meyers, V.; Alexander, D.; Law, J.
2014-01-01
The default method of dealing with missing data in statistical analyses is to only use the complete observations (complete case analysis), which can lead to unexpected bias when data do not meet the assumption of missing completely at random (MCAR). For the assumption of MCAR to be met, missingness cannot be related to either the observed or unobserved variables. A less stringent assumption, missing at random (MAR), requires that missingness not be associated with the value of the missing variable itself, but can be associated with the other observed variables. When data are truly MAR as opposed to MCAR, the default complete case analysis method can lead to biased results. There are statistical options available to adjust for data that are MAR, including multiple imputation (MI) which is consistent and efficient at estimating effects. Multiple imputation uses informing variables to determine statistical distributions for each piece of missing data. Then multiple datasets are created by randomly drawing on the distributions for each piece of missing data. Since MI is efficient, only a limited number, usually less than 20, of imputed datasets are required to get stable estimates. Each imputed dataset is analyzed using standard statistical techniques, and then results are combined to get overall estimates of effect. A simulation study will be demonstrated to show the results of using the default complete case analysis, and MI in a linear regression of MCAR and MAR simulated data. Further, MI was successfully applied to the association study of CO2 levels and headaches when initial analysis showed there may be an underlying association between missing CO2 levels and reported headaches. Through MI, we were able to show that there is a strong association between average CO2 levels and the risk of headaches. Each unit increase in CO2 (mmHg) resulted in a doubling in the odds of reported headaches.
Replacing missing values using trustworthy data values from web data sources
NASA Astrophysics Data System (ADS)
Izham Jaya, M.; Sidi, Fatimah; Mat Yusof, Sharmila; Suriani Affendey, Lilly; Ishak, Iskandar; Jabar, Marzanah A.
2017-09-01
In practice, collected data usually are incomplete and contains missing value. Existing approaches in managing missing values overlook the importance of trustworthy data values in replacing missing values. In view that trusted completed data is very important in data analysis, we proposed a framework of missing value replacement using trustworthy data values from web data sources. The proposed framework adopted ontology to map data values from web data sources to the incomplete dataset. As data from web is conflicting with each other, we proposed a trust score measurement based on data accuracy and data reliability. Trust score is then used to select trustworthy data values from web data sources for missing values replacement. We successfully implemented the proposed framework using financial dataset and presented the findings in this paper. From our experiment, we manage to show that replacing missing values with trustworthy data values is important especially in a case of conflicting data to solve missing values problem.
Bayesian Sensitivity Analysis of Statistical Models with Missing Data
ZHU, HONGTU; IBRAHIM, JOSEPH G.; TANG, NIANSHENG
2013-01-01
Methods for handling missing data depend strongly on the mechanism that generated the missing values, such as missing completely at random (MCAR) or missing at random (MAR), as well as other distributional and modeling assumptions at various stages. It is well known that the resulting estimates and tests may be sensitive to these assumptions as well as to outlying observations. In this paper, we introduce various perturbations to modeling assumptions and individual observations, and then develop a formal sensitivity analysis to assess these perturbations in the Bayesian analysis of statistical models with missing data. We develop a geometric framework, called the Bayesian perturbation manifold, to characterize the intrinsic structure of these perturbations. We propose several intrinsic influence measures to perform sensitivity analysis and quantify the effect of various perturbations to statistical models. We use the proposed sensitivity analysis procedure to systematically investigate the tenability of the non-ignorable missing at random (NMAR) assumption. Simulation studies are conducted to evaluate our methods, and a dataset is analyzed to illustrate the use of our diagnostic measures. PMID:24753718
40 CFR 98.455 - Procedures for estimating missing data.
Code of Federal Regulations, 2013 CFR
2013-07-01
... 40 Protection of Environment 22 2013-07-01 2013-07-01 false Procedures for estimating missing data... § 98.455 Procedures for estimating missing data. A complete record of all measured parameters used in the GHG emissions calculations is required. Replace missing data, if needed, based on data from...
40 CFR 98.305 - Procedures for estimating missing data.
Code of Federal Regulations, 2012 CFR
2012-07-01
... 40 Protection of Environment 22 2012-07-01 2012-07-01 false Procedures for estimating missing data... Use § 98.305 Procedures for estimating missing data. A complete record of all measured parameters used in the GHG emissions calculations is required. Replace missing data, if needed, based on data from...
40 CFR 98.305 - Procedures for estimating missing data.
Code of Federal Regulations, 2014 CFR
2014-07-01
... 40 Protection of Environment 21 2014-07-01 2014-07-01 false Procedures for estimating missing data... Use § 98.305 Procedures for estimating missing data. A complete record of all measured parameters used in the GHG emissions calculations is required. Replace missing data, if needed, based on data from...
40 CFR 98.455 - Procedures for estimating missing data.
Code of Federal Regulations, 2011 CFR
2011-07-01
... 40 Protection of Environment 21 2011-07-01 2011-07-01 false Procedures for estimating missing data... § 98.455 Procedures for estimating missing data. A complete record of all measured parameters used in the GHG emissions calculations is required. Replace missing data, if needed, based on data from...
40 CFR 98.455 - Procedures for estimating missing data.
Code of Federal Regulations, 2012 CFR
2012-07-01
... 40 Protection of Environment 22 2012-07-01 2012-07-01 false Procedures for estimating missing data... § 98.455 Procedures for estimating missing data. A complete record of all measured parameters used in the GHG emissions calculations is required. Replace missing data, if needed, based on data from...
40 CFR 98.305 - Procedures for estimating missing data.
Code of Federal Regulations, 2011 CFR
2011-07-01
... 40 Protection of Environment 21 2011-07-01 2011-07-01 false Procedures for estimating missing data... Use § 98.305 Procedures for estimating missing data. A complete record of all measured parameters used in the GHG emissions calculations is required. Replace missing data, if needed, based on data from...
40 CFR 98.305 - Procedures for estimating missing data.
Code of Federal Regulations, 2013 CFR
2013-07-01
... 40 Protection of Environment 22 2013-07-01 2013-07-01 false Procedures for estimating missing data... Use § 98.305 Procedures for estimating missing data. A complete record of all measured parameters used in the GHG emissions calculations is required. Replace missing data, if needed, based on data from...
40 CFR 98.455 - Procedures for estimating missing data.
Code of Federal Regulations, 2014 CFR
2014-07-01
... 40 Protection of Environment 21 2014-07-01 2014-07-01 false Procedures for estimating missing data... § 98.455 Procedures for estimating missing data. A complete record of all measured parameters used in the GHG emissions calculations is required. Replace missing data, if needed, based on data from...
NASA Astrophysics Data System (ADS)
Dumedah, Gift; Walker, Jeffrey P.; Chik, Li
2014-07-01
Soil moisture information is critically important for water management operations including flood forecasting, drought monitoring, and groundwater recharge estimation. While an accurate and continuous record of soil moisture is required for these applications, the available soil moisture data, in practice, is typically fraught with missing values. There are a wide range of methods available to infilling hydrologic variables, but a thorough inter-comparison between statistical methods and artificial neural networks has not been made. This study examines 5 statistical methods including monthly averages, weighted Pearson correlation coefficient, a method based on temporal stability of soil moisture, and a weighted merging of the three methods, together with a method based on the concept of rough sets. Additionally, 9 artificial neural networks are examined, broadly categorized into feedforward, dynamic, and radial basis networks. These 14 infilling methods were used to estimate missing soil moisture records and subsequently validated against known values for 13 soil moisture monitoring stations for three different soil layer depths in the Yanco region in southeast Australia. The evaluation results show that the top three highest performing methods are the nonlinear autoregressive neural network, rough sets method, and monthly replacement. A high estimation accuracy (root mean square error (RMSE) of about 0.03 m/m) was found in the nonlinear autoregressive network, due to its regression based dynamic network which allows feedback connections through discrete-time estimation. An equally high accuracy (0.05 m/m RMSE) in the rough sets procedure illustrates the important role of temporal persistence of soil moisture, with the capability to account for different soil moisture conditions.
Coquet, Julia Becaria; Tumas, Natalia; Osella, Alberto Ruben; Tanzi, Matteo; Franco, Isabella; Diaz, Maria Del Pilar
2016-01-01
A number of studies have evidenced the effect of modifiable lifestyle factors such as diet, breastfeeding and nutritional status on breast cancer risk. However, none have addressed the missing data problem in nutritional epidemiologic research in South America. Missing data is a frequent problem in breast cancer studies and epidemiological settings in general. Estimates of effect obtained from these studies may be biased, if no appropriate method for handling missing data is applied. We performed Multiple Imputation for missing values on covariates in a breast cancer case-control study of Córdoba (Argentina) to optimize risk estimates. Data was obtained from a breast cancer case control study from 2008 to 2015 (318 cases, 526 controls). Complete case analysis and multiple imputation using chained equations were the methods applied to estimate the effects of a Traditional dietary pattern and other recognized factors associated with breast cancer. Physical activity and socioeconomic status were imputed. Logistic regression models were performed. When complete case analysis was performed only 31% of women were considered. Although a positive association of Traditional dietary pattern and breast cancer was observed from both approaches (complete case analysis OR=1.3, 95%CI=1.0-1.7; multiple imputation OR=1.4, 95%CI=1.2-1.7), effects of other covariates, like BMI and breastfeeding, were only identified when multiple imputation was considered. A Traditional dietary pattern, BMI and breastfeeding are associated with the occurrence of breast cancer in this Argentinean population when multiple imputation is appropriately performed. Multiple Imputation is suggested in Latin America’s epidemiologic studies to optimize effect estimates in the future. PMID:27892664
Missing data imputation of solar radiation data under different atmospheric conditions.
Turrado, Concepción Crespo; López, María Del Carmen Meizoso; Lasheras, Fernando Sánchez; Gómez, Benigno Antonio Rodríguez; Rollé, José Luis Calvo; Juez, Francisco Javier de Cos
2014-10-29
Global solar broadband irradiance on a planar surface is measured at weather stations by pyranometers. In the case of the present research, solar radiation values from nine meteorological stations of the MeteoGalicia real-time observational network, captured and stored every ten minutes, are considered. In this kind of record, the lack of data and/or the presence of wrong values adversely affects any time series study. Consequently, when this occurs, a data imputation process must be performed in order to replace missing data with estimated values. This paper aims to evaluate the multivariate imputation of ten-minute scale data by means of the chained equations method (MICE). This method allows the network itself to impute the missing or wrong data of a solar radiation sensor, by using either all or just a group of the measurements of the remaining sensors. Very good results have been obtained with the MICE method in comparison with other methods employed in this field such as Inverse Distance Weighting (IDW) and Multiple Linear Regression (MLR). The average RMSE value of the predictions for the MICE algorithm was 13.37% while that for the MLR it was 28.19%, and 31.68% for the IDW.
Missing Data Imputation of Solar Radiation Data under Different Atmospheric Conditions
Turrado, Concepción Crespo; López, María del Carmen Meizoso; Lasheras, Fernando Sánchez; Gómez, Benigno Antonio Rodríguez; Rollé, José Luis Calvo; de Cos Juez, Francisco Javier
2014-01-01
Global solar broadband irradiance on a planar surface is measured at weather stations by pyranometers. In the case of the present research, solar radiation values from nine meteorological stations of the MeteoGalicia real-time observational network, captured and stored every ten minutes, are considered. In this kind of record, the lack of data and/or the presence of wrong values adversely affects any time series study. Consequently, when this occurs, a data imputation process must be performed in order to replace missing data with estimated values. This paper aims to evaluate the multivariate imputation of ten-minute scale data by means of the chained equations method (MICE). This method allows the network itself to impute the missing or wrong data of a solar radiation sensor, by using either all or just a group of the measurements of the remaining sensors. Very good results have been obtained with the MICE method in comparison with other methods employed in this field such as Inverse Distance Weighting (IDW) and Multiple Linear Regression (MLR). The average RMSE value of the predictions for the MICE algorithm was 13.37% while that for the MLR it was 28.19%, and 31.68% for the IDW. PMID:25356644
Large Scale Crop Classification in Ukraine using Multi-temporal Landsat-8 Images with Missing Data
NASA Astrophysics Data System (ADS)
Kussul, N.; Skakun, S.; Shelestov, A.; Lavreniuk, M. S.
2014-12-01
At present, there are no globally available Earth observation (EO) derived products on crop maps. This issue is being addressed within the Sentinel-2 for Agriculture initiative where a number of test sites (including from JECAM) participate to provide coherent protocols and best practices for various global agriculture systems, and subsequently crop maps from Sentinel-2. One of the problems in dealing with optical images for large territories (more than 10,000 sq. km) is the presence of clouds and shadows that result in having missing values in data sets. In this abstract, a new approach to classification of multi-temporal optical satellite imagery with missing data due to clouds and shadows is proposed. First, self-organizing Kohonen maps (SOMs) are used to restore missing pixel values in a time series of satellite imagery. SOMs are trained for each spectral band separately using non-missing values. Missing values are restored through a special procedure that substitutes input sample's missing components with neuron's weight coefficients. After missing data restoration, a supervised classification is performed for multi-temporal satellite images. For this, an ensemble of neural networks, in particular multilayer perceptrons (MLPs), is proposed. Ensembling of neural networks is done by the technique of average committee, i.e. to calculate the average class probability over classifiers and select the class with the highest average posterior probability for the given input sample. The proposed approach is applied for large scale crop classification using multi temporal Landsat-8 images for the JECAM test site in Ukraine [1-2]. It is shown that ensemble of MLPs provides better performance than a single neural network in terms of overall classification accuracy and kappa coefficient. The obtained classification map is also validated through estimated crop and forest areas and comparison to official statistics. 1. A.Yu. Shelestov et al., "Geospatial information system for agricultural monitoring," Cybernetics Syst. Anal., vol. 49, no. 1, pp. 124-132, 2013. 2. J. Gallego et al., "Efficiency Assessment of Different Approaches to Crop Classification Based on Satellite and Ground Observations," J. Autom. Inform. Scie., vol. 44, no. 5, pp. 67-80, 2012.
40 CFR 98.95 - Procedures for estimating missing data.
Code of Federal Regulations, 2013 CFR
2013-07-01
... 40 Protection of Environment 22 2013-07-01 2013-07-01 false Procedures for estimating missing data... estimating missing data. (a) Except as provided in paragraph (b) of this section, a complete record of all... required. (b) If you use fluorinated heat transfer fluids at your facility and are missing data for one or...
40 CFR 98.95 - Procedures for estimating missing data.
Code of Federal Regulations, 2012 CFR
2012-07-01
... 40 Protection of Environment 22 2012-07-01 2012-07-01 false Procedures for estimating missing data... estimating missing data. (a) Except as provided in paragraph (b) of this section, a complete record of all... required. (b) If you use fluorinated heat transfer fluids at your facility and are missing data for one or...
40 CFR Appendix C to Part 75 - Missing Data Estimation Procedures
Code of Federal Regulations, 2013 CFR
2013-07-01
... 40 Protection of Environment 17 2013-07-01 2013-07-01 false Missing Data Estimation Procedures C... (CONTINUED) CONTINUOUS EMISSION MONITORING Pt. 75, App. C Appendix C to Part 75—Missing Data Estimation Procedures 1. Parametric Monitoring Procedure for Missing SO2 Concentration or NOX Emission Rate Data 1...
40 CFR Appendix C to Part 75 - Missing Data Estimation Procedures
Code of Federal Regulations, 2014 CFR
2014-07-01
... 40 Protection of Environment 17 2014-07-01 2014-07-01 false Missing Data Estimation Procedures C... (CONTINUED) CONTINUOUS EMISSION MONITORING Pt. 75, App. C Appendix C to Part 75—Missing Data Estimation Procedures 1. Parametric Monitoring Procedure for Missing SO2 Concentration or NOX Emission Rate Data 1...
40 CFR Appendix C to Part 75 - Missing Data Estimation Procedures
Code of Federal Regulations, 2012 CFR
2012-07-01
... 40 Protection of Environment 17 2012-07-01 2012-07-01 false Missing Data Estimation Procedures C... (CONTINUED) CONTINUOUS EMISSION MONITORING Pt. 75, App. C Appendix C to Part 75—Missing Data Estimation Procedures 1. Parametric Monitoring Procedure for Missing SO2 Concentration or NOX Emission Rate Data 1...
40 CFR 98.95 - Procedures for estimating missing data.
Code of Federal Regulations, 2014 CFR
2014-07-01
... 40 Protection of Environment 21 2014-07-01 2014-07-01 false Procedures for estimating missing data... estimating missing data. (a) Except as provided in paragraph (b) of this section, a complete record of all... required. (b) If you use fluorinated heat transfer fluids at your facility and are missing data for one or...
40 CFR Appendix C to Part 75 - Missing Data Estimation Procedures
Code of Federal Regulations, 2011 CFR
2011-07-01
... 40 Protection of Environment 16 2011-07-01 2011-07-01 false Missing Data Estimation Procedures C... (CONTINUED) CONTINUOUS EMISSION MONITORING Pt. 75, App. C Appendix C to Part 75—Missing Data Estimation Procedures 1. Parametric Monitoring Procedure for Missing SO2 Concentration or NOX Emission Rate Data 1...
Missing value imputation: with application to handwriting data
NASA Astrophysics Data System (ADS)
Xu, Zhen; Srihari, Sargur N.
2015-01-01
Missing values make pattern analysis difficult, particularly with limited available data. In longitudinal research, missing values accumulate, thereby aggravating the problem. Here we consider how to deal with temporal data with missing values in handwriting analysis. In the task of studying development of individuality of handwriting, we encountered the fact that feature values are missing for several individuals at several time instances. Six algorithms, i.e., random imputation, mean imputation, most likely independent value imputation, and three methods based on Bayesian network (static Bayesian network, parameter EM, and structural EM), are compared with children's handwriting data. We evaluate the accuracy and robustness of the algorithms under different ratios of missing data and missing values, and useful conclusions are given. Specifically, static Bayesian network is used for our data which contain around 5% missing data to provide adequate accuracy and low computational cost.
Dong, Jianghu J; Wang, Liangliang; Gill, Jagbir; Cao, Jiguo
2017-01-01
This article is motivated by some longitudinal clinical data of kidney transplant recipients, where kidney function progression is recorded as the estimated glomerular filtration rates at multiple time points post kidney transplantation. We propose to use the functional principal component analysis method to explore the major source of variations of glomerular filtration rate curves. We find that the estimated functional principal component scores can be used to cluster glomerular filtration rate curves. Ordering functional principal component scores can detect abnormal glomerular filtration rate curves. Finally, functional principal component analysis can effectively estimate missing glomerular filtration rate values and predict future glomerular filtration rate values.
Auerbach, Benjamin M
2011-05-01
One of the greatest limitations to the application of the revised Fully anatomical stature estimation method is the inability to measure some of the skeletal elements required in its calculation. These element dimensions cannot be obtained due to taphonomic factors, incomplete excavation, or disease processes, and result in missing data. This study examines methods of imputing these missing dimensions using observable Fully measurements from the skeleton and the accuracy of incorporating these missing element estimations into anatomical stature reconstruction. These are further assessed against stature estimations obtained from mathematical regression formulae for the lower limb bones (femur and tibia). Two thousand seven hundred and seventeen North and South American indigenous skeletons were measured, and subsets of these with observable Fully dimensions were used to simulate missing elements and create estimation methods and equations. Comparisons were made directly between anatomically reconstructed statures and mathematically derived statures, as well as with anatomically derived statures with imputed missing dimensions. These analyses demonstrate that, while mathematical stature estimations are more accurate, anatomical statures incorporating missing dimensions are not appreciably less accurate and are more precise. The anatomical stature estimation method using imputed missing dimensions is supported. Missing element estimation, however, is limited to the vertebral column (only when lumbar vertebrae are present) and to talocalcaneal height (only when femora and tibiae are present). Crania, entire vertebral columns, and femoral or tibial lengths cannot be reliably estimated. Further discussion of the applicability of these methods is discussed. Copyright © 2011 Wiley-Liss, Inc.
40 CFR 98.95 - Procedures for estimating missing data.
Code of Federal Regulations, 2011 CFR
2011-07-01
... 40 Protection of Environment 21 2011-07-01 2011-07-01 false Procedures for estimating missing data... estimating missing data. (a) Except as provided in paragraph (b) of this section, a complete record of all... required. (b) If you use heat transfer fluids at your facility and are missing data for one or more of the...
de Bock, Élodie; Hardouin, Jean-Benoit; Blanchin, Myriam; Le Neel, Tanguy; Kubis, Gildas; Sébille, Véronique
2015-01-01
The purpose of this study was to identify the most adequate strategy for group comparison of longitudinal patient-reported outcomes in the presence of possibly informative intermittent missing data. Models coming from classical test theory (CTT) and item response theory (IRT) were compared. Two groups of patients' responses to dichotomous items with three times of assessment were simulated. Different cases were considered: presence or absence of a group effect and/or a time effect, a total of 100 or 200 patients, 4 or 7 items and two different values for the correlation coefficient of the latent trait between two consecutive times (0.4 or 0.9). Cases including informative and non-informative intermittent missing data were compared at different rates (15, 30 %). These simulated data were analyzed with CTT using score and mixed model (SM) and with IRT using longitudinal Rasch mixed model (LRM). The type I error, the power and the bias of the group effect estimations were compared between the two methods. This study showed that LRM performs better than SM. When the rate of missing data rose to 30 %, estimations were biased with SM mainly for informative missing data. Otherwise, LRM and SM methods were comparable concerning biases. However, regardless of the rate of intermittent missing data, power of LRM was higher compared to power of SM. In conclusion, LRM should be favored when the rate of missing data is higher than 15 %. For other cases, SM and LRM provide similar results.
Jones, Rachael M; Stayner, Leslie T; Demirtas, Hakan
2014-10-01
Drinking water may contain pollutants that harm human health. The frequency of pollutant monitoring may occur quarterly, annually, or less frequently, depending upon the pollutant, the pollutant concentration, and community water system. However, birth and other health outcomes are associated with narrow time-windows of exposure. Infrequent monitoring impedes linkage between water quality and health outcomes for epidemiological analyses. To evaluate the performance of multiple imputation to fill in water quality values between measurements in community water systems (CWSs). The multiple imputation method was implemented in a simulated setting using data from the Atrazine Monitoring Program (AMP, 2006-2009 in five Midwestern states). Values were deleted from the AMP data to leave one measurement per month. Four patterns reflecting drinking water monitoring regulations were used to delete months of data in each CWS: three patterns were missing at random and one pattern was missing not at random. Synthetic health outcome data were created using a linear and a Poisson exposure-response relationship with five levels of hypothesized association, respectively. The multiple imputation method was evaluated by comparing the exposure-response relationships estimated based on multiply imputed data with the hypothesized association. The four patterns deleted 65-92% months of atrazine observations in AMP data. Even with these high rates of missing information, our procedure was able to recover most of the missing information when the synthetic health outcome was included for missing at random patterns and for missing not at random patterns with low-to-moderate exposure-response relationships. Multiple imputation appears to be an effective method for filling in water quality values between measurements. Copyright © 2014 Elsevier Inc. All rights reserved.
40 CFR 98.145 - Procedures for estimating missing data.
Code of Federal Regulations, 2013 CFR
2013-07-01
... 40 Protection of Environment 22 2013-07-01 2013-07-01 false Procedures for estimating missing data... missing data. A complete record of all measured parameters used in the GHG emissions calculations is... in § 98.144 cannot be followed and data is missing, you must use the most appropriate of the missing...
Impact of Missing Data on Person-Model Fit and Person Trait Estimation
ERIC Educational Resources Information Center
Zhang, Bo; Walker, Cindy M.
2008-01-01
The purpose of this research was to examine the effects of missing data on person-model fit and person trait estimation in tests with dichotomous items. Under the missing-completely-at-random framework, four missing data treatment techniques were investigated including pairwise deletion, coding missing responses as incorrect, hotdeck imputation,…
40 CFR 98.145 - Procedures for estimating missing data.
Code of Federal Regulations, 2014 CFR
2014-07-01
... 40 Protection of Environment 21 2014-07-01 2014-07-01 false Procedures for estimating missing data... missing data. A complete record of all measured parameters used in the GHG emissions calculations is... in § 98.144 cannot be followed and data is missing, you must use the most appropriate of the missing...
40 CFR 98.145 - Procedures for estimating missing data.
Code of Federal Regulations, 2012 CFR
2012-07-01
... 40 Protection of Environment 22 2012-07-01 2012-07-01 false Procedures for estimating missing data... missing data. A complete record of all measured parameters used in the GHG emissions calculations is... in § 98.144 cannot be followed and data is missing, you must use the most appropriate of the missing...
Zhu, Hong; Xu, Xiaohan; Ahn, Chul
2017-01-01
Paired experimental design is widely used in clinical and health behavioral studies, where each study unit contributes a pair of observations. Investigators often encounter incomplete observations of paired outcomes in the data collected. Some study units contribute complete pairs of observations, while the others contribute either pre- or post-intervention observations. Statistical inference for paired experimental design with incomplete observations of continuous outcomes has been extensively studied in literature. However, sample size method for such study design is sparsely available. We derive a closed-form sample size formula based on the generalized estimating equation approach by treating the incomplete observations as missing data in a linear model. The proposed method properly accounts for the impact of mixed structure of observed data: a combination of paired and unpaired outcomes. The sample size formula is flexible to accommodate different missing patterns, magnitude of missingness, and correlation parameter values. We demonstrate that under complete observations, the proposed generalized estimating equation sample size estimate is the same as that based on the paired t-test. In the presence of missing data, the proposed method would lead to a more accurate sample size estimate comparing with the crude adjustment. Simulation studies are conducted to evaluate the finite-sample performance of the generalized estimating equation sample size formula. A real application example is presented for illustration.
Rendall, Michael S.; Ghosh-Dastidar, Bonnie; Weden, Margaret M.; Baker, Elizabeth H.; Nazarov, Zafar
2013-01-01
Within-survey multiple imputation (MI) methods are adapted to pooled-survey regression estimation where one survey has more regressors, but typically fewer observations, than the other. This adaptation is achieved through: (1) larger numbers of imputations to compensate for the higher fraction of missing values; (2) model-fit statistics to check the assumption that the two surveys sample from a common universe; and (3) specificying the analysis model completely from variables present in the survey with the larger set of regressors, thereby excluding variables never jointly observed. In contrast to the typical within-survey MI context, cross-survey missingness is monotonic and easily satisfies the Missing At Random (MAR) assumption needed for unbiased MI. Large efficiency gains and substantial reduction in omitted variable bias are demonstrated in an application to sociodemographic differences in the risk of child obesity estimated from two nationally-representative cohort surveys. PMID:24223447
Martín-Merino, Elisa; Calderón-Larrañaga, Amaia; Hawley, Samuel; Poblador-Plou, Beatriz; Llorente-García, Ana; Petersen, Irene; Prieto-Alhambra, Daniel
2018-01-01
Background Missing data are often an issue in electronic medical records (EMRs) research. However, there are many ways that people deal with missing data in drug safety studies. Aim To compare the risk estimates resulting from different strategies for the handling of missing data in the study of venous thromboembolism (VTE) risk associated with antiosteoporotic medications (AOM). Methods New users of AOM (alendronic acid, other bisphosphonates, strontium ranelate, selective estrogen receptor modulators, teriparatide, or denosumab) aged ≥50 years during 1998–2014 were identified in two Spanish (the Base de datos para la Investigación Farmacoepidemiológica en Atención Primaria [BIFAP] and EpiChron cohort) and one UK (Clinical Practice Research Datalink [CPRD]) EMR. Hazard ratios (HRs) according to AOM (with alendronic acid as reference) were calculated adjusting for VTE risk factors, body mass index (that was missing in 61% of patients included in the three databases), and smoking (that was missing in 23% of patients) in the year of AOM therapy initiation. HRs and standard errors obtained using cross-sectional multiple imputation (MI) (reference method) were compared to complete case (CC) analysis – using only patients with complete data – and longitudinal MI – adding to the cross-sectional MI model the body mass index/smoking values as recorded in the year before and after therapy initiation. Results Overall, 422/95,057 (0.4%), 19/12,688 (0.1%), and 2,051/161,202 (1.3%) VTE cases/participants were seen in BIFAP, EpiChron, and CPRD, respectively. HRs moved from 100.00% underestimation to 40.31% overestimation in CC compared with cross-sectional MI, while longitudinal MI methods provided similar risk estimates compared with cross-sectional MI. Precision for HR improved in cross-sectional MI versus CC by up to 160.28%, while longitudinal MI improved precision (compared with cross-sectional) only minimally (up to 0.80%). Conclusion CC may substantially affect relative risk estimation in EMR-based drug safety studies, since missing data are not often completely at random. Little improvement was seen in these data in terms of power with the inclusion of longitudinal MI compared with cross-sectional MI. The strategy for handling missing data in drug safety studies can have a large impact on both risk estimates and precision.
Martín-Merino, Elisa; Calderón-Larrañaga, Amaia; Hawley, Samuel; Poblador-Plou, Beatriz; Llorente-García, Ana; Petersen, Irene; Prieto-Alhambra, Daniel
2018-01-01
Missing data are often an issue in electronic medical records (EMRs) research. However, there are many ways that people deal with missing data in drug safety studies. To compare the risk estimates resulting from different strategies for the handling of missing data in the study of venous thromboembolism (VTE) risk associated with antiosteoporotic medications (AOM). New users of AOM (alendronic acid, other bisphosphonates, strontium ranelate, selective estrogen receptor modulators, teriparatide, or denosumab) aged ≥50 years during 1998-2014 were identified in two Spanish (the Base de datos para la Investigación Farmacoepidemiológica en Atención Primaria [BIFAP] and EpiChron cohort) and one UK (Clinical Practice Research Datalink [CPRD]) EMR. Hazard ratios (HRs) according to AOM (with alendronic acid as reference) were calculated adjusting for VTE risk factors, body mass index (that was missing in 61% of patients included in the three databases), and smoking (that was missing in 23% of patients) in the year of AOM therapy initiation. HRs and standard errors obtained using cross-sectional multiple imputation (MI) (reference method) were compared to complete case (CC) analysis - using only patients with complete data - and longitudinal MI - adding to the cross-sectional MI model the body mass index/smoking values as recorded in the year before and after therapy initiation. Overall, 422/95,057 (0.4%), 19/12,688 (0.1%), and 2,051/161,202 (1.3%) VTE cases/participants were seen in BIFAP, EpiChron, and CPRD, respectively. HRs moved from 100.00% underestimation to 40.31% overestimation in CC compared with cross-sectional MI, while longitudinal MI methods provided similar risk estimates compared with cross-sectional MI. Precision for HR improved in cross-sectional MI versus CC by up to 160.28%, while longitudinal MI improved precision (compared with cross-sectional) only minimally (up to 0.80%). CC may substantially affect relative risk estimation in EMR-based drug safety studies, since missing data are not often completely at random. Little improvement was seen in these data in terms of power with the inclusion of longitudinal MI compared with cross-sectional MI. The strategy for handling missing data in drug safety studies can have a large impact on both risk estimates and precision.
Liu, Danping; Yeung, Edwina H; McLain, Alexander C; Xie, Yunlong; Buck Louis, Germaine M; Sundaram, Rajeshwari
2017-09-01
Imperfect follow-up in longitudinal studies commonly leads to missing outcome data that can potentially bias the inference when the missingness is nonignorable; that is, the propensity of missingness depends on missing values in the data. In the Upstate KIDS Study, we seek to determine if the missingness of child development outcomes is nonignorable, and how a simple model assuming ignorable missingness would compare with more complicated models for a nonignorable mechanism. To correct for nonignorable missingness, the shared random effects model (SREM) jointly models the outcome and the missing mechanism. However, the computational complexity and lack of software packages has limited its practical applications. This paper proposes a novel two-step approach to handle nonignorable missing outcomes in generalized linear mixed models. We first analyse the missing mechanism with a generalized linear mixed model and predict values of the random effects; then, the outcome model is fitted adjusting for the predicted random effects to account for heterogeneity in the missingness propensity. Extensive simulation studies suggest that the proposed method is a reliable approximation to SREM, with a much faster computation. The nonignorability of missing data in the Upstate KIDS Study is estimated to be mild to moderate, and the analyses using the two-step approach or SREM are similar to the model assuming ignorable missingness. The two-step approach is a computationally straightforward method that can be conducted as sensitivity analyses in longitudinal studies to examine violations to the ignorable missingness assumption and the implications relative to health outcomes. © 2017 John Wiley & Sons Ltd.
Wahl, Simone; Boulesteix, Anne-Laure; Zierer, Astrid; Thorand, Barbara; van de Wiel, Mark A
2016-10-26
Missing values are a frequent issue in human studies. In many situations, multiple imputation (MI) is an appropriate missing data handling strategy, whereby missing values are imputed multiple times, the analysis is performed in every imputed data set, and the obtained estimates are pooled. If the aim is to estimate (added) predictive performance measures, such as (change in) the area under the receiver-operating characteristic curve (AUC), internal validation strategies become desirable in order to correct for optimism. It is not fully understood how internal validation should be combined with multiple imputation. In a comprehensive simulation study and in a real data set based on blood markers as predictors for mortality, we compare three combination strategies: Val-MI, internal validation followed by MI on the training and test parts separately, MI-Val, MI on the full data set followed by internal validation, and MI(-y)-Val, MI on the full data set omitting the outcome followed by internal validation. Different validation strategies, including bootstrap und cross-validation, different (added) performance measures, and various data characteristics are considered, and the strategies are evaluated with regard to bias and mean squared error of the obtained performance estimates. In addition, we elaborate on the number of resamples and imputations to be used, and adopt a strategy for confidence interval construction to incomplete data. Internal validation is essential in order to avoid optimism, with the bootstrap 0.632+ estimate representing a reliable method to correct for optimism. While estimates obtained by MI-Val are optimistically biased, those obtained by MI(-y)-Val tend to be pessimistic in the presence of a true underlying effect. Val-MI provides largely unbiased estimates, with a slight pessimistic bias with increasing true effect size, number of covariates and decreasing sample size. In Val-MI, accuracy of the estimate is more strongly improved by increasing the number of bootstrap draws rather than the number of imputations. With a simple integrated approach, valid confidence intervals for performance estimates can be obtained. When prognostic models are developed on incomplete data, Val-MI represents a valid strategy to obtain estimates of predictive performance measures.
ERIC Educational Resources Information Center
Savalei, Victoria; Rhemtulla, Mijke
2012-01-01
Fraction of missing information [lambda][subscript j] is a useful measure of the impact of missing data on the quality of estimation of a particular parameter. This measure can be computed for all parameters in the model, and it communicates the relative loss of efficiency in the estimation of a particular parameter due to missing data. It has…
Missing observations in multiyear rotation sampling designs
NASA Technical Reports Server (NTRS)
Gbur, E. E.; Sielken, R. L., Jr. (Principal Investigator)
1982-01-01
Because Multiyear estimation of at-harvest stratum crop proportions is more efficient than single year estimation, the behavior of multiyear estimators in the presence of missing acquisitions was studied. Only the (worst) case when a segment proportion cannot be estimated for the entire year is considered. The effect of these missing segments on the variance of the at-harvest stratum crop proportion estimator is considered when missing segments are not replaced, and when missing segments are replaced by segments not sampled in previous years. The principle recommendations are to replace missing segments according to some specified strategy, and to use a sequential procedure for selecting a sampling design; i.e., choose an optimal two year design and then, based on the observed two year design after segment losses have been taken into account, choose the best possible three year design having the observed two year parent design.
Taking the Missing Propensity Into Account When Estimating Competence Scores
Pohl, Steffi; Carstensen, Claus H.
2014-01-01
When competence tests are administered, subjects frequently omit items. These missing responses pose a threat to correctly estimating the proficiency level. Newer model-based approaches aim to take nonignorable missing data processes into account by incorporating a latent missing propensity into the measurement model. Two assumptions are typically made when using these models: (1) The missing propensity is unidimensional and (2) the missing propensity and the ability are bivariate normally distributed. These assumptions may, however, be violated in real data sets and could, thus, pose a threat to the validity of this approach. The present study focuses on modeling competencies in various domains, using data from a school sample (N = 15,396) and an adult sample (N = 7,256) from the National Educational Panel Study. Our interest was to investigate whether violations of unidimensionality and the normal distribution assumption severely affect the performance of the model-based approach in terms of differences in ability estimates. We propose a model with a competence dimension, a unidimensional missing propensity and a distributional assumption more flexible than a multivariate normal. Using this model for ability estimation results in different ability estimates compared with a model ignoring missing responses. Implications for ability estimation in large-scale assessments are discussed. PMID:29795844
Leistra, Minze; Wolters, André; van den Berg, Frederik
2008-06-01
Volatilisation of pesticides from crop canopies can be an important emission pathway. In addition to pesticide properties, competing processes in the canopy and environmental conditions play a part. A computation model is being developed to simulate the processes, but only some of the input data can be obtained directly from the literature. Three well-defined experiments on the volatilisation of radiolabelled parathion-methyl (as example compound) from plants in a wind tunnel system were simulated with the computation model. Missing parameter values were estimated by calibration against the experimental results. The resulting thickness of the air boundary layer, rate of plant penetation and rate of phototransformation were compared with a diversity of literature data. The sequence of importance of the canopy processes was: volatilisation > plant penetration > phototransformation. Computer simulation of wind tunnel experiments, with radiolabelled pesticide sprayed on plants, yields values for the rate coefficients of processes at the plant surface. As some input data for simulations are not required in the framework of registration procedures, attempts to estimate missing parameter values on the basis of divergent experimental results have to be continued. Copyright (c) 2008 Society of Chemical Industry.
Ogawa, Takahiro; Haseyama, Miki
2013-03-01
A missing texture reconstruction method based on an error reduction (ER) algorithm, including a novel estimation scheme of Fourier transform magnitudes is presented in this brief. In our method, Fourier transform magnitude is estimated for a target patch including missing areas, and the missing intensities are estimated by retrieving its phase based on the ER algorithm. Specifically, by monitoring errors converged in the ER algorithm, known patches whose Fourier transform magnitudes are similar to that of the target patch are selected from the target image. In the second approach, the Fourier transform magnitude of the target patch is estimated from those of the selected known patches and their corresponding errors. Consequently, by using the ER algorithm, we can estimate both the Fourier transform magnitudes and phases to reconstruct the missing areas.
NASA Astrophysics Data System (ADS)
Al-Mudhafar, W. J.
2013-12-01
Precisely prediction of rock facies leads to adequate reservoir characterization by improving the porosity-permeability relationships to estimate the properties in non-cored intervals. It also helps to accurately identify the spatial facies distribution to perform an accurate reservoir model for optimal future reservoir performance. In this paper, the facies estimation has been done through Multinomial logistic regression (MLR) with respect to the well logs and core data in a well in upper sandstone formation of South Rumaila oil field. The entire independent variables are gamma rays, formation density, water saturation, shale volume, log porosity, core porosity, and core permeability. Firstly, Robust Sequential Imputation Algorithm has been considered to impute the missing data. This algorithm starts from a complete subset of the dataset and estimates sequentially the missing values in an incomplete observation by minimizing the determinant of the covariance of the augmented data matrix. Then, the observation is added to the complete data matrix and the algorithm continues with the next observation with missing values. The MLR has been chosen to estimate the maximum likelihood and minimize the standard error for the nonlinear relationships between facies & core and log data. The MLR is used to predict the probabilities of the different possible facies given each independent variable by constructing a linear predictor function having a set of weights that are linearly combined with the independent variables by using a dot product. Beta distribution of facies has been considered as prior knowledge and the resulted predicted probability (posterior) has been estimated from MLR based on Baye's theorem that represents the relationship between predicted probability (posterior) with the conditional probability and the prior knowledge. To assess the statistical accuracy of the model, the bootstrap should be carried out to estimate extra-sample prediction error by randomly drawing datasets with replacement from the training data. Each sample has the same size of the original training set and it can be conducted N times to produce N bootstrap datasets to re-fit the model accordingly to decrease the squared difference between the estimated and observed categorical variables (facies) leading to decrease the degree of uncertainty.
2013-01-01
Background Rapid development of highly saturated genetic maps aids molecular breeding, which can accelerate gain per breeding cycle in woody perennial plants such as Rubus idaeus (red raspberry). Recently, robust genotyping methods based on high-throughput sequencing were developed, which provide high marker density, but result in some genotype errors and a large number of missing genotype values. Imputation can reduce the number of missing values and can correct genotyping errors, but current methods of imputation require a reference genome and thus are not an option for most species. Results Genotyping by Sequencing (GBS) was used to produce highly saturated maps for a R. idaeus pseudo-testcross progeny. While low coverage and high variance in sequencing resulted in a large number of missing values for some individuals, a novel method of imputation based on maximum likelihood marker ordering from initial marker segregation overcame the challenge of missing values, and made map construction computationally tractable. The two resulting parental maps contained 4521 and 2391 molecular markers spanning 462.7 and 376.6 cM respectively over seven linkage groups. Detection of precise genomic regions with segregation distortion was possible because of map saturation. Microsatellites (SSRs) linked these results to published maps for cross-validation and map comparison. Conclusions GBS together with genome-independent imputation provides a rapid method for genetic map construction in any pseudo-testcross progeny. Our method of imputation estimates the correct genotype call of missing values and corrects genotyping errors that lead to inflated map size and reduced precision in marker placement. Comparison of SSRs to published R. idaeus maps showed that the linkage maps constructed with GBS and our method of imputation were robust, and marker positioning reliable. The high marker density allowed identification of genomic regions with segregation distortion in R. idaeus, which may help to identify deleterious alleles that are the basis of inbreeding depression in the species. PMID:23324311
Adjusting HIV prevalence estimates for non-participation: an application to demographic surveillance
McGovern, Mark E.; Marra, Giampiero; Radice, Rosalba; Canning, David; Newell, Marie-Louise; Bärnighausen, Till
2015-01-01
Introduction HIV testing is a cornerstone of efforts to combat the HIV epidemic, and testing conducted as part of surveillance provides invaluable data on the spread of infection and the effectiveness of campaigns to reduce the transmission of HIV. However, participation in HIV testing can be low, and if respondents systematically select not to be tested because they know or suspect they are HIV positive (and fear disclosure), standard approaches to deal with missing data will fail to remove selection bias. We implemented Heckman-type selection models, which can be used to adjust for missing data that are not missing at random, and established the extent of selection bias in a population-based HIV survey in an HIV hyperendemic community in rural South Africa. Methods We used data from a population-based HIV survey carried out in 2009 in rural KwaZulu-Natal, South Africa. In this survey, 5565 women (35%) and 2567 men (27%) provided blood for an HIV test. We accounted for missing data using interviewer identity as a selection variable which predicted consent to HIV testing but was unlikely to be independently associated with HIV status. Our approach involved using this selection variable to examine the HIV status of residents who would ordinarily refuse to test, except that they were allocated a persuasive interviewer. Our copula model allows for flexibility when modelling the dependence structure between HIV survey participation and HIV status. Results For women, our selection model generated an HIV prevalence estimate of 33% (95% CI 27–40) for all people eligible to consent to HIV testing in the survey. This estimate is higher than the estimate of 24% generated when only information from respondents who participated in testing is used in the analysis, and the estimate of 27% when imputation analysis is used to predict missing data on HIV status. For men, we found an HIV prevalence of 25% (95% CI 15–35) using the selection model, compared to 16% among those who participated in testing, and 18% estimated with imputation. We provide new confidence intervals that correct for the fact that the relationship between testing and HIV status is unknown and requires estimation. Conclusions We confirm the feasibility and value of adopting selection models to account for missing data in population-based HIV surveys and surveillance systems. Elements of survey design, such as interviewer identity, present the opportunity to adopt this approach in routine applications. Where non-participation is high, true confidence intervals are much wider than those generated by standard approaches to dealing with missing data suggest. PMID:26613900
Missing value imputation strategies for metabolomics data.
Armitage, Emily Grace; Godzien, Joanna; Alonso-Herranz, Vanesa; López-Gonzálvez, Ángeles; Barbas, Coral
2015-12-01
The origin of missing values can be caused by different reasons and depending on these origins missing values should be considered differently and dealt with in different ways. In this research, four methods of imputation have been compared with respect to revealing their effects on the normality and variance of data, on statistical significance and on the approximation of a suitable threshold to accept missing data as truly missing. Additionally, the effects of different strategies for controlling familywise error rate or false discovery and how they work with the different strategies for missing value imputation have been evaluated. Missing values were found to affect normality and variance of data and k-means nearest neighbour imputation was the best method tested for restoring this. Bonferroni correction was the best method for maximizing true positives and minimizing false positives and it was observed that as low as 40% missing data could be truly missing. The range between 40 and 70% missing values was defined as a "gray area" and therefore a strategy has been proposed that provides a balance between the optimal imputation strategy that was k-means nearest neighbor and the best approximation of positioning real zeros. © 2015 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Missing Value Imputation Approach for Mass Spectrometry-based Metabolomics Data.
Wei, Runmin; Wang, Jingye; Su, Mingming; Jia, Erik; Chen, Shaoqiu; Chen, Tianlu; Ni, Yan
2018-01-12
Missing values exist widely in mass-spectrometry (MS) based metabolomics data. Various methods have been applied for handling missing values, but the selection can significantly affect following data analyses. Typically, there are three types of missing values, missing not at random (MNAR), missing at random (MAR), and missing completely at random (MCAR). Our study comprehensively compared eight imputation methods (zero, half minimum (HM), mean, median, random forest (RF), singular value decomposition (SVD), k-nearest neighbors (kNN), and quantile regression imputation of left-censored data (QRILC)) for different types of missing values using four metabolomics datasets. Normalized root mean squared error (NRMSE) and NRMSE-based sum of ranks (SOR) were applied to evaluate imputation accuracy. Principal component analysis (PCA)/partial least squares (PLS)-Procrustes analysis were used to evaluate the overall sample distribution. Student's t-test followed by correlation analysis was conducted to evaluate the effects on univariate statistics. Our findings demonstrated that RF performed the best for MCAR/MAR and QRILC was the favored one for left-censored MNAR. Finally, we proposed a comprehensive strategy and developed a public-accessible web-tool for the application of missing value imputation in metabolomics ( https://metabolomics.cc.hawaii.edu/software/MetImp/ ).
The effects of missing data on global ozone estimates
NASA Technical Reports Server (NTRS)
Drewry, J. W.; Robbins, J. L.
1981-01-01
The effects of missing data and model truncation on estimates of the global mean, zonal distribution, and global distribution of ozone are considered. It is shown that missing data can introduce biased estimates with errors that are not accounted for in the accuracy calculations of empirical modeling techniques. Data-fill techniques are introduced and used for evaluating error bounds and constraining the estimate in areas of sparse and missing data. It is found that the accuracy of the global mean estimate is more dependent on data distribution than model size. Zonal features can be accurately described by 7th order models over regions of adequate data distribution. Data variance accounted for by higher order models appears to represent climatological features of columnar ozone rather than pure error. Data-fill techniques can prevent artificial feature generation in regions of sparse or missing data without degrading high order estimates over dense data regions.
Voillet, Valentin; Besse, Philippe; Liaubet, Laurence; San Cristobal, Magali; González, Ignacio
2016-10-03
In omics data integration studies, it is common, for a variety of reasons, for some individuals to not be present in all data tables. Missing row values are challenging to deal with because most statistical methods cannot be directly applied to incomplete datasets. To overcome this issue, we propose a multiple imputation (MI) approach in a multivariate framework. In this study, we focus on multiple factor analysis (MFA) as a tool to compare and integrate multiple layers of information. MI involves filling the missing rows with plausible values, resulting in M completed datasets. MFA is then applied to each completed dataset to produce M different configurations (the matrices of coordinates of individuals). Finally, the M configurations are combined to yield a single consensus solution. We assessed the performance of our method, named MI-MFA, on two real omics datasets. Incomplete artificial datasets with different patterns of missingness were created from these data. The MI-MFA results were compared with two other approaches i.e., regularized iterative MFA (RI-MFA) and mean variable imputation (MVI-MFA). For each configuration resulting from these three strategies, the suitability of the solution was determined against the true MFA configuration obtained from the original data and a comprehensive graphical comparison showing how the MI-, RI- or MVI-MFA configurations diverge from the true configuration was produced. Two approaches i.e., confidence ellipses and convex hulls, to visualize and assess the uncertainty due to missing values were also described. We showed how the areas of ellipses and convex hulls increased with the number of missing individuals. A free and easy-to-use code was proposed to implement the MI-MFA method in the R statistical environment. We believe that MI-MFA provides a useful and attractive method for estimating the coordinates of individuals on the first MFA components despite missing rows. MI-MFA configurations were close to the true configuration even when many individuals were missing in several data tables. This method takes into account the uncertainty of MI-MFA configurations induced by the missing rows, thereby allowing the reliability of the results to be evaluated.
Crown, William; Chang, Jessica; Olson, Melvin; Kahler, Kristijan; Swindle, Jason; Buzinec, Paul; Shah, Nilay; Borah, Bijan
2015-09-01
Missing data, particularly missing variables, can create serious analytic challenges in observational comparative effectiveness research studies. Statistical linkage of datasets is a potential method for incorporating missing variables. Prior studies have focused upon the bias introduced by imperfect linkage. This analysis uses a case study of hepatitis C patients to estimate the net effect of statistical linkage on bias, also accounting for the potential reduction in missing variable bias. The results show that statistical linkage can reduce bias while also enabling parameter estimates to be obtained for the formerly missing variables. The usefulness of statistical linkage will vary depending upon the strength of the correlations of the missing variables with the treatment variable, as well as the outcome variable of interest.
The Effects of Methods of Imputation for Missing Values on the Validity and Reliability of Scales
ERIC Educational Resources Information Center
Cokluk, Omay; Kayri, Murat
2011-01-01
The main aim of this study is the comparative examination of the factor structures, corrected item-total correlations, and Cronbach-alpha internal consistency coefficients obtained by different methods used in imputation for missing values in conditions of not having missing values, and having missing values of different rates in terms of testing…
Newgard, Craig; Malveau, Susan; Staudenmayer, Kristan; Wang, N. Ewen; Hsia, Renee Y.; Mann, N. Clay; Holmes, James F.; Kuppermann, Nathan; Haukoos, Jason S.; Bulger, Eileen M.; Dai, Mengtao; Cook, Lawrence J.
2012-01-01
Objectives The objective was to evaluate the process of using existing data sources, probabilistic linkage, and multiple imputation to create large population-based injury databases matched to outcomes. Methods This was a retrospective cohort study of injured children and adults transported by 94 emergency medical systems (EMS) agencies to 122 hospitals in seven regions of the western United States over a 36-month period (2006 to 2008). All injured patients evaluated by EMS personnel within specific geographic catchment areas were included, regardless of field disposition or outcome. The authors performed probabilistic linkage of EMS records to four hospital and postdischarge data sources (emergency department [ED] data, patient discharge data, trauma registries, and vital statistics files) and then handled missing values using multiple imputation. The authors compare and evaluate matched records, match rates (proportion of matches among eligible patients), and injury outcomes within and across sites. Results There were 381,719 injured patients evaluated by EMS personnel in the seven regions. Among transported patients, match rates ranged from 14.9% to 87.5% and were directly affected by the availability of hospital data sources and proportion of missing values for key linkage variables. For vital statistics records (1-year mortality), estimated match rates ranged from 88.0% to 98.7%. Use of multiple imputation (compared to complete case analysis) reduced bias for injury outcomes, although sample size, percentage missing, type of variable, and combined-site versus single-site imputation models all affected the resulting estimates and variance. Conclusions This project demonstrates the feasibility and describes the process of constructing population-based injury databases across multiple phases of care using existing data sources and commonly available analytic methods. Attention to key linkage variables and decisions for handling missing values can be used to increase match rates between data sources, minimize bias, and preserve sampling design. PMID:22506952
40 CFR 98.416 - Data reporting requirements.
Code of Federal Regulations, 2010 CFR
2010-07-01
.... (16) Where missing data have been estimated pursuant to § 98.415, the reason the data were missing, the length of time the data were missing, the method used to estimate the missing data, and the... 40 Protection of Environment 20 2010-07-01 2010-07-01 false Data reporting requirements. 98.416...
Handling missing values in the MDS-UPDRS.
Goetz, Christopher G; Luo, Sheng; Wang, Lu; Tilley, Barbara C; LaPelle, Nancy R; Stebbins, Glenn T
2015-10-01
This study was undertaken to define the number of missing values permissible to render valid total scores for each Movement Disorder Society Unified Parkinson's Disease Rating Scale (MDS-UPDRS) part. To handle missing values, imputation strategies serve as guidelines to reject an incomplete rating or create a surrogate score. We tested a rigorous, scale-specific, data-based approach to handling missing values for the MDS-UPDRS. From two large MDS-UPDRS datasets, we sequentially deleted item scores, either consistently (same items) or randomly (different items) across all subjects. Lin's Concordance Correlation Coefficient (CCC) compared scores calculated without missing values with prorated scores based on sequentially increasing missing values. The maximal number of missing values retaining a CCC greater than 0.95 determined the threshold for rendering a valid prorated score. A second confirmatory sample was selected from the MDS-UPDRS international translation program. To provide valid part scores applicable across all Hoehn and Yahr (H&Y) stages when the same items are consistently missing, one missing item from Part I, one from Part II, three from Part III, but none from Part IV can be allowed. To provide valid part scores applicable across all H&Y stages when random item entries are missing, one missing item from Part I, two from Part II, seven from Part III, but none from Part IV can be allowed. All cutoff values were confirmed in the validation sample. These analyses are useful for constructing valid surrogate part scores for MDS-UPDRS when missing items fall within the identified threshold and give scientific justification for rejecting partially completed ratings that fall below the threshold. © 2015 International Parkinson and Movement Disorder Society.
Rosinska, Magdalena; Pantazis, Nikos; Janiec, Janusz; Pharris, Anastasia; Amato-Gauci, Andrew J; Quinten, Chantal; Ecdc Hiv/Aids Surveillance Network
2018-06-01
Accurate case-based surveillance data remain the key data source for estimating HIV burden and monitoring prevention efforts in Europe. We carried out a literature review and exploratory analysis of surveillance data regarding two crucial issues affecting European surveillance for HIV: missing data and reporting delay. Initial screening showed substantial variability of these data issues, both in time and across countries. In terms of missing data, the CD4+ cell count is the most problematic variable because of the high proportion of missing values. In 20 of 31 countries of the European Union/European Economic Area (EU/EEA), CD4+ counts are systematically missing for all or some years. One of the key challenges related to reporting delays is that countries undertake specific one-off actions in effort to capture previously unreported cases, and that these cases are subsequently reported with excessive delays. Slightly different underlying assumptions and effectively different models may be required for individual countries to adjust for missing data and reporting delays. However, using a similar methodology is recommended to foster harmonisation and to improve the accuracy and usability of HIV surveillance data at national and EU/EEA levels.
Bayesian Statistical Inference in Ion-Channel Models with Exact Missed Event Correction.
Epstein, Michael; Calderhead, Ben; Girolami, Mark A; Sivilotti, Lucia G
2016-07-26
The stochastic behavior of single ion channels is most often described as an aggregated continuous-time Markov process with discrete states. For ligand-gated channels each state can represent a different conformation of the channel protein or a different number of bound ligands. Single-channel recordings show only whether the channel is open or shut: states of equal conductance are aggregated, so transitions between them have to be inferred indirectly. The requirement to filter noise from the raw signal further complicates the modeling process, as it limits the time resolution of the data. The consequence of the reduced bandwidth is that openings or shuttings that are shorter than the resolution cannot be observed; these are known as missed events. Postulated models fitted using filtered data must therefore explicitly account for missed events to avoid bias in the estimation of rate parameters and therefore assess parameter identifiability accurately. In this article, we present the first, to our knowledge, Bayesian modeling of ion-channels with exact missed events correction. Bayesian analysis represents uncertain knowledge of the true value of model parameters by considering these parameters as random variables. This allows us to gain a full appreciation of parameter identifiability and uncertainty when estimating values for model parameters. However, Bayesian inference is particularly challenging in this context as the correction for missed events increases the computational complexity of the model likelihood. Nonetheless, we successfully implemented a two-step Markov chain Monte Carlo method that we called "BICME", which performs Bayesian inference in models of realistic complexity. The method is demonstrated on synthetic and real single-channel data from muscle nicotinic acetylcholine channels. We show that parameter uncertainty can be characterized more accurately than with maximum-likelihood methods. Our code for performing inference in these ion channel models is publicly available. Copyright © 2016 Biophysical Society. Published by Elsevier Inc. All rights reserved.
Standard and Robust Methods in Regression Imputation
ERIC Educational Resources Information Center
Moraveji, Behjat; Jafarian, Koorosh
2014-01-01
The aim of this paper is to provide an introduction of new imputation algorithms for estimating missing values from official statistics in larger data sets of data pre-processing, or outliers. The goal is to propose a new algorithm called IRMI (iterative robust model-based imputation). This algorithm is able to deal with all challenges like…
ERIC Educational Resources Information Center
Köhler, Carmen; Pohl, Steffi; Carstensen, Claus H.
2015-01-01
When competence tests are administered, subjects frequently omit items. These missing responses pose a threat to correctly estimating the proficiency level. Newer model-based approaches aim to take nonignorable missing data processes into account by incorporating a latent missing propensity into the measurement model. Two assumptions are typically…
40 CFR 98.335 - Procedures for estimating missing data.
Code of Federal Regulations, 2011 CFR
2011-07-01
... 40 Protection of Environment 21 2011-07-01 2011-07-01 false Procedures for estimating missing data... missing data. For the carbon input procedure in § 98.333(b), a complete record of all measured parameters... average carbon contents of inputs according to the procedures in § 98.335(b) if data are missing. (b) For...
40 CFR 98.335 - Procedures for estimating missing data.
Code of Federal Regulations, 2013 CFR
2013-07-01
... 40 Protection of Environment 22 2013-07-01 2013-07-01 false Procedures for estimating missing data... missing data. For the carbon input procedure in § 98.333(b), a complete record of all measured parameters... average carbon contents of inputs according to the procedures in § 98.335(b) if data are missing. (b) For...
40 CFR 98.335 - Procedures for estimating missing data.
Code of Federal Regulations, 2014 CFR
2014-07-01
... 40 Protection of Environment 21 2014-07-01 2014-07-01 false Procedures for estimating missing data... missing data. For the carbon input procedure in § 98.333(b), a complete record of all measured parameters... average carbon contents of inputs according to the procedures in § 98.335(b) if data are missing. (b) For...
40 CFR 98.335 - Procedures for estimating missing data.
Code of Federal Regulations, 2012 CFR
2012-07-01
... 40 Protection of Environment 22 2012-07-01 2012-07-01 false Procedures for estimating missing data... missing data. For the carbon input procedure in § 98.333(b), a complete record of all measured parameters... average carbon contents of inputs according to the procedures in § 98.335(b) if data are missing. (b) For...
40 CFR 98.335 - Procedures for estimating missing data.
Code of Federal Regulations, 2010 CFR
2010-07-01
... 40 Protection of Environment 20 2010-07-01 2010-07-01 false Procedures for estimating missing data... missing data. For the carbon input procedure in § 98.333(b), a complete record of all measured parameters... average carbon contents of inputs according to the procedures in § 98.335(b) if data are missing. (b) For...
Bannon, William
2015-04-01
Missing data typically refer to the absence of one or more values within a study variable(s) contained in a dataset. The development is often the result of a study participant choosing not to provide a response to a survey item. In general, a greater number of missing values within a dataset reflects a greater challenge to the data analyst. However, if researchers are armed with just a few basic tools, they can quite effectively diagnose how serious the issue of missing data is within a dataset, as well as prescribe the most appropriate solution. Specifically, the keys to effectively assessing and treating missing data values within a dataset involve specifying how missing data will be defined in a study, assessing the amount of missing data, identifying the pattern of the missing data, and selecting the best way to treat the missing data values. I will touch on each of these processes and provide a brief illustration of how the validity of study findings are at great risk if missing data values are not treated effectively. ©2015 American Association of Nurse Practitioners.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Spriggs, G D
In a previous paper, the composite exposure rate conversion factor (ECF) for nuclear fallout was calculated using a simple theoretical photon-transport model. The theoretical model was used to fill in the gaps in the FGR-12 table generated by ORNL. The FGR-12 table contains the individual conversion factors for approximate 1000 radionuclides. However, in order to calculate the exposure rate during the first 30 minutes following a nuclear detonation, the conversion factors for approximately 2000 radionuclides are needed. From a human-effects standpoint, it is also necessary to have the dose rate conversion factors (DCFs) for all 2000 radionuclides. The DCFs aremore » used to predict the whole-body dose rates that would occur if a human were standing in a radiation field of known exposure rate. As calculated by ORNL, the whole-body dose rate (rem/hr) is approximately 70% of the exposure rate (R/hr) at one meter above the surface. Hence, the individual DCFs could be estimated by multiplying the individual ECFs by 0.7. Although this is a handy rule-of-thumb, a more consistent (and perhaps, more accurate) method of estimating the individual DCFs for the missing radionuclides in the FGR-12 table is to use the linear relationship between DCF and total gamma energy released per decay. This relationship is shown in Figure 1. The DCFs for individual organs in the body can also be estimated from the estimated whole-body DCF. Using the DCFs given FGR-12, the ratio of the organ-specific DCFs to the whole-body DCF were plotted as a function of the whole-body DCF. From these plots, the asymptotic ratios were obtained (see Table 1). Using these asymptotic ratios, the organ-specific DCFs can be estimated using the estimated whole-body DCF for each of the missing radionuclides in the FGR-12 table. Although this procedure for estimating the organ-specific DCFs may over-estimate the value for some low gamma-energy emitters, having a finite value for the organ-specific DCFs in the table is probably better than having no value at all. A summary of the complete ECF and DCF values are given in Table 2.« less
Multivariate longitudinal data analysis with censored and intermittent missing responses.
Lin, Tsung-I; Lachos, Victor H; Wang, Wan-Lun
2018-05-08
The multivariate linear mixed model (MLMM) has emerged as an important analytical tool for longitudinal data with multiple outcomes. However, the analysis of multivariate longitudinal data could be complicated by the presence of censored measurements because of a detection limit of the assay in combination with unavoidable missing values arising when subjects miss some of their scheduled visits intermittently. This paper presents a generalization of the MLMM approach, called the MLMM-CM, for a joint analysis of the multivariate longitudinal data with censored and intermittent missing responses. A computationally feasible expectation maximization-based procedure is developed to carry out maximum likelihood estimation within the MLMM-CM framework. Moreover, the asymptotic standard errors of fixed effects are explicitly obtained via the information-based method. We illustrate our methodology by using simulated data and a case study from an AIDS clinical trial. Experimental results reveal that the proposed method is able to provide more satisfactory performance as compared with the traditional MLMM approach. Copyright © 2018 John Wiley & Sons, Ltd.
Missing Data in Alcohol Clinical Trials with Binary Outcomes
Hallgren, Kevin A.; Witkiewitz, Katie; Kranzler, Henry R.; Falk, Daniel E.; Litten, Raye Z.; O’Malley, Stephanie S.; Anton, Raymond F.
2017-01-01
Background Missing data are common in alcohol clinical trials for both continuous and binary endpoints. Approaches to handle missing data have been explored for continuous outcomes, yet no studies have compared missing data approaches for binary outcomes (e.g., abstinence, no heavy drinking days). The present study compares approaches to modeling binary outcomes with missing data in the COMBINE study. Method We included participants in the COMBINE Study who had complete drinking data during treatment and who were assigned to active medication or placebo conditions (N=1146). Using simulation methods, missing data were introduced under common scenarios with varying sample sizes and amounts of missing data. Logistic regression was used to estimate the effect of naltrexone (vs. placebo) in predicting any drinking and any heavy drinking outcomes at the end of treatment using four analytic approaches: complete case analysis (CCA), last observation carried forward (LOCF), the worst-case scenario of missing equals any drinking or heavy drinking (WCS), and multiple imputation (MI). In separate analyses, these approaches were compared when drinking data were manually deleted for those participants who discontinued treatment but continued to provide drinking data. Results WCS produced the greatest amount of bias in treatment effect estimates. MI usually yielded less biased estimates than WCS and CCA in the simulated data, and performed considerably better than LOCF when estimating treatment effects among individuals who discontinued treatment. Conclusions Missing data can introduce bias in treatment effect estimates in alcohol clinical trials. Researchers should utilize modern missing data methods, including MI, and avoid WCS and CCA when analyzing binary alcohol clinical trial outcomes. PMID:27254113
Missing Data in Alcohol Clinical Trials with Binary Outcomes.
Hallgren, Kevin A; Witkiewitz, Katie; Kranzler, Henry R; Falk, Daniel E; Litten, Raye Z; O'Malley, Stephanie S; Anton, Raymond F
2016-07-01
Missing data are common in alcohol clinical trials for both continuous and binary end points. Approaches to handle missing data have been explored for continuous outcomes, yet no studies have compared missing data approaches for binary outcomes (e.g., abstinence, no heavy drinking days). This study compares approaches to modeling binary outcomes with missing data in the COMBINE study. We included participants in the COMBINE study who had complete drinking data during treatment and who were assigned to active medication or placebo conditions (N = 1,146). Using simulation methods, missing data were introduced under common scenarios with varying sample sizes and amounts of missing data. Logistic regression was used to estimate the effect of naltrexone (vs. placebo) in predicting any drinking and any heavy drinking outcomes at the end of treatment using 4 analytic approaches: complete case analysis (CCA), last observation carried forward (LOCF), the worst case scenario (WCS) of missing equals any drinking or heavy drinking, and multiple imputation (MI). In separate analyses, these approaches were compared when drinking data were manually deleted for those participants who discontinued treatment but continued to provide drinking data. WCS produced the greatest amount of bias in treatment effect estimates. MI usually yielded less biased estimates than WCS and CCA in the simulated data and performed considerably better than LOCF when estimating treatment effects among individuals who discontinued treatment. Missing data can introduce bias in treatment effect estimates in alcohol clinical trials. Researchers should utilize modern missing data methods, including MI, and avoid WCS and CCA when analyzing binary alcohol clinical trial outcomes. Copyright © 2016 by the Research Society on Alcoholism.
Multiple imputation for handling missing outcome data when estimating the relative risk.
Sullivan, Thomas R; Lee, Katherine J; Ryan, Philip; Salter, Amy B
2017-09-06
Multiple imputation is a popular approach to handling missing data in medical research, yet little is known about its applicability for estimating the relative risk. Standard methods for imputing incomplete binary outcomes involve logistic regression or an assumption of multivariate normality, whereas relative risks are typically estimated using log binomial models. It is unclear whether misspecification of the imputation model in this setting could lead to biased parameter estimates. Using simulated data, we evaluated the performance of multiple imputation for handling missing data prior to estimating adjusted relative risks from a correctly specified multivariable log binomial model. We considered an arbitrary pattern of missing data in both outcome and exposure variables, with missing data induced under missing at random mechanisms. Focusing on standard model-based methods of multiple imputation, missing data were imputed using multivariate normal imputation or fully conditional specification with a logistic imputation model for the outcome. Multivariate normal imputation performed poorly in the simulation study, consistently producing estimates of the relative risk that were biased towards the null. Despite outperforming multivariate normal imputation, fully conditional specification also produced somewhat biased estimates, with greater bias observed for higher outcome prevalences and larger relative risks. Deleting imputed outcomes from analysis datasets did not improve the performance of fully conditional specification. Both multivariate normal imputation and fully conditional specification produced biased estimates of the relative risk, presumably since both use a misspecified imputation model. Based on simulation results, we recommend researchers use fully conditional specification rather than multivariate normal imputation and retain imputed outcomes in the analysis when estimating relative risks. However fully conditional specification is not without its shortcomings, and so further research is needed to identify optimal approaches for relative risk estimation within the multiple imputation framework.
A Youth Compendium of Physical Activities: Activity Codes and Metabolic Intensities
BUTTE, NANCY F.; WATSON, KATHLEEN B.; RIDLEY, KATE; ZAKERI, ISSA F.; MCMURRAY, ROBERT G.; PFEIFFER, KARIN A.; CROUTER, SCOTT E.; HERRMANN, STEPHEN D.; BASSETT, DAVID R.; LONG, ALEXANDER; BERHANE, ZEKARIAS; TROST, STEWART G.; AINSWORTH, BARBARA E.; BERRIGAN, DAVID; FULTON, JANET E.
2018-01-01
ABSTRACT Purpose A Youth Compendium of Physical Activities (Youth Compendium) was developed to estimate the energy costs of physical activities using data on youth only. Methods On the basis of a literature search and pooled data of energy expenditure measurements in youth, the energy costs of 196 activities were compiled in 16 activity categories to form a Youth Compendium of Physical Activities. To estimate the intensity of each activity, measured oxygen consumption (V˙O2) was divided by basal metabolic rate (Schofield age-, sex-, and mass-specific equations) to produce a youth MET (METy). A mixed linear model was developed for each activity category to impute missing values for age ranges with no observations for a specific activity. Results This Youth Compendium consists of METy values for 196 specific activities classified into 16 major categories for four age-groups, 6–9, 10–12, 13–15, and 16–18 yr. METy values in this Youth Compendium were measured (51%) or imputed (49%) from youth data. Conclusion This Youth Compendium of Physical Activities uses pediatric data exclusively, addresses the age dependency of METy, and imputes missing METy values and thus represents advancement in physical activity research and practice. This Youth Compendium will be a valuable resource for stakeholders interested in evaluating interventions, programs, and policies designed to assess and encourage physical activity in youth. PMID:28938248
A Spatiotemporal Prediction Framework for Air Pollution Based on Deep RNN
NASA Astrophysics Data System (ADS)
Fan, J.; Li, Q.; Hou, J.; Feng, X.; Karimian, H.; Lin, S.
2017-10-01
Time series data in practical applications always contain missing values due to sensor malfunction, network failure, outliers etc. In order to handle missing values in time series, as well as the lack of considering temporal properties in machine learning models, we propose a spatiotemporal prediction framework based on missing value processing algorithms and deep recurrent neural network (DRNN). By using missing tag and missing interval to represent time series patterns, we implement three different missing value fixing algorithms, which are further incorporated into deep neural network that consists of LSTM (Long Short-term Memory) layers and fully connected layers. Real-world air quality and meteorological datasets (Jingjinji area, China) are used for model training and testing. Deep feed forward neural networks (DFNN) and gradient boosting decision trees (GBDT) are trained as baseline models against the proposed DRNN. Performances of three missing value fixing algorithms, as well as different machine learning models are evaluated and analysed. Experiments show that the proposed DRNN framework outperforms both DFNN and GBDT, therefore validating the capacity of the proposed framework. Our results also provides useful insights for better understanding of different strategies that handle missing values.
Carré, N; Uhry, Z; Velten, M; Trétarre, B; Schvartz, C; Molinié, F; Maarouf, N; Langlois, C; Grosclaude, P; Colonna, M
2006-09-01
Cancer registries have a complete recording of new cancer cases occurring among residents of a specific geographic area. In France, they cover only 13% of the population. For thyroid cancer, where incidence rate is highly variable according to the district conversely to mortality, national incidence estimates are not accurate. A nationwide database, such as hospital discharge system, could improve this estimate but its positive predictive value and sensibility should be evaluated. The positive predictive value and the sensitivity for thyroid cancer case ascertainment (ICD-10) of the national hospital discharge system in 1999 and 2000 were estimated using the cancer registries database of 10 French districts as gold standard. The linkage of the two databases required transmission of nominative information from the health facilities of the study. From the registries database, a logistic regression analysis was carried out to identify factors related to being missed by the hospital discharge system. Among the 973 standardized discharge charts selected from the hospital discharge system, 866 were considered as true positive cases, and 107 as false positive. Forty five of the latter group were prevalent cases. The predictive positive value was 89% (95% confidence interval (CI): 87-91%) and did not differ according to the district (p=0,80). According to the cancer registries, 322 thyroid cancer cases diagnosed in 1999 or 2000 were missed by the hospital discharge system. Thus, the sensitivity of this latter system was 73% (70-76%) and varied significantly from 62% to 85% across districts (p<0.001) and according to the type of health facility (p<0.01). Predictive positive value of the French hospital discharge system for ascertainment of thyroid cancer cases is high and stable across districts. Sensitivity is lower and varies significantly according to the type of health facility and across districts, which limits the interest of this database for a national estimate of thyroid cancer incidence rate.
Cassell, B G; Adamec, V; Pearson, R E
2003-09-01
A method to measure completeness of pedigree information is applied to populations of Holstein (registered and grade) and Jersey (largely registered) cows. Inbreeding coefficients where missing ancestors make no contribution were compared to a method using average relationships for missing ancestors. Estimated inbreeding depression was from an animal model that simultaneously adjusted for breeding values. Inbreeding and its standard deviation increased with more information, from 0.04 +/- 0.84 to 1.65 +/- 2.05 and 2.06 +/- 2.22 for grade Holsteins with <31%, 31 to 70%, and 71 to 100% complete five-generation pedigrees. Inbreeding from the method of average relationships for missing ancestors was 2.75 +/- 1.06, 3.10 +/- 2.21, and 2.89 +/- 2.37 for the same groups. Pedigrees of registered Holsteins and Jerseys were over 97% and over 89% complete, respectively. Inbreeding depression in days to first service and summit milk yield was estimated from both methods. Inbreeding depression for days to first service was not consistently significant for grade Holsteins and ranged from -0.37 d/1% increase in inbreeding (grade Holstein pedigrees <31% complete) to 0.15 d for grade Holstein pedigrees >70% complete. Estimates were similar for both methods. Inbreeding depression for registered Holsteins and Jerseys were positive (undesirable) but not significant for days to first service. Inbreeding depressed summit milk yield significantly in all groups by both methods. Summit milk yield declined by -0.12 to -0.06 kg/d per 1% increase in inbreeding in Holsteins and by -0.08 kg/1% increase in inbreeding in Jerseys. Pedigrees of grade animals are frequently incomplete and can yield misleading estimates of inbreeding depression. This problem is not overcome by inserting average relationships for missing ancestors in calculation of inbreeding coefficients.
Partitioning error components for accuracy-assessment of near-neighbor methods of imputation
Albert R. Stage; Nicholas L. Crookston
2007-01-01
Imputation is applied for two quite different purposes: to supply missing data to complete a data set for subsequent modeling analyses or to estimate subpopulation totals. Error properties of the imputed values have different effects in these two contexts. We partition errors of imputation derived from similar observation units as arising from three sources:...
ERIC Educational Resources Information Center
Enders, Craig K.; Peugh, James L.
2004-01-01
Two methods, direct maximum likelihood (ML) and the expectation maximization (EM) algorithm, can be used to obtain ML parameter estimates for structural equation models with missing data (MD). Although the 2 methods frequently produce identical parameter estimates, it may be easier to satisfy missing at random assumptions using EM. However, no…
Hopke, P K; Liu, C; Rubin, D B
2001-03-01
Many chemical and environmental data sets are complicated by the existence of fully missing values or censored values known to lie below detection thresholds. For example, week-long samples of airborne particulate matter were obtained at Alert, NWT, Canada, between 1980 and 1991, where some of the concentrations of 24 particulate constituents were coarsened in the sense of being either fully missing or below detection limits. To facilitate scientific analysis, it is appealing to create complete data by filling in missing values so that standard complete-data methods can be applied. We briefly review commonly used strategies for handling missing values and focus on the multiple-imputation approach, which generally leads to valid inferences when faced with missing data. Three statistical models are developed for multiply imputing the missing values of airborne particulate matter. We expect that these models are useful for creating multiple imputations in a variety of incomplete multivariate time series data sets.
Wang, Guoshen; Pan, Yi; Seth, Puja; Song, Ruiguang; Belcher, Lisa
2017-01-01
Missing data create challenges for determining progress made in linking HIV-positive persons to HIV medical care. Statistical methods are not used to address missing program data on linkage. In 2014, 61 health department jurisdictions were funded by Centers for Disease Control and Prevention (CDC) and submitted data on HIV testing, newly diagnosed HIV-positive persons, and linkage to HIV medical care. Missing or unusable data existed in our data set. A new approach using multiple imputation to address missing linkage data was proposed, and results were compared to the current approach that uses data with complete information. There were 12,472 newly diagnosed HIV-positive persons from CDC-funded HIV testing events in 2014. Using multiple imputation, 94.1% (95% confidence interval (CI): [93.7%, 94.6%]) of newly diagnosed persons were referred to HIV medical care, 88.6% (95% CI: [88.0%, 89.1%]) were linked to care within any time frame, and 83.6% (95% CI: [83.0%, 84.3%]) were linked to care within 90 days. Multiple imputation is recommended for addressing missing linkage data in future analyses when the missing percentage is high. The use of multiple imputation for missing values can result in a better understanding of how programs are performing on key HIV testing and HIV service delivery indicators.
Fu, Yong-Bi
2014-01-01
Genotyping by sequencing (GBS) recently has emerged as a promising genomic approach for assessing genetic diversity on a genome-wide scale. However, concerns are not lacking about the uniquely large unbalance in GBS genotype data. Although some genotype imputation has been proposed to infer missing observations, little is known about the reliability of a genetic diversity analysis of GBS data, with up to 90% of observations missing. Here we performed an empirical assessment of accuracy in genetic diversity analysis of highly incomplete single nucleotide polymorphism genotypes with imputations. Three large single-nucleotide polymorphism genotype data sets for corn, wheat, and rice were acquired, and missing data with up to 90% of missing observations were randomly generated and then imputed for missing genotypes with three map-independent imputation methods. Estimating heterozygosity and inbreeding coefficient from original, missing, and imputed data revealed variable patterns of bias from assessed levels of missingness and genotype imputation, but the estimation biases were smaller for missing data without genotype imputation. The estimates of genetic differentiation were rather robust up to 90% of missing observations but became substantially biased when missing genotypes were imputed. The estimates of topology accuracy for four representative samples of interested groups generally were reduced with increased levels of missing genotypes. Probabilistic principal component analysis based imputation performed better in terms of topology accuracy than those analyses of missing data without genotype imputation. These findings are not only significant for understanding the reliability of the genetic diversity analysis with respect to large missing data and genotype imputation but also are instructive for performing a proper genetic diversity analysis of highly incomplete GBS or other genotype data. PMID:24626289
Clustering with Missing Values: No Imputation Required
NASA Technical Reports Server (NTRS)
Wagstaff, Kiri
2004-01-01
Clustering algorithms can identify groups in large data sets, such as star catalogs and hyperspectral images. In general, clustering methods cannot analyze items that have missing data values. Common solutions either fill in the missing values (imputation) or ignore the missing data (marginalization). Imputed values are treated as just as reliable as the truly observed data, but they are only as good as the assumptions used to create them. In contrast, we present a method for encoding partially observed features as a set of supplemental soft constraints and introduce the KSC algorithm, which incorporates constraints into the clustering process. In experiments on artificial data and data from the Sloan Digital Sky Survey, we show that soft constraints are an effective way to enable clustering with missing values.
28 CFR 19.4 - Cost and percentage estimates.
Code of Federal Regulations, 2010 CFR
2010-07-01
... RECOVERY OF MISSING CHILDREN § 19.4 Cost and percentage estimates. It is estimated that this program will... administrative costs. It is DOJ's objective that 50 percent of DOJ penalty mail contain missing children...
Noble gas Records of Early Evolution of the Earth
NASA Astrophysics Data System (ADS)
Ozima, M.; Podoesk, F. A.
2001-12-01
Comparison between atmospheric noble gases (except for He) and solar (or meteoritic) noble gases clearly suggests that the Earth should have much more Xe than is present in air, and thus that up to about 90 percent of terrestrial Xe is missing from the Earth (1). In this report, we discuss implications of these observations on I-Pu chronology of the Earth and on the origin of terrestrial He3. Whetherill (2) first noted that an estimated I129/I127 ratio (3x10-6) in the proto-Earth was about two orders of magnitude smaller than values commonly observed in meteorites (10-4), and pointed out the possibility that Earth formation postdated meteorites by about 100Ma. Ozima and Podosek (1999) came to a similar conclusion on the basis of I129/I127-Pu244/U238 systematics (1). In this report, we reexamine I-Pu systematics with new data for crustal I content (295 ppb for a bulk crust, (3)). With imposition of an estimated value of 86 percent missing Xe as a constraint on terrestrial Xe inventory, we conclude that the best estimate for a formation age of the Earth is about 28Ma after the initial condensation of the solar nebula (at 4.57Ga). The formation age thus estimated is significantly later than the generally assumed age of meteorites. We also argue from the I-Pu systematics that the missing Xe became missing place about 120Ma after Earth formation. Assuming that the Earth is mostly degassed, the I-Pu formation age of the Earth can be reasonably assumed to represent a whole Earth event. Therefore, we interpret that the I-Pu age of the Earth represents the time when the Earth started to retain noble gases. More specifically, this may correspond to the time when the proto-Earth attained a sufficient size to exert the necessary gravitational force. A giant impact could be another possibility, but it remains to be seen whether or not a giant impact could quantitatively remove heavier noble gases from the Earth. It is interesting to speculate that missing Xe was sequestered in the core during core formation. Core formation time would then be related to the time of the missing Xe event. The above estimated missing Xe age is close to the core formation age suggested from Nb-Zr systematics (4) and from U-Pb systematics (5), but considerably later than that suggested from Hf-W systematics (6). From a comparison of relative elemental abundance of noble gases between the Earth and the solar composition, we show that terrestrial He3 may be totally unrelated to heavier noble gases. This requires independent origin of terrestrial He3 from heavy noble gases. 1.Ozima M. and Podosek F.A. (1999) JGR, 104(BII), 25493. 2.Whetherill G.W. (1975) Ann. Rev. Nuclear Science, 25, 283. 3.Muramatsu Y. and Wedepohl K.H. (1998) Chemical Geology, 147, 201. 4. Jacobsen S.B. and Yin Q.Z. (2001) Lunar Planetary Science, XXXII, 1961.pdf (abstract). 5.Galer S.J.G. and Goldstein S.L. (1995) in Geophysical Monograph 95, 75-98, AGU. 6.Halliday A.N., Lee D.-C. and Jacobsen S.B. (2000) in Origin of the Earth and Moon, 45-62, Univ. Arizona Press.
The Fifth Cell: Correlation Bias in U.S. Census Adjustment.
ERIC Educational Resources Information Center
Wachter, Kenneth W.; Freedman, David A.
2000-01-01
Presents a method for estimating the total national number of doubly missing people (missing from Census counts and adjusted counts as well) and their distribution by race and sex. Application to the 1990 U.S. Census yields an estimate of three million doubly-missing people. (SLD)
The Impact of Missing Background Data on Subpopulation Estimation
ERIC Educational Resources Information Center
Rutkowski, Leslie
2011-01-01
Although population modeling methods are well established, a paucity of literature appears to exist regarding the effect of missing background data on subpopulation achievement estimates. Using simulated data that follows typical large-scale assessment designs with known parameters and a number of missing conditions, this paper examines the extent…
Wallace, Meredith L; Anderson, Stewart J; Mazumdar, Sati
2010-12-20
Missing covariate data present a challenge to tree-structured methodology due to the fact that a single tree model, as opposed to an estimated parameter value, may be desired for use in a clinical setting. To address this problem, we suggest a multiple imputation algorithm that adds draws of stochastic error to a tree-based single imputation method presented by Conversano and Siciliano (Technical Report, University of Naples, 2003). Unlike previously proposed techniques for accommodating missing covariate data in tree-structured analyses, our methodology allows the modeling of complex and nonlinear covariate structures while still resulting in a single tree model. We perform a simulation study to evaluate our stochastic multiple imputation algorithm when covariate data are missing at random and compare it to other currently used methods. Our algorithm is advantageous for identifying the true underlying covariate structure when complex data and larger percentages of missing covariate observations are present. It is competitive with other current methods with respect to prediction accuracy. To illustrate our algorithm, we create a tree-structured survival model for predicting time to treatment response in older, depressed adults. Copyright © 2010 John Wiley & Sons, Ltd.
Estimation of covariate-specific time-dependent ROC curves in the presence of missing biomarkers.
Li, Shanshan; Ning, Yang
2015-09-01
Covariate-specific time-dependent ROC curves are often used to evaluate the diagnostic accuracy of a biomarker with time-to-event outcomes, when certain covariates have an impact on the test accuracy. In many medical studies, measurements of biomarkers are subject to missingness due to high cost or limitation of technology. This article considers estimation of covariate-specific time-dependent ROC curves in the presence of missing biomarkers. To incorporate the covariate effect, we assume a proportional hazards model for the failure time given the biomarker and the covariates, and a semiparametric location model for the biomarker given the covariates. In the presence of missing biomarkers, we propose a simple weighted estimator for the ROC curves where the weights are inversely proportional to the selection probability. We also propose an augmented weighted estimator which utilizes information from the subjects with missing biomarkers. The augmented weighted estimator enjoys the double-robustness property in the sense that the estimator remains consistent if either the missing data process or the conditional distribution of the missing data given the observed data is correctly specified. We derive the large sample properties of the proposed estimators and evaluate their finite sample performance using numerical studies. The proposed approaches are illustrated using the US Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset. © 2015, The International Biometric Society.
Two methods for parameter estimation using multiple-trait models and beef cattle field data.
Bertrand, J K; Kriese, L A
1990-08-01
Two methods are presented for estimating variances and covariances from beef cattle field data using multiple-trait sire models. Both methods require that the first trait have no missing records and that the contemporary groups for the second trait be subsets of the contemporary groups for the first trait; however, the second trait may have missing records. One method uses pseudo expectations involving quadratics composed of the solutions and the right-hand sides of the mixed model equations. The other method is an extension of Henderson's Simple Method to the multiple trait case. Neither of these methods requires any inversions of large matrices in the computation of the parameters; therefore, both methods can handle very large sets of data. Four simulated data sets were generated to evaluate the methods. In general, both methods estimated genetic correlations and heritabilities that were close to the Restricted Maximum Likelihood estimates and the true data set values, even when selection within contemporary groups was practiced. The estimates of residual correlations by both methods, however, were biased by selection. These two methods can be useful in estimating variances and covariances from multiple-trait models in large populations that have undergone a minimal amount of selection within contemporary groups.
Missing data exploration: highlighting graphical presentation of missing pattern.
Zhang, Zhongheng
2015-12-01
Functions shipped with R base can fulfill many tasks of missing data handling. However, because the data volume of electronic medical record (EMR) system is always very large, more sophisticated methods may be helpful in data management. The article focuses on missing data handling by using advanced techniques. There are three types of missing data, that is, missing completely at random (MCAR), missing at random (MAR) and not missing at random (NMAR). This classification system depends on how missing values are generated. Two packages, Multivariate Imputation by Chained Equations (MICE) and Visualization and Imputation of Missing Values (VIM), provide sophisticated functions to explore missing data pattern. In particular, the VIM package is especially helpful in visual inspection of missing data. Finally, correlation analysis provides information on the dependence of missing data on other variables. Such information is useful in subsequent imputations.
Packaging and distributing ecological data from multisite studies
NASA Technical Reports Server (NTRS)
Olson, R. J.; Voorhees, L. D.; Field, J. M.; Gentry, M. J.
1996-01-01
Studies of global change and other regional issues depend on ecological data collected at multiple study areas or sites. An information system model is proposed for compiling diverse data from dispersed sources so that the data are consistent, complete, and readily available. The model includes investigators who collect and analyze field measurements, science teams that synthesize data, a project information system that collates data, a data archive center that distributes data to secondary users, and a master data directory that provides broader searching opportunities. Special attention to format consistency is required, such as units of measure, spatial coordinates, dates, and notation for missing values. Often data may need to be enhanced by estimating missing values, aggregating to common temporal units, or adding other related data such as climatic and soils data. Full documentation, an efficient data distribution mechanism, and an equitable way to acknowledge the original source of data are also required.
Royle, J. Andrew; Sutherland, Christopher S.; Fuller, Angela K.; Sun, Catherine C.
2015-01-01
We develop a likelihood analysis framework for fitting spatial capture-recapture (SCR) models to data collected on class structured or stratified populations. Our interest is motivated by the necessity of accommodating the problem of missing observations of individual class membership. This is particularly problematic in SCR data arising from DNA analysis of scat, hair or other material, which frequently yields individual identity but fails to identify the sex. Moreover, this can represent a large fraction of the data and, given the typically small sample sizes of many capture-recapture studies based on DNA information, utilization of the data with missing sex information is necessary. We develop the class structured likelihood for the case of missing covariate values, and then we address the scaling of the likelihood so that models with and without class structured parameters can be formally compared regardless of missing values. We apply our class structured model to black bear data collected in New York in which sex could be determined for only 62 of 169 uniquely identified individuals. The models containing sex-specificity of both the intercept of the SCR encounter probability model and the distance coefficient, and including a behavioral response are strongly favored by log-likelihood. Estimated population sex ratio is strongly influenced by sex structure in model parameters illustrating the importance of rigorous modeling of sex differences in capture-recapture models.
Hipp, John R.; Wang, Cheng; Butts, Carter T.; Jose, Rupa; Lakon, Cynthia M.
2015-01-01
Although stochastic actor based models (e.g., as implemented in the SIENA software program) are growing in popularity as a technique for estimating longitudinal network data, a relatively understudied issue is the consequence of missing network data for longitudinal analysis. We explore this issue in our research note by utilizing data from four schools in an existing dataset (the AddHealth dataset) over three time points, assessing the substantive consequences of using four different strategies for addressing missing network data. The results indicate that whereas some measures in such models are estimated relatively robustly regardless of the strategy chosen for addressing missing network data, some of the substantive conclusions will differ based on the missing data strategy chosen. These results have important implications for this burgeoning applied research area, implying that researchers should more carefully consider how they address missing data when estimating such models. PMID:25745276
Hipp, John R; Wang, Cheng; Butts, Carter T; Jose, Rupa; Lakon, Cynthia M
2015-05-01
Although stochastic actor based models (e.g., as implemented in the SIENA software program) are growing in popularity as a technique for estimating longitudinal network data, a relatively understudied issue is the consequence of missing network data for longitudinal analysis. We explore this issue in our research note by utilizing data from four schools in an existing dataset (the AddHealth dataset) over three time points, assessing the substantive consequences of using four different strategies for addressing missing network data. The results indicate that whereas some measures in such models are estimated relatively robustly regardless of the strategy chosen for addressing missing network data, some of the substantive conclusions will differ based on the missing data strategy chosen. These results have important implications for this burgeoning applied research area, implying that researchers should more carefully consider how they address missing data when estimating such models.
Huan, Tao; Li, Liang
2015-01-20
Metabolomics requires quantitative comparison of individual metabolites present in an entire sample set. Unfortunately, missing intensity values in one or more samples are very common. Because missing values can have a profound influence on metabolomic results, the extent of missing values found in a metabolomic data set should be treated as an important parameter for measuring the analytical performance of a technique. In this work, we report a study on the scope of missing values and a robust method of filling the missing values in a chemical isotope labeling (CIL) LC-MS metabolomics platform. Unlike conventional LC-MS, CIL LC-MS quantifies the concentration differences of individual metabolites in two comparative samples based on the mass spectral peak intensity ratio of a peak pair from a mixture of differentially labeled samples. We show that this peak-pair feature can be explored as a unique means of extracting metabolite intensity information from raw mass spectra. In our approach, a peak-pair peaking algorithm, IsoMS, is initially used to process the LC-MS data set to generate a CSV file or table that contains metabolite ID and peak ratio information (i.e., metabolite-intensity table). A zero-fill program, freely available from MyCompoundID.org , is developed to automatically find a missing value in the CSV file and go back to the raw LC-MS data to find the peak pair and, then, calculate the intensity ratio and enter the ratio value into the table. Most of the missing values are found to be low abundance peak pairs. We demonstrate the performance of this method in analyzing an experimental and technical replicate data set of human urine metabolome. Furthermore, we propose a standardized approach of counting missing values in a replicate data set as a way of gauging the extent of missing values in a metabolomics platform. Finally, we illustrate that applying the zero-fill program, in conjunction with dansylation CIL LC-MS, can lead to a marked improvement in finding significant metabolites that differentiate bladder cancer patients and their controls in a metabolomics study of 109 subjects.
Arnold, Alice M.; Newman, Anne B.; Dermond, Norma; Haan, Mary; Fitzpatrick, Annette
2009-01-01
Aim To estimate an equivalent to the Modified Mini-Mental State Exam (3MSE), and to compare changes in the 3MSE with and without the estimated scores. Methods Comparability study on a subset of 405 participants, aged ≥70 years, from the Cardiovascular Health Study (CHS), a longitudinal study in 4 United States communities. The 3MSE, the Telephone Interview for Cognitive Status (TICS) and the Informant Questionnaire on Cognitive Decline in the Elderly (IQCODE) were administered within 30 days of one another. Regression models were developed to predict the 3MSE score from the TICS and/or IQCODE, and the predicted values were used to estimate missing 3MSE scores in longitudinal follow-up of 4,274 CHS participants. Results The TICS explained 67% of the variability in 3MSE scores, with a correlation of 0.82 between predicted and observed scores. The IQCODE alone was not a good estimate of 3MSE score, but improved the model fit when added to the TICS model. Using estimated 3MSE scores classified more participants with low cognition, and rates of decline were greater than when only the observed 3MSE scores were considered. Conclusions 3MSE scores can be reliably estimated from the TICS, with or without the IQCODE. Incorporating these estimates captured more cognitive decline in older adults. PMID:19407461
Purposeful Variable Selection and Stratification to Impute Missing FAST Data in Trauma Research
Fuchs, Paul A.; del Junco, Deborah J.; Fox, Erin E.; Holcomb, John B.; Rahbar, Mohammad H.; Wade, Charles A.; Alarcon, Louis H.; Brasel, Karen J.; Bulger, Eileen M.; Cohen, Mitchell J.; Myers, John G.; Muskat, Peter; Phelan, Herb A.; Schreiber, Martin A.; Cotton, Bryan A.
2013-01-01
Background The Focused Assessment with Sonography for Trauma (FAST) exam is an important variable in many retrospective trauma studies. The purpose of this study was to devise an imputation method to overcome missing data for the FAST exam. Due to variability in patients’ injuries and trauma care, these data are unlikely to be missing completely at random (MCAR), raising concern for validity when analyses exclude patients with missing values. Methods Imputation was conducted under a less restrictive, more plausible missing at random (MAR) assumption. Patients with missing FAST exams had available data on alternate, clinically relevant elements that were strongly associated with FAST results in complete cases, especially when considered jointly. Subjects with missing data (32.7%) were divided into eight mutually exclusive groups based on selected variables that both described the injury and were associated with missing FAST values. Additional variables were selected within each group to classify missing FAST values as positive or negative, and correct FAST exam classification based on these variables was determined for patients with non-missing FAST values. Results Severe head/neck injury (odds ratio, OR=2.04), severe extremity injury (OR=4.03), severe abdominal injury (OR=1.94), no injury (OR=1.94), other abdominal injury (OR=0.47), other head/neck injury (OR=0.57) and other extremity injury (OR=0.45) groups had significant ORs for missing data; the other group odds ratio was not significant (OR=0.84). All 407 missing FAST values were imputed, with 109 classified as positive. Correct classification of non-missing FAST results using the alternate variables was 87.2%. Conclusions Purposeful imputation for missing FAST exams based on interactions among selected variables assessed by simple stratification may be a useful adjunct to sensitivity analysis in the evaluation of imputation strategies under different missing data mechanisms. This approach has the potential for widespread application in clinical and translational research and validation is warranted. Level of Evidence Level II Prognostic or Epidemiological PMID:23778515
40 CFR 98.96 - Data reporting requirements.
Code of Federal Regulations, 2013 CFR
2013-07-01
..., Equation I-16 of this subpart, for each fluorinated heat transfer fluid used. (s) Where missing data... § 98.95(b), the number of times missing data procedures were followed in the reporting year, the method used to estimate the missing data, and the estimates of those data. (t) A brief description of each...
40 CFR 98.96 - Data reporting requirements.
Code of Federal Regulations, 2012 CFR
2012-07-01
..., Equation I-16 of this subpart, for each fluorinated heat transfer fluid used. (s) Where missing data... § 98.95(b), the number of times missing data procedures were followed in the reporting year, the method used to estimate the missing data, and the estimates of those data. (t) A brief description of each...
Pragmatic criteria of the definition of neonatal near miss: a comparative study.
Kale, Pauline Lorena; Jorge, Maria Helena Prado de Mello; Laurenti, Ruy; Fonseca, Sandra Costa; Silva, Kátia Silveira da
2017-12-04
The objective of this study was to test the validity of the pragmatic criteria of the definitions of neonatal near miss, extending them throughout the infant period, and to estimate the indicators of perinatal care in public maternity hospitals. A cohort of live births from six maternity hospitals in the municipalities of São Paulo, Niterói, and Rio de Janeiro, Brazil, was carried out in 2011. We carried out interviews and checked prenatal cards and medical records. We compared the pragmatic criteria (birth weight, gestational age, and 5' Apgar score) of the definitions of near miss of Pileggi et al., Pileggi-Castro et al., Souza et al., and Silva et al. We calculated sensitivity, specificity (gold standard: infant mortality), percentage of deaths among newborns with life-threatening conditions, and rates of near miss, mortality, and severe outcomes per 1,000 live births. A total 7,315 newborns were analyzed (completeness of information > 99%). The sensitivity of the definition of Pileggi-Castro et al. was higher, resulting in a higher number of cases of near miss, Souza et al. presented lower value, and Pileggi et al. and de Silva et al. presented intermediate values. There is an increase in sensitivity when the period goes from 0-6 to 0-27 days, and there is a decrease when it goes to 0-364 days. Specificities were high (≥ 97%) and above sensitivities (54% to 77%). One maternity hospital in São Paulo and one in Niterói presented, respectively, the lowest and highest rates of infant mortality, near miss, and frequency of births with life-threatening conditions, regardless of the definition. The definitions of near miss based exclusively on pragmatic criteria are valid and can be used for monitoring purposes. Based on the perinatal literature, the cutoff points adopted by Silva et al. were more appropriate. Periodic studies could apply a more complete definition, incorporating clinical, laboratory, and management criteria, including congenital anomalies predictive of infant mortality.
Tensor completion for estimating missing values in visual data.
Liu, Ji; Musialski, Przemyslaw; Wonka, Peter; Ye, Jieping
2013-01-01
In this paper, we propose an algorithm to estimate missing values in tensors of visual data. The values can be missing due to problems in the acquisition process or because the user manually identified unwanted outliers. Our algorithm works even with a small amount of samples and it can propagate structure to fill larger missing regions. Our methodology is built on recent studies about matrix completion using the matrix trace norm. The contribution of our paper is to extend the matrix case to the tensor case by proposing the first definition of the trace norm for tensors and then by building a working algorithm. First, we propose a definition for the tensor trace norm that generalizes the established definition of the matrix trace norm. Second, similarly to matrix completion, the tensor completion is formulated as a convex optimization problem. Unfortunately, the straightforward problem extension is significantly harder to solve than the matrix case because of the dependency among multiple constraints. To tackle this problem, we developed three algorithms: simple low rank tensor completion (SiLRTC), fast low rank tensor completion (FaLRTC), and high accuracy low rank tensor completion (HaLRTC). The SiLRTC algorithm is simple to implement and employs a relaxation technique to separate the dependent relationships and uses the block coordinate descent (BCD) method to achieve a globally optimal solution; the FaLRTC algorithm utilizes a smoothing scheme to transform the original nonsmooth problem into a smooth one and can be used to solve a general tensor trace norm minimization problem; the HaLRTC algorithm applies the alternating direction method of multipliers (ADMMs) to our problem. Our experiments show potential applications of our algorithms and the quantitative evaluation indicates that our methods are more accurate and robust than heuristic approaches. The efficiency comparison indicates that FaLTRC and HaLRTC are more efficient than SiLRTC and between FaLRTC an- HaLRTC the former is more efficient to obtain a low accuracy solution and the latter is preferred if a high-accuracy solution is desired.
Missing data exploration: highlighting graphical presentation of missing pattern
2015-01-01
Functions shipped with R base can fulfill many tasks of missing data handling. However, because the data volume of electronic medical record (EMR) system is always very large, more sophisticated methods may be helpful in data management. The article focuses on missing data handling by using advanced techniques. There are three types of missing data, that is, missing completely at random (MCAR), missing at random (MAR) and not missing at random (NMAR). This classification system depends on how missing values are generated. Two packages, Multivariate Imputation by Chained Equations (MICE) and Visualization and Imputation of Missing Values (VIM), provide sophisticated functions to explore missing data pattern. In particular, the VIM package is especially helpful in visual inspection of missing data. Finally, correlation analysis provides information on the dependence of missing data on other variables. Such information is useful in subsequent imputations. PMID:26807411
Belger, Mark; Haro, Josep Maria; Reed, Catherine; Happich, Michael; Kahle-Wrobleski, Kristin; Argimon, Josep Maria; Bruno, Giuseppe; Dodel, Richard; Jones, Roy W; Vellas, Bruno; Wimo, Anders
2016-07-18
Missing data are a common problem in prospective studies with a long follow-up, and the volume, pattern and reasons for missing data may be relevant when estimating the cost of illness. We aimed to evaluate the effects of different methods for dealing with missing longitudinal cost data and for costing caregiver time on total societal costs in Alzheimer's disease (AD). GERAS is an 18-month observational study of costs associated with AD. Total societal costs included patient health and social care costs, and caregiver health and informal care costs. Missing data were classified as missing completely at random (MCAR), missing at random (MAR) or missing not at random (MNAR). Simulation datasets were generated from baseline data with 10-40 % missing total cost data for each missing data mechanism. Datasets were also simulated to reflect the missing cost data pattern at 18 months using MAR and MNAR assumptions. Naïve and multiple imputation (MI) methods were applied to each dataset and results compared with complete GERAS 18-month cost data. Opportunity and replacement cost approaches were used for caregiver time, which was costed with and without supervision included and with time for working caregivers only being costed. Total costs were available for 99.4 % of 1497 patients at baseline. For MCAR datasets, naïve methods performed as well as MI methods. For MAR, MI methods performed better than naïve methods. All imputation approaches were poor for MNAR data. For all approaches, percentage bias increased with missing data volume. For datasets reflecting 18-month patterns, a combination of imputation methods provided more accurate cost estimates (e.g. bias: -1 % vs -6 % for single MI method), although different approaches to costing caregiver time had a greater impact on estimated costs (29-43 % increase over base case estimate). Methods used to impute missing cost data in AD will impact on accuracy of cost estimates although varying approaches to costing informal caregiver time has the greatest impact on total costs. Tailoring imputation methods to the reason for missing data will further our understanding of the best analytical approach for studies involving cost outcomes.
Imputation for multisource data with comparison and assessment techniques
Casleton, Emily Michele; Osthus, David Allen; Van Buren, Kendra Lu
2017-12-27
Missing data are prevalent issue in analyses involving data collection. The problem of missing data is exacerbated for multisource analysis, where data from multiple sensors are combined to arrive at a single conclusion. In this scenario, it is more likely to occur and can lead to discarding a large amount of data collected; however, the information from observed sensors can be leveraged to estimate those values not observed. We propose two methods for imputation of multisource data, both of which take advantage of potential correlation between data from different sensors, through ridge regression and a state-space model. These methods, asmore » well as the common median imputation, are applied to data collected from a variety of sensors monitoring an experimental facility. Performance of imputation methods is compared with the mean absolute deviation; however, rather than using this metric to solely rank themethods,we also propose an approach to identify significant differences. Imputation techniqueswill also be assessed by their ability to produce appropriate confidence intervals, through coverage and length, around the imputed values. Finally, performance of imputed datasets is compared with a marginalized dataset through a weighted k-means clustering. In general, we found that imputation through a dynamic linearmodel tended to be the most accurate and to produce the most precise confidence intervals, and that imputing the missing values and down weighting them with respect to observed values in the analysis led to the most accurate performance.« less
Imputation for multisource data with comparison and assessment techniques
DOE Office of Scientific and Technical Information (OSTI.GOV)
Casleton, Emily Michele; Osthus, David Allen; Van Buren, Kendra Lu
Missing data are prevalent issue in analyses involving data collection. The problem of missing data is exacerbated for multisource analysis, where data from multiple sensors are combined to arrive at a single conclusion. In this scenario, it is more likely to occur and can lead to discarding a large amount of data collected; however, the information from observed sensors can be leveraged to estimate those values not observed. We propose two methods for imputation of multisource data, both of which take advantage of potential correlation between data from different sensors, through ridge regression and a state-space model. These methods, asmore » well as the common median imputation, are applied to data collected from a variety of sensors monitoring an experimental facility. Performance of imputation methods is compared with the mean absolute deviation; however, rather than using this metric to solely rank themethods,we also propose an approach to identify significant differences. Imputation techniqueswill also be assessed by their ability to produce appropriate confidence intervals, through coverage and length, around the imputed values. Finally, performance of imputed datasets is compared with a marginalized dataset through a weighted k-means clustering. In general, we found that imputation through a dynamic linearmodel tended to be the most accurate and to produce the most precise confidence intervals, and that imputing the missing values and down weighting them with respect to observed values in the analysis led to the most accurate performance.« less
Order-restricted inference for means with missing values.
Wang, Heng; Zhong, Ping-Shou
2017-09-01
Missing values appear very often in many applications, but the problem of missing values has not received much attention in testing order-restricted alternatives. Under the missing at random (MAR) assumption, we impute the missing values nonparametrically using kernel regression. For data with imputation, the classical likelihood ratio test designed for testing the order-restricted means is no longer applicable since the likelihood does not exist. This article proposes a novel method for constructing test statistics for assessing means with an increasing order or a decreasing order based on jackknife empirical likelihood (JEL) ratio. It is shown that the JEL ratio statistic evaluated under the null hypothesis converges to a chi-bar-square distribution, whose weights depend on missing probabilities and nonparametric imputation. Simulation study shows that the proposed test performs well under various missing scenarios and is robust for normally and nonnormally distributed data. The proposed method is applied to an Alzheimer's disease neuroimaging initiative data set for finding a biomarker for the diagnosis of the Alzheimer's disease. © 2017, The International Biometric Society.
Feng, Jianyuan; Turksoy, Kamuran; Samadi, Sediqeh; Hajizadeh, Iman; Littlejohn, Elizabeth; Cinar, Ali
2017-12-01
Supervision and control systems rely on signals from sensors to receive information to monitor the operation of a system and adjust manipulated variables to achieve the control objective. However, sensor performance is often limited by their working conditions and sensors may also be subjected to interference by other devices. Many different types of sensor errors such as outliers, missing values, drifts and corruption with noise may occur during process operation. A hybrid online sensor error detection and functional redundancy system is developed to detect errors in online signals, and replace erroneous or missing values detected with model-based estimates. The proposed hybrid system relies on two techniques, an outlier-robust Kalman filter (ORKF) and a locally-weighted partial least squares (LW-PLS) regression model, which leverage the advantages of automatic measurement error elimination with ORKF and data-driven prediction with LW-PLS. The system includes a nominal angle analysis (NAA) method to distinguish between signal faults and large changes in sensor values caused by real dynamic changes in process operation. The performance of the system is illustrated with clinical data continuous glucose monitoring (CGM) sensors from people with type 1 diabetes. More than 50,000 CGM sensor errors were added to original CGM signals from 25 clinical experiments, then the performance of error detection and functional redundancy algorithms were analyzed. The results indicate that the proposed system can successfully detect most of the erroneous signals and substitute them with reasonable estimated values computed by functional redundancy system.
Meng, Yu; Li, Gang; Gao, Yaozong; Lin, Weili; Shen, Dinggang
2016-11-01
Longitudinal neuroimaging analysis of the dynamic brain development in infants has received increasing attention recently. Many studies expect a complete longitudinal dataset in order to accurately chart the brain developmental trajectories. However, in practice, a large portion of subjects in longitudinal studies often have missing data at certain time points, due to various reasons such as the absence of scan or poor image quality. To make better use of these incomplete longitudinal data, in this paper, we propose a novel machine learning-based method to estimate the subject-specific, vertex-wise cortical morphological attributes at the missing time points in longitudinal infant studies. Specifically, we develop a customized regression forest, named dynamically assembled regression forest (DARF), as the core regression tool. DARF ensures the spatial smoothness of the estimated maps for vertex-wise cortical morphological attributes and also greatly reduces the computational cost. By employing a pairwise estimation followed by a joint refinement, our method is able to fully exploit the available information from both subjects with complete scans and subjects with missing scans for estimation of the missing cortical attribute maps. The proposed method has been applied to estimating the dynamic cortical thickness maps at missing time points in an incomplete longitudinal infant dataset, which includes 31 healthy infant subjects, each having up to five time points in the first postnatal year. The experimental results indicate that our proposed framework can accurately estimate the subject-specific vertex-wise cortical thickness maps at missing time points, with the average error less than 0.23 mm. Hum Brain Mapp 37:4129-4147, 2016. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.
Estimated Environmental Exposures for MISSE-3 and MISSE-4
NASA Technical Reports Server (NTRS)
Finckenor, Miria M.; Pippin, Gary; Kinard, William H.
2008-01-01
Describes the estimated environmental exposure for MISSE-2 and MISSE-4. These test beds, attached to the outside of the International Space Station, were planned for 3 years of exposure. This was changed to 1 year after MISSE-1 and -2 were in space for 4 years. MISSE-3 and -4 operate in a low Earth orbit space environment, which exposes them to a variety of assaults including atomic oxygen, ultraviolet radiation, particulate radiation, thermal cycling, and meteoroid/space debris impact, as well as contamination associated with proximity to an active space station. Measurements and determinations of atomic oxygen fluences, solar UV exposure levels, molecular contamination levels, and particulate radiation are included.
Jia, Erik; Chen, Tianlu
2018-01-01
Left-censored missing values commonly exist in targeted metabolomics datasets and can be considered as missing not at random (MNAR). Improper data processing procedures for missing values will cause adverse impacts on subsequent statistical analyses. However, few imputation methods have been developed and applied to the situation of MNAR in the field of metabolomics. Thus, a practical left-censored missing value imputation method is urgently needed. We developed an iterative Gibbs sampler based left-censored missing value imputation approach (GSimp). We compared GSimp with other three imputation methods on two real-world targeted metabolomics datasets and one simulation dataset using our imputation evaluation pipeline. The results show that GSimp outperforms other imputation methods in terms of imputation accuracy, observation distribution, univariate and multivariate analyses, and statistical sensitivity. Additionally, a parallel version of GSimp was developed for dealing with large scale metabolomics datasets. The R code for GSimp, evaluation pipeline, tutorial, real-world and simulated targeted metabolomics datasets are available at: https://github.com/WandeRum/GSimp. PMID:29385130
NASA Astrophysics Data System (ADS)
Nanda, Trushnamayee; Sahoo, Bhabagrahi; Chatterjee, Chandranath
2017-06-01
The Kohonen Self-Organizing Map (KSOM) estimator is prescribed as a useful tool for infilling the missing data in hydrometeorology. However, in this study, when the performance of the KSOM estimator is tested for gap-filling in the streamflow, rainfall, evapotranspiration (ET), and temperature timeseries data, collected from 30 gauging stations in India under missing data situations, it is felt that the KSOM modeling performance could be further improved. Consequently, this study tries to answer the research questions as to whether the length of record of the historical data and its variability has any effect on the performance of the KSOM? Whether inclusion of temporal distribution of timeseries data and the nature of outliers in the KSOM framework enhances its performance further? Subsequently, it is established that the KSOM framework should include the coefficient of variation of the datasets for determination of the number of map units, without considering it as a single value function of the sample data size. This could help to upscale and generalize the applicability of KSOM for varied hydrometeorological data types.
Lum, Kirsten J; Newcomb, Craig W; Roy, Jason A; Carbonari, Dena M; Saine, M Elle; Cardillo, Serena; Bhullar, Harshvinder; Gallagher, Arlene M; Lo Re, Vincent
2017-01-01
The extent to which days' supply data are missing in pharmacoepidemiologic databases and effective methods for estimation is unknown. We determined the percentage of missing days' supply on prescription and patient levels for oral anti-diabetic drugs (OADs) and evaluated three methods for estimating days' supply within the Clinical Practice Research Datalink (CPRD) and The Health Improvement Network (THIN). We estimated the percentage of OAD prescriptions and patients with missing days' supply in each database from 2009 to 2013. Within a random sample of prescriptions with known days' supply, we measured the accuracy of three methods to estimate missing days' supply by imputing the following: (1) 28 days' supply, (2) mode number of tablets/day by drug strength and number of tablets/prescription, and (3) number of tablets/day via a machine learning algorithm. We determined incidence rates (IRs) of acute myocardial infarction (AMI) using each method to evaluate the impact on ascertainment of exposure time and outcomes. Days' supply was missing for 24 % of OAD prescriptions in CPRD and 33 % in THIN (affecting 48 and 57 % of patients, respectively). Methods 2 and 3 were very accurate in estimating days' supply for OADs prescribed at a consistent number of tablets/day. Method 3 was more accurate for OADs prescribed at varying number of tablets/day. IRs of AMI were similar across methods for most OADs. Missing days' supply is a substantial problem in both databases. Method 2 is easy and very accurate for most OADs and results in IRs comparable to those from method 3.
ERIC Educational Resources Information Center
Han, Kyung T.; Guo, Fanmin
2014-01-01
The full-information maximum likelihood (FIML) method makes it possible to estimate and analyze structural equation models (SEM) even when data are partially missing, enabling incomplete data to contribute to model estimation. The cornerstone of FIML is the missing-at-random (MAR) assumption. In (unidimensional) computerized adaptive testing…
Estimation of Item Response Theory Parameters in the Presence of Missing Data
ERIC Educational Resources Information Center
Finch, Holmes
2008-01-01
Missing data are a common problem in a variety of measurement settings, including responses to items on both cognitive and affective assessments. Researchers have shown that such missing data may create problems in the estimation of item difficulty parameters in the Item Response Theory (IRT) context, particularly if they are ignored. At the same…
Worm, Signe W; Friis-Møller, Nina; Bruyand, Mathias; D'Arminio Monforte, Antonella; Rickenbach, Martin; Reiss, Peter; El-Sadr, Wafaa; Phillips, Andrew; Lundgren, Jens; Sabin, Caroline
2010-01-28
This study describes the characteristics of the metabolic syndrome in HIV-positive patients in the Data Collection on Adverse Events of Anti-HIV Drugs study and discusses the impact of different methodological approaches on estimates of the prevalence of metabolic syndrome over time. We described the prevalence of the metabolic syndrome in patients under follow-up at the end of six calendar periods from 2000 to 2007. The definition that was used for the metabolic syndrome was modified to take account of the use of lipid-lowering and antihypertensive medication, measurement variability and missing values, and assessed the impact of these modifications on the estimated prevalence. For all definitions considered, there was an increasing prevalence of the metabolic syndrome over time, although the prevalence estimates themselves varied widely. Using our primary definition, we found an increase in prevalence from 19.4% in 2000/2001 to 41.6% in 2006/2007. Modification of the definition to incorporate antihypertensive and lipid-lowering medication had relatively little impact on the prevalence estimates, as did modification to allow for missing data. In contrast, modification to allow the metabolic syndrome to be reversible and to allow for measurement variability lowered prevalence estimates substantially. The prevalence of the metabolic syndrome in cohort studies is largely based on the use of nonstandardized measurements as they are captured in daily clinical care. As a result, bias is easily introduced, particularly when measurements are both highly variable and may be missing. We suggest that the prevalence of the metabolic syndrome in cohort studies should be based on two consecutive measurements of the laboratory components in the syndrome definition.
Energy use, entropy and extra-terrestrial civilizations
NASA Astrophysics Data System (ADS)
Hetesi, Zsolt
2010-03-01
The possible number of extra-terrestrial civilizations is estimated by the Drake-equation. Many articles pointed out that there are missing factors and over-estimations in the original equation. In this article we will point out that assuming some axioms there might be several limits for a technical civilization. The key role of the energy use and the problem of the centres and periphery strongly influence the value of the Llifetime of a civilization. Our development have several edifications of the investigations of the growth of an alien civilization.
VARIABLE SELECTION FOR REGRESSION MODELS WITH MISSING DATA
Garcia, Ramon I.; Ibrahim, Joseph G.; Zhu, Hongtu
2009-01-01
We consider the variable selection problem for a class of statistical models with missing data, including missing covariate and/or response data. We investigate the smoothly clipped absolute deviation penalty (SCAD) and adaptive LASSO and propose a unified model selection and estimation procedure for use in the presence of missing data. We develop a computationally attractive algorithm for simultaneously optimizing the penalized likelihood function and estimating the penalty parameters. Particularly, we propose to use a model selection criterion, called the ICQ statistic, for selecting the penalty parameters. We show that the variable selection procedure based on ICQ automatically and consistently selects the important covariates and leads to efficient estimates with oracle properties. The methodology is very general and can be applied to numerous situations involving missing data, from covariates missing at random in arbitrary regression models to nonignorably missing longitudinal responses and/or covariates. Simulations are given to demonstrate the methodology and examine the finite sample performance of the variable selection procedures. Melanoma data from a cancer clinical trial is presented to illustrate the proposed methodology. PMID:20336190
Building a kinetic Monte Carlo model with a chosen accuracy.
Bhute, Vijesh J; Chatterjee, Abhijit
2013-06-28
The kinetic Monte Carlo (KMC) method is a popular modeling approach for reaching large materials length and time scales. The KMC dynamics is erroneous when atomic processes that are relevant to the dynamics are missing from the KMC model. Recently, we had developed for the first time an error measure for KMC in Bhute and Chatterjee [J. Chem. Phys. 138, 084103 (2013)]. The error measure, which is given in terms of the probability that a missing process will be selected in the correct dynamics, requires estimation of the missing rate. In this work, we present an improved procedure for estimating the missing rate. The estimate found using the new procedure is within an order of magnitude of the correct missing rate, unlike our previous approach where the estimate was larger by orders of magnitude. This enables one to find the error in the KMC model more accurately. In addition, we find the time for which the KMC model can be used before a maximum error in the dynamics has been reached.
Bell, Melanie L; Horton, Nicholas J; Dhillon, Haryana M; Bray, Victoria J; Vardy, Janette
2018-05-26
Patient reported outcomes (PROs) are important in oncology research; however, missing data can pose a threat to the validity of results. Psycho-oncology researchers should be aware of the statistical options for handling missing data robustly. One rarely used set of methods, which includes extensions for handling missing data, is generalized estimating equations (GEEs). Our objective was to demonstrate use of GEEs to analyze PROs with missing data in randomized trials with assessments at fixed time points. We introduce GEEs and show, with a worked example, how to use GEEs that account for missing data: inverse probability weighted GEEs and multiple imputation with GEE. We use data from an RCT evaluating a web-based brain training for cancer survivors reporting cognitive symptoms after chemotherapy treatment. The primary outcome for this demonstration is the binary outcome of cognitive impairment. Several methods are used, and results are compared. We demonstrate that estimates can vary depending on the choice of analytical approach, with odds ratios for no cognitive impairment ranging from 2.04 to 5.74. While most of these estimates were statistically significant (P < 0.05), a few were not. Researchers using PROs should use statistical methods that handle missing data in a way as to result in unbiased estimates. GEE extensions are analytic options for handling dropouts in longitudinal RCTs, particularly if the outcome is not continuous. Copyright © 2018 John Wiley & Sons, Ltd.
De Silva, Anurika Priyanjali; Moreno-Betancur, Margarita; De Livera, Alysha Madhu; Lee, Katherine Jane; Simpson, Julie Anne
2017-07-25
Missing data is a common problem in epidemiological studies, and is particularly prominent in longitudinal data, which involve multiple waves of data collection. Traditional multiple imputation (MI) methods (fully conditional specification (FCS) and multivariate normal imputation (MVNI)) treat repeated measurements of the same time-dependent variable as just another 'distinct' variable for imputation and therefore do not make the most of the longitudinal structure of the data. Only a few studies have explored extensions to the standard approaches to account for the temporal structure of longitudinal data. One suggestion is the two-fold fully conditional specification (two-fold FCS) algorithm, which restricts the imputation of a time-dependent variable to time blocks where the imputation model includes measurements taken at the specified and adjacent times. To date, no study has investigated the performance of two-fold FCS and standard MI methods for handling missing data in a time-varying covariate with a non-linear trajectory over time - a commonly encountered scenario in epidemiological studies. We simulated 1000 datasets of 5000 individuals based on the Longitudinal Study of Australian Children (LSAC). Three missing data mechanisms: missing completely at random (MCAR), and a weak and a strong missing at random (MAR) scenarios were used to impose missingness on body mass index (BMI) for age z-scores; a continuous time-varying exposure variable with a non-linear trajectory over time. We evaluated the performance of FCS, MVNI, and two-fold FCS for handling up to 50% of missing data when assessing the association between childhood obesity and sleep problems. The standard two-fold FCS produced slightly more biased and less precise estimates than FCS and MVNI. We observed slight improvements in bias and precision when using a time window width of two for the two-fold FCS algorithm compared to the standard width of one. We recommend the use of FCS or MVNI in a similar longitudinal setting, and when encountering convergence issues due to a large number of time points or variables with missing values, the two-fold FCS with exploration of a suitable time window.
Moy, Ernest; Barrett, Marguerite; Coffey, Rosanna; Hines, Anika L; Newman-Toker, David E
2015-02-01
An estimated 1.2 million people in the US have an acute myocardial infarction (AMI) each year. An estimated 7% of AMI hospitalizations result in death. Most patients experiencing acute coronary symptoms, such as unstable angina, visit an emergency department (ED). Some patients hospitalized with AMI after a treat-and-release ED visit likely represent missed opportunities for correct diagnosis and treatment. The purpose of the present study is to estimate the frequency of missed AMI or its precursors in the ED by examining use of EDs prior to hospitalization for AMI. We estimated the rate of probable missed diagnoses in EDs in the week before hospitalization for AMI and examined associated factors. We used Healthcare Cost and Utilization Project State Inpatient Databases and State Emergency Department Databases for 2007 to evaluate missed diagnoses in 111,973 admitted patients aged 18 years and older. We identified missed diagnoses in the ED for 993 of 112,000 patients (0.9% of all AMI admissions). These patients had visited an ED with chest pain or cardiac conditions, were released, and were subsequently admitted for AMI within 7 days. Higher odds of having missed diagnoses were associated with being younger and of Black race. Hospital teaching status, availability of cardiac catheterization, high ED admission rates, high inpatient occupancy rates, and urban location were associated with lower odds of missed diagnoses. Administrative data provide robust information that may help EDs identify populations at risk of experiencing a missed diagnosis, address disparities, and reduce diagnostic errors.
Barnard, Neal D; Levin, Susan M; Yokoyama, Yoko
2015-06-01
In observational studies, vegetarians generally have lower body weights compared with omnivores. However, weight changes that occur when vegetarian diets are prescribed have not been well quantified. We estimated the effect on body weight when vegetarian diets are prescribed. We searched PubMed, EMBASE, and the Cochrane Central Register of Controlled Trials for articles through December 31, 2013. Additional articles were identified from reference lists. We included intervention trials in which participants were adults, interventions included vegetarian diets of ≥4 weeks' duration without energy intake limitations, and effects on body weight were reported. Two investigators independently extracted data using predetermined fields. Estimates of body weight change, comparing intervention groups to untreated control groups, were derived using a random effects model to estimate the weighted mean difference. To quantify effects on body weight of baseline weight, sex, age, study duration, study goals, type of diet, and study authorship, additional analyses examined within-group changes for all studies reporting variance data. We identified 15 trials (17 intervention groups), of which 4 included untreated controls. Prescription of vegetarian diets was associated with a mean weight change of -3.4 kg (95% CI -4.4 to -2.4; P<0.001) in an intention-to-treat analysis and -4.6 kg (95% CI -5.4 to -3.8; P<0.001) in a completer analysis (omitting missing post-intervention values). Greater weight loss was reported in studies with higher baseline weights, smaller proportions of female participants, older participants, or longer durations, and in studies in which weight loss was a goal. Using baseline data for missing values, I(2) equaled 52.3 (P=0.10), indicating moderate heterogeneity. When missing data were omitted, I(2) equaled 0 (P=0.65), indicating low heterogeneity. Studies are relatively few, with variable quality. The prescription of vegetarian diets reduces mean body weight, suggesting potential value for prevention and management of weight-related conditions. Copyright © 2015 Academy of Nutrition and Dietetics. Published by Elsevier Inc. All rights reserved.
Lee, Minjung; Dignam, James J.; Han, Junhee
2014-01-01
We propose a nonparametric approach for cumulative incidence estimation when causes of failure are unknown or missing for some subjects. Under the missing at random assumption, we estimate the cumulative incidence function using multiple imputation methods. We develop asymptotic theory for the cumulative incidence estimators obtained from multiple imputation methods. We also discuss how to construct confidence intervals for the cumulative incidence function and perform a test for comparing the cumulative incidence functions in two samples with missing cause of failure. Through simulation studies, we show that the proposed methods perform well. The methods are illustrated with data from a randomized clinical trial in early stage breast cancer. PMID:25043107
Ondeck, Nathaniel T; Fu, Michael C; Skrip, Laura A; McLynn, Ryan P; Su, Edwin P; Grauer, Jonathan N
2018-03-01
Despite the advantages of large, national datasets, one continuing concern is missing data values. Complete case analysis, where only cases with complete data are analyzed, is commonly used rather than more statistically rigorous approaches such as multiple imputation. This study characterizes the potential selection bias introduced using complete case analysis and compares the results of common regressions using both techniques following unicompartmental knee arthroplasty. Patients undergoing unicompartmental knee arthroplasty were extracted from the 2005 to 2015 National Surgical Quality Improvement Program. As examples, the demographics of patients with and without missing preoperative albumin and hematocrit values were compared. Missing data were then treated with both complete case analysis and multiple imputation (an approach that reproduces the variation and associations that would have been present in a full dataset) and the conclusions of common regressions for adverse outcomes were compared. A total of 6117 patients were included, of which 56.7% were missing at least one value. Younger, female, and healthier patients were more likely to have missing preoperative albumin and hematocrit values. The use of complete case analysis removed 3467 patients from the study in comparison with multiple imputation which included all 6117 patients. The 2 methods of handling missing values led to differing associations of low preoperative laboratory values with commonly studied adverse outcomes. The use of complete case analysis can introduce selection bias and may lead to different conclusions in comparison with the statistically rigorous multiple imputation approach. Joint surgeons should consider the methods of handling missing values when interpreting arthroplasty research. Copyright © 2017 Elsevier Inc. All rights reserved.
Larsen, Lawrence C; Shah, Mena
2016-01-01
Although networks of environmental monitors are constantly improving through advances in technology and management, instances of missing data still occur. Many methods of imputing values for missing data are available, but they are often difficult to use or produce unsatisfactory results. I-Bot (short for "Imputation Robot") is a context-intensive approach to the imputation of missing data in data sets from networks of environmental monitors. I-Bot is easy to use and routinely produces imputed values that are highly reliable. I-Bot is described and demonstrated using more than 10 years of California data for daily maximum 8-hr ozone, 24-hr PM2.5 (particulate matter with an aerodynamic diameter <2.5 μm), mid-day average surface temperature, and mid-day average wind speed. I-Bot performance is evaluated by imputing values for observed data as if they were missing, and then comparing the imputed values with the observed values. In many cases, I-Bot is able to impute values for long periods with missing data, such as a week, a month, a year, or even longer. Qualitative visual methods and standard quantitative metrics demonstrate the effectiveness of the I-Bot methodology. Many resources are expended every year to analyze and interpret data sets from networks of environmental monitors. A large fraction of those resources is used to cope with difficulties due to the presence of missing data. The I-Bot method of imputing values for such missing data may help convert incomplete data sets into virtually complete data sets that facilitate the analysis and reliable interpretation of vital environmental data.
Predicting missing values in a home care database using an adaptive uncertainty rule method.
Konias, S; Gogou, G; Bamidis, P D; Vlahavas, I; Maglaveras, N
2005-01-01
Contemporary literature illustrates an abundance of adaptive algorithms for mining association rules. However, most literature is unable to deal with the peculiarities, such as missing values and dynamic data creation, that are frequently encountered in fields like medicine. This paper proposes an uncertainty rule method that uses an adaptive threshold for filling missing values in newly added records. A new approach for mining uncertainty rules and filling missing values is proposed, which is in turn particularly suitable for dynamic databases, like the ones used in home care systems. In this study, a new data mining method named FiMV (Filling Missing Values) is illustrated based on the mined uncertainty rules. Uncertainty rules have quite a similar structure to association rules and are extracted by an algorithm proposed in previous work, namely AURG (Adaptive Uncertainty Rule Generation). The main target was to implement an appropriate method for recovering missing values in a dynamic database, where new records are continuously added, without needing to specify any kind of thresholds beforehand. The method was applied to a home care monitoring system database. Randomly, multiple missing values for each record's attributes (rate 5-20% by 5% increments) were introduced in the initial dataset. FiMV demonstrated 100% completion rates with over 90% success in each case, while usual approaches, where all records with missing values are ignored or thresholds are required, experienced significantly reduced completion and success rates. It is concluded that the proposed method is appropriate for the data-cleaning step of the Knowledge Discovery process in databases. The latter, containing much significance for the output efficiency of any data mining technique, can improve the quality of the mined information.
Recurrent Neural Networks for Multivariate Time Series with Missing Values.
Che, Zhengping; Purushotham, Sanjay; Cho, Kyunghyun; Sontag, David; Liu, Yan
2018-04-17
Multivariate time series data in practical applications, such as health care, geoscience, and biology, are characterized by a variety of missing values. In time series prediction and other related tasks, it has been noted that missing values and their missing patterns are often correlated with the target labels, a.k.a., informative missingness. There is very limited work on exploiting the missing patterns for effective imputation and improving prediction performance. In this paper, we develop novel deep learning models, namely GRU-D, as one of the early attempts. GRU-D is based on Gated Recurrent Unit (GRU), a state-of-the-art recurrent neural network. It takes two representations of missing patterns, i.e., masking and time interval, and effectively incorporates them into a deep model architecture so that it not only captures the long-term temporal dependencies in time series, but also utilizes the missing patterns to achieve better prediction results. Experiments of time series classification tasks on real-world clinical datasets (MIMIC-III, PhysioNet) and synthetic datasets demonstrate that our models achieve state-of-the-art performance and provide useful insights for better understanding and utilization of missing values in time series analysis.
Prediction of regulatory gene pairs using dynamic time warping and gene ontology.
Yang, Andy C; Hsu, Hui-Huang; Lu, Ming-Da; Tseng, Vincent S; Shih, Timothy K
2014-01-01
Selecting informative genes is the most important task for data analysis on microarray gene expression data. In this work, we aim at identifying regulatory gene pairs from microarray gene expression data. However, microarray data often contain multiple missing expression values. Missing value imputation is thus needed before further processing for regulatory gene pairs becomes possible. We develop a novel approach to first impute missing values in microarray time series data by combining k-Nearest Neighbour (KNN), Dynamic Time Warping (DTW) and Gene Ontology (GO). After missing values are imputed, we then perform gene regulation prediction based on our proposed DTW-GO distance measurement of gene pairs. Experimental results show that our approach is more accurate when compared with existing missing value imputation methods on real microarray data sets. Furthermore, our approach can also discover more regulatory gene pairs that are known in the literature than other methods.
Improving record linkage performance in the presence of missing linkage data.
Ong, Toan C; Mannino, Michael V; Schilling, Lisa M; Kahn, Michael G
2014-12-01
Existing record linkage methods do not handle missing linking field values in an efficient and effective manner. The objective of this study is to investigate three novel methods for improving the accuracy and efficiency of record linkage when record linkage fields have missing values. By extending the Fellegi-Sunter scoring implementations available in the open-source Fine-grained Record Linkage (FRIL) software system we developed three novel methods to solve the missing data problem in record linkage, which we refer to as: Weight Redistribution, Distance Imputation, and Linkage Expansion. Weight Redistribution removes fields with missing data from the set of quasi-identifiers and redistributes the weight from the missing attribute based on relative proportions across the remaining available linkage fields. Distance Imputation imputes the distance between the missing data fields rather than imputing the missing data value. Linkage Expansion adds previously considered non-linkage fields to the linkage field set to compensate for the missing information in a linkage field. We tested the linkage methods using simulated data sets with varying field value corruption rates. The methods developed had sensitivity ranging from .895 to .992 and positive predictive values (PPV) ranging from .865 to 1 in data sets with low corruption rates. Increased corruption rates lead to decreased sensitivity for all methods. These new record linkage algorithms show promise in terms of accuracy and efficiency and may be valuable for combining large data sets at the patient level to support biomedical and clinical research. Copyright © 2014 Elsevier Inc. All rights reserved.
NASA Astrophysics Data System (ADS)
Ghiglieri, Jacopo
We report on a search for new physics in a final state with two same sign leptons, missing transverse energy, and significant hadronic activity at a center of mass energy sqrt(s) = 7 TeV. The data were collected with the CMS detector at the CERN LHC and correspond to an integrated luminosity of 0.98 inverse femtobarns. Data-driven methods are developed to estimate the dominant Standard Model backgrounds. No evidence for new physics is observed. The dominant background to the analysis comes from failures of lepton identification in Standard Model ttbar events. The ttbar production cross section in the dilepton final state is measured using 3.1 inverse picobarns of data. The cross section is measured to be 194 +/- 72 (stat) +/- 24 (syst) +/- 21 (lumi) pb. An algorithm is developed that uses tracking information to improve the reconstruction of missing transverse energy. The reconstruction of missing transverse energy is commissioned using the first collisions recorded at 0.9, 2.36 and 7 TeV data. Events with abnormally large values of missing transverse energy are identified as arising from anomalous signals in the calorimeters. Tools are developed to identify and remove these anomalous signals.
40 CFR 98.445 - Procedures for estimating missing data.
Code of Federal Regulations, 2013 CFR
2013-07-01
... quantities calculations is required. Whenever the monitoring procedures cannot be followed, you must use the...) A quarterly mass or volume of contents in containers received that is missing must be estimated as...
40 CFR 98.445 - Procedures for estimating missing data.
Code of Federal Regulations, 2014 CFR
2014-07-01
... quantities calculations is required. Whenever the monitoring procedures cannot be followed, you must use the...) A quarterly mass or volume of contents in containers received that is missing must be estimated as...
40 CFR 98.445 - Procedures for estimating missing data.
Code of Federal Regulations, 2012 CFR
2012-07-01
... quantities calculations is required. Whenever the monitoring procedures cannot be followed, you must use the...) A quarterly mass or volume of contents in containers received that is missing must be estimated as...
Missing portion sizes in FFQ--alternatives to use of standard portions.
Køster-Rasmussen, Rasmus; Siersma, Volkert; Halldorsson, Thorhallur I; de Fine Olivarius, Niels; Henriksen, Jan E; Heitmann, Berit L
2015-08-01
Standard portions or substitution of missing portion sizes with medians may generate bias when quantifying the dietary intake from FFQ. The present study compared four different methods to include portion sizes in FFQ. We evaluated three stochastic methods for imputation of portion sizes based on information about anthropometry, sex, physical activity and age. Energy intakes computed with standard portion sizes, defined as sex-specific medians (median), or with portion sizes estimated with multinomial logistic regression (MLR), 'comparable categories' (Coca) or k-nearest neighbours (KNN) were compared with a reference based on self-reported portion sizes (quantified by a photographic food atlas embedded in the FFQ). The Danish Health Examination Survey 2007-2008. The study included 3728 adults with complete portion size data. Compared with the reference, the root-mean-square errors of the mean daily total energy intake (in kJ) computed with portion sizes estimated by the four methods were (men; women): median (1118; 1061), MLR (1060; 1051), Coca (1230; 1146), KNN (1281; 1181). The equivalent biases (mean error) were (in kJ): median (579; 469), MLR (248; 178), Coca (234; 188), KNN (-340; 218). The methods MLR and Coca provided the best agreement with the reference. The stochastic methods allowed for estimation of meaningful portion sizes by conditioning on information about physiology and they were suitable for multiple imputation. We propose to use MLR or Coca to substitute missing portion size values or when portion sizes needs to be included in FFQ without portion size data.
Missing value imputation for gene expression data by tailored nearest neighbors.
Faisal, Shahla; Tutz, Gerhard
2017-04-25
High dimensional data like gene expression and RNA-sequences often contain missing values. The subsequent analysis and results based on these incomplete data can suffer strongly from the presence of these missing values. Several approaches to imputation of missing values in gene expression data have been developed but the task is difficult due to the high dimensionality (number of genes) of the data. Here an imputation procedure is proposed that uses weighted nearest neighbors. Instead of using nearest neighbors defined by a distance that includes all genes the distance is computed for genes that are apt to contribute to the accuracy of imputed values. The method aims at avoiding the curse of dimensionality, which typically occurs if local methods as nearest neighbors are applied in high dimensional settings. The proposed weighted nearest neighbors algorithm is compared to existing missing value imputation techniques like mean imputation, KNNimpute and the recently proposed imputation by random forests. We use RNA-sequence and microarray data from studies on human cancer to compare the performance of the methods. The results from simulations as well as real studies show that the weighted distance procedure can successfully handle missing values for high dimensional data structures where the number of predictors is larger than the number of samples. The method typically outperforms the considered competitors.
Accounting for dropout bias using mixed-effects models.
Mallinckrodt, C H; Clark, W S; David, S R
2001-01-01
Treatment effects are often evaluated by comparing change over time in outcome measures. However, valid analyses of longitudinal data can be problematic when subjects discontinue (dropout) prior to completing the study. This study assessed the merits of likelihood-based repeated measures analyses (MMRM) compared with fixed-effects analysis of variance where missing values were imputed using the last observation carried forward approach (LOCF) in accounting for dropout bias. Comparisons were made in simulated data and in data from a randomized clinical trial. Subject dropout was introduced in the simulated data to generate ignorable and nonignorable missingness. Estimates of treatment group differences in mean change from baseline to endpoint from MMRM were, on average, markedly closer to the true value than estimates from LOCF in every scenario simulated. Standard errors and confidence intervals from MMRM accurately reflected the uncertainty of the estimates, whereas standard errors and confidence intervals from LOCF underestimated uncertainty.
NASA Astrophysics Data System (ADS)
El Sharif, H.; Teegavarapu, R. S.
2012-12-01
Spatial interpolation methods used for estimation of missing precipitation data at a site seldom check for their ability to preserve site and regional statistics. Such statistics are primarily defined by spatial correlations and other site-to-site statistics in a region. Preservation of site and regional statistics represents a means of assessing the validity of missing precipitation estimates at a site. This study evaluates the efficacy of a fuzzy-logic methodology for infilling missing historical daily precipitation data in preserving site and regional statistics. Rain gauge sites in the state of Kentucky, USA, are used as a case study for evaluation of this newly proposed method in comparison to traditional data infilling techniques. Several error and performance measures will be used to evaluate the methods and trade-offs in accuracy of estimation and preservation of site and regional statistics.
Estimation of Missing Water-Level Data for the Everglades Depth Estimation Network (EDEN)
Conrads, Paul; Petkewich, Matthew D.
2009-01-01
The Everglades Depth Estimation Network (EDEN) is an integrated network of real-time water-level gaging stations, ground-elevation models, and water-surface elevation models designed to provide scientists, engineers, and water-resource managers with current (2000-2009) water-depth information for the entire freshwater portion of the greater Everglades. The U.S. Geological Survey Greater Everglades Priority Ecosystems Science provides support for EDEN and their goal of providing quality-assured monitoring data for the U.S. Army Corps of Engineers Comprehensive Everglades Restoration Plan. To increase the accuracy of the daily water-surface elevation model, water-level estimation equations were developed to fill missing data. To minimize the occurrences of no estimation of data due to missing data for an input station, a minimum of three linear regression equations were developed for each station using different input stations. Of the 726 water-level estimation equations developed to fill missing data at 239 stations, more than 60 percent of the equations have coefficients of determination greater than 0.90, and 92 percent have an coefficient of determination greater than 0.70.
Baccini, Michela; Carreras, Giulia
2014-10-01
This paper describes the methods used to investigate variations in total alcoholic beverage consumption as related to selected control intervention policies and other socioeconomic factors (unplanned factors) within 12 European countries involved in the AMPHORA project. The analysis presented several critical points: presence of missing values, strong correlation among the unplanned factors, long-term waves or trends in both the time series of alcohol consumption and the time series of the main explanatory variables. These difficulties were addressed by implementing a multiple imputation procedure for filling in missing values, then specifying for each country a multiple regression model which accounted for time trend, policy measures and a limited set of unplanned factors, selected in advance on the basis of sociological and statistical considerations are addressed. This approach allowed estimating the "net" effect of the selected control policies on alcohol consumption, but not the association between each unplanned factor and the outcome.
Sensitivity Analysis of Multiple Informant Models When Data are Not Missing at Random
Blozis, Shelley A.; Ge, Xiaojia; Xu, Shu; Natsuaki, Misaki N.; Shaw, Daniel S.; Neiderhiser, Jenae; Scaramella, Laura; Leve, Leslie; Reiss, David
2014-01-01
Missing data are common in studies that rely on multiple informant data to evaluate relationships among variables for distinguishable individuals clustered within groups. Estimation of structural equation models using raw data allows for incomplete data, and so all groups may be retained even if only one member of a group contributes data. Statistical inference is based on the assumption that data are missing completely at random or missing at random. Importantly, whether or not data are missing is assumed to be independent of the missing data. A saturated correlates model that incorporates correlates of the missingness or the missing data into an analysis and multiple imputation that may also use such correlates offer advantages over the standard implementation of SEM when data are not missing at random because these approaches may result in a data analysis problem for which the missingness is ignorable. This paper considers these approaches in an analysis of family data to assess the sensitivity of parameter estimates to assumptions about missing data, a strategy that may be easily implemented using SEM software. PMID:25221420
Meaning of Missing Values in Eyewitness Recall and Accident Records
Uttl, Bob; Kisinger, Kelly
2010-01-01
Background Eyewitness recalls and accident records frequently do not mention the conditions and behaviors of interest to researchers and lead to missing values and to uncertainty about the prevalence of these conditions and behaviors surrounding accidents. Missing values may occur because eyewitnesses report the presence but not the absence of obvious clues/accident features. We examined this possibility. Methodology/Principal Findings Participants watched car accident videos and were asked to recall as much information as they could remember about each accident. The results showed that eyewitnesses were far more likely to report the presence of present obvious clues than the absence of absent obvious clues even though they were aware of their absence. Conclusions One of the principal mechanisms causing missing values may be eyewitnesses' tendency to not report the absence of obvious features. We discuss the implications of our findings for both retrospective and prospective analyses of accident records, and illustrate the consequences of adopting inappropriate assumptions about the meaning of missing values using the Avaluator Avalanche Accident Prevention Card. PMID:20824054
Meaning of missing values in eyewitness recall and accident records.
Uttl, Bob; Kisinger, Kelly
2010-09-02
Eyewitness recalls and accident records frequently do not mention the conditions and behaviors of interest to researchers and lead to missing values and to uncertainty about the prevalence of these conditions and behaviors surrounding accidents. Missing values may occur because eyewitnesses report the presence but not the absence of obvious clues/accident features. We examined this possibility. Participants watched car accident videos and were asked to recall as much information as they could remember about each accident. The results showed that eyewitnesses were far more likely to report the presence of present obvious clues than the absence of absent obvious clues even though they were aware of their absence. One of the principal mechanisms causing missing values may be eyewitnesses' tendency to not report the absence of obvious features. We discuss the implications of our findings for both retrospective and prospective analyses of accident records, and illustrate the consequences of adopting inappropriate assumptions about the meaning of missing values using the Avaluator Avalanche Accident Prevention Card.
The Contribution of Missed Clinic Visits to Disparities in HIV Viral Load Outcomes
Westfall, Andrew O.; Gardner, Lytt I.; Giordano, Thomas P.; Wilson, Tracey E.; Drainoni, Mari-Lynn; Keruly, Jeanne C.; Rodriguez, Allan E.; Malitz, Faye; Batey, D. Scott; Mugavero, Michael J.
2015-01-01
Objectives. We explored the contribution of missed primary HIV care visits (“no-show”) to observed disparities in virological failure (VF) among Black persons and persons with injection drug use (IDU) history. Methods. We used patient-level data from 6 academic clinics, before the Centers for Disease Control and Prevention and Health Resources and Services Administration Retention in Care intervention. We employed staged multivariable logistic regression and multivariable models stratified by no-show visit frequency to evaluate the association of sociodemographic factors with VF. We used multiple imputations to assign missing viral load values. Results. Among 10 053 patients (mean age = 46 years; 35% female; 64% Black; 15% with IDU history), 31% experienced VF. Although Black patients and patients with IDU history were significantly more likely to experience VF in initial analyses, race and IDU parameter estimates were attenuated after sequential addition of no-show frequency. In stratified models, race and IDU were not statistically significantly associated with VF at any no-show level. Conclusions. Because missed clinic visits contributed to observed differences in viral load outcomes among Black and IDU patients, achieving an improved understanding of differential visit attendance is imperative to reducing disparities in HIV. PMID:26270301
Konias, Sokratis; Chouvarda, Ioanna; Vlahavas, Ioannis; Maglaveras, Nicos
2005-09-01
Current approaches for mining association rules usually assume that the mining is performed in a static database, where the problem of missing attribute values does not practically exist. However, these assumptions are not preserved in some medical databases, like in a home care system. In this paper, a novel uncertainty rule algorithm is illustrated, namely URG-2 (Uncertainty Rule Generator), which addresses the problem of mining dynamic databases containing missing values. This algorithm requires only one pass from the initial dataset in order to generate the item set, while new metrics corresponding to the notion of Support and Confidence are used. URG-2 was evaluated over two medical databases, introducing randomly multiple missing values for each record's attribute (rate: 5-20% by 5% increments) in the initial dataset. Compared with the classical approach (records with missing values are ignored), the proposed algorithm was more robust in mining rules from datasets containing missing values. In all cases, the difference in preserving the initial rules ranged between 30% and 60% in favour of URG-2. Moreover, due to its incremental nature, URG-2 saved over 90% of the time required for thorough re-mining. Thus, the proposed algorithm can offer a preferable solution for mining in dynamic relational databases.
Selection-Fusion Approach for Classification of Datasets with Missing Values
Ghannad-Rezaie, Mostafa; Soltanian-Zadeh, Hamid; Ying, Hao; Dong, Ming
2010-01-01
This paper proposes a new approach based on missing value pattern discovery for classifying incomplete data. This approach is particularly designed for classification of datasets with a small number of samples and a high percentage of missing values where available missing value treatment approaches do not usually work well. Based on the pattern of the missing values, the proposed approach finds subsets of samples for which most of the features are available and trains a classifier for each subset. Then, it combines the outputs of the classifiers. Subset selection is translated into a clustering problem, allowing derivation of a mathematical framework for it. A trade off is established between the computational complexity (number of subsets) and the accuracy of the overall classifier. To deal with this trade off, a numerical criterion is proposed for the prediction of the overall performance. The proposed method is applied to seven datasets from the popular University of California, Irvine data mining archive and an epilepsy dataset from Henry Ford Hospital, Detroit, Michigan (total of eight datasets). Experimental results show that classification accuracy of the proposed method is superior to those of the widely used multiple imputations method and four other methods. They also show that the level of superiority depends on the pattern and percentage of missing values. PMID:20212921
Second-Order Two-Sided Estimates in Nonlinear Elliptic Problems
NASA Astrophysics Data System (ADS)
Cianchi, Andrea; Maz'ya, Vladimir G.
2018-05-01
Best possible second-order regularity is established for solutions to p-Laplacian type equations with {p \\in (1, ∞)} and a square-integrable right-hand side. Our results provide a nonlinear counterpart of the classical L 2-coercivity theory for linear problems, which is missing in the existing literature. Both local and global estimates are obtained. The latter apply to solutions to either Dirichlet or Neumann boundary value problems. Minimal regularity on the boundary of the domain is required, although our conclusions are new even for smooth domains. If the domain is convex, no regularity of its boundary is needed at all.
Wu, Wei-Sheng; Jhou, Meng-Jhun
2017-01-13
Missing value imputation is important for microarray data analyses because microarray data with missing values would significantly degrade the performance of the downstream analyses. Although many microarray missing value imputation algorithms have been developed, an objective and comprehensive performance comparison framework is still lacking. To solve this problem, we previously proposed a framework which can perform a comprehensive performance comparison of different existing algorithms. Also the performance of a new algorithm can be evaluated by our performance comparison framework. However, constructing our framework is not an easy task for the interested researchers. To save researchers' time and efforts, here we present an easy-to-use web tool named MVIAeval (Missing Value Imputation Algorithm evaluator) which implements our performance comparison framework. MVIAeval provides a user-friendly interface allowing users to upload the R code of their new algorithm and select (i) the test datasets among 20 benchmark microarray (time series and non-time series) datasets, (ii) the compared algorithms among 12 existing algorithms, (iii) the performance indices from three existing ones, (iv) the comprehensive performance scores from two possible choices, and (v) the number of simulation runs. The comprehensive performance comparison results are then generated and shown as both figures and tables. MVIAeval is a useful tool for researchers to easily conduct a comprehensive and objective performance evaluation of their newly developed missing value imputation algorithm for microarray data or any data which can be represented as a matrix form (e.g. NGS data or proteomics data). Thus, MVIAeval will greatly expedite the progress in the research of missing value imputation algorithms.
NASA Astrophysics Data System (ADS)
Sunilkumar, K.; Narayana Rao, T.; Saikranthi, K.; Purnachandra Rao, M.
2015-09-01
This study presents a comprehensive evaluation of five widely used multisatellite precipitation estimates (MPEs) against 1° × 1° gridded rain gauge data set as ground truth over India. One decade observations are used to assess the performance of various MPEs (Climate Prediction Center (CPC)-South Asia data set, CPC Morphing Technique (CMORPH), Precipitation Estimation From Remotely Sensed Information Using Artificial Neural Networks, Tropical Rainfall Measuring Mission's Multisatellite Precipitation Analysis (TMPA-3B42), and Global Precipitation Climatology Project). All MPEs have high detection skills of rain with larger probability of detection (POD) and smaller "missing" values. However, the detection sensitivity differs from one product (and also one region) to the other. While the CMORPH has the lowest sensitivity of detecting rain, CPC shows highest sensitivity and often overdetects rain, as evidenced by large POD and false alarm ratio and small missing values. All MPEs show higher rain sensitivity over eastern India than western India. These differential sensitivities are found to alter the biases in rain amount differently. All MPEs show similar spatial patterns of seasonal rain bias and root-mean-square error, but their spatial variability across India is complex and pronounced. The MPEs overestimate the rainfall over the dry regions (northwest and southeast India) and severely underestimate over mountainous regions (west coast and northeast India), whereas the bias is relatively small over the core monsoon zone. Higher occurrence of virga rain due to subcloud evaporation and possible missing of small-scale convective events by gauges over the dry regions are the main reasons for the observed overestimation of rain by MPEs. The decomposed components of total bias show that the major part of overestimation is due to false precipitation. The severe underestimation of rain along the west coast is attributed to the predominant occurrence of shallow rain and underestimation of moderate to heavy rain by MPEs. The decomposed components suggest that the missed precipitation and hit bias are the leading error sources for the total bias along the west coast. All evaluation metrics are found to be nearly equal in two contrasting monsoon seasons (southwest and northeast), indicating that the performance of MPEs does not change with the season, at least over southeast India. Among various MPEs, the performance of TMPA is found to be better than others, as it reproduced most of the spatial variability exhibited by the reference.
Siddique, Juned; Harel, Ofer; Crespi, Catherine M.; Hedeker, Donald
2014-01-01
The true missing data mechanism is never known in practice. We present a method for generating multiple imputations for binary variables that formally incorporates missing data mechanism uncertainty. Imputations are generated from a distribution of imputation models rather than a single model, with the distribution reflecting subjective notions of missing data mechanism uncertainty. Parameter estimates and standard errors are obtained using rules for nested multiple imputation. Using simulation, we investigate the impact of missing data mechanism uncertainty on post-imputation inferences and show that incorporating this uncertainty can increase the coverage of parameter estimates. We apply our method to a longitudinal smoking cessation trial where nonignorably missing data were a concern. Our method provides a simple approach for formalizing subjective notions regarding nonresponse and can be implemented using existing imputation software. PMID:24634315
Nguyen, Cattram D; Strazdins, Lyndall; Nicholson, Jan M; Cooklin, Amanda R
2018-07-01
Understanding the long-term health effects of employment - a major social determinant - on population health is best understood via longitudinal cohort studies, yet missing data (attrition, item non-response) remain a ubiquitous challenge. Additionally, and unique to the work-family context, is the intermittent participation of parents, particularly mothers, in employment, yielding 'incomplete' data. Missing data are patterned by gender and social circumstances, and the extent and nature of resulting biases are unknown. This study investigates how estimates of the association between work-family conflict and mental health depend on the use of four different approaches to missing data treatment, each of which allows for progressive inclusion of more cases in the analyses. We used 5 waves of data from 4983 mothers participating in the Longitudinal Study of Australian Children. Only 23% had completely observed work-family conflict data across all waves. Participants with and without missing data differed such that complete cases were the most advantaged group. Comparison of the missing data treatments indicate the expected narrowing of confidence intervals when more sample were included. However, impact on the estimated strength of association varied by level of exposure: At the lower levels of work-family conflict, estimates strengthened (were larger); at higher levels they weakened (were smaller). Our results suggest that inadequate handling of missing data in extant longitudinal studies of work-family conflict and mental health may have misestimated the adverse effects of work-family conflict, particularly for mothers. Considerable caution should be exercised in interpreting analyses that fail to explore and account for biases arising from missing data. Copyright © 2018. Published by Elsevier Ltd.
Nearest neighbor imputation using spatial–temporal correlations in wireless sensor networks
Li, YuanYuan; Parker, Lynne E.
2016-01-01
Missing data is common in Wireless Sensor Networks (WSNs), especially with multi-hop communications. There are many reasons for this phenomenon, such as unstable wireless communications, synchronization issues, and unreliable sensors. Unfortunately, missing data creates a number of problems for WSNs. First, since most sensor nodes in the network are battery-powered, it is too expensive to have the nodes retransmit missing data across the network. Data re-transmission may also cause time delays when detecting abnormal changes in an environment. Furthermore, localized reasoning techniques on sensor nodes (such as machine learning algorithms to classify states of the environment) are generally not robust enough to handle missing data. Since sensor data collected by a WSN is generally correlated in time and space, we illustrate how replacing missing sensor values with spatially and temporally correlated sensor values can significantly improve the network’s performance. However, our studies show that it is important to determine which nodes are spatially and temporally correlated with each other. Simple techniques based on Euclidean distance are not sufficient for complex environmental deployments. Thus, we have developed a novel Nearest Neighbor (NN) imputation method that estimates missing data in WSNs by learning spatial and temporal correlations between sensor nodes. To improve the search time, we utilize a kd-tree data structure, which is a non-parametric, data-driven binary search tree. Instead of using traditional mean and variance of each dimension for kd-tree construction, and Euclidean distance for kd-tree search, we use weighted variances and weighted Euclidean distances based on measured percentages of missing data. We have evaluated this approach through experiments on sensor data from a volcano dataset collected by a network of Crossbow motes, as well as experiments using sensor data from a highway traffic monitoring application. Our experimental results show that our proposed 𝒦-NN imputation method has a competitive accuracy with state-of-the-art Expectation–Maximization (EM) techniques, while using much simpler computational techniques, thus making it suitable for use in resource-constrained WSNs. PMID:28435414
Pragmatic criteria of the definition of neonatal near miss: a comparative study
Kale, Pauline Lorena; Jorge, Maria Helena Prado de Mello; Laurenti, Ruy; Fonseca, Sandra Costa; da Silva, Kátia Silveira
2017-01-01
ABSTRACT OBJECTIVE The objective of this study was to test the validity of the pragmatic criteria of the definitions of neonatal near miss, extending them throughout the infant period, and to estimate the indicators of perinatal care in public maternity hospitals. METHODS A cohort of live births from six maternity hospitals in the municipalities of São Paulo, Niterói, and Rio de Janeiro, Brazil, was carried out in 2011. We carried out interviews and checked prenatal cards and medical records. We compared the pragmatic criteria (birth weight, gestational age, and 5’ Apgar score) of the definitions of near miss of Pileggi et al., Pileggi-Castro et al., Souza et al., and Silva et al. We calculated sensitivity, specificity (gold standard: infant mortality), percentage of deaths among newborns with life-threatening conditions, and rates of near miss, mortality, and severe outcomes per 1,000 live births. RESULTS A total 7,315 newborns were analyzed (completeness of information > 99%). The sensitivity of the definition of Pileggi-Castro et al. was higher, resulting in a higher number of cases of near miss, Souza et al. presented lower value, and Pileggi et al. and de Silva et al. presented intermediate values. There is an increase in sensitivity when the period goes from 0–6 to 0–27 days, and there is a decrease when it goes to 0–364 days. Specificities were high (≥ 97%) and above sensitivities (54% to 77%). One maternity hospital in São Paulo and one in Niterói presented, respectively, the lowest and highest rates of infant mortality, near miss, and frequency of births with life-threatening conditions, regardless of the definition. CONCLUSIONS The definitions of near miss based exclusively on pragmatic criteria are valid and can be used for monitoring purposes. Based on the perinatal literature, the cutoff points adopted by Silva et al. were more appropriate. Periodic studies could apply a more complete definition, incorporating clinical, laboratory, and management criteria, including congenital anomalies predictive of infant mortality. PMID:29211204
Missing doses in the life span study of Japanese atomic bomb survivors.
Richardson, David B; Wing, Steve; Cole, Stephen R
2013-03-15
The Life Span Study of atomic bomb survivors is an important source of risk estimates used to inform radiation protection and compensation. Interviews with survivors in the 1950s and 1960s provided information needed to estimate radiation doses for survivors proximal to ground zero. Because of a lack of interview or the complexity of shielding, doses are missing for 7,058 of the 68,119 proximal survivors. Recent analyses excluded people with missing doses, and despite the protracted collection of interview information necessary to estimate some survivors' doses, defined start of follow-up as October 1, 1950, for everyone. We describe the prevalence of missing doses and its association with mortality, distance from hypocenter, city, age, and sex. Missing doses were more common among Nagasaki residents than among Hiroshima residents (prevalence ratio = 2.05; 95% confidence interval: 1.96, 2.14), among people who were closer to ground zero than among those who were far from it, among people who were younger at enrollment than among those who were older, and among males than among females (prevalence ratio = 1.22; 95% confidence interval: 1.17, 1.28). Missing dose was associated with all-cancer and leukemia mortality, particularly during the first years of follow-up (all-cancer rate ratio = 2.16, 95% confidence interval: 1.51, 3.08; and leukemia rate ratio = 4.28, 95% confidence interval: 1.72, 10.67). Accounting for missing dose and late entry should reduce bias in estimated dose-mortality associations.
Missing Doses in the Life Span Study of Japanese Atomic Bomb Survivors
Richardson, David B.; Wing, Steve; Cole, Stephen R.
2013-01-01
The Life Span Study of atomic bomb survivors is an important source of risk estimates used to inform radiation protection and compensation. Interviews with survivors in the 1950s and 1960s provided information needed to estimate radiation doses for survivors proximal to ground zero. Because of a lack of interview or the complexity of shielding, doses are missing for 7,058 of the 68,119 proximal survivors. Recent analyses excluded people with missing doses, and despite the protracted collection of interview information necessary to estimate some survivors' doses, defined start of follow-up as October 1, 1950, for everyone. We describe the prevalence of missing doses and its association with mortality, distance from hypocenter, city, age, and sex. Missing doses were more common among Nagasaki residents than among Hiroshima residents (prevalence ratio = 2.05; 95% confidence interval: 1.96, 2.14), among people who were closer to ground zero than among those who were far from it, among people who were younger at enrollment than among those who were older, and among males than among females (prevalence ratio = 1.22; 95% confidence interval: 1.17, 1.28). Missing dose was associated with all-cancer and leukemia mortality, particularly during the first years of follow-up (all-cancer rate ratio = 2.16, 95% confidence interval: 1.51, 3.08; and leukemia rate ratio = 4.28, 95% confidence interval: 1.72, 10.67). Accounting for missing dose and late entry should reduce bias in estimated dose-mortality associations. PMID:23429722
Mukaka, Mavuto; White, Sarah A; Terlouw, Dianne J; Mwapasa, Victor; Kalilani-Phiri, Linda; Faragher, E Brian
2016-07-22
Missing outcomes can seriously impair the ability to make correct inferences from randomized controlled trials (RCTs). Complete case (CC) analysis is commonly used, but it reduces sample size and is perceived to lead to reduced statistical efficiency of estimates while increasing the potential for bias. As multiple imputation (MI) methods preserve sample size, they are generally viewed as the preferred analytical approach. We examined this assumption, comparing the performance of CC and MI methods to determine risk difference (RD) estimates in the presence of missing binary outcomes. We conducted simulation studies of 5000 simulated data sets with 50 imputations of RCTs with one primary follow-up endpoint at different underlying levels of RD (3-25 %) and missing outcomes (5-30 %). For missing at random (MAR) or missing completely at random (MCAR) outcomes, CC method estimates generally remained unbiased and achieved precision similar to or better than MI methods, and high statistical coverage. Missing not at random (MNAR) scenarios yielded invalid inferences with both methods. Effect size estimate bias was reduced in MI methods by always including group membership even if this was unrelated to missingness. Surprisingly, under MAR and MCAR conditions in the assessed scenarios, MI offered no statistical advantage over CC methods. While MI must inherently accompany CC methods for intention-to-treat analyses, these findings endorse CC methods for per protocol risk difference analyses in these conditions. These findings provide an argument for the use of the CC approach to always complement MI analyses, with the usual caveat that the validity of the mechanism for missingness be thoroughly discussed. More importantly, researchers should strive to collect as much data as possible.
Falcaro, Milena; Carpenter, James R
2017-06-01
Population-based net survival by tumour stage at diagnosis is a key measure in cancer surveillance. Unfortunately, data on tumour stage are often missing for a non-negligible proportion of patients and the mechanism giving rise to the missingness is usually anything but completely at random. In this setting, restricting analysis to the subset of complete records gives typically biased results. Multiple imputation is a promising practical approach to the issues raised by the missing data, but its use in conjunction with the Pohar-Perme method for estimating net survival has not been formally evaluated. We performed a resampling study using colorectal cancer population-based registry data to evaluate the ability of multiple imputation, used along with the Pohar-Perme method, to deliver unbiased estimates of stage-specific net survival and recover missing stage information. We created 1000 independent data sets, each containing 5000 patients. Stage data were then made missing at random under two scenarios (30% and 50% missingness). Complete records analysis showed substantial bias and poor confidence interval coverage. Across both scenarios our multiple imputation strategy virtually eliminated the bias and greatly improved confidence interval coverage. In the presence of missing stage data complete records analysis often gives severely biased results. We showed that combining multiple imputation with the Pohar-Perme estimator provides a valid practical approach for the estimation of stage-specific colorectal cancer net survival. As usual, when the percentage of missing data is high the results should be interpreted cautiously and sensitivity analyses are recommended. Copyright © 2017 Elsevier Ltd. All rights reserved.
Empirical Likelihood in Nonignorable Covariate-Missing Data Problems.
Xie, Yanmei; Zhang, Biao
2017-04-20
Missing covariate data occurs often in regression analysis, which frequently arises in the health and social sciences as well as in survey sampling. We study methods for the analysis of a nonignorable covariate-missing data problem in an assumed conditional mean function when some covariates are completely observed but other covariates are missing for some subjects. We adopt the semiparametric perspective of Bartlett et al. (Improving upon the efficiency of complete case analysis when covariates are MNAR. Biostatistics 2014;15:719-30) on regression analyses with nonignorable missing covariates, in which they have introduced the use of two working models, the working probability model of missingness and the working conditional score model. In this paper, we study an empirical likelihood approach to nonignorable covariate-missing data problems with the objective of effectively utilizing the two working models in the analysis of covariate-missing data. We propose a unified approach to constructing a system of unbiased estimating equations, where there are more equations than unknown parameters of interest. One useful feature of these unbiased estimating equations is that they naturally incorporate the incomplete data into the data analysis, making it possible to seek efficient estimation of the parameter of interest even when the working regression function is not specified to be the optimal regression function. We apply the general methodology of empirical likelihood to optimally combine these unbiased estimating equations. We propose three maximum empirical likelihood estimators of the underlying regression parameters and compare their efficiencies with other existing competitors. We present a simulation study to compare the finite-sample performance of various methods with respect to bias, efficiency, and robustness to model misspecification. The proposed empirical likelihood method is also illustrated by an analysis of a data set from the US National Health and Nutrition Examination Survey (NHANES).
Combining Fourier and lagged k-nearest neighbor imputation for biomedical time series data.
Rahman, Shah Atiqur; Huang, Yuxiao; Claassen, Jan; Heintzman, Nathaniel; Kleinberg, Samantha
2015-12-01
Most clinical and biomedical data contain missing values. A patient's record may be split across multiple institutions, devices may fail, and sensors may not be worn at all times. While these missing values are often ignored, this can lead to bias and error when the data are mined. Further, the data are not simply missing at random. Instead the measurement of a variable such as blood glucose may depend on its prior values as well as that of other variables. These dependencies exist across time as well, but current methods have yet to incorporate these temporal relationships as well as multiple types of missingness. To address this, we propose an imputation method (FLk-NN) that incorporates time lagged correlations both within and across variables by combining two imputation methods, based on an extension to k-NN and the Fourier transform. This enables imputation of missing values even when all data at a time point is missing and when there are different types of missingness both within and across variables. In comparison to other approaches on three biological datasets (simulated and actual Type 1 diabetes datasets, and multi-modality neurological ICU monitoring) the proposed method has the highest imputation accuracy. This was true for up to half the data being missing and when consecutive missing values are a significant fraction of the overall time series length. Copyright © 2015 Elsevier Inc. All rights reserved.
Federal Register 2010, 2011, 2012, 2013, 2014
2012-10-02
... quarter substitution test. ``Collocated'' indicates that the collocated data was substituted for missing... 24-hour standard design value is greater than the level of the standard. EPA addresses missing data... substituted for the missing data. In the maximum quarter test, maximum recorded values are substituted for the...
Paul, David R; Kramer, Matthew; Stote, Kim S; Spears, Karen E; Moshfegh, Alanna J; Baer, David J; Rumpler, William V
2008-06-09
Activity monitors (AM) are small, electronic devices used to quantify the amount and intensity of physical activity (PA). Unfortunately, it has been demonstrated that data loss that occurs when AMs are not worn by subjects (removals during sleeping and waking hours) tend to result in biased estimates of PA and total energy expenditure (TEE). No study has reported the degree of data loss in a large study of adults, and/or the degree to which the estimates of PA and TEE are affected. Also, no study in adults has proposed a methodology to minimize the effects of AM removals. Adherence estimates were generated from a pool of 524 women and men that wore AMs for 13 - 15 consecutive days. To simulate the effect of data loss due to AM removal, a reference dataset was first compiled from a subset consisting of 35 highly adherent subjects (24 HR; minimum of 20 hrs/day for seven consecutive days). AM removals were then simulated during sleep and between one and ten waking hours using this 24 HR dataset. Differences in the mean values for PA and TEE between the 24 HR reference dataset and the different simulations were compared using paired t-tests and/or coefficients of variation. The estimated average adherence of the pool of 524 subjects was 15.8 +/- 3.4 hrs/day for approximately 11.7 +/- 2.0 days. Simulated data loss due to AM removals during sleeping hours in the 24 HR database (n = 35), resulted in biased estimates of PA (p < 0.05), but not TEE. Losing as little as one hour of data from the 24 HR dataset during waking hours results in significant biases (p < 0.0001) and variability (coefficients of variation between 7 and 21%) in the estimates of PA. Inserting a constant value for sleep and imputing estimates for missing data during waking hours significantly improved the estimates of PA. Although estimated adherence was good, measurements of PA can be improved by relatively simple imputation of missing AM data.
Do missing data influence the accuracy of divergence-time estimation with BEAST?
Zheng, Yuchi; Wiens, John J
2015-04-01
Time-calibrated phylogenies have become essential to evolutionary biology. A recurrent and unresolved question for dating analyses is whether genes with missing data cells should be included or excluded. This issue is particularly unclear for the most widely used dating method, the uncorrelated lognormal approach implemented in BEAST. Here, we test the robustness of this method to missing data. We compare divergence-time estimates from a nearly complete dataset (20 nuclear genes for 32 species of squamate reptiles) to those from subsampled matrices, including those with 5 or 2 complete loci only and those with 5 or 8 incomplete loci added. In general, missing data had little impact on estimated dates (mean error of ∼5Myr per node or less, given an overall age of ∼220Myr in squamates), even when 80% of sampled genes had 75% missing data. Mean errors were somewhat higher when all genes were 75% incomplete (∼17Myr). However, errors increased dramatically when only 2 of 9 fossil calibration points were included (∼40Myr), regardless of missing data. Overall, missing data (and even numbers of genes sampled) may have only minor impacts on the accuracy of divergence dating with BEAST, relative to the dramatic effects of fossil calibrations. Copyright © 2015 Elsevier Inc. All rights reserved.
MacNeil Vroomen, Janet; Eekhout, Iris; Dijkgraaf, Marcel G; van Hout, Hein; de Rooij, Sophia E; Heymans, Martijn W; Bosmans, Judith E
2016-11-01
Cost and effect data often have missing data because economic evaluations are frequently added onto clinical studies where cost data are rarely the primary outcome. The objective of this article was to investigate which multiple imputation strategy is most appropriate to use for missing cost-effectiveness data in a randomized controlled trial. Three incomplete data sets were generated from a complete reference data set with 17, 35 and 50 % missing data in effects and costs. The strategies evaluated included complete case analysis (CCA), multiple imputation with predictive mean matching (MI-PMM), MI-PMM on log-transformed costs (log MI-PMM), and a two-step MI. Mean cost and effect estimates, standard errors and incremental net benefits were compared with the results of the analyses on the complete reference data set. The CCA, MI-PMM, and the two-step MI strategy diverged from the results for the reference data set when the amount of missing data increased. In contrast, the estimates of the Log MI-PMM strategy remained stable irrespective of the amount of missing data. MI provided better estimates than CCA in all scenarios. With low amounts of missing data the MI strategies appeared equivalent but we recommend using the log MI-PMM with missing data greater than 35 %.
Results of Database Studies in Spine Surgery Can Be Influenced by Missing Data.
Basques, Bryce A; McLynn, Ryan P; Fice, Michael P; Samuel, Andre M; Lukasiewicz, Adam M; Bohl, Daniel D; Ahn, Junyoung; Singh, Kern; Grauer, Jonathan N
2017-12-01
National databases are increasingly being used for research in spine surgery; however, one limitation of such databases that has received sparse mention is the frequency of missing data. Studies using these databases often do not emphasize the percentage of missing data for each variable used and do not specify how patients with missing data are incorporated into analyses. This study uses the American College of Surgeons National Surgical Quality Improvement Program (ACS-NSQIP) database to examine whether different treatments of missing data can influence the results of spine studies. (1) What is the frequency of missing data fields for demographics, medical comorbidities, preoperative laboratory values, operating room times, and length of stay recorded in ACS-NSQIP? (2) Using three common approaches to handling missing data, how frequently do those approaches agree in terms of finding particular variables to be associated with adverse events? (3) Do different approaches to handling missing data influence the outcomes and effect sizes of an analysis testing for an association with these variables with occurrence of adverse events? Patients who underwent spine surgery between 2005 and 2013 were identified from the ACS-NSQIP database. A total of 88,471 patients undergoing spine surgery were identified. The most common procedures were anterior cervical discectomy and fusion, lumbar decompression, and lumbar fusion. Demographics, comorbidities, and perioperative laboratory values were tabulated for each patient, and the percent of missing data was noted for each variable. These variables were tested for an association with "any adverse event" using three separate multivariate regressions that used the most common treatments for missing data. In the first regression, patients with any missing data were excluded. In the second regression, missing data were treated as a negative or "reference" value; for continuous variables, the mean of each variable's reference range was computed and imputed. In the third regression, any variables with > 10% rate of missing data were removed from the regression; among variables with ≤ 10% missing data, individual cases with missing values were excluded. The results of these regressions were compared to determine how the different treatments of missing data could affect the results of spine studies using the ACS-NSQIP database. Of the 88,471 patients, as many as 4441 (5%) had missing elements among demographic data, 69,184 (72%) among comorbidities, 70,892 (80%) among preoperative laboratory values, and 56,551 (64%) among operating room times. Considering the three different treatments of missing data, we found different risk factors for adverse events. Of 44 risk factors found to be associated with adverse events in any analysis, only 15 (34%) of these risk factors were common among the three regressions. The second treatment of missing data (assuming "normal" value) found the most risk factors (40) to be associated with any adverse event, whereas the first treatment (deleting patients with missing data) found the fewest associations at 20. Among the risk factors associated with any adverse event, the 10 with the greatest effect size (odds ratio) by each regression were ranked. Of the 15 variables in the top 10 for any regression, six of these were common among all three lists. Differing treatments of missing data can influence the results of spine studies using the ACS-NSQIP. The current study highlights the importance of considering how such missing data are handled. Until there are better guidelines on the best approaches to handle missing data, investigators should report how missing data were handled to increase the quality and transparency of orthopaedic database research. Readers of large database studies should note whether handling of missing data was addressed and consider potential bias with high rates or unspecified or weak methods for handling missing data.
A Probability Based Framework for Testing the Missing Data Mechanism
ERIC Educational Resources Information Center
Lin, Johnny Cheng-Han
2013-01-01
Many methods exist for imputing missing data but fewer methods have been proposed to test the missing data mechanism. Little (1988) introduced a multivariate chi-square test for the missing completely at random data mechanism (MCAR) that compares observed means for each pattern with expectation-maximization (EM) estimated means. As an alternative,…
NASA Astrophysics Data System (ADS)
Chatzidakis, S.; Choi, C. K.; Tsoukalas, L. H.
2016-08-01
The potential non-proliferation monitoring of spent nuclear fuel sealed in dry casks interacting continuously with the naturally generated cosmic ray muons is investigated. Treatments on the muon RMS scattering angle by Moliere, Rossi-Greisen, Highland and, Lynch-Dahl were analyzed and compared with simplified Monte Carlo simulations. The Lynch-Dahl expression has the lowest error and appears to be appropriate when performing conceptual calculations for high-Z, thick targets such as dry casks. The GEANT4 Monte Carlo code was used to simulate dry casks with various fuel loadings and scattering variance estimates for each case were obtained. The scattering variance estimation was shown to be unbiased and using Chebyshev's inequality, it was found that 106 muons will provide estimates of the scattering variances that are within 1% of the true value at a 99% confidence level. These estimates were used as reference values to calculate scattering distributions and evaluate the asymptotic behavior for small variations on fuel loading. It is shown that the scattering distributions between a fully loaded dry cask and one with a fuel assembly missing initially overlap significantly but their distance eventually increases with increasing number of muons. One missing fuel assembly can be distinguished from a fully loaded cask with a small overlapping between the distributions which is the case of 100,000 muons. This indicates that the removal of a standard fuel assembly can be identified using muons providing that enough muons are collected. A Bayesian algorithm was developed to classify dry casks and provide a decision rule that minimizes the risk of making an incorrect decision. The algorithm performance was evaluated and the lower detection limit was determined.
Federal Register 2010, 2011, 2012, 2013, 2014
2013-08-14
... requirement for one or more quarters during 2010-2012 monitoring period. EPA has addressed missing data from... recorded values are substituted for the missing data, and the resulting 24-hour design value is compared to... missing data from the Greensburg monitor by performing a statistical analysis of the data, in which a...
Missing Data and Multiple Imputation in the Context of Multivariate Analysis of Variance
ERIC Educational Resources Information Center
Finch, W. Holmes
2016-01-01
Multivariate analysis of variance (MANOVA) is widely used in educational research to compare means on multiple dependent variables across groups. Researchers faced with the problem of missing data often use multiple imputation of values in place of the missing observations. This study compares the performance of 2 methods for combining p values in…
A Two-Stage Approach to Missing Data: Theory and Application to Auxiliary Variables
ERIC Educational Resources Information Center
Savalei, Victoria; Bentler, Peter M.
2009-01-01
A well-known ad-hoc approach to conducting structural equation modeling with missing data is to obtain a saturated maximum likelihood (ML) estimate of the population covariance matrix and then to use this estimate in the complete data ML fitting function to obtain parameter estimates. This 2-stage (TS) approach is appealing because it minimizes a…
ERIC Educational Resources Information Center
Estabrook, Ryne; Neale, Michael
2013-01-01
Factor score estimation is a controversial topic in psychometrics, and the estimation of factor scores from exploratory factor models has historically received a great deal of attention. However, both confirmatory factor models and the existence of missing data have generally been ignored in this debate. This article presents a simulation study…
40 CFR 98.75 - Procedures for estimating missing data.
Code of Federal Regulations, 2011 CFR
2011-07-01
... 40 Protection of Environment 21 2011-07-01 2011-07-01 false Procedures for estimating missing data. 98.75 Section 98.75 Protection of Environment ENVIRONMENTAL PROTECTION AGENCY (CONTINUED) AIR PROGRAMS (CONTINUED) MANDATORY GREENHOUSE GAS REPORTING Ammonia Manufacturing § 98.75 Procedures for...
40 CFR 98.75 - Procedures for estimating missing data.
Code of Federal Regulations, 2014 CFR
2014-07-01
... 40 Protection of Environment 21 2014-07-01 2014-07-01 false Procedures for estimating missing data. 98.75 Section 98.75 Protection of Environment ENVIRONMENTAL PROTECTION AGENCY (CONTINUED) AIR PROGRAMS (CONTINUED) MANDATORY GREENHOUSE GAS REPORTING Ammonia Manufacturing § 98.75 Procedures for...
40 CFR 98.75 - Procedures for estimating missing data.
Code of Federal Regulations, 2013 CFR
2013-07-01
... 40 Protection of Environment 22 2013-07-01 2013-07-01 false Procedures for estimating missing data. 98.75 Section 98.75 Protection of Environment ENVIRONMENTAL PROTECTION AGENCY (CONTINUED) AIR PROGRAMS (CONTINUED) MANDATORY GREENHOUSE GAS REPORTING Ammonia Manufacturing § 98.75 Procedures for...
40 CFR 98.75 - Procedures for estimating missing data.
Code of Federal Regulations, 2012 CFR
2012-07-01
... 40 Protection of Environment 22 2012-07-01 2012-07-01 false Procedures for estimating missing data. 98.75 Section 98.75 Protection of Environment ENVIRONMENTAL PROTECTION AGENCY (CONTINUED) AIR PROGRAMS (CONTINUED) MANDATORY GREENHOUSE GAS REPORTING Ammonia Manufacturing § 98.75 Procedures for...
40 CFR 98.75 - Procedures for estimating missing data.
Code of Federal Regulations, 2010 CFR
2010-07-01
... 40 Protection of Environment 20 2010-07-01 2010-07-01 false Procedures for estimating missing data. 98.75 Section 98.75 Protection of Environment ENVIRONMENTAL PROTECTION AGENCY (CONTINUED) AIR PROGRAMS (CONTINUED) MANDATORY GREENHOUSE GAS REPORTING Ammonia Manufacturing § 98.75 Procedures for...
40 CFR 98.45 - Procedures for estimating missing data.
Code of Federal Regulations, 2012 CFR
2012-07-01
... 40 Protection of Environment 22 2012-07-01 2012-07-01 false Procedures for estimating missing data. 98.45 Section 98.45 Protection of Environment ENVIRONMENTAL PROTECTION AGENCY (CONTINUED) AIR PROGRAMS (CONTINUED) MANDATORY GREENHOUSE GAS REPORTING Electricity Generation § 98.45 Procedures for...
40 CFR 98.45 - Procedures for estimating missing data.
Code of Federal Regulations, 2010 CFR
2010-07-01
... 40 Protection of Environment 20 2010-07-01 2010-07-01 false Procedures for estimating missing data. 98.45 Section 98.45 Protection of Environment ENVIRONMENTAL PROTECTION AGENCY (CONTINUED) AIR PROGRAMS (CONTINUED) MANDATORY GREENHOUSE GAS REPORTING Electricity Generation § 98.45 Procedures for...
40 CFR 98.45 - Procedures for estimating missing data.
Code of Federal Regulations, 2014 CFR
2014-07-01
... 40 Protection of Environment 21 2014-07-01 2014-07-01 false Procedures for estimating missing data. 98.45 Section 98.45 Protection of Environment ENVIRONMENTAL PROTECTION AGENCY (CONTINUED) AIR PROGRAMS (CONTINUED) MANDATORY GREENHOUSE GAS REPORTING Electricity Generation § 98.45 Procedures for...
40 CFR 98.45 - Procedures for estimating missing data.
Code of Federal Regulations, 2011 CFR
2011-07-01
... 40 Protection of Environment 21 2011-07-01 2011-07-01 false Procedures for estimating missing data. 98.45 Section 98.45 Protection of Environment ENVIRONMENTAL PROTECTION AGENCY (CONTINUED) AIR PROGRAMS (CONTINUED) MANDATORY GREENHOUSE GAS REPORTING Electricity Generation § 98.45 Procedures for...
40 CFR 98.45 - Procedures for estimating missing data.
Code of Federal Regulations, 2013 CFR
2013-07-01
... 40 Protection of Environment 22 2013-07-01 2013-07-01 false Procedures for estimating missing data. 98.45 Section 98.45 Protection of Environment ENVIRONMENTAL PROTECTION AGENCY (CONTINUED) AIR PROGRAMS (CONTINUED) MANDATORY GREENHOUSE GAS REPORTING Electricity Generation § 98.45 Procedures for...
Exact Bayesian p-values for a test of independence in a 2 × 2 contingency table with missing data.
Lin, Yan; Lipsitz, Stuart R; Sinha, Debajyoti; Fitzmaurice, Garrett; Lipshultz, Steven
2017-01-01
Altham (Altham PME. Exact Bayesian analysis of a 2 × 2 contingency table, and Fisher's "exact" significance test. J R Stat Soc B 1969; 31: 261-269) showed that a one-sided p-value from Fisher's exact test of independence in a 2 × 2 contingency table is equal to the posterior probability of negative association in the 2 × 2 contingency table under a Bayesian analysis using an improper prior. We derive an extension of Fisher's exact test p-value in the presence of missing data, assuming the missing data mechanism is ignorable (i.e., missing at random or completely at random). Further, we propose Bayesian p-values for a test of independence in a 2 × 2 contingency table with missing data using alternative priors; we also present results from a simulation study exploring the Type I error rate and power of the proposed exact test p-values. An example, using data on the association between blood pressure and a cardiac enzyme, is presented to illustrate the methods.
Lee, Geunho; Lee, Hyun Beom; Jung, Byung Hwa; Nam, Hojung
2017-07-01
Mass spectrometry (MS) data are used to analyze biological phenomena based on chemical species. However, these data often contain unexpected duplicate records and missing values due to technical or biological factors. These 'dirty data' problems increase the difficulty of performing MS analyses because they lead to performance degradation when statistical or machine-learning tests are applied to the data. Thus, we have developed missing values preprocessor (mvp), an open-source software for preprocessing data that might include duplicate records and missing values. mvp uses the property of MS data in which identical chemical species present the same or similar values for key identifiers, such as the mass-to-charge ratio and intensity signal, and forms cliques via graph theory to process dirty data. We evaluated the validity of the mvp process via quantitative and qualitative analyses and compared the results from a statistical test that analyzed the original and mvp-applied data. This analysis showed that using mvp reduces problems associated with duplicate records and missing values. We also examined the effects of using unprocessed data in statistical tests and examined the improved statistical test results obtained with data preprocessed using mvp.
Parameter estimation in Cox models with missing failure indicators and the OPPERA study.
Brownstein, Naomi C; Cai, Jianwen; Slade, Gary D; Bair, Eric
2015-12-30
In a prospective cohort study, examining all participants for incidence of the condition of interest may be prohibitively expensive. For example, the "gold standard" for diagnosing temporomandibular disorder (TMD) is a physical examination by a trained clinician. In large studies, examining all participants in this manner is infeasible. Instead, it is common to use questionnaires to screen for incidence of TMD and perform the "gold standard" examination only on participants who screen positively. Unfortunately, some participants may leave the study before receiving the "gold standard" examination. Within the framework of survival analysis, this results in missing failure indicators. Motivated by the Orofacial Pain: Prospective Evaluation and Risk Assessment (OPPERA) study, a large cohort study of TMD, we propose a method for parameter estimation in survival models with missing failure indicators. We estimate the probability of being an incident case for those lacking a "gold standard" examination using logistic regression. These estimated probabilities are used to generate multiple imputations of case status for each missing examination that are combined with observed data in appropriate regression models. The variance introduced by the procedure is estimated using multiple imputation. The method can be used to estimate both regression coefficients in Cox proportional hazard models as well as incidence rates using Poisson regression. We simulate data with missing failure indicators and show that our method performs as well as or better than competing methods. Finally, we apply the proposed method to data from the OPPERA study. Copyright © 2015 John Wiley & Sons, Ltd.
Neeser, Rudolph; Ackermann, Rebecca Rogers; Gain, James
2009-09-01
Various methodological approaches have been used for reconstructing fossil hominin remains in order to increase sample sizes and to better understand morphological variation. Among these, morphometric quantitative techniques for reconstruction are increasingly common. Here we compare the accuracy of three approaches--mean substitution, thin plate splines, and multiple linear regression--for estimating missing landmarks of damaged fossil specimens. Comparisons are made varying the number of missing landmarks, sample sizes, and the reference species of the population used to perform the estimation. The testing is performed on landmark data from individuals of Homo sapiens, Pan troglodytes and Gorilla gorilla, and nine hominin fossil specimens. Results suggest that when a small, same-species fossil reference sample is available to guide reconstructions, thin plate spline approaches perform best. However, if no such sample is available (or if the species of the damaged individual is uncertain), estimates of missing morphology based on a single individual (or even a small sample) of close taxonomic affinity are less accurate than those based on a large sample of individuals drawn from more distantly related extant populations using a technique (such as a regression method) able to leverage the information (e.g., variation/covariation patterning) contained in this large sample. Thin plate splines also show an unexpectedly large amount of error in estimating landmarks, especially over large areas. Recommendations are made for estimating missing landmarks under various scenarios. Copyright 2009 Wiley-Liss, Inc.
Handling Missing Data in Educational Research Using SPSS
ERIC Educational Resources Information Center
Cheema, Jehanzeb
2012-01-01
This study looked at the effect of a number of factors such as the choice of analytical method, the handling method for missing data, sample size, and proportion of missing data, in order to evaluate the effect of missing data treatment on accuracy of estimation. In order to accomplish this a methodological approach involving simulated data was…
Code of Federal Regulations, 2010 CFR
2010-07-01
...) of the definition of “missing participant annuity assumptions” in § 4050.2, the present value as of... Plan B's deemed distribution date (and using the missing participant annuity assumptions), the present value per dollar of annual benefit (payable monthly as a joint and 50 percent survivor annuity...
Jäntschi, Lorentz; Sestraş, Radu E.; Bolboacă, Sorana D.
2013-01-01
The health benefit of drinking wine, expressed as capacity to defend the human organism from the free radicals action and thus reducing the oxidative stress, has already been demonstrated, and the results had been published in scientific literature. The aim of our study was to develop and assess a model able to estimate the antioxidant capacity (AC) of several samples of Romanian wines and to evaluate the AC dependency on the vintage (defined as the year in which wine was produced) and grape variety under presence of censored data. A contingency of two grape varieties from two different vineyards in Romania and five production years, with some missing experimental data, was used to conduct the analysis. The analysis showed that the antioxidant capacity of the investigated wines is linearly dependent on the vintage. Furthermore, an iterative algorithm was developed and applied to obtain the coefficients of the model and to estimate the missing experimental value. The contribution of wine source to the antioxidant capacity proved equal to 11%. PMID:24260039
Multi-test cervical cancer diagnosis with missing data estimation
NASA Astrophysics Data System (ADS)
Xu, Tao; Huang, Xiaolei; Kim, Edward; Long, L. Rodney; Antani, Sameer
2015-03-01
Cervical cancer is a leading most common type of cancer for women worldwide. Existing screening programs for cervical cancer suffer from low sensitivity. Using images of the cervix (cervigrams) as an aid in detecting pre-cancerous changes to the cervix has good potential to improve sensitivity and help reduce the number of cervical cancer cases. In this paper, we present a method that utilizes multi-modality information extracted from multiple tests of a patient's visit to classify the patient visit to be either low-risk or high-risk. Our algorithm integrates image features and text features to make a diagnosis. We also present two strategies to estimate the missing values in text features: Image Classifier Supervised Mean Imputation (ICSMI) and Image Classifier Supervised Linear Interpolation (ICSLI). We evaluate our method on a large medical dataset and compare it with several alternative approaches. The results show that the proposed method with ICSLI strategy achieves the best result of 83.03% specificity and 76.36% sensitivity. When higher specificity is desired, our method can achieve 90% specificity with 62.12% sensitivity.
Estimating a Missing Examination Score
ERIC Educational Resources Information Center
Loui, Michael C.; Lin, Athena
2017-01-01
In science and engineering courses, instructors administer multiple examinations as major assessments of students' learning. When a student is unable to take an exam, the instructor might estimate the missing exam score to calculate the student's course grade. Using exam score data from multiple offerings of two large courses at a public…
Applied Missing Data Analysis. Methodology in the Social Sciences Series
ERIC Educational Resources Information Center
Enders, Craig K.
2010-01-01
Walking readers step by step through complex concepts, this book translates missing data techniques into something that applied researchers and graduate students can understand and utilize in their own research. Enders explains the rationale and procedural details for maximum likelihood estimation, Bayesian estimation, multiple imputation, and…
40 CFR 98.85 - Procedures for estimating missing data.
Code of Federal Regulations, 2010 CFR
2010-07-01
... to determine combined process and combustion CO2 emissions, the missing data procedures in § 98.35 apply. (b) For CO2 process emissions from cement manufacturing facilities calculated according to § 98... best available estimate of the monthly clinker production based on information used for accounting...
NASA Astrophysics Data System (ADS)
Anwar, Faizan; Bárdossy, András; Seidel, Jochen
2017-04-01
Estimating missing values in a time series of a hydrological variable is an everyday task for a hydrologist. Existing methods such as inverse distance weighting, multivariate regression, and kriging, though simple to apply, provide no indication of the quality of the estimated value and depend mainly on the values of neighboring stations at a given step in the time series. Copulas have the advantage of representing the pure dependence structure between two or more variables (given the relationship between them is monotonic). They rid us of questions such as transforming the data before use or calculating functions that model the relationship between the considered variables. A copula-based approach is suggested to infill discharge, precipitation, and temperature data. As a first step the normal copula is used, subsequently, the necessity to use non-normal / non-symmetrical dependence is investigated. Discharge and temperature are treated as regular continuous variables and can be used without processing for infilling and quality checking. Due to the mixed distribution of precipitation values, it has to be treated differently. This is done by assigning a discrete probability to the zeros and treating the rest as a continuous distribution. Building on the work of others, along with infilling, the normal copula is also utilized to identify values in a time series that might be erroneous. This is done by treating the available value as missing, infilling it using the normal copula and checking if it lies within a confidence band (5 to 95% in our case) of the obtained conditional distribution. Hydrological data from two catchments Upper Neckar River (Germany) and Santa River (Peru) are used to demonstrate the application for datasets with different data quality. The Python code used here is also made available on GitHub. The required input is the time series of a given variable at different stations.
Moderation analysis with missing data in the predictors.
Zhang, Qian; Wang, Lijuan
2017-12-01
The most widely used statistical model for conducting moderation analysis is the moderated multiple regression (MMR) model. In MMR modeling, missing data could pose a challenge, mainly because the interaction term is a product of two or more variables and thus is a nonlinear function of the involved variables. In this study, we consider a simple MMR model, where the effect of the focal predictor X on the outcome Y is moderated by a moderator U. The primary interest is to find ways of estimating and testing the moderation effect with the existence of missing data in X. We mainly focus on cases when X is missing completely at random (MCAR) and missing at random (MAR). Three methods are compared: (a) Normal-distribution-based maximum likelihood estimation (NML); (b) Normal-distribution-based multiple imputation (NMI); and (c) Bayesian estimation (BE). Via simulations, we found that NML and NMI could lead to biased estimates of moderation effects under MAR missingness mechanism. The BE method outperformed NMI and NML for MMR modeling with missing data in the focal predictor, missingness depending on the moderator and/or auxiliary variables, and correctly specified distributions for the focal predictor. In addition, more robust BE methods are needed in terms of the distribution mis-specification problem of the focal predictor. An empirical example was used to illustrate the applications of the methods with a simple sensitivity analysis. (PsycINFO Database Record (c) 2017 APA, all rights reserved).
Bozio, Catherine H; Flanders, W Dana; Finelli, Lyn; Bramley, Anna M; Reed, Carrie; Gandhi, Neel R; Vidal, Jorge E; Erdman, Dean; Levine, Min Z; Lindstrom, Stephen; Ampofo, Krow; Arnold, Sandra R; Self, Wesley H; Williams, Derek J; Grijalva, Carlos G; Anderson, Evan J; McCullers, Jonathan A; Edwards, Kathryn M; Pavia, Andrew T; Wunderink, Richard G; Jain, Seema
2018-04-01
Real-time polymerase chain reaction (PCR) on respiratory specimens and serology on paired blood specimens are used to determine the etiology of respiratory illnesses for research studies. However, convalescent serology is often not collected. We used multiple imputation to assign values for missing serology results to estimate virus-specific prevalence among pediatric and adult community-acquired pneumonia hospitalizations using data from an active population-based surveillance study. Presence of adenoviruses, human metapneumovirus, influenza viruses, parainfluenza virus types 1-3, and respiratory syncytial virus was defined by positive PCR on nasopharyngeal/oropharyngeal specimens or a 4-fold rise in paired serology. We performed multiple imputation by developing a multivariable regression model for each virus using data from patients with available serology results. We calculated absolute and relative differences in the proportion of each virus detected comparing the imputed to observed (nonimputed) results. Among 2222 children and 2259 adults, 98.8% and 99.5% had nasopharyngeal/oropharyngeal specimens and 43.2% and 37.5% had paired serum specimens, respectively. Imputed results increased viral etiology assignments by an absolute difference of 1.6%-4.4% and 0.8%-2.8% in children and adults, respectively; relative differences were 1.1-3.0 times higher. Multiple imputation can be used when serology results are missing, to refine virus-specific prevalence estimates, and these will likely increase estimates.
Should multiple imputation be the method of choice for handling missing data in randomized trials?
Sullivan, Thomas R; White, Ian R; Salter, Amy B; Ryan, Philip; Lee, Katherine J
2016-01-01
The use of multiple imputation has increased markedly in recent years, and journal reviewers may expect to see multiple imputation used to handle missing data. However in randomized trials, where treatment group is always observed and independent of baseline covariates, other approaches may be preferable. Using data simulation we evaluated multiple imputation, performed both overall and separately by randomized group, across a range of commonly encountered scenarios. We considered both missing outcome and missing baseline data, with missing outcome data induced under missing at random mechanisms. Provided the analysis model was correctly specified, multiple imputation produced unbiased treatment effect estimates, but alternative unbiased approaches were often more efficient. When the analysis model overlooked an interaction effect involving randomized group, multiple imputation produced biased estimates of the average treatment effect when applied to missing outcome data, unless imputation was performed separately by randomized group. Based on these results, we conclude that multiple imputation should not be seen as the only acceptable way to handle missing data in randomized trials. In settings where multiple imputation is adopted, we recommend that imputation is carried out separately by randomized group. PMID:28034175
Should multiple imputation be the method of choice for handling missing data in randomized trials?
Sullivan, Thomas R; White, Ian R; Salter, Amy B; Ryan, Philip; Lee, Katherine J
2016-01-01
The use of multiple imputation has increased markedly in recent years, and journal reviewers may expect to see multiple imputation used to handle missing data. However in randomized trials, where treatment group is always observed and independent of baseline covariates, other approaches may be preferable. Using data simulation we evaluated multiple imputation, performed both overall and separately by randomized group, across a range of commonly encountered scenarios. We considered both missing outcome and missing baseline data, with missing outcome data induced under missing at random mechanisms. Provided the analysis model was correctly specified, multiple imputation produced unbiased treatment effect estimates, but alternative unbiased approaches were often more efficient. When the analysis model overlooked an interaction effect involving randomized group, multiple imputation produced biased estimates of the average treatment effect when applied to missing outcome data, unless imputation was performed separately by randomized group. Based on these results, we conclude that multiple imputation should not be seen as the only acceptable way to handle missing data in randomized trials. In settings where multiple imputation is adopted, we recommend that imputation is carried out separately by randomized group.
Analysis of longitudinal data from animals with missing values using SPSS.
Duricki, Denise A; Soleman, Sara; Moon, Lawrence D F
2016-06-01
Testing of therapies for disease or injury often involves the analysis of longitudinal data from animals. Modern analytical methods have advantages over conventional methods (particularly when some data are missing), yet they are not used widely by preclinical researchers. Here we provide an easy-to-use protocol for the analysis of longitudinal data from animals, and we present a click-by-click guide for performing suitable analyses using the statistical package IBM SPSS Statistics software (SPSS). We guide readers through the analysis of a real-life data set obtained when testing a therapy for brain injury (stroke) in elderly rats. If a few data points are missing, as in this example data set (for example, because of animal dropout), repeated-measures analysis of covariance may fail to detect a treatment effect. An alternative analysis method, such as the use of linear models (with various covariance structures), and analysis using restricted maximum likelihood estimation (to include all available data) can be used to better detect treatment effects. This protocol takes 2 h to carry out.
Economic values under inappropriate normal distribution assumptions.
Sadeghi-Sefidmazgi, A; Nejati-Javaremi, A; Moradi-Shahrbabak, M; Miraei-Ashtiani, S R; Amer, P R
2012-08-01
The objectives of this study were to quantify the errors in economic values (EVs) for traits affected by cost or price thresholds when skewed or kurtotic distributions of varying degree are assumed to be normal and when data with a normal distribution is subject to censoring. EVs were estimated for a continuous trait with dichotomous economic implications because of a price premium or penalty arising from a threshold ranging between -4 and 4 standard deviations from the mean. In order to evaluate the impacts of skewness, positive and negative excess kurtosis, standard skew normal, Pearson and the raised cosine distributions were used, respectively. For the various evaluable levels of skewness and kurtosis, the results showed that EVs can be underestimated or overestimated by more than 100% when price determining thresholds fall within a range from the mean that might be expected in practice. Estimates of EVs were very sensitive to censoring or missing data. In contrast to practical genetic evaluation, economic evaluation is very sensitive to lack of normality and missing data. Although in some special situations, the presence of multiple thresholds may attenuate the combined effect of errors at each threshold point, in practical situations there is a tendency for a few key thresholds to dominate the EV, and there are many situations where errors could be compounded across multiple thresholds. In the development of breeding objectives for non-normal continuous traits influenced by value thresholds, it is necessary to select a transformation that will resolve problems of non-normality or consider alternative methods that are less sensitive to non-normality.
Fatality estimator user’s guide
Huso, Manuela M.; Som, Nicholas; Ladd, Lew
2012-12-11
Only carcasses judged to have been killed after the previous search should be included in the fatality data set submitted to this estimator software. This estimator already corrects for carcasses missed in previous searches, so carcasses judged to have been missed at least once should be considered “incidental” and not included in the fatality data set used to estimate fatality. Note: When observed carcass count is <5 (including 0 for species known to be at risk, but not observed), USGS Data Series 881 (http://pubs.usgs.gov/ds/0881/) is recommended for fatality estimation.
Dou, Chao
2016-01-01
The storage volume of internet data center is one of the classical time series. It is very valuable to predict the storage volume of a data center for the business value. However, the storage volume series from a data center is always “dirty,” which contains the noise, missing data, and outliers, so it is necessary to extract the main trend of storage volume series for the future prediction processing. In this paper, we propose an irregular sampling estimation method to extract the main trend of the time series, in which the Kalman filter is used to remove the “dirty” data; then the cubic spline interpolation and average method are used to reconstruct the main trend. The developed method is applied in the storage volume series of internet data center. The experiment results show that the developed method can estimate the main trend of storage volume series accurately and make great contribution to predict the future volume value. PMID:28090205
Miao, Beibei; Dou, Chao; Jin, Xuebo
2016-01-01
The storage volume of internet data center is one of the classical time series. It is very valuable to predict the storage volume of a data center for the business value. However, the storage volume series from a data center is always "dirty," which contains the noise, missing data, and outliers, so it is necessary to extract the main trend of storage volume series for the future prediction processing. In this paper, we propose an irregular sampling estimation method to extract the main trend of the time series, in which the Kalman filter is used to remove the "dirty" data; then the cubic spline interpolation and average method are used to reconstruct the main trend. The developed method is applied in the storage volume series of internet data center. The experiment results show that the developed method can estimate the main trend of storage volume series accurately and make great contribution to predict the future volume value. .
Cadwell, Betsy L; Boyle, James P; Tierney, Edward F; Thompson, Theodore J
2007-09-01
Some states' death certificate form includes a diabetes yes/no check box that enables policy makers to investigate the change in heart disease mortality rates by diabetes status. Because the check boxes are sometimes unmarked, a method accounting for missing data is needed when estimating heart disease mortality rates by diabetes status. Using North Dakota's data (1992-2003), we generate the posterior distribution of diabetes status to estimate diabetes status among those with heart disease and an unmarked check box using Monte Carlo methods. Combining this estimate with the number of death certificates with known diabetes status provides a numerator for heart disease mortality rates. Denominators for rates were estimated from the North Dakota Behavioral Risk Factor Surveillance System. Accounting for missing data, age-adjusted heart disease mortality rates (per 1,000) among women with diabetes were 8.6 during 1992-1998 and 6.7 during 1999-2003. Among men with diabetes, rates were 13.0 during 1992-1998 and 10.0 during 1999-2003. The Bayesian approach accounted for the uncertainty due to missing diabetes status as well as the uncertainty in estimating the populations with diabetes.
Smith, Justin D.; Borckardt, Jeffrey J.; Nash, Michael R.
2013-01-01
The case-based time-series design is a viable methodology for treatment outcome research. However, the literature has not fully addressed the problem of missing observations with such autocorrelated data streams. Mainly, to what extent do missing observations compromise inference when observations are not independent? Do the available missing data replacement procedures preserve inferential integrity? Does the extent of autocorrelation matter? We use Monte Carlo simulation modeling of a single-subject intervention study to address these questions. We find power sensitivity to be within acceptable limits across four proportions of missing observations (10%, 20%, 30%, and 40%) when missing data are replaced using the Expectation-Maximization Algorithm, more commonly known as the EM Procedure (Dempster, Laird, & Rubin, 1977).This applies to data streams with lag-1 autocorrelation estimates under 0.80. As autocorrelation estimates approach 0.80, the replacement procedure yields an unacceptable power profile. The implications of these findings and directions for future research are discussed. PMID:22697454
Causal inference with missing exposure information: Methods and applications to an obstetric study.
Zhang, Zhiwei; Liu, Wei; Zhang, Bo; Tang, Li; Zhang, Jun
2016-10-01
Causal inference in observational studies is frequently challenged by the occurrence of missing data, in addition to confounding. Motivated by the Consortium on Safe Labor, a large observational study of obstetric labor practice and birth outcomes, this article focuses on the problem of missing exposure information in a causal analysis of observational data. This problem can be approached from different angles (i.e. missing covariates and causal inference), and useful methods can be obtained by drawing upon the available techniques and insights in both areas. In this article, we describe and compare a collection of methods based on different modeling assumptions, under standard assumptions for missing data (i.e. missing-at-random and positivity) and for causal inference with complete data (i.e. no unmeasured confounding and another positivity assumption). These methods involve three models: one for treatment assignment, one for the dependence of outcome on treatment and covariates, and one for the missing data mechanism. In general, consistent estimation of causal quantities requires correct specification of at least two of the three models, although there may be some flexibility as to which two models need to be correct. Such flexibility is afforded by doubly robust estimators adapted from the missing covariates literature and the literature on causal inference with complete data, and by a newly developed triply robust estimator that is consistent if any two of the three models are correct. The methods are applied to the Consortium on Safe Labor data and compared in a simulation study mimicking the Consortium on Safe Labor. © The Author(s) 2013.
Ju, Xiangqun; Jamieson, Lisa M; Mejia, Gloria C
2016-12-01
To estimate the effect of mothers' education on Indigenous Australian children's dental caries experience while controlling for the mediating effect of children's sweet food intake. The Longitudinal Study of Indigenous Children is a study of two representative cohorts of Indigenous Australian children, aged from 6 months to 2 years (baby cohort) and from 3.5 to 5 years (child cohort) at baseline. The children's primary caregiver undertook a face-to-face interview in 2008 and repeated annually for the next 4 years. Data included household demographics, child health (nutrition information and dental health), maternal conditions and highest qualification levels. Mother's educational level was classified into four categories: 0-9 years, 10 years, 11-12 years and >12 years. Children's mean sweet food intake was categorized as <20%, 20-30%, and >30%. After multiple imputation of missing values, a marginal structural model with stabilized inverse probability weights was used to estimate the direct effect of mothers' education level on children's dental decay experience. From 2008 to 2012, complete data on 1720 mother-child dyads were available. Dental caries experience for children was 42.3% over the 5-year period. The controlled direct effect estimates of mother's education on child dental caries were 1.21 (95% CI: 1.01-1.45), 1.03 (95% CI: 0.91-1.18) and 1.07 (95% CI: 0.93-1.22); after multiple imputation of missing values, the effects were 1.21 (95% CI: 1.05-1.39), 1.06 (95% CI: 0.94-1.19) and 1.06 (95% CI: 0.95-1.19), comparing '0-9', '10' and '11-12' years to > 12 years of education. Mothers' education level had a direct effect on children's dental decay experience that was not mediated by sweet food intake and other risk factors when estimated using a marginal structural model. © 2016 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.
Dental health state utility values associated with tooth loss in two contrasting cultures.
Nassani, M Z; Locker, D; Elmesallati, A A; Devlin, H; Mohammadi, T M; Hajizamani, A; Kay, E J
2009-08-01
The study aimed to assess the value placed on oral health states by measuring the utility of mouths in which teeth had been lost and to explore variations in utility values within and between two contrasting cultures, UK and Iran. One hundred and fifty eight patients, 84 from UK and 74 from Iran, were recruited from clinics at University-based faculties of dentistry. All had experienced tooth loss and had restored or unrestored dental spaces. They were presented with 19 different scenarios of mouths with missing teeth. Fourteen involved the loss of one tooth and five involved shortened dental arches (SDAs) with varying numbers of missing posterior teeth. Each written description was accompanied by a verbal explanation and digital pictures of mouth models. Participants were asked to indicate on a standardized Visual Analogue Scale how they would value the health of their mouth if they had lost the tooth/teeth described and the resulting space was left unrestored. With a utility value of 0.0 representing the worst possible health state for a mouth and 1.0 representing the best, the mouth with the upper central incisor missing attracted the lowest utility value in both samples (UK = 0.16; Iran = 0.06), while the one with a missing upper second molar the highest utility values (0.42, 0.39 respectively). In both countries the utility value increased as the tooth in the scenario moved from the anterior towards the posterior aspect of the mouth. There were significant differences in utility values between UK and Iranian samples for four scenarios all involving the loss of anterior teeth. These differences remained after controlling for gender, age and the state of the dentition. With respect to the SDA scenarios, a mouth with a SDA with only the second molar teeth missing in all quadrants attracted the highest utility values, while a mouth with an extreme SDA with both missing molar and premolar teeth in all quadrants attracted the lowest utility values. The study provided further evidence of the validity of the scaling approach to utility measurement in mouths with missing teeth. Some cross-cultural variations in values were observed but these should be viewed with due caution because the magnitude of the differences was small.
Daily values flow comparison and estimates using program HYCOMP, version 1.0
Sanders, Curtis L.
2002-01-01
A method used by the U.S. Geological Survey for quality control in computing daily value flow records is to compare hydrographs of computed flows at a station under review to hydrographs of computed flows at a selected index station. The hydrographs are placed on top of each other (as hydrograph overlays) on a light table, compared, and missing daily flow data estimated. This method, however, is subjective and can produce inconsistent results, because hydrographers can differ when calculating acceptable limits of deviation between observed and estimated flows. Selection of appropriate index stations also is judgemental, giving no consideration to the mathematical correlation between the review station and the index station(s). To address the limitation of the hydrograph overlay method, a set of software programs, written in the SAS macrolanguage, was developed and designated Program HYDCOMP. The program automatically selects statistically comparable index stations by correlation and regression, and performs hydrographic comparisons and estimates of missing data by regressing daily mean flows at the review station against -8 to +8 lagged flows at one or two index stations and day-of-week. Another advantage that HYDCOMP has over the graphical method is that estimated flows, the criteria for determining the quality of the data, and the selection of index stations are determined statistically, and are reproducible from one user to another. HYDCOMP will load the most-correlated index stations into another file containing the ?best index stations,? but will not overwrite stations already in the file. A knowledgeable user should delete unsuitable index stations from this file based on standard error of estimate, hydrologic similarity of candidate index stations to the review station, and knowledge of the individual station characteristics. Also, the user can add index stations not selected by HYDCOMP, if desired. Once the file of best-index stations is created, a user may do hydrographic comparison and data estimates by entering the number of the review station, selecting an index station, and specifying the periods to be used for regression and plotting. For example, the user can restrict the regression to ice-free periods of the year to exclude flows estimated during iced conditions. However, the regression could still be used to estimate flow during iced conditions. HYDCOMP produces the standard error of estimate as a measure of the central scatter of the regression and R-square (coefficient of determination) for evaluating the accuracy of the regression. Output from HYDCOMP includes plots of percent residuals against (1) time within the regression and plot periods, (2) month and day of the year for evaluating seasonal bias in the regression, and (3) the magnitude of flow. For hydrographic comparisons, it plots 2-month segments of hydrographs over the selected plot period showing the observed flows, the regressed flows, the 95 percent confidence limit flows, flow measurements, and regression limits. If the observed flows at the review station remain outside the 95 percent confidence limits for a prolonged period, there may be some error in the flows at the review station or at the index station(s). In addition, daily minimum and maximum temperatures and daily rainfall are shown on the hydrographs, if available, to help indicate whether an apparent change in flow may result from rainfall or from changes in backwater from melting ice or freezing water. HYDCOMP statistically smooths estimated flows from non-missing flows at the edges of the gaps in data into regressed flows at the center of the gaps using the Kalman smoothing algorithm. Missing flows are automatically estimated by HYDCOMP, but the user also can specify that periods of erroneous, but nonmissing flows, be estimated by the program.
Modeling Achievement Trajectories when Attrition Is Informative
ERIC Educational Resources Information Center
Feldman, Betsy J.; Rabe-Hesketh, Sophia
2012-01-01
In longitudinal education studies, assuming that dropout and missing data occur completely at random is often unrealistic. When the probability of dropout depends on covariates and observed responses (called "missing at random" [MAR]), or on values of responses that are missing (called "informative" or "not missing at random" [NMAR]),…
ERIC Educational Resources Information Center
Karl, Andrew T.; Yang, Yan; Lohr, Sharon L.
2013-01-01
Value-added models have been widely used to assess the contributions of individual teachers and schools to students' academic growth based on longitudinal student achievement outcomes. There is concern, however, that ignoring the presence of missing values, which are common in longitudinal studies, can bias teachers' value-added scores.…
Peel, D; Waples, R S; Macbeth, G M; Do, C; Ovenden, J R
2013-03-01
Theoretical models are often applied to population genetic data sets without fully considering the effect of missing data. Researchers can deal with missing data by removing individuals that have failed to yield genotypes and/or by removing loci that have failed to yield allelic determinations, but despite their best efforts, most data sets still contain some missing data. As a consequence, realized sample size differs among loci, and this poses a problem for unbiased methods that must explicitly account for random sampling error. One commonly used solution for the calculation of contemporary effective population size (N(e) ) is to calculate the effective sample size as an unweighted mean or harmonic mean across loci. This is not ideal because it fails to account for the fact that loci with different numbers of alleles have different information content. Here we consider this problem for genetic estimators of contemporary effective population size (N(e) ). To evaluate bias and precision of several statistical approaches for dealing with missing data, we simulated populations with known N(e) and various degrees of missing data. Across all scenarios, one method of correcting for missing data (fixed-inverse variance-weighted harmonic mean) consistently performed the best for both single-sample and two-sample (temporal) methods of estimating N(e) and outperformed some methods currently in widespread use. The approach adopted here may be a starting point to adjust other population genetics methods that include per-locus sample size components. © 2012 Blackwell Publishing Ltd.
Li, Siying; Koch, Gary G; Preisser, John S; Lam, Diana; Sanchez-Kam, Matilde
2017-01-01
Dichotomous endpoints in clinical trials have only two possible outcomes, either directly or via categorization of an ordinal or continuous observation. It is common to have missing data for one or more visits during a multi-visit study. This paper presents a closed form method for sensitivity analysis of a randomized multi-visit clinical trial that possibly has missing not at random (MNAR) dichotomous data. Counts of missing data are redistributed to the favorable and unfavorable outcomes mathematically to address possibly informative missing data. Adjusted proportion estimates and their closed form covariance matrix estimates are provided. Treatment comparisons over time are addressed with Mantel-Haenszel adjustment for a stratification factor and/or randomization-based adjustment for baseline covariables. The application of such sensitivity analyses is illustrated with an example. An appendix outlines an extension of the methodology to ordinal endpoints.
Marginalized zero-inflated Poisson models with missing covariates.
Benecha, Habtamu K; Preisser, John S; Divaris, Kimon; Herring, Amy H; Das, Kalyan
2018-05-11
Unlike zero-inflated Poisson regression, marginalized zero-inflated Poisson (MZIP) models for counts with excess zeros provide estimates with direct interpretations for the overall effects of covariates on the marginal mean. In the presence of missing covariates, MZIP and many other count data models are ordinarily fitted using complete case analysis methods due to lack of appropriate statistical methods and software. This article presents an estimation method for MZIP models with missing covariates. The method, which is applicable to other missing data problems, is illustrated and compared with complete case analysis by using simulations and dental data on the caries preventive effects of a school-based fluoride mouthrinse program. © 2018 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
The multiple imputation method: a case study involving secondary data analysis.
Walani, Salimah R; Cleland, Charles M
2015-05-01
To illustrate with the example of a secondary data analysis study the use of the multiple imputation method to replace missing data. Most large public datasets have missing data, which need to be handled by researchers conducting secondary data analysis studies. Multiple imputation is a technique widely used to replace missing values while preserving the sample size and sampling variability of the data. The 2004 National Sample Survey of Registered Nurses. The authors created a model to impute missing values using the chained equation method. They used imputation diagnostics procedures and conducted regression analysis of imputed data to determine the differences between the log hourly wages of internationally educated and US-educated registered nurses. The authors used multiple imputation procedures to replace missing values in a large dataset with 29,059 observations. Five multiple imputed datasets were created. Imputation diagnostics using time series and density plots showed that imputation was successful. The authors also present an example of the use of multiple imputed datasets to conduct regression analysis to answer a substantive research question. Multiple imputation is a powerful technique for imputing missing values in large datasets while preserving the sample size and variance of the data. Even though the chained equation method involves complex statistical computations, recent innovations in software and computation have made it possible for researchers to conduct this technique on large datasets. The authors recommend nurse researchers use multiple imputation methods for handling missing data to improve the statistical power and external validity of their studies.
A Comparison of Missing-Data Procedures for Arima Time-Series Analysis
ERIC Educational Resources Information Center
Velicer, Wayne F.; Colby, Suzanne M.
2005-01-01
Missing data are a common practical problem for longitudinal designs. Time-series analysis is a longitudinal method that involves a large number of observations on a single unit. Four different missing-data methods (deletion, mean substitution, mean of adjacent observations, and maximum likelihood estimation) were evaluated. Computer-generated…
They Remember the "Lost" People.
ERIC Educational Resources Information Center
Klages, Karen
Estimates of the number of children currently missing in the United States are only approximate because there is no effective central data bank to collect information on missing persons and unidentified bodies. However, the problem appears to have reached epidemic proportions. Some parents of missing persons have formed organizations in different…
Multimedia data from two probability-based exposure studies were investigated in terms of how missing data and measurement-error imprecision affected estimation of population parameters and associations. Missing data resulted mainly from individuals' refusing to participate in c...
The Empirical Nature and Statistical Treatment of Missing Data
ERIC Educational Resources Information Center
Tannenbaum, Christyn E.
2009-01-01
Introduction. Missing data is a common problem in research and can produce severely misleading analyses, including biased estimates of statistical parameters, and erroneous conclusions. In its 1999 report, the APA Task Force on Statistical Inference encouraged authors to report complications such as missing data and discouraged the use of…
40 CFR 98.96 - Data reporting requirements.
Code of Federal Regulations, 2011 CFR
2011-07-01
... of this subpart, for each fluorinated GHG used. (s) Where missing data procedures were used to... missing data procedures were followed in the reporting year, the method used to estimate the missing data... 40 Protection of Environment 21 2011-07-01 2011-07-01 false Data reporting requirements. 98.96...
40 CFR 98.456 - Data reporting requirements.
Code of Federal Regulations, 2014 CFR
2014-07-01
..., of Equation SS-6 of this subpart. (t) For any missing data, you must report the reason the data were missing, the parameters for which the data were missing, the substitute parameters used to estimate... 40 Protection of Environment 21 2014-07-01 2014-07-01 false Data reporting requirements. 98.456...
USDA-ARS?s Scientific Manuscript database
Missing meteorological data have to be estimated for agricultural and environmental modeling. The objective of this work was to develop a technique to reconstruct the missing daily precipitation data in the central part of the Chesapeake Bay Watershed using regression trees (RT) and artificial neura...
Apparatus And Method For Reconstructing Data Using Cross-Parity Stripes On Storage Media
Hughes, James Prescott
2003-06-17
An apparatus and method for reconstructing missing data using cross-parity stripes on a storage medium is provided. The apparatus and method may operate on data symbols having sizes greater than a data bit. The apparatus and method makes use of a plurality of parity stripes for reconstructing missing data stripes. The parity symbol values in the parity stripes are used as a basis for determining the value of the missing data symbol in a data stripe. A correction matrix is shifted along the data stripes, correcting missing data symbols as it is shifted. The correction is performed from the outside data stripes towards the inner data stripes to thereby use previously reconstructed data symbols to reconstruct other missing data symbols.
Salim, Agus; Mackinnon, Andrew; Christensen, Helen; Griffiths, Kathleen
2008-09-30
The pre-test-post-test design (PPD) is predominant in trials of psychotherapeutic treatments. Missing data due to withdrawals present an even bigger challenge in assessing treatment effectiveness under the PPD than under designs with more observations since dropout implies an absence of information about response to treatment. When confronted with missing data, often it is reasonable to assume that the mechanism underlying missingness is related to observed but not to unobserved outcomes (missing at random, MAR). Previous simulation and theoretical studies have shown that, under MAR, modern techniques such as maximum-likelihood (ML) based methods and multiple imputation (MI) can be used to produce unbiased estimates of treatment effects. In practice, however, ad hoc methods such as last observation carried forward (LOCF) imputation and complete-case (CC) analysis continue to be used. In order to better understand the behaviour of these methods in the PPD, we compare the performance of traditional approaches (LOCF, CC) and theoretically sound techniques (MI, ML), under various MAR mechanisms. We show that the LOCF method is seriously biased and conclude that its use should be abandoned. Complete-case analysis produces unbiased estimates only when the dropout mechanism does not depend on pre-test values even when dropout is related to fixed covariates including treatment group (covariate-dependent: CD). However, CC analysis is generally biased under MAR. The magnitude of the bias is largest when the correlation of post- and pre-test is relatively low.
Liu, Benmei; Yu, Mandi; Graubard, Barry I; Troiano, Richard P; Schenker, Nathaniel
2016-01-01
The Physical Activity Monitor (PAM) component was introduced into the 2003-2004 National Health and Nutrition Examination Survey (NHANES) to collect objective information on physical activity including both movement intensity counts and ambulatory steps. Due to an error in the accelerometer device initialization process, the steps data were missing for all participants in several primary sampling units (PSUs), typically a single county or group of contiguous counties, who had intensity count data from their accelerometers. To avoid potential bias and loss in efficiency in estimation and inference involving the steps data, we considered methods to accurately impute the missing values for steps collected in the 2003-2004 NHANES. The objective was to come up with an efficient imputation method which minimized model-based assumptions. We adopted a multiple imputation approach based on Additive Regression, Bootstrapping and Predictive mean matching (ARBP) methods. This method fits alternative conditional expectation (ace) models, which use an automated procedure to estimate optimal transformations for both the predictor and response variables. This paper describes the approaches used in this imputation and evaluates the methods by comparing the distributions of the original and the imputed data. A simulation study using the observed data is also conducted as part of the model diagnostics. Finally some real data analyses are performed to compare the before and after imputation results. PMID:27488606
Policy Analysis: Valuation of Ecosystem Services in the Southern Appalachian Mountains.
Banzhaf, H Spencer; Burtraw, Dallas; Criscimagna, Susie Chung; Cosby, Bernard J; Evans, David A; Krupnick, Alan J; Siikamäki, Juha V
2016-03-15
This study estimates the economic value of an increase in ecosystem services attributable to the reduced acidification expected from more stringent air pollution policy. By integrating a detailed biogeochemical model that projects future ecological recovery with economic methods that measure preferences for specific ecological improvements, we estimate the economic value of ecological benefits from new air pollution policies in the Southern Appalachian ecosystem. Our results indicate that these policies generate aggregate benefits of about $3.7 billion, or about $16 per year per household in the region. The study provides currently missing information about the ecological benefits from air pollution policies that is needed to evaluate such policies comprehensively. More broadly, the study also illustrates how integrated biogeochemical and economic assessments of multidimensional ecosystems can evaluate the relative benefits of different policy options that vary by scale and across ecosystem attributes.
Reconstruction of missed critical frequency of F2-layer over Mexico using TEC
NASA Astrophysics Data System (ADS)
Sergeeva, M. A.; Maltseva, O. A.; Gonzalez-Esparza, A.; Romero Hernandez, E.; De la Luz, V.; Rodriguez-Martinez, M. R.
2016-12-01
The study of the Earth's ionosphere's state is one of the key issues within the Space Weather monitoring task. It is hard to overestimate the importance of diagnostics of its current state and forecasts of Space Weather conditions. There are different methods of short-time predictions for the ionosphere state change. The real-time monitoring of the ionospheric Total Electron Content (TEC) provides the opportunity to choose an appropriate technique for the particular observation point on the Earth. From September 2015 the continuous monitoring of TEC variations over the territory of Mexico is performed by the Mexican Space Weather Service (SCiESMEX). Regular patterns of the diurnal and seasonal TEC variations were revealed in base of past statistics and real-time observations which can be used to test the prediction method. Some specific features of the ionosphere behaviour are discussed. However, with all the merits of TEC as an ionospheric parameter, for the full picture of the processes in the ionosphere and for practical applications it is needed to identify the behaviour of other principal ionospheric parameters provided by ionosondes. Currently, SCiESMEX works on the project of the ionosonde installation in Mexico. This study was focused on the reconstruction of the critical frequency of F2-layer of the ionosphere (foF2) when this data is missing. For this purpose measurements of TEC and the median value of the equivalent slab thickness of the ionosphere were used. First, the foF2 values reconstruction was made for the case of the ionosonde data being absent during some hours or days. Second, the possibility of foF2 reconstruction was estimated for the Mexican region having no ionosonde using local TEC data and foF2 data obtained in the regions close to Mexico. Calculations were performed for quiet and disturbed periods. The results of reconstruction were compared to the foF2 obtained from the International Reference Model and to median foF2 values. Comparison with other low-and mid-latitude regions was made. It was shown that foF2 reconstructed using TEC have better agreement with the experimental data. Considering the said above, the use of the reconstructed foF2 values is a great aid for the ionosphere state estimation over Mexico when foF2 information is missed.
Implicit Valuation of the Near-Miss is Dependent on Outcome Context.
Banks, Parker J; Tata, Matthew S; Bennett, Patrick J; Sekuler, Allison B; Gruber, Aaron J
2018-03-01
Gambling studies have described a "near-miss effect" wherein the experience of almost winning increases gambling persistence. The near-miss has been proposed to inflate the value of preceding actions through its perceptual similarity to wins. We demonstrate here, however, that it acts as a conditioned stimulus to positively or negatively influence valuation, dependent on reward expectation and cognitive engagement. When subjects are asked to choose between two simulated slot machines, near-misses increase valuation of machines with a low payout rate, whereas they decrease valuation of high payout machines. This contextual effect impairs decisions and persists regardless of manipulations to outcome feedback or financial incentive provided for good performance. It is consistent with proposals that near-misses cause frustration when wins are expected, and we propose that it increases choice stochasticity and overrides avoidance of low-valued options. Intriguingly, the near-miss effect disappears when subjects are required to explicitly value machines by placing bets, rather than choosing between them. We propose that this task increases cognitive engagement and recruits participation of brain regions involved in cognitive processing, causing inhibition of otherwise dominant systems of decision-making. Our results reveal that only implicit, rather than explicit strategies of decision-making are affected by near-misses, and that the brain can fluidly shift between these strategies according to task demands.
Normal Theory Two-Stage ML Estimator When Data Are Missing at the Item Level
Savalei, Victoria; Rhemtulla, Mijke
2017-01-01
In many modeling contexts, the variables in the model are linear composites of the raw items measured for each participant; for instance, regression and path analysis models rely on scale scores, and structural equation models often use parcels as indicators of latent constructs. Currently, no analytic estimation method exists to appropriately handle missing data at the item level. Item-level multiple imputation (MI), however, can handle such missing data straightforwardly. In this article, we develop an analytic approach for dealing with item-level missing data—that is, one that obtains a unique set of parameter estimates directly from the incomplete data set and does not require imputations. The proposed approach is a variant of the two-stage maximum likelihood (TSML) methodology, and it is the analytic equivalent of item-level MI. We compare the new TSML approach to three existing alternatives for handling item-level missing data: scale-level full information maximum likelihood, available-case maximum likelihood, and item-level MI. We find that the TSML approach is the best analytic approach, and its performance is similar to item-level MI. We recommend its implementation in popular software and its further study. PMID:29276371
Normal Theory Two-Stage ML Estimator When Data Are Missing at the Item Level.
Savalei, Victoria; Rhemtulla, Mijke
2017-08-01
In many modeling contexts, the variables in the model are linear composites of the raw items measured for each participant; for instance, regression and path analysis models rely on scale scores, and structural equation models often use parcels as indicators of latent constructs. Currently, no analytic estimation method exists to appropriately handle missing data at the item level. Item-level multiple imputation (MI), however, can handle such missing data straightforwardly. In this article, we develop an analytic approach for dealing with item-level missing data-that is, one that obtains a unique set of parameter estimates directly from the incomplete data set and does not require imputations. The proposed approach is a variant of the two-stage maximum likelihood (TSML) methodology, and it is the analytic equivalent of item-level MI. We compare the new TSML approach to three existing alternatives for handling item-level missing data: scale-level full information maximum likelihood, available-case maximum likelihood, and item-level MI. We find that the TSML approach is the best analytic approach, and its performance is similar to item-level MI. We recommend its implementation in popular software and its further study.
Should genes with missing data be excluded from phylogenetic analyses?
Jiang, Wei; Chen, Si-Yun; Wang, Hong; Li, De-Zhu; Wiens, John J
2014-11-01
Phylogeneticists often design their studies to maximize the number of genes included but minimize the overall amount of missing data. However, few studies have addressed the costs and benefits of adding characters with missing data, especially for likelihood analyses of multiple loci. In this paper, we address this topic using two empirical data sets (in yeast and plants) with well-resolved phylogenies. We introduce varying amounts of missing data into varying numbers of genes and test whether the benefits of excluding genes with missing data outweigh the costs of excluding the non-missing data that are associated with them. We also test if there is a proportion of missing data in the incomplete genes at which they cease to be beneficial or harmful, and whether missing data consistently bias branch length estimates. Our results indicate that adding incomplete genes generally increases the accuracy of phylogenetic analyses relative to excluding them, especially when there is a high proportion of incomplete genes in the overall dataset (and thus few complete genes). Detailed analyses suggest that adding incomplete genes is especially helpful for resolving poorly supported nodes. Given that we find that excluding genes with missing data often decreases accuracy relative to including these genes (and that decreases are generally of greater magnitude than increases), there is little basis for assuming that excluding these genes is necessarily the safer or more conservative approach. We also find no evidence that missing data consistently bias branch length estimates. Copyright © 2014 Elsevier Inc. All rights reserved.
Improving data sharing in research with context-free encoded missing data.
Hoevenaar-Blom, Marieke P; Guillemont, Juliette; Ngandu, Tiia; Beishuizen, Cathrien R L; Coley, Nicola; Moll van Charante, Eric P; Andrieu, Sandrine; Kivipelto, Miia; Soininen, Hilkka; Brayne, Carol; Meiller, Yannick; Richard, Edo
2017-01-01
Lack of attention to missing data in research may result in biased results, loss of power and reduced generalizability. Registering reasons for missing values at the time of data collection, or-in the case of sharing existing data-before making data available to other teams, can save time and efforts, improve scientific value and help to prevent erroneous assumptions and biased results. To ensure that encoding of missing data is sufficient to understand the reason why data are missing, it should ideally be context-free. Therefore, 11 context-free codes of missing data were carefully designed based on three completed randomized controlled clinical trials and tested in a new randomized controlled clinical trial by an international team consisting of clinical researchers and epidemiologists with extended experience in designing and conducting trials and an Information System expert. These codes can be divided into missing due to participant and/or participation characteristics (n = 6), missing by design (n = 4), and due to a procedural error (n = 1). Broad implementation of context-free missing data encoding may enhance the possibilities of data sharing and pooling, thus allowing more powerful analyses using existing data.
NASA Astrophysics Data System (ADS)
Song, Y.; Gurney, K. R.; Rayner, P. J.; Asefi-Najafabady, S.
2012-12-01
High resolution quantification of global fossil fuel CO2 emissions has become essential in research aimed at understanding the global carbon cycle and supporting the verification of international agreements on greenhouse gas emission reductions. The Fossil Fuel Data Assimilation System (FFDAS) was used to estimate global fossil fuel carbon emissions at 0.25 degree from 1992 to 2010. FFDAS quantifies CO2 emissions based on areal population density, per capita economic activity, energy intensity and carbon intensity. A critical constraint to this system is the estimation of national-scale fossil fuel CO2 emissions disaggregated into economic sectors. Furthermore, prior uncertainty estimation is an important aspect of the FFDAS. Objective techniques to quantify uncertainty for the national emissions are essential. There are several institutional datasets that quantify national carbon emissions, including British Petroleum (BP), the International Energy Agency (IEA), the Energy Information Administration (EIA), and the Carbon Dioxide Information and Analysis Center (CDIAC). These four datasets have been "harmonized" by Jordan Macknick for inter-comparison purposes (Macknick, Carbon Management, 2011). The harmonization attempted to generate consistency among the different institutional datasets via a variety of techniques such as reclassifying into consistent emitting categories, recalculating based on consistent emission factors, and converting into consistent units. These harmonized data form the basis of our uncertainty estimation. We summarized the maximum, minimum and mean national carbon emissions for all the datasets from 1992 to 2010. We calculated key statistics highlighting the remaining differences among the harmonized datasets. We combine the span (max - min) of datasets for each country and year with the standard deviation of the national spans over time. We utilize the economic sectoral definitions from IEA to disaggregate the national total emission into specific sectors required by FFDAS. Our results indicated that although the harmonization performed by Macknick generates better agreement among datasets, significant differences remain at national total level. For example, the CO2 emission span for most countries range from 10% to 12%; BP is generally the highest of the four datasets while IEA is typically the lowest; The US and China had the highest absolute span values but lower percentage span values compared to other countries. However, the US and China make up nearly one-half of the total global absolute span quantity. The absolute span value for the summation of national differences approaches 1 GtC/year in 2007, almost one-half of the biological "missing sink". The span value is used as a potential bias in a recalculation of global and regional carbon budgets to highlight the importance of fossil fuel CO2 emissions in calculating the missing sink. We conclude that if the harmonized span represents potential bias, calculations of the missing sink through forward budget or inverse approaches may be biased by nearly a factor of two.
Gap-filling methods to impute eddy covariance flux data by preserving variance.
NASA Astrophysics Data System (ADS)
Kunwor, S.; Staudhammer, C. L.; Starr, G.; Loescher, H. W.
2015-12-01
To represent carbon dynamics, in terms of exchange of CO2 between the terrestrial ecosystem and the atmosphere, eddy covariance (EC) data has been collected using eddy flux towers from various sites across globe for more than two decades. However, measurements from EC data are missing for various reasons: precipitation, routine maintenance, or lack of vertical turbulence. In order to have estimates of net ecosystem exchange of carbon dioxide (NEE) with high precision and accuracy, robust gap-filling methods to impute missing data are required. While the methods used so far have provided robust estimates of the mean value of NEE, little attention has been paid to preserving the variance structures embodied by the flux data. Preserving the variance of these data will provide unbiased and precise estimates of NEE over time, which mimic natural fluctuations. We used a non-linear regression approach with moving windows of different lengths (15, 30, and 60-days) to estimate non-linear regression parameters for one year of flux data from a long-leaf pine site at the Joseph Jones Ecological Research Center. We used as our base the Michaelis-Menten and Van't Hoff functions. We assessed the potential physiological drivers of these parameters with linear models using micrometeorological predictors. We then used a parameter prediction approach to refine the non-linear gap-filling equations based on micrometeorological conditions. This provides us an opportunity to incorporate additional variables, such as vapor pressure deficit (VPD) and volumetric water content (VWC) into the equations. Our preliminary results indicate that improvements in gap-filling can be gained with a 30-day moving window with additional micrometeorological predictors (as indicated by lower root mean square error (RMSE) of the predicted values of NEE). Our next steps are to use these parameter predictions from moving windows to gap-fill the data with and without incorporation of potential driver variables of the parameters traditionally used. Then, comparisons of the predicted values from these methods and 'traditional' gap-filling methods (using 12 fixed monthly windows) will be assessed to show the scale of preserving variance. Further, this method will be applied to impute artificially created gaps for analyzing if variance is preserved.
The Missing Data Assumptions of the NEAT Design and Their Implications for Test Equating
ERIC Educational Resources Information Center
Sinharay, Sandip; Holland, Paul W.
2010-01-01
The Non-Equivalent groups with Anchor Test (NEAT) design involves "missing data" that are "missing by design." Three nonlinear observed score equating methods used with a NEAT design are the "frequency estimation equipercentile equating" (FEEE), the "chain equipercentile equating" (CEE), and the "item-response-theory observed-score-equating" (IRT…
Effects of Missing Data Methods in Structural Equation Modeling with Nonnormal Longitudinal Data
ERIC Educational Resources Information Center
Shin, Tacksoo; Davison, Mark L.; Long, Jeffrey D.
2009-01-01
The purpose of this study is to investigate the effects of missing data techniques in longitudinal studies under diverse conditions. A Monte Carlo simulation examined the performance of 3 missing data methods in latent growth modeling: listwise deletion (LD), maximum likelihood estimation using the expectation and maximization algorithm with a…
ERIC Educational Resources Information Center
Lee, In Heok
2012-01-01
Researchers in career and technical education often ignore more effective ways of reporting and treating missing data and instead implement traditional, but ineffective, missing data methods (Gemici, Rojewski, & Lee, 2012). The recent methodological, and even the non-methodological, literature has increasingly emphasized the importance of…
Smith, Justin D; Borckardt, Jeffrey J; Nash, Michael R
2012-09-01
The case-based time-series design is a viable methodology for treatment outcome research. However, the literature has not fully addressed the problem of missing observations with such autocorrelated data streams. Mainly, to what extent do missing observations compromise inference when observations are not independent? Do the available missing data replacement procedures preserve inferential integrity? Does the extent of autocorrelation matter? We use Monte Carlo simulation modeling of a single-subject intervention study to address these questions. We find power sensitivity to be within acceptable limits across four proportions of missing observations (10%, 20%, 30%, and 40%) when missing data are replaced using the Expectation-Maximization Algorithm, more commonly known as the EM Procedure (Dempster, Laird, & Rubin, 1977). This applies to data streams with lag-1 autocorrelation estimates under 0.80. As autocorrelation estimates approach 0.80, the replacement procedure yields an unacceptable power profile. The implications of these findings and directions for future research are discussed. Copyright © 2011. Published by Elsevier Ltd.
A meta-data based method for DNA microarray imputation.
Jörnsten, Rebecka; Ouyang, Ming; Wang, Hui-Yu
2007-03-29
DNA microarray experiments are conducted in logical sets, such as time course profiling after a treatment is applied to the samples, or comparisons of the samples under two or more conditions. Due to cost and design constraints of spotted cDNA microarray experiments, each logical set commonly includes only a small number of replicates per condition. Despite the vast improvement of the microarray technology in recent years, missing values are prevalent. Intuitively, imputation of missing values is best done using many replicates within the same logical set. In practice, there are few replicates and thus reliable imputation within logical sets is difficult. However, it is in the case of few replicates that the presence of missing values, and how they are imputed, can have the most profound impact on the outcome of downstream analyses (e.g. significance analysis and clustering). This study explores the feasibility of imputation across logical sets, using the vast amount of publicly available microarray data to improve imputation reliability in the small sample size setting. We download all cDNA microarray data of Saccharomyces cerevisiae, Arabidopsis thaliana, and Caenorhabditis elegans from the Stanford Microarray Database. Through cross-validation and simulation, we find that, for all three species, our proposed imputation using data from public databases is far superior to imputation within a logical set, sometimes to an astonishing degree. Furthermore, the imputation root mean square error for significant genes is generally a lot less than that of non-significant ones. Since downstream analysis of significant genes, such as clustering and network analysis, can be very sensitive to small perturbations of estimated gene effects, it is highly recommended that researchers apply reliable data imputation prior to further analysis. Our method can also be applied to cDNA microarray experiments from other species, provided good reference data are available.
Code of Federal Regulations, 2010 CFR
2010-07-01
... missing participants. (b) Limitation on benefit value. The total actuarial present value of all benefits... Relating to Labor (Continued) PENSION BENEFIT GUARANTY CORPORATION PLAN TERMINATIONS MISSING PARTICIPANTS § 4050.11 Limitations. (a) Exclusive benefit. The benefits provided for under this part will be the only...
Miss-distance indicator for tank main gun systems
NASA Astrophysics Data System (ADS)
Bornstein, Jonathan A.; Hillis, David B.
1994-07-01
The initial development of a passive, automated system to track bullet trajectories near a target to determine the `miss distance,' and the corresponding correction necessary to bring the following round `on target' is discussed. The system consists of a visible wavelength CCD sensor, long focal length optics, and a separate IR sensor to detect the muzzle flash of the firing event; this is coupled to a `PC' based image processing and automatic tracking system designed to follow the projectile trajectory by intelligently comparing frame to frame variation of the projectile tracer image. An error analysis indicates that the device is particularly sensitive to variation of the projectile time of flight to the target, and requires development of algorithms to estimate this value from the 2D images employed by the sensor to monitor the projectile trajectory. Initial results obtained by using a brassboard prototype to track training ammunition are promising.
What are we missing? Scope 3 greenhouse gas emissions accounting in the metals and minerals industry
NASA Astrophysics Data System (ADS)
Greene, Suzanne E.
2018-05-01
Metal and mineral companies have significant greenhouse gas emissions in their upstream and downstream value chains due to outsourced extraction, beneficiation and transportation activities, depending on a firm's business model. While many companies move towards more transparent reporting of corporate greenhouse gas emissions, value chain emissions remain difficult to capture, particularly in the global supply chain. Incomplete reports make it difficult for companies to track emissions reductions goals or implement sustainable supply chain improvements, especially for commodity products that form the base of many other sector's value chains. Using voluntarily-reported CDP data, this paper sheds light on hotspots in value chain emissions for individual metal and mineral companies, and for the sector as a whole. The state of value chain emissions reporting for the industry is discussed in general, with a focus on where emissions could potentially be underestimated and how estimates could be improved.
A Review of Missing Data Handling Methods in Education Research
ERIC Educational Resources Information Center
Cheema, Jehanzeb R.
2014-01-01
Missing data are a common occurrence in survey-based research studies in education, and the way missing values are handled can significantly affect the results of analyses based on such data. Despite known problems with performance of some missing data handling methods, such as mean imputation, many researchers in education continue to use those…
Federal Register 2010, 2011, 2012, 2013, 2014
2012-01-23
... monitors with missing data. Maximum recorded values are substituted for the missing data. The resulting... which the incomplete site is missing data. The linear regression relationship is based on time periods... between the monitors is used to fill in missing data for the incomplete monitor, so that the normal data...
Near-Miss Effects on Response Latencies and Win Estimations of Slot Machine Players
ERIC Educational Resources Information Center
Dixon, Mark R.; Schreiber, James E.
2004-01-01
The present study examined the degree to which slot machine near-miss trials, or trials that displayed 2 of 3 winning symbols on the payoff line, affected response times and win estimations of 12 recreational slot machine players. Participants played a commercial slot machine in a casino-like laboratory for course extra-credit points. Videotaped…
The Impact of Missing Data on Species Tree Estimation.
Xi, Zhenxiang; Liu, Liang; Davis, Charles C
2016-03-01
Phylogeneticists are increasingly assembling genome-scale data sets that include hundreds of genes to resolve their focal clades. Although these data sets commonly include a moderate to high amount of missing data, there remains no consensus on their impact to species tree estimation. Here, using several simulated and empirical data sets, we assess the effects of missing data on species tree estimation under varying degrees of incomplete lineage sorting (ILS) and gene rate heterogeneity. We demonstrate that concatenation (RAxML), gene-tree-based coalescent (ASTRAL, MP-EST, and STAR), and supertree (matrix representation with parsimony [MRP]) methods perform reliably, so long as missing data are randomly distributed (by gene and/or by species) and that a sufficiently large number of genes are sampled. When data sets are indecisive sensu Sanderson et al. (2010. Phylogenomics with incomplete taxon coverage: the limits to inference. BMC Evol Biol. 10:155) and/or ILS is high, however, high amounts of missing data that are randomly distributed require exhaustive levels of gene sampling, likely exceeding most empirical studies to date. Moreover, missing data become especially problematic when they are nonrandomly distributed. We demonstrate that STAR produces inconsistent results when the amount of nonrandom missing data is high, regardless of the degree of ILS and gene rate heterogeneity. Similarly, concatenation methods using maximum likelihood can be misled by nonrandom missing data in the presence of gene rate heterogeneity, which becomes further exacerbated when combined with high ILS. In contrast, ASTRAL, MP-EST, and MRP are more robust under all of these scenarios. These results underscore the importance of understanding the influence of missing data in the phylogenomics era. © The Author 2015. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
Bozio, Catherine H; Flanders, W Dana; Finelli, Lyn; Bramley, Anna M; Reed, Carrie; Gandhi, Neel R; Vidal, Jorge E; Erdman, Dean; Levine, Min Z; Lindstrom, Stephen; Ampofo, Krow; Arnold, Sandra R; Self, Wesley H; Williams, Derek J; Grijalva, Carlos G; Anderson, Evan J; McCullers, Jonathan A; Edwards, Kathryn M; Pavia, Andrew T; Wunderink, Richard G; Jain, Seema
2018-01-01
Abstract Background Real-time polymerase chain reaction (PCR) on respiratory specimens and serology on paired blood specimens are used to determine the etiology of respiratory illnesses for research studies. However, convalescent serology is often not collected. We used multiple imputation to assign values for missing serology results to estimate virus-specific prevalence among pediatric and adult community-acquired pneumonia hospitalizations using data from an active population-based surveillance study. Methods Presence of adenoviruses, human metapneumovirus, influenza viruses, parainfluenza virus types 1–3, and respiratory syncytial virus was defined by positive PCR on nasopharyngeal/oropharyngeal specimens or a 4-fold rise in paired serology. We performed multiple imputation by developing a multivariable regression model for each virus using data from patients with available serology results. We calculated absolute and relative differences in the proportion of each virus detected comparing the imputed to observed (nonimputed) results. Results Among 2222 children and 2259 adults, 98.8% and 99.5% had nasopharyngeal/oropharyngeal specimens and 43.2% and 37.5% had paired serum specimens, respectively. Imputed results increased viral etiology assignments by an absolute difference of 1.6%–4.4% and 0.8%–2.8% in children and adults, respectively; relative differences were 1.1–3.0 times higher. Conclusions Multiple imputation can be used when serology results are missing, to refine virus-specific prevalence estimates, and these will likely increase estimates.
Cornish, Rosie P; Tilling, Kate; Boyd, Andy; Davies, Amy; Macleod, John
2015-06-01
Most epidemiological studies have missing information, leading to reduced power and potential bias. Estimates of exposure-outcome associations will generally be biased if the outcome variable is missing not at random (MNAR). Linkage to administrative data containing a proxy for the missing study outcome allows assessment of whether this outcome is MNAR and the evaluation of bias. We examined this in relation to the association between infant breastfeeding and IQ at 15 years, where a proxy for IQ was available through linkage to school attainment data. Subjects were those who enrolled in the Avon Longitudinal Study of Parents and Children in 1990-91 (n = 13 795), of whom 5023 had IQ measured at age 15. For those with missing IQ, 7030 (79%) had information on educational attainment at age 16 obtained through linkage to the National Pupil Database. The association between duration of breastfeeding and IQ was estimated using a complete case analysis, multiple imputation and inverse probability-of-missingness weighting; these estimates were then compared with those derived from analyses informed by the linkage. IQ at 15 was MNAR-individuals with higher attainment were less likely to have missing IQ data, even after adjusting for socio-demographic factors. All the approaches underestimated the association between breastfeeding and IQ compared with analyses informed by linkage. Linkage to administrative data containing a proxy for the outcome variable allows the MNAR assumption to be tested and more efficient analyses to be performed. Under certain circumstances, this may produce unbiased results. © The Author 2015. Published by Oxford University Press on behalf of the International Epidemiological Association.
Hanigan, Ivan; Hall, Gillian; Dear, Keith B G
2006-09-13
To explain the possible effects of exposure to weather conditions on population health outcomes, weather data need to be calculated at a level in space and time that is appropriate for the health data. There are various ways of estimating exposure values from raw data collected at weather stations but the rationale for using one technique rather than another; the significance of the difference in the values obtained; and the effect these have on a research question are factors often not explicitly considered. In this study we compare different techniques for allocating weather data observations to small geographical areas and different options for weighting averages of these observations when calculating estimates of daily precipitation and temperature for Australian Postal Areas. Options that weight observations based on distance from population centroids and population size are more computationally intensive but give estimates that conceptually are more closely related to the experience of the population. Options based on values derived from sites internal to postal areas, or from nearest neighbour sites--that is, using proximity polygons around weather stations intersected with postal areas--tended to include fewer stations' observations in their estimates, and missing values were common. Options based on observations from stations within 50 kilometres radius of centroids and weighting of data by distance from centroids gave more complete estimates. Using the geographic centroid of the postal area gave estimates that differed slightly from the population weighted centroids and the population weighted average of sub-unit estimates. To calculate daily weather exposure values for analysis of health outcome data for small areas, the use of data from weather stations internal to the area only, or from neighbouring weather stations (allocated by the use of proximity polygons), is too limited. The most appropriate method conceptually is the use of weather data from sites within 50 kilometres radius of the area weighted to population centres, but a simpler acceptable option is to weight to the geographic centroid.
Welch, Catherine A; Petersen, Irene; Bartlett, Jonathan W; White, Ian R; Marston, Louise; Morris, Richard W; Nazareth, Irwin; Walters, Kate; Carpenter, James
2014-01-01
Most implementations of multiple imputation (MI) of missing data are designed for simple rectangular data structures ignoring temporal ordering of data. Therefore, when applying MI to longitudinal data with intermittent patterns of missing data, some alternative strategies must be considered. One approach is to divide data into time blocks and implement MI independently at each block. An alternative approach is to include all time blocks in the same MI model. With increasing numbers of time blocks, this approach is likely to break down because of co-linearity and over-fitting. The new two-fold fully conditional specification (FCS) MI algorithm addresses these issues, by only conditioning on measurements, which are local in time. We describe and report the results of a novel simulation study to critically evaluate the two-fold FCS algorithm and its suitability for imputation of longitudinal electronic health records. After generating a full data set, approximately 70% of selected continuous and categorical variables were made missing completely at random in each of ten time blocks. Subsequently, we applied a simple time-to-event model. We compared efficiency of estimated coefficients from a complete records analysis, MI of data in the baseline time block and the two-fold FCS algorithm. The results show that the two-fold FCS algorithm maximises the use of data available, with the gain relative to baseline MI depending on the strength of correlations within and between variables. Using this approach also increases plausibility of the missing at random assumption by using repeated measures over time of variables whose baseline values may be missing. PMID:24782349
NONPARAMETRIC MANOVA APPROACHES FOR NON-NORMAL MULTIVARIATE OUTCOMES WITH MISSING VALUES
He, Fanyin; Mazumdar, Sati; Tang, Gong; Bhatia, Triptish; Anderson, Stewart J.; Dew, Mary Amanda; Krafty, Robert; Nimgaonkar, Vishwajit; Deshpande, Smita; Hall, Martica; Reynolds, Charles F.
2017-01-01
Between-group comparisons often entail many correlated response variables. The multivariate linear model, with its assumption of multivariate normality, is the accepted standard tool for these tests. When this assumption is violated, the nonparametric multivariate Kruskal-Wallis (MKW) test is frequently used. However, this test requires complete cases with no missing values in response variables. Deletion of cases with missing values likely leads to inefficient statistical inference. Here we extend the MKW test to retain information from partially-observed cases. Results of simulated studies and analysis of real data show that the proposed method provides adequate coverage and superior power to complete-case analyses. PMID:29416225
Bayesian Inference for Growth Mixture Models with Latent Class Dependent Missing Data
ERIC Educational Resources Information Center
Lu, Zhenqiu Laura; Zhang, Zhiyong; Lubke, Gitta
2011-01-01
"Growth mixture models" (GMMs) with nonignorable missing data have drawn increasing attention in research communities but have not been fully studied. The goal of this article is to propose and to evaluate a Bayesian method to estimate the GMMs with latent class dependent missing data. An extended GMM is first presented in which class…
The Impact of Different Missing Data Handling Methods on DINA Model
ERIC Educational Resources Information Center
Sünbül, Seçil Ömür
2018-01-01
In this study, it was aimed to investigate the impact of different missing data handling methods on DINA model parameter estimation and classification accuracy. In the study, simulated data were used and the data were generated by manipulating the number of items and sample size. In the generated data, two different missing data mechanisms…
Missing the Boat--Impact of Just Missing Identification as a High-Performing School
ERIC Educational Resources Information Center
Weiner, Jennie; Donaldson, Morgaen; Dougherty, Shaun M.
2017-01-01
This study capitalizes on the performance identification system under the No Child Left Behind waivers to estimate the school-level impact of just missing formal state recognition as a high-performing school. Using a fuzzy regression-discontinuity design and data from the early years of waiver implementation in Rhode Island, we find that, when…
Toprani, Amita; Madsen, Ann; Das, Tara; Gambatese, Melissa; Greene, Carolyn; Begier, Elizabeth
2014-01-01
New York City (NYC) mandates reporting of all abortion procedures. These reports enable tracking of abortion incidence and underpin programs, policy, and research. Since January 2011, the majority of abortion facilities must report electronically. We conducted an evaluation of NYC's abortion reporting system and its transition to electronic reporting. We summarize the evaluation methodology and results and draw lessons relevant to other vital statistics and public health reporting systems. The evaluation followed Centers for Disease Control and Prevention guidelines for evaluating public health surveillance systems. We interviewed key stakeholders and conducted a data provider survey. In addition, we compared the system's abortion counts with external estimates and calculated the proportion of missing and invalid values for each variable on the report form. Finally, we assessed the process for changing the report form and estimated system costs. NYC Health Department's Bureau of Vital Statistics. Usefulness, simplicity, flexibility, data quality, acceptability, sensitivity, timeliness, and stability of the abortion reporting system. Ninety-five percent of abortion data providers considered abortion reporting important; 52% requested training regarding the report form. Thirty percent reported problems with electronic biometric fingerprint certification, and 18% reported problems with the electronic system's stability. Estimated system sensitivity was 88%. Of 17 variables, education and ancestry had more than 5% missing values in 2010. Changing the electronic reporting module was costly and time-consuming. System operating costs were estimated at $80 136 to $89 057 annually. The NYC abortion reporting system is sensitive and provides high-quality data, but opportunities for improvement include facilitating biometric certification, increasing electronic platform stability, and conducting ongoing outreach and training for data providers. This evaluation will help data users determine the degree of confidence that should be placed on abortion data. In addition, the evaluation results are applicable to other vital statistics reporting and surveillance systems.
29 CFR 4050.8 - Automatic lump sum.
Code of Federal Regulations, 2010 CFR
2010-07-01
... present value (determined as of the deemed distribution date under the missing participant lump sum... Relating to Labor (Continued) PENSION BENEFIT GUARANTY CORPORATION PLAN TERMINATIONS MISSING PARTICIPANTS § 4050.8 Automatic lump sum. This section applies to a missing participant whose designated benefit was...
Godin, Judith; Keefe, Janice; Andrew, Melissa K
2017-04-01
Missing values are commonly encountered on the Mini Mental State Examination (MMSE), particularly when administered to frail older people. This presents challenges for MMSE scoring in research settings. We sought to describe missingness in MMSEs administered in long-term-care facilities (LTCF) and to compare and contrast approaches to dealing with missing items. As part of the Care and Construction project in Nova Scotia, Canada, LTCF residents completed an MMSE. Different methods of dealing with missing values (e.g., use of raw scores, raw scores/number of items attempted, scale-level multiple imputation [MI], and blended approaches) are compared to item-level MI. The MMSE was administered to 320 residents living in 23 LTCF. The sample was predominately female (73%), and 38% of participants were aged >85 years. At least one item was missing from 122 (38.2%) of the MMSEs. Data were not Missing Completely at Random (MCAR), χ 2 (1110) = 1,351, p < 0.001. Using raw scores for those missing <6 items in combination with scale-level MI resulted in the regression coefficients and standard errors closest to item-level MI. Patterns of missing items often suggest systematic problems, such as trouble with manual dexterity, literacy, or visual impairment. While these observations may be relatively easy to take into account in clinical settings, non-random missingness presents challenges for research and must be considered in statistical analyses. We present suggestions for dealing with missing MMSE data based on the extent of missingness and the goal of analyses. Copyright © 2016 The Authors. Production and hosting by Elsevier B.V. All rights reserved.
Influence function based variance estimation and missing data issues in case-cohort studies.
Mark, S D; Katki, H
2001-12-01
Recognizing that the efficiency in relative risk estimation for the Cox proportional hazards model is largely constrained by the total number of cases, Prentice (1986) proposed the case-cohort design in which covariates are measured on all cases and on a random sample of the cohort. Subsequent to Prentice, other methods of estimation and sampling have been proposed for these designs. We formalize an approach to variance estimation suggested by Barlow (1994), and derive a robust variance estimator based on the influence function. We consider the applicability of the variance estimator to all the proposed case-cohort estimators, and derive the influence function when known sampling probabilities in the estimators are replaced by observed sampling fractions. We discuss the modifications required when cases are missing covariate information. The missingness may occur by chance, and be completely at random; or may occur as part of the sampling design, and depend upon other observed covariates. We provide an adaptation of S-plus code that allows estimating influence function variances in the presence of such missing covariates. Using examples from our current case-cohort studies on esophageal and gastric cancer, we illustrate how our results our useful in solving design and analytic issues that arise in practice.
NASA Astrophysics Data System (ADS)
Sorba, Robert; Sawicki, Marcin
2018-05-01
We perform spatially resolved, pixel-by-pixel Spectral Energy Distribution (SED) fitting on galaxies up to z ˜ 2.5 in the Hubble eXtreme Deep Field (XDF). Comparing stellar mass estimates from spatially resolved and spatially unresolved photometry we find that unresolved masses can be systematically underestimated by factors of up to 5. The ratio of the unresolved to resolved mass measurement depends on the galaxy's specific star formation rate (sSFR): at low sSFRs the bias is small, but above sSFR ˜ 10-9.5 yr-1 the discrepancy increases rapidly such that galaxies with sSFRs ˜ 10-8 yr-1 have unresolved mass estimates of only one-half to one-fifth of the resolved value. This result indicates that stellar masses estimated from spatially unresolved data sets need to be systematically corrected, in some cases by large amounts, and we provide an analytic prescription for applying this correction. We show that correcting stellar mass measurements for this bias changes the normalization and slope of the star-forming main sequence and reduces its intrinsic width; most dramatically, correcting for the mass bias increases the stellar mass density of the Universe at high redshift and can resolve the long-standing discrepancy between the directly measured cosmic SFR density at z ≳ 1 and that inferred from stellar mass densities (`the missing mass problem').
Petkewich, Matthew D.; Conrads, Paul
2013-01-01
The Everglades Depth Estimation Network is an integrated network of real-time water-level gaging stations, a ground-elevation model, and a water-surface elevation model designed to provide scientists, engineers, and water-resource managers with water-level and water-depth information (1991-2013) for the entire freshwater portion of the Greater Everglades. The U.S. Geological Survey Greater Everglades Priority Ecosystems Science provides support for the Everglades Depth Estimation Network in order for the Network to provide quality-assured monitoring data for the U.S. Army Corps of Engineers Comprehensive Everglades Restoration Plan. In a previous study, water-level estimation equations were developed to fill in missing data to increase the accuracy of the daily water-surface elevation model. During this study, those equations were updated because of the addition and removal of water-level gaging stations, the consistent use of water-level data relative to the North American Vertical Datum of 1988, and availability of recent data (March 1, 2006, to September 30, 2011). Up to three linear regression equations were developed for each station by using three different input stations to minimize the occurrences of missing data for an input station. Of the 667 water-level estimation equations developed to fill missing data at 223 stations, more than 72 percent of the equations have coefficients of determination greater than 0.90, and 97 percent have coefficients of determination greater than 0.70.
Fitting ordinary differential equations to short time course data.
Brewer, Daniel; Barenco, Martino; Callard, Robin; Hubank, Michael; Stark, Jaroslav
2008-02-28
Ordinary differential equations (ODEs) are widely used to model many systems in physics, chemistry, engineering and biology. Often one wants to compare such equations with observed time course data, and use this to estimate parameters. Surprisingly, practical algorithms for doing this are relatively poorly developed, particularly in comparison with the sophistication of numerical methods for solving both initial and boundary value problems for differential equations, and for locating and analysing bifurcations. A lack of good numerical fitting methods is particularly problematic in the context of systems biology where only a handful of time points may be available. In this paper, we present a survey of existing algorithms and describe the main approaches. We also introduce and evaluate a new efficient technique for estimating ODEs linear in parameters particularly suited to situations where noise levels are high and the number of data points is low. It employs a spline-based collocation scheme and alternates linear least squares minimization steps with repeated estimates of the noise-free values of the variables. This is reminiscent of expectation-maximization methods widely used for problems with nuisance parameters or missing data.
Error mitigation for CCSD compressed imager data
NASA Astrophysics Data System (ADS)
Gladkova, Irina; Grossberg, Michael; Gottipati, Srikanth; Shahriar, Fazlul; Bonev, George
2009-08-01
To efficiently use the limited bandwidth available on the downlink from satellite to ground station, imager data is usually compressed before transmission. Transmission introduces unavoidable errors, which are only partially removed by forward error correction and packetization. In the case of the commonly used CCSD Rice-based compression, it results in a contiguous sequence of dummy values along scan lines in a band of the imager data. We have developed a method capable of using the image statistics to provide a principled estimate of the missing data. Our method outperforms interpolation yet can be performed fast enough to provide uninterrupted data flow. The estimation of the lost data provides significant value to end users who may use only part of the data, may not have statistical tools, or lack the expertise to mitigate the impact of the lost data. Since the locations of the lost data will be clearly marked as meta-data in the HDF or NetCDF header, experts who prefer to handle error mitigation themselves will be free to use or ignore our estimates as they see fit.
Computational Aspects of N-Mixture Models
Dennis, Emily B; Morgan, Byron JT; Ridout, Martin S
2015-01-01
The N-mixture model is widely used to estimate the abundance of a population in the presence of unknown detection probability from only a set of counts subject to spatial and temporal replication (Royle, 2004, Biometrics 60, 105–115). We explain and exploit the equivalence of N-mixture and multivariate Poisson and negative-binomial models, which provides powerful new approaches for fitting these models. We show that particularly when detection probability and the number of sampling occasions are small, infinite estimates of abundance can arise. We propose a sample covariance as a diagnostic for this event, and demonstrate its good performance in the Poisson case. Infinite estimates may be missed in practice, due to numerical optimization procedures terminating at arbitrarily large values. It is shown that the use of a bound, K, for an infinite summation in the N-mixture likelihood can result in underestimation of abundance, so that default values of K in computer packages should be avoided. Instead we propose a simple automatic way to choose K. The methods are illustrated by analysis of data on Hermann's tortoise Testudo hermanni. PMID:25314629
NASA Astrophysics Data System (ADS)
Carson, Richard T.; Mitchell, Robert Cameron
1993-07-01
This paper presents the findings of a study designed to determine the national benefits of freshwater pollution control. By using data from a national contingent valuation survey, we estimate the aggregate benefits of meeting the goals of the Clean Water Act. A valuation function is estimated which depicts willingness to pay as a function of water quality, income, and other variables. Several validation checks and tests for specific biases are performed, and the benefit estimates are corrected for missing and invalid responses. The two major policy implications from our work are that the benefits and costs of water pollution control efforts are roughly equal and that many of the new policy actions necessary to ensure that all water bodies reach at least a swimmable quality level will not have positive net benefits.