Variable Selection with Prior Information for Generalized Linear Models via the Prior LASSO Method.
Jiang, Yuan; He, Yunxiao; Zhang, Heping
LASSO is a popular statistical tool often used in conjunction with generalized linear models that can simultaneously select variables and estimate parameters. When there are many variables of interest, as in current biological and biomedical studies, the power of LASSO can be limited. Fortunately, so much biological and biomedical data have been collected and they may contain useful information about the importance of certain variables. This paper proposes an extension of LASSO, namely, prior LASSO (pLASSO), to incorporate that prior information into penalized generalized linear models. The goal is achieved by adding in the LASSO criterion function an additional measure of the discrepancy between the prior information and the model. For linear regression, the whole solution path of the pLASSO estimator can be found with a procedure similar to the Least Angle Regression (LARS). Asymptotic theories and simulation results show that pLASSO provides significant improvement over LASSO when the prior information is relatively accurate. When the prior information is less reliable, pLASSO shows great robustness to the misspecification. We illustrate the application of pLASSO using a real data set from a genome-wide association study.
NASA Astrophysics Data System (ADS)
Pandremmenou, K.; Tziortziotis, N.; Paluri, S.; Zhang, W.; Blekas, K.; Kondi, L. P.; Kumar, S.
2015-03-01
We propose the use of the Least Absolute Shrinkage and Selection Operator (LASSO) regression method in order to predict the Cumulative Mean Squared Error (CMSE), incurred by the loss of individual slices in video transmission. We extract a number of quality-relevant features from the H.264/AVC video sequences, which are given as input to the LASSO. This method has the benefit of not only keeping a subset of the features that have the strongest effects towards video quality, but also produces accurate CMSE predictions. Particularly, we study the LASSO regression through two different architectures; the Global LASSO (G.LASSO) and Local LASSO (L.LASSO). In G.LASSO, a single regression model is trained for all slice types together, while in L.LASSO, motivated by the fact that the values for some features are closely dependent on the considered slice type, each slice type has its own regression model, in an e ort to improve LASSO's prediction capability. Based on the predicted CMSE values, we group the video slices into four priority classes. Additionally, we consider a video transmission scenario over a noisy channel, where Unequal Error Protection (UEP) is applied to all prioritized slices. The provided results demonstrate the efficiency of LASSO in estimating CMSE with high accuracy, using only a few features. les that typically contain high-entropy data, producing a footprint that is far less conspicuous than existing methods. The system uses a local web server to provide a le system, user interface and applications through an web architecture.
Avalos, Marta; Adroher, Nuria Duran; Lagarde, Emmanuel; Thiessard, Frantz; Grandvalet, Yves; Contrand, Benjamin; Orriols, Ludivine
2012-09-01
Large data sets with many variables provide particular challenges when constructing analytic models. Lasso-related methods provide a useful tool, although one that remains unfamiliar to most epidemiologists. We illustrate the application of lasso methods in an analysis of the impact of prescribed drugs on the risk of a road traffic crash, using a large French nationwide database (PLoS Med 2010;7:e1000366). In the original case-control study, the authors analyzed each exposure separately. We use the lasso method, which can simultaneously perform estimation and variable selection in a single model. We compare point estimates and confidence intervals using (1) a separate logistic regression model for each drug with a Bonferroni correction and (2) lasso shrinkage logistic regression analysis. Shrinkage regression had little effect on (bias corrected) point estimates, but led to less conservative results, noticeably for drugs with moderate levels of exposure. Carbamates, carboxamide derivative and fatty acid derivative antiepileptics, drugs used in opioid dependence, and mineral supplements of potassium showed stronger associations. Lasso is a relevant method in the analysis of databases with large number of exposures and can be recommended as an alternative to conventional strategies.
Jovanovic, Milos; Radovanovic, Sandro; Vukicevic, Milan; Van Poucke, Sven; Delibasic, Boris
2016-09-01
Quantification and early identification of unplanned readmission risk have the potential to improve the quality of care during hospitalization and after discharge. However, high dimensionality, sparsity, and class imbalance of electronic health data and the complexity of risk quantification, challenge the development of accurate predictive models. Predictive models require a certain level of interpretability in order to be applicable in real settings and create actionable insights. This paper aims to develop accurate and interpretable predictive models for readmission in a general pediatric patient population, by integrating a data-driven model (sparse logistic regression) and domain knowledge based on the international classification of diseases 9th-revision clinical modification (ICD-9-CM) hierarchy of diseases. Additionally, we propose a way to quantify the interpretability of a model and inspect the stability of alternative solutions. The analysis was conducted on >66,000 pediatric hospital discharge records from California, State Inpatient Databases, Healthcare Cost and Utilization Project between 2009 and 2011. We incorporated domain knowledge based on the ICD-9-CM hierarchy in a data driven, Tree-Lasso regularized logistic regression model, providing the framework for model interpretation. This approach was compared with traditional Lasso logistic regression resulting in models that are easier to interpret by fewer high-level diagnoses, with comparable prediction accuracy. The results revealed that the use of a Tree-Lasso model was as competitive in terms of accuracy (measured by area under the receiver operating characteristic curve-AUC) as the traditional Lasso logistic regression, but integration with the ICD-9-CM hierarchy of diseases provided more interpretable models in terms of high-level diagnoses. Additionally, interpretations of models are in accordance with existing medical understanding of pediatric readmission. Best performing models have similar performances reaching AUC values 0.783 and 0.779 for traditional Lasso and Tree-Lasso, respectfully. However, information loss of Lasso models is 0.35 bits higher compared to Tree-Lasso model. We propose a method for building predictive models applicable for the detection of readmission risk based on Electronic Health records. Integration of domain knowledge (in the form of ICD-9-CM taxonomy) and a data-driven, sparse predictive algorithm (Tree-Lasso Logistic Regression) resulted in an increase of interpretability of the resulting model. The models are interpreted for the readmission prediction problem in general pediatric population in California, as well as several important subpopulations, and the interpretations of models comply with existing medical understanding of pediatric readmission. Finally, quantitative assessment of the interpretability of the models is given, that is beyond simple counts of selected low-level features. Copyright © 2016 Elsevier B.V. All rights reserved.
Bayesian Adaptive Lasso for Ordinal Regression with Latent Variables
ERIC Educational Resources Information Center
Feng, Xiang-Nan; Wu, Hao-Tian; Song, Xin-Yuan
2017-01-01
We consider an ordinal regression model with latent variables to investigate the effects of observable and latent explanatory variables on the ordinal responses of interest. Each latent variable is characterized by correlated observed variables through a confirmatory factor analysis model. We develop a Bayesian adaptive lasso procedure to conduct…
Prediction of siRNA potency using sparse logistic regression.
Hu, Wei; Hu, John
2014-06-01
RNA interference (RNAi) can modulate gene expression at post-transcriptional as well as transcriptional levels. Short interfering RNA (siRNA) serves as a trigger for the RNAi gene inhibition mechanism, and therefore is a crucial intermediate step in RNAi. There have been extensive studies to identify the sequence characteristics of potent siRNAs. One such study built a linear model using LASSO (Least Absolute Shrinkage and Selection Operator) to measure the contribution of each siRNA sequence feature. This model is simple and interpretable, but it requires a large number of nonzero weights. We have introduced a novel technique, sparse logistic regression, to build a linear model using single-position specific nucleotide compositions which has the same prediction accuracy of the linear model based on LASSO. The weights in our new model share the same general trend as those in the previous model, but have only 25 nonzero weights out of a total 84 weights, a 54% reduction compared to the previous model. Contrary to the linear model based on LASSO, our model suggests that only a few positions are influential on the efficacy of the siRNA, which are the 5' and 3' ends and the seed region of siRNA sequences. We also employed sparse logistic regression to build a linear model using dual-position specific nucleotide compositions, a task LASSO is not able to accomplish well due to its high dimensional nature. Our results demonstrate the superiority of sparse logistic regression as a technique for both feature selection and regression over LASSO in the context of siRNA design.
Huang, Jian; Zhang, Cun-Hui
2013-01-01
The ℓ1-penalized method, or the Lasso, has emerged as an important tool for the analysis of large data sets. Many important results have been obtained for the Lasso in linear regression which have led to a deeper understanding of high-dimensional statistical problems. In this article, we consider a class of weighted ℓ1-penalized estimators for convex loss functions of a general form, including the generalized linear models. We study the estimation, prediction, selection and sparsity properties of the weighted ℓ1-penalized estimator in sparse, high-dimensional settings where the number of predictors p can be much larger than the sample size n. Adaptive Lasso is considered as a special case. A multistage method is developed to approximate concave regularized estimation by applying an adaptive Lasso recursively. We provide prediction and estimation oracle inequalities for single- and multi-stage estimators, a general selection consistency theorem, and an upper bound for the dimension of the Lasso estimator. Important models including the linear regression, logistic regression and log-linear models are used throughout to illustrate the applications of the general results. PMID:24348100
Jang, Dae -Heung; Anderson-Cook, Christine Michaela
2016-11-22
With many predictors in regression, fitting the full model can induce multicollinearity problems. Least Absolute Shrinkage and Selection Operation (LASSO) is useful when the effects of many explanatory variables are sparse in a high-dimensional dataset. Influential points can have a disproportionate impact on the estimated values of model parameters. Here, this paper describes a new influence plot that can be used to increase understanding of the contributions of individual observations and the robustness of results. This can serve as a complement to other regression diagnostics techniques in the LASSO regression setting. Using this influence plot, we can find influential pointsmore » and their impact on shrinkage of model parameters and model selection. Lastly, we provide two examples to illustrate the methods.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jang, Dae -Heung; Anderson-Cook, Christine Michaela
With many predictors in regression, fitting the full model can induce multicollinearity problems. Least Absolute Shrinkage and Selection Operation (LASSO) is useful when the effects of many explanatory variables are sparse in a high-dimensional dataset. Influential points can have a disproportionate impact on the estimated values of model parameters. Here, this paper describes a new influence plot that can be used to increase understanding of the contributions of individual observations and the robustness of results. This can serve as a complement to other regression diagnostics techniques in the LASSO regression setting. Using this influence plot, we can find influential pointsmore » and their impact on shrinkage of model parameters and model selection. Lastly, we provide two examples to illustrate the methods.« less
Non-Asymptotic Oracle Inequalities for the High-Dimensional Cox Regression via Lasso.
Kong, Shengchun; Nan, Bin
2014-01-01
We consider finite sample properties of the regularized high-dimensional Cox regression via lasso. Existing literature focuses on linear models or generalized linear models with Lipschitz loss functions, where the empirical risk functions are the summations of independent and identically distributed (iid) losses. The summands in the negative log partial likelihood function for censored survival data, however, are neither iid nor Lipschitz.We first approximate the negative log partial likelihood function by a sum of iid non-Lipschitz terms, then derive the non-asymptotic oracle inequalities for the lasso penalized Cox regression using pointwise arguments to tackle the difficulties caused by lacking iid Lipschitz losses.
Non-Asymptotic Oracle Inequalities for the High-Dimensional Cox Regression via Lasso
Kong, Shengchun; Nan, Bin
2013-01-01
We consider finite sample properties of the regularized high-dimensional Cox regression via lasso. Existing literature focuses on linear models or generalized linear models with Lipschitz loss functions, where the empirical risk functions are the summations of independent and identically distributed (iid) losses. The summands in the negative log partial likelihood function for censored survival data, however, are neither iid nor Lipschitz.We first approximate the negative log partial likelihood function by a sum of iid non-Lipschitz terms, then derive the non-asymptotic oracle inequalities for the lasso penalized Cox regression using pointwise arguments to tackle the difficulties caused by lacking iid Lipschitz losses. PMID:24516328
Large unbalanced credit scoring using Lasso-logistic regression ensemble.
Wang, Hong; Xu, Qingsong; Zhou, Lifeng
2015-01-01
Recently, various ensemble learning methods with different base classifiers have been proposed for credit scoring problems. However, for various reasons, there has been little research using logistic regression as the base classifier. In this paper, given large unbalanced data, we consider the plausibility of ensemble learning using regularized logistic regression as the base classifier to deal with credit scoring problems. In this research, the data is first balanced and diversified by clustering and bagging algorithms. Then we apply a Lasso-logistic regression learning ensemble to evaluate the credit risks. We show that the proposed algorithm outperforms popular credit scoring models such as decision tree, Lasso-logistic regression and random forests in terms of AUC and F-measure. We also provide two importance measures for the proposed model to identify important variables in the data.
Empirical Performance of Cross-Validation With Oracle Methods in a Genomics Context.
Martinez, Josue G; Carroll, Raymond J; Müller, Samuel; Sampson, Joshua N; Chatterjee, Nilanjan
2011-11-01
When employing model selection methods with oracle properties such as the smoothly clipped absolute deviation (SCAD) and the Adaptive Lasso, it is typical to estimate the smoothing parameter by m-fold cross-validation, for example, m = 10. In problems where the true regression function is sparse and the signals large, such cross-validation typically works well. However, in regression modeling of genomic studies involving Single Nucleotide Polymorphisms (SNP), the true regression functions, while thought to be sparse, do not have large signals. We demonstrate empirically that in such problems, the number of selected variables using SCAD and the Adaptive Lasso, with 10-fold cross-validation, is a random variable that has considerable and surprising variation. Similar remarks apply to non-oracle methods such as the Lasso. Our study strongly questions the suitability of performing only a single run of m-fold cross-validation with any oracle method, and not just the SCAD and Adaptive Lasso.
Large Unbalanced Credit Scoring Using Lasso-Logistic Regression Ensemble
Wang, Hong; Xu, Qingsong; Zhou, Lifeng
2015-01-01
Recently, various ensemble learning methods with different base classifiers have been proposed for credit scoring problems. However, for various reasons, there has been little research using logistic regression as the base classifier. In this paper, given large unbalanced data, we consider the plausibility of ensemble learning using regularized logistic regression as the base classifier to deal with credit scoring problems. In this research, the data is first balanced and diversified by clustering and bagging algorithms. Then we apply a Lasso-logistic regression learning ensemble to evaluate the credit risks. We show that the proposed algorithm outperforms popular credit scoring models such as decision tree, Lasso-logistic regression and random forests in terms of AUC and F-measure. We also provide two importance measures for the proposed model to identify important variables in the data. PMID:25706988
High dimensional linear regression models under long memory dependence and measurement error
NASA Astrophysics Data System (ADS)
Kaul, Abhishek
This dissertation consists of three chapters. The first chapter introduces the models under consideration and motivates problems of interest. A brief literature review is also provided in this chapter. The second chapter investigates the properties of Lasso under long range dependent model errors. Lasso is a computationally efficient approach to model selection and estimation, and its properties are well studied when the regression errors are independent and identically distributed. We study the case, where the regression errors form a long memory moving average process. We establish a finite sample oracle inequality for the Lasso solution. We then show the asymptotic sign consistency in this setup. These results are established in the high dimensional setup (p> n) where p can be increasing exponentially with n. Finally, we show the consistency, n½ --d-consistency of Lasso, along with the oracle property of adaptive Lasso, in the case where p is fixed. Here d is the memory parameter of the stationary error sequence. The performance of Lasso is also analysed in the present setup with a simulation study. The third chapter proposes and investigates the properties of a penalized quantile based estimator for measurement error models. Standard formulations of prediction problems in high dimension regression models assume the availability of fully observed covariates and sub-Gaussian and homogeneous model errors. This makes these methods inapplicable to measurement errors models where covariates are unobservable and observations are possibly non sub-Gaussian and heterogeneous. We propose weighted penalized corrected quantile estimators for the regression parameter vector in linear regression models with additive measurement errors, where unobservable covariates are nonrandom. The proposed estimators forgo the need for the above mentioned model assumptions. We study these estimators in both the fixed dimension and high dimensional sparse setups, in the latter setup, the dimensionality can grow exponentially with the sample size. In the fixed dimensional setting we provide the oracle properties associated with the proposed estimators. In the high dimensional setting, we provide bounds for the statistical error associated with the estimation, that hold with asymptotic probability 1, thereby providing the ℓ1-consistency of the proposed estimator. We also establish the model selection consistency in terms of the correctly estimated zero components of the parameter vector. A simulation study that investigates the finite sample accuracy of the proposed estimator is also included in this chapter.
Empirical Performance of Cross-Validation With Oracle Methods in a Genomics Context
Martinez, Josue G.; Carroll, Raymond J.; Müller, Samuel; Sampson, Joshua N.; Chatterjee, Nilanjan
2012-01-01
When employing model selection methods with oracle properties such as the smoothly clipped absolute deviation (SCAD) and the Adaptive Lasso, it is typical to estimate the smoothing parameter by m-fold cross-validation, for example, m = 10. In problems where the true regression function is sparse and the signals large, such cross-validation typically works well. However, in regression modeling of genomic studies involving Single Nucleotide Polymorphisms (SNP), the true regression functions, while thought to be sparse, do not have large signals. We demonstrate empirically that in such problems, the number of selected variables using SCAD and the Adaptive Lasso, with 10-fold cross-validation, is a random variable that has considerable and surprising variation. Similar remarks apply to non-oracle methods such as the Lasso. Our study strongly questions the suitability of performing only a single run of m-fold cross-validation with any oracle method, and not just the SCAD and Adaptive Lasso. PMID:22347720
Statistical downscaling modeling with quantile regression using lasso to estimate extreme rainfall
NASA Astrophysics Data System (ADS)
Santri, Dewi; Wigena, Aji Hamim; Djuraidah, Anik
2016-02-01
Rainfall is one of the climatic elements with high diversity and has many negative impacts especially extreme rainfall. Therefore, there are several methods that required to minimize the damage that may occur. So far, Global circulation models (GCM) are the best method to forecast global climate changes include extreme rainfall. Statistical downscaling (SD) is a technique to develop the relationship between GCM output as a global-scale independent variables and rainfall as a local- scale response variable. Using GCM method will have many difficulties when assessed against observations because GCM has high dimension and multicollinearity between the variables. The common method that used to handle this problem is principal components analysis (PCA) and partial least squares regression. The new method that can be used is lasso. Lasso has advantages in simultaneuosly controlling the variance of the fitted coefficients and performing automatic variable selection. Quantile regression is a method that can be used to detect extreme rainfall in dry and wet extreme. Objective of this study is modeling SD using quantile regression with lasso to predict extreme rainfall in Indramayu. The results showed that the estimation of extreme rainfall (extreme wet in January, February and December) in Indramayu could be predicted properly by the model at quantile 90th.
Integrative eQTL analysis of tumor and host omics data in individuals with bladder cancer.
Pineda, Silvia; Van Steen, Kristel; Malats, Núria
2017-09-01
Integrative analyses of several omics data are emerging. The data are usually generated from the same source material (i.e., tumor sample) representing one level of regulation. However, integrating different regulatory levels (i.e., blood) with those from tumor may also reveal important knowledge about the human genetic architecture. To model this multilevel structure, an integrative-expression quantitative trait loci (eQTL) analysis applying two-stage regression (2SR) was proposed. This approach first regressed tumor gene expression levels with tumor markers and the adjusted residuals from the previous model were then regressed with the germline genotypes measured in blood. Previously, we demonstrated that penalized regression methods in combination with a permutation-based MaxT method (Global-LASSO) is a promising tool to fix some of the challenges that high-throughput omics data analysis imposes. Here, we assessed whether Global-LASSO can also be applied when tumor and blood omics data are integrated. We further compared our strategy with two 2SR-approaches, one using multiple linear regression (2SR-MLR) and other using LASSO (2SR-LASSO). We applied the three models to integrate genomic, epigenomic, and transcriptomic data from tumor tissue with blood germline genotypes from 181 individuals with bladder cancer included in the TCGA Consortium. Global-LASSO provided a larger list of eQTLs than the 2SR methods, identified a previously reported eQTLs in prostate stem cell antigen (PSCA), and provided further clues on the complexity of APBEC3B loci, with a minimal false-positive rate not achieved by 2SR-MLR. It also represents an important contribution for omics integrative analysis because it is easy to apply and adaptable to any type of data. © 2017 WILEY PERIODICALS, INC.
A Permutation Approach for Selecting the Penalty Parameter in Penalized Model Selection
Sabourin, Jeremy A; Valdar, William; Nobel, Andrew B
2015-01-01
Summary We describe a simple, computationally effcient, permutation-based procedure for selecting the penalty parameter in LASSO penalized regression. The procedure, permutation selection, is intended for applications where variable selection is the primary focus, and can be applied in a variety of structural settings, including that of generalized linear models. We briefly discuss connections between permutation selection and existing theory for the LASSO. In addition, we present a simulation study and an analysis of real biomedical data sets in which permutation selection is compared with selection based on the following: cross-validation (CV), the Bayesian information criterion (BIC), Scaled Sparse Linear Regression, and a selection method based on recently developed testing procedures for the LASSO. PMID:26243050
Detection of Differential Item Functioning Using the Lasso Approach
ERIC Educational Resources Information Center
Magis, David; Tuerlinckx, Francis; De Boeck, Paul
2015-01-01
This article proposes a novel approach to detect differential item functioning (DIF) among dichotomously scored items. Unlike standard DIF methods that perform an item-by-item analysis, we propose the "LR lasso DIF method": logistic regression (LR) model is formulated for all item responses. The model contains item-specific intercepts,…
NASA Astrophysics Data System (ADS)
Setiyorini, Anis; Suprijadi, Jadi; Handoko, Budhi
2017-03-01
Geographically Weighted Regression (GWR) is a regression model that takes into account the spatial heterogeneity effect. In the application of the GWR, inference on regression coefficients is often of interest, as is estimation and prediction of the response variable. Empirical research and studies have demonstrated that local correlation between explanatory variables can lead to estimated regression coefficients in GWR that are strongly correlated, a condition named multicollinearity. It later results on a large standard error on estimated regression coefficients, and, hence, problematic for inference on relationships between variables. Geographically Weighted Lasso (GWL) is a method which capable to deal with spatial heterogeneity and local multicollinearity in spatial data sets. GWL is a further development of GWR method, which adds a LASSO (Least Absolute Shrinkage and Selection Operator) constraint in parameter estimation. In this study, GWL will be applied by using fixed exponential kernel weights matrix to establish a poverty modeling of Java Island, Indonesia. The results of applying the GWL to poverty datasets show that this method stabilizes regression coefficients in the presence of multicollinearity and produces lower prediction and estimation error of the response variable than GWR does.
Xu, Chao; Fang, Jian; Shen, Hui; Wang, Yu-Ping; Deng, Hong-Wen
2018-01-25
Extreme phenotype sampling (EPS) is a broadly-used design to identify candidate genetic factors contributing to the variation of quantitative traits. By enriching the signals in extreme phenotypic samples, EPS can boost the association power compared to random sampling. Most existing statistical methods for EPS examine the genetic factors individually, despite many quantitative traits have multiple genetic factors underlying their variation. It is desirable to model the joint effects of genetic factors, which may increase the power and identify novel quantitative trait loci under EPS. The joint analysis of genetic data in high-dimensional situations requires specialized techniques, e.g., the least absolute shrinkage and selection operator (LASSO). Although there are extensive research and application related to LASSO, the statistical inference and testing for the sparse model under EPS remain unknown. We propose a novel sparse model (EPS-LASSO) with hypothesis test for high-dimensional regression under EPS based on a decorrelated score function. The comprehensive simulation shows EPS-LASSO outperforms existing methods with stable type I error and FDR control. EPS-LASSO can provide a consistent power for both low- and high-dimensional situations compared with the other methods dealing with high-dimensional situations. The power of EPS-LASSO is close to other low-dimensional methods when the causal effect sizes are small and is superior when the effects are large. Applying EPS-LASSO to a transcriptome-wide gene expression study for obesity reveals 10 significant body mass index associated genes. Our results indicate that EPS-LASSO is an effective method for EPS data analysis, which can account for correlated predictors. The source code is available at https://github.com/xu1912/EPSLASSO. hdeng2@tulane.edu. Supplementary data are available at Bioinformatics online. © The Author (2018). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
The Bayesian group lasso for confounded spatial data
Hefley, Trevor J.; Hooten, Mevin B.; Hanks, Ephraim M.; Russell, Robin E.; Walsh, Daniel P.
2017-01-01
Generalized linear mixed models for spatial processes are widely used in applied statistics. In many applications of the spatial generalized linear mixed model (SGLMM), the goal is to obtain inference about regression coefficients while achieving optimal predictive ability. When implementing the SGLMM, multicollinearity among covariates and the spatial random effects can make computation challenging and influence inference. We present a Bayesian group lasso prior with a single tuning parameter that can be chosen to optimize predictive ability of the SGLMM and jointly regularize the regression coefficients and spatial random effect. We implement the group lasso SGLMM using efficient Markov chain Monte Carlo (MCMC) algorithms and demonstrate how multicollinearity among covariates and the spatial random effect can be monitored as a derived quantity. To test our method, we compared several parameterizations of the SGLMM using simulated data and two examples from plant ecology and disease ecology. In all examples, problematic levels multicollinearity occurred and influenced sampling efficiency and inference. We found that the group lasso prior resulted in roughly twice the effective sample size for MCMC samples of regression coefficients and can have higher and less variable predictive accuracy based on out-of-sample data when compared to the standard SGLMM.
Bayesian LASSO, scale space and decision making in association genetics.
Pasanen, Leena; Holmström, Lasse; Sillanpää, Mikko J
2015-01-01
LASSO is a penalized regression method that facilitates model fitting in situations where there are as many, or even more explanatory variables than observations, and only a few variables are relevant in explaining the data. We focus on the Bayesian version of LASSO and consider four problems that need special attention: (i) controlling false positives, (ii) multiple comparisons, (iii) collinearity among explanatory variables, and (iv) the choice of the tuning parameter that controls the amount of shrinkage and the sparsity of the estimates. The particular application considered is association genetics, where LASSO regression can be used to find links between chromosome locations and phenotypic traits in a biological organism. However, the proposed techniques are relevant also in other contexts where LASSO is used for variable selection. We separate the true associations from false positives using the posterior distribution of the effects (regression coefficients) provided by Bayesian LASSO. We propose to solve the multiple comparisons problem by using simultaneous inference based on the joint posterior distribution of the effects. Bayesian LASSO also tends to distribute an effect among collinear variables, making detection of an association difficult. We propose to solve this problem by considering not only individual effects but also their functionals (i.e. sums and differences). Finally, whereas in Bayesian LASSO the tuning parameter is often regarded as a random variable, we adopt a scale space view and consider a whole range of fixed tuning parameters, instead. The effect estimates and the associated inference are considered for all tuning parameters in the selected range and the results are visualized with color maps that provide useful insights into data and the association problem considered. The methods are illustrated using two sets of artificial data and one real data set, all representing typical settings in association genetics.
Pérez-Rodríguez, Paulino; Gianola, Daniel; González-Camacho, Juan Manuel; Crossa, José; Manès, Yann; Dreisigacker, Susanne
2012-01-01
In genome-enabled prediction, parametric, semi-parametric, and non-parametric regression models have been used. This study assessed the predictive ability of linear and non-linear models using dense molecular markers. The linear models were linear on marker effects and included the Bayesian LASSO, Bayesian ridge regression, Bayes A, and Bayes B. The non-linear models (this refers to non-linearity on markers) were reproducing kernel Hilbert space (RKHS) regression, Bayesian regularized neural networks (BRNN), and radial basis function neural networks (RBFNN). These statistical models were compared using 306 elite wheat lines from CIMMYT genotyped with 1717 diversity array technology (DArT) markers and two traits, days to heading (DTH) and grain yield (GY), measured in each of 12 environments. It was found that the three non-linear models had better overall prediction accuracy than the linear regression specification. Results showed a consistent superiority of RKHS and RBFNN over the Bayesian LASSO, Bayesian ridge regression, Bayes A, and Bayes B models. PMID:23275882
Pérez-Rodríguez, Paulino; Gianola, Daniel; González-Camacho, Juan Manuel; Crossa, José; Manès, Yann; Dreisigacker, Susanne
2012-12-01
In genome-enabled prediction, parametric, semi-parametric, and non-parametric regression models have been used. This study assessed the predictive ability of linear and non-linear models using dense molecular markers. The linear models were linear on marker effects and included the Bayesian LASSO, Bayesian ridge regression, Bayes A, and Bayes B. The non-linear models (this refers to non-linearity on markers) were reproducing kernel Hilbert space (RKHS) regression, Bayesian regularized neural networks (BRNN), and radial basis function neural networks (RBFNN). These statistical models were compared using 306 elite wheat lines from CIMMYT genotyped with 1717 diversity array technology (DArT) markers and two traits, days to heading (DTH) and grain yield (GY), measured in each of 12 environments. It was found that the three non-linear models had better overall prediction accuracy than the linear regression specification. Results showed a consistent superiority of RKHS and RBFNN over the Bayesian LASSO, Bayesian ridge regression, Bayes A, and Bayes B models.
Lee, Tsair-Fwu; Liou, Ming-Hsiang; Huang, Yu-Jie; Chao, Pei-Ju; Ting, Hui-Min; Lee, Hsiao-Yi
2014-01-01
To predict the incidence of moderate-to-severe patient-reported xerostomia among head and neck squamous cell carcinoma (HNSCC) and nasopharyngeal carcinoma (NPC) patients treated with intensity-modulated radiotherapy (IMRT). Multivariable normal tissue complication probability (NTCP) models were developed by using quality of life questionnaire datasets from 152 patients with HNSCC and 84 patients with NPC. The primary endpoint was defined as moderate-to-severe xerostomia after IMRT. The numbers of predictive factors for a multivariable logistic regression model were determined using the least absolute shrinkage and selection operator (LASSO) with bootstrapping technique. Four predictive models were achieved by LASSO with the smallest number of factors while preserving predictive value with higher AUC performance. For all models, the dosimetric factors for the mean dose given to the contralateral and ipsilateral parotid gland were selected as the most significant predictors. Followed by the different clinical and socio-economic factors being selected, namely age, financial status, T stage, and education for different models were chosen. The predicted incidence of xerostomia for HNSCC and NPC patients can be improved by using multivariable logistic regression models with LASSO technique. The predictive model developed in HNSCC cannot be generalized to NPC cohort treated with IMRT without validation and vice versa. PMID:25163814
Ting, Hui-Min; Chang, Liyun; Huang, Yu-Jie; Wu, Jia-Ming; Wang, Hung-Yu; Horng, Mong-Fong; Chang, Chun-Ming; Lan, Jen-Hong; Huang, Ya-Yu; Fang, Fu-Min; Leung, Stephen Wan
2014-01-01
Purpose The aim of this study was to develop a multivariate logistic regression model with least absolute shrinkage and selection operator (LASSO) to make valid predictions about the incidence of moderate-to-severe patient-rated xerostomia among head and neck cancer (HNC) patients treated with IMRT. Methods and Materials Quality of life questionnaire datasets from 206 patients with HNC were analyzed. The European Organization for Research and Treatment of Cancer QLQ-H&N35 and QLQ-C30 questionnaires were used as the endpoint evaluation. The primary endpoint (grade 3+ xerostomia) was defined as moderate-to-severe xerostomia at 3 (XER3m) and 12 months (XER12m) after the completion of IMRT. Normal tissue complication probability (NTCP) models were developed. The optimal and suboptimal numbers of prognostic factors for a multivariate logistic regression model were determined using the LASSO with bootstrapping technique. Statistical analysis was performed using the scaled Brier score, Nagelkerke R2, chi-squared test, Omnibus, Hosmer-Lemeshow test, and the AUC. Results Eight prognostic factors were selected by LASSO for the 3-month time point: Dmean-c, Dmean-i, age, financial status, T stage, AJCC stage, smoking, and education. Nine prognostic factors were selected for the 12-month time point: Dmean-i, education, Dmean-c, smoking, T stage, baseline xerostomia, alcohol abuse, family history, and node classification. In the selection of the suboptimal number of prognostic factors by LASSO, three suboptimal prognostic factors were fine-tuned by Hosmer-Lemeshow test and AUC, i.e., Dmean-c, Dmean-i, and age for the 3-month time point. Five suboptimal prognostic factors were also selected for the 12-month time point, i.e., Dmean-i, education, Dmean-c, smoking, and T stage. The overall performance for both time points of the NTCP model in terms of scaled Brier score, Omnibus, and Nagelkerke R2 was satisfactory and corresponded well with the expected values. Conclusions Multivariate NTCP models with LASSO can be used to predict patient-rated xerostomia after IMRT. PMID:24586971
Lee, Tsair-Fwu; Chao, Pei-Ju; Ting, Hui-Min; Chang, Liyun; Huang, Yu-Jie; Wu, Jia-Ming; Wang, Hung-Yu; Horng, Mong-Fong; Chang, Chun-Ming; Lan, Jen-Hong; Huang, Ya-Yu; Fang, Fu-Min; Leung, Stephen Wan
2014-01-01
The aim of this study was to develop a multivariate logistic regression model with least absolute shrinkage and selection operator (LASSO) to make valid predictions about the incidence of moderate-to-severe patient-rated xerostomia among head and neck cancer (HNC) patients treated with IMRT. Quality of life questionnaire datasets from 206 patients with HNC were analyzed. The European Organization for Research and Treatment of Cancer QLQ-H&N35 and QLQ-C30 questionnaires were used as the endpoint evaluation. The primary endpoint (grade 3(+) xerostomia) was defined as moderate-to-severe xerostomia at 3 (XER3m) and 12 months (XER12m) after the completion of IMRT. Normal tissue complication probability (NTCP) models were developed. The optimal and suboptimal numbers of prognostic factors for a multivariate logistic regression model were determined using the LASSO with bootstrapping technique. Statistical analysis was performed using the scaled Brier score, Nagelkerke R(2), chi-squared test, Omnibus, Hosmer-Lemeshow test, and the AUC. Eight prognostic factors were selected by LASSO for the 3-month time point: Dmean-c, Dmean-i, age, financial status, T stage, AJCC stage, smoking, and education. Nine prognostic factors were selected for the 12-month time point: Dmean-i, education, Dmean-c, smoking, T stage, baseline xerostomia, alcohol abuse, family history, and node classification. In the selection of the suboptimal number of prognostic factors by LASSO, three suboptimal prognostic factors were fine-tuned by Hosmer-Lemeshow test and AUC, i.e., Dmean-c, Dmean-i, and age for the 3-month time point. Five suboptimal prognostic factors were also selected for the 12-month time point, i.e., Dmean-i, education, Dmean-c, smoking, and T stage. The overall performance for both time points of the NTCP model in terms of scaled Brier score, Omnibus, and Nagelkerke R(2) was satisfactory and corresponded well with the expected values. Multivariate NTCP models with LASSO can be used to predict patient-rated xerostomia after IMRT.
Kim, Sun Mi; Kim, Yongdai; Jeong, Kuhwan; Jeong, Heeyeong; Kim, Jiyoung
2018-01-01
The aim of this study was to compare the performance of image analysis for predicting breast cancer using two distinct regression models and to evaluate the usefulness of incorporating clinical and demographic data (CDD) into the image analysis in order to improve the diagnosis of breast cancer. This study included 139 solid masses from 139 patients who underwent a ultrasonography-guided core biopsy and had available CDD between June 2009 and April 2010. Three breast radiologists retrospectively reviewed 139 breast masses and described each lesion using the Breast Imaging Reporting and Data System (BI-RADS) lexicon. We applied and compared two regression methods-stepwise logistic (SL) regression and logistic least absolute shrinkage and selection operator (LASSO) regression-in which the BI-RADS descriptors and CDD were used as covariates. We investigated the performances of these regression methods and the agreement of radiologists in terms of test misclassification error and the area under the curve (AUC) of the tests. Logistic LASSO regression was superior (P<0.05) to SL regression, regardless of whether CDD was included in the covariates, in terms of test misclassification errors (0.234 vs. 0.253, without CDD; 0.196 vs. 0.258, with CDD) and AUC (0.785 vs. 0.759, without CDD; 0.873 vs. 0.735, with CDD). However, it was inferior (P<0.05) to the agreement of three radiologists in terms of test misclassification errors (0.234 vs. 0.168, without CDD; 0.196 vs. 0.088, with CDD) and the AUC without CDD (0.785 vs. 0.844, P<0.001), but was comparable to the AUC with CDD (0.873 vs. 0.880, P=0.141). Logistic LASSO regression based on BI-RADS descriptors and CDD showed better performance than SL in predicting the presence of breast cancer. The use of CDD as a supplement to the BI-RADS descriptors significantly improved the prediction of breast cancer using logistic LASSO regression.
Ternès, Nils; Rotolo, Federico; Michiels, Stefan
2016-07-10
Correct selection of prognostic biomarkers among multiple candidates is becoming increasingly challenging as the dimensionality of biological data becomes higher. Therefore, minimizing the false discovery rate (FDR) is of primary importance, while a low false negative rate (FNR) is a complementary measure. The lasso is a popular selection method in Cox regression, but its results depend heavily on the penalty parameter λ. Usually, λ is chosen using maximum cross-validated log-likelihood (max-cvl). However, this method has often a very high FDR. We review methods for a more conservative choice of λ. We propose an empirical extension of the cvl by adding a penalization term, which trades off between the goodness-of-fit and the parsimony of the model, leading to the selection of fewer biomarkers and, as we show, to the reduction of the FDR without large increase in FNR. We conducted a simulation study considering null and moderately sparse alternative scenarios and compared our approach with the standard lasso and 10 other competitors: Akaike information criterion (AIC), corrected AIC, Bayesian information criterion (BIC), extended BIC, Hannan and Quinn information criterion (HQIC), risk information criterion (RIC), one-standard-error rule, adaptive lasso, stability selection, and percentile lasso. Our extension achieved the best compromise across all the scenarios between a reduction of the FDR and a limited raise of the FNR, followed by the AIC, the RIC, and the adaptive lasso, which performed well in some settings. We illustrate the methods using gene expression data of 523 breast cancer patients. In conclusion, we propose to apply our extension to the lasso whenever a stringent FDR with a limited FNR is targeted. Copyright © 2016 John Wiley & Sons, Ltd. Copyright © 2016 John Wiley & Sons, Ltd.
A SIGNIFICANCE TEST FOR THE LASSO1
Lockhart, Richard; Taylor, Jonathan; Tibshirani, Ryan J.; Tibshirani, Robert
2014-01-01
In the sparse linear regression setting, we consider testing the significance of the predictor variable that enters the current lasso model, in the sequence of models visited along the lasso solution path. We propose a simple test statistic based on lasso fitted values, called the covariance test statistic, and show that when the true model is linear, this statistic has an Exp(1) asymptotic distribution under the null hypothesis (the null being that all truly active variables are contained in the current lasso model). Our proof of this result for the special case of the first predictor to enter the model (i.e., testing for a single significant predictor variable against the global null) requires only weak assumptions on the predictor matrix X. On the other hand, our proof for a general step in the lasso path places further technical assumptions on X and the generative model, but still allows for the important high-dimensional case p > n, and does not necessarily require that the current lasso model achieves perfect recovery of the truly active variables. Of course, for testing the significance of an additional variable between two nested linear models, one typically uses the chi-squared test, comparing the drop in residual sum of squares (RSS) to a χ12 distribution. But when this additional variable is not fixed, and has been chosen adaptively or greedily, this test is no longer appropriate: adaptivity makes the drop in RSS stochastically much larger than χ12 under the null hypothesis. Our analysis explicitly accounts for adaptivity, as it must, since the lasso builds an adaptive sequence of linear models as the tuning parameter λ decreases. In this analysis, shrinkage plays a key role: though additional variables are chosen adaptively, the coefficients of lasso active variables are shrunken due to the l1 penalty. Therefore, the test statistic (which is based on lasso fitted values) is in a sense balanced by these two opposing properties—adaptivity and shrinkage—and its null distribution is tractable and asymptotically Exp(1). PMID:25574062
Goo, Yeung-Ja James; Chi, Der-Jang; Shen, Zong-De
2016-01-01
The purpose of this study is to establish rigorous and reliable going concern doubt (GCD) prediction models. This study first uses the least absolute shrinkage and selection operator (LASSO) to select variables and then applies data mining techniques to establish prediction models, such as neural network (NN), classification and regression tree (CART), and support vector machine (SVM). The samples of this study include 48 GCD listed companies and 124 NGCD (non-GCD) listed companies from 2002 to 2013 in the TEJ database. We conduct fivefold cross validation in order to identify the prediction accuracy. According to the empirical results, the prediction accuracy of the LASSO-NN model is 88.96 % (Type I error rate is 12.22 %; Type II error rate is 7.50 %), the prediction accuracy of the LASSO-CART model is 88.75 % (Type I error rate is 13.61 %; Type II error rate is 14.17 %), and the prediction accuracy of the LASSO-SVM model is 89.79 % (Type I error rate is 10.00 %; Type II error rate is 15.83 %).
Efficient Smoothed Concomitant Lasso Estimation for High Dimensional Regression
NASA Astrophysics Data System (ADS)
Ndiaye, Eugene; Fercoq, Olivier; Gramfort, Alexandre; Leclère, Vincent; Salmon, Joseph
2017-10-01
In high dimensional settings, sparse structures are crucial for efficiency, both in term of memory, computation and performance. It is customary to consider ℓ 1 penalty to enforce sparsity in such scenarios. Sparsity enforcing methods, the Lasso being a canonical example, are popular candidates to address high dimension. For efficiency, they rely on tuning a parameter trading data fitting versus sparsity. For the Lasso theory to hold this tuning parameter should be proportional to the noise level, yet the latter is often unknown in practice. A possible remedy is to jointly optimize over the regression parameter as well as over the noise level. This has been considered under several names in the literature: Scaled-Lasso, Square-root Lasso, Concomitant Lasso estimation for instance, and could be of interest for uncertainty quantification. In this work, after illustrating numerical difficulties for the Concomitant Lasso formulation, we propose a modification we coined Smoothed Concomitant Lasso, aimed at increasing numerical stability. We propose an efficient and accurate solver leading to a computational cost no more expensive than the one for the Lasso. We leverage on standard ingredients behind the success of fast Lasso solvers: a coordinate descent algorithm, combined with safe screening rules to achieve speed efficiency, by eliminating early irrelevant features.
Klimovskaia, Anna; Ganscha, Stefan; Claassen, Manfred
2016-12-01
Stochastic chemical reaction networks constitute a model class to quantitatively describe dynamics and cell-to-cell variability in biological systems. The topology of these networks typically is only partially characterized due to experimental limitations. Current approaches for refining network topology are based on the explicit enumeration of alternative topologies and are therefore restricted to small problem instances with almost complete knowledge. We propose the reactionet lasso, a computational procedure that derives a stepwise sparse regression approach on the basis of the Chemical Master Equation, enabling large-scale structure learning for reaction networks by implicitly accounting for billions of topology variants. We have assessed the structure learning capabilities of the reactionet lasso on synthetic data for the complete TRAIL induced apoptosis signaling cascade comprising 70 reactions. We find that the reactionet lasso is able to efficiently recover the structure of these reaction systems, ab initio, with high sensitivity and specificity. With only < 1% false discoveries, the reactionet lasso is able to recover 45% of all true reactions ab initio among > 6000 possible reactions and over 102000 network topologies. In conjunction with information rich single cell technologies such as single cell RNA sequencing or mass cytometry, the reactionet lasso will enable large-scale structure learning, particularly in areas with partial network structure knowledge, such as cancer biology, and thereby enable the detection of pathological alterations of reaction networks. We provide software to allow for wide applicability of the reactionet lasso.
A LEAST ABSOLUTE SHRINKAGE AND SELECTION OPERATOR (LASSO) FOR NONLINEAR SYSTEM IDENTIFICATION
NASA Technical Reports Server (NTRS)
Kukreja, Sunil L.; Lofberg, Johan; Brenner, Martin J.
2006-01-01
Identification of parametric nonlinear models involves estimating unknown parameters and detecting its underlying structure. Structure computation is concerned with selecting a subset of parameters to give a parsimonious description of the system which may afford greater insight into the functionality of the system or a simpler controller design. In this study, a least absolute shrinkage and selection operator (LASSO) technique is investigated for computing efficient model descriptions of nonlinear systems. The LASSO minimises the residual sum of squares by the addition of a 1 penalty term on the parameter vector of the traditional 2 minimisation problem. Its use for structure detection is a natural extension of this constrained minimisation approach to pseudolinear regression problems which produces some model parameters that are exactly zero and, therefore, yields a parsimonious system description. The performance of this LASSO structure detection method was evaluated by using it to estimate the structure of a nonlinear polynomial model. Applicability of the method to more complex systems such as those encountered in aerospace applications was shown by identifying a parsimonious system description of the F/A-18 Active Aeroelastic Wing using flight test data.
Application of Multi-task Lasso Regression in the Stellar Parametrization
NASA Astrophysics Data System (ADS)
Chang, L. N.; Zhang, P. A.
2015-01-01
The multi-task learning approaches have attracted the increasing attention in the fields of machine learning, computer vision, and artificial intelligence. By utilizing the correlations in tasks, learning multiple related tasks simultaneously is better than learning each task independently. An efficient multi-task Lasso (Least Absolute Shrinkage Selection and Operator) regression algorithm is proposed in this paper to estimate the physical parameters of stellar spectra. It not only makes different physical parameters share the common features, but also can effectively preserve their own peculiar features. Experiments were done based on the ELODIE data simulated with the stellar atmospheric simulation model, and on the SDSS data released by the American large survey Sloan. The precision of the model is better than those of the methods in the related literature, especially for the acceleration of gravity (lg g) and the chemical abundance ([Fe/H]). In the experiments, we changed the resolution of the spectrum, and applied the noises with different signal-to-noise ratio (SNR) to the spectrum, so as to illustrate the stability of the model. The results show that the model is influenced by both the resolution and the noise. But the influence of the noise is larger than that of the resolution. In general, the multi-task Lasso regression algorithm is easy to operate, has a strong stability, and also can improve the overall accuracy of the model.
Developing deterioration models for Wyoming bridges.
DOT National Transportation Integrated Search
2016-05-01
Deterioration models for the Wyoming Bridge Inventory were developed using both stochastic and deterministic models. : The selection of explanatory variables is investigated and a new method using LASSO regression to eliminate human bias : in explana...
Haem, Elham; Harling, Kajsa; Ayatollahi, Seyyed Mohammad Taghi; Zare, Najaf; Karlsson, Mats O
2017-02-01
One important aim in population pharmacokinetics (PK) and pharmacodynamics is identification and quantification of the relationships between the parameters and covariates. Lasso has been suggested as a technique for simultaneous estimation and covariate selection. In linear regression, it has been shown that Lasso possesses no oracle properties, which means it asymptotically performs as though the true underlying model was given in advance. Adaptive Lasso (ALasso) with appropriate initial weights is claimed to possess oracle properties; however, it can lead to poor predictive performance when there is multicollinearity between covariates. This simulation study implemented a new version of ALasso, called adjusted ALasso (AALasso), to take into account the ratio of the standard error of the maximum likelihood (ML) estimator to the ML coefficient as the initial weight in ALasso to deal with multicollinearity in non-linear mixed-effect models. The performance of AALasso was compared with that of ALasso and Lasso. PK data was simulated in four set-ups from a one-compartment bolus input model. Covariates were created by sampling from a multivariate standard normal distribution with no, low (0.2), moderate (0.5) or high (0.7) correlation. The true covariates influenced only clearance at different magnitudes. AALasso, ALasso and Lasso were compared in terms of mean absolute prediction error and error of the estimated covariate coefficient. The results show that AALasso performed better in small data sets, even in those in which a high correlation existed between covariates. This makes AALasso a promising method for covariate selection in nonlinear mixed-effect models.
Multi-omics facilitated variable selection in Cox-regression model for cancer prognosis prediction.
Liu, Cong; Wang, Xujun; Genchev, Georgi Z; Lu, Hui
2017-07-15
New developments in high-throughput genomic technologies have enabled the measurement of diverse types of omics biomarkers in a cost-efficient and clinically-feasible manner. Developing computational methods and tools for analysis and translation of such genomic data into clinically-relevant information is an ongoing and active area of investigation. For example, several studies have utilized an unsupervised learning framework to cluster patients by integrating omics data. Despite such recent advances, predicting cancer prognosis using integrated omics biomarkers remains a challenge. There is also a shortage of computational tools for predicting cancer prognosis by using supervised learning methods. The current standard approach is to fit a Cox regression model by concatenating the different types of omics data in a linear manner, while penalty could be added for feature selection. A more powerful approach, however, would be to incorporate data by considering relationships among omics datatypes. Here we developed two methods: a SKI-Cox method and a wLASSO-Cox method to incorporate the association among different types of omics data. Both methods fit the Cox proportional hazards model and predict a risk score based on mRNA expression profiles. SKI-Cox borrows the information generated by these additional types of omics data to guide variable selection, while wLASSO-Cox incorporates this information as a penalty factor during model fitting. We show that SKI-Cox and wLASSO-Cox models select more true variables than a LASSO-Cox model in simulation studies. We assess the performance of SKI-Cox and wLASSO-Cox using TCGA glioblastoma multiforme and lung adenocarcinoma data. In each case, mRNA expression, methylation, and copy number variation data are integrated to predict the overall survival time of cancer patients. Our methods achieve better performance in predicting patients' survival in glioblastoma and lung adenocarcinoma. Copyright © 2017. Published by Elsevier Inc.
Quantifying predictive capability of electronic health records for the most harmful breast cancer
NASA Astrophysics Data System (ADS)
Wu, Yirong; Fan, Jun; Peissig, Peggy; Berg, Richard; Tafti, Ahmad Pahlavan; Yin, Jie; Yuan, Ming; Page, David; Cox, Jennifer; Burnside, Elizabeth S.
2018-03-01
Improved prediction of the "most harmful" breast cancers that cause the most substantive morbidity and mortality would enable physicians to target more intense screening and preventive measures at those women who have the highest risk; however, such prediction models for the "most harmful" breast cancers have rarely been developed. Electronic health records (EHRs) represent an underused data source that has great research and clinical potential. Our goal was to quantify the value of EHR variables in the "most harmful" breast cancer risk prediction. We identified 794 subjects who had breast cancer with primary non-benign tumors with their earliest diagnosis on or after 1/1/2004 from an existing personalized medicine data repository, including 395 "most harmful" breast cancer cases and 399 "least harmful" breast cancer cases. For these subjects, we collected EHR data comprised of 6 components: demographics, diagnoses, symptoms, procedures, medications, and laboratory results. We developed two regularized prediction models, Ridge Logistic Regression (Ridge-LR) and Lasso Logistic Regression (Lasso-LR), to predict the "most harmful" breast cancer one year in advance. The area under the ROC curve (AUC) was used to assess model performance. We observed that the AUCs of Ridge-LR and Lasso-LR models were 0.818 and 0.839 respectively. For both the Ridge-LR and LassoLR models, the predictive performance of the whole EHR variables was significantly higher than that of each individual component (p<0.001). In conclusion, EHR variables can be used to predict the "most harmful" breast cancer, providing the possibility to personalize care for those women at the highest risk in clinical practice.
Quantifying predictive capability of electronic health records for the most harmful breast cancer.
Wu, Yirong; Fan, Jun; Peissig, Peggy; Berg, Richard; Tafti, Ahmad Pahlavan; Yin, Jie; Yuan, Ming; Page, David; Cox, Jennifer; Burnside, Elizabeth S
2018-02-01
Improved prediction of the "most harmful" breast cancers that cause the most substantive morbidity and mortality would enable physicians to target more intense screening and preventive measures at those women who have the highest risk; however, such prediction models for the "most harmful" breast cancers have rarely been developed. Electronic health records (EHRs) represent an underused data source that has great research and clinical potential. Our goal was to quantify the value of EHR variables in the "most harmful" breast cancer risk prediction. We identified 794 subjects who had breast cancer with primary non-benign tumors with their earliest diagnosis on or after 1/1/2004 from an existing personalized medicine data repository, including 395 "most harmful" breast cancer cases and 399 "least harmful" breast cancer cases. For these subjects, we collected EHR data comprised of 6 components: demographics, diagnoses, symptoms, procedures, medications, and laboratory results. We developed two regularized prediction models, Ridge Logistic Regression (Ridge-LR) and Lasso Logistic Regression (Lasso-LR), to predict the "most harmful" breast cancer one year in advance. The area under the ROC curve (AUC) was used to assess model performance. We observed that the AUCs of Ridge-LR and Lasso-LR models were 0.818 and 0.839 respectively. For both the Ridge-LR and Lasso-LR models, the predictive performance of the whole EHR variables was significantly higher than that of each individual component (p<0.001). In conclusion, EHR variables can be used to predict the "most harmful" breast cancer, providing the possibility to personalize care for those women at the highest risk in clinical practice.
NASA Astrophysics Data System (ADS)
Dyar, M. D.; Carmosino, M. L.; Breves, E. A.; Ozanne, M. V.; Clegg, S. M.; Wiens, R. C.
2012-04-01
A remote laser-induced breakdown spectrometer (LIBS) designed to simulate the ChemCam instrument on the Mars Science Laboratory Rover Curiosity was used to probe 100 geologic samples at a 9-m standoff distance. ChemCam consists of an integrated remote LIBS instrument that will probe samples up to 7 m from the mast of the rover and a remote micro-imager (RMI) that will record context images. The elemental compositions of 100 igneous and highly-metamorphosed rocks are determined with LIBS using three variations of multivariate analysis, with a goal of improving the analytical accuracy. Two forms of partial least squares (PLS) regression are employed with finely-tuned parameters: PLS-1 regresses a single response variable (elemental concentration) against the observation variables (spectra, or intensity at each of 6144 spectrometer channels), while PLS-2 simultaneously regresses multiple response variables (concentrations of the ten major elements in rocks) against the observation predictor variables, taking advantage of natural correlations between elements. Those results are contrasted with those from the multivariate regression technique of the least absolute shrinkage and selection operator (lasso), which is a penalized shrunken regression method that selects the specific channels for each element that explain the most variance in the concentration of that element. To make this comparison, we use results of cross-validation and of held-out testing, and employ unscaled and uncentered spectral intensity data because all of the input variables are already in the same units. Results demonstrate that the lasso, PLS-1, and PLS-2 all yield comparable results in terms of accuracy for this dataset. However, the interpretability of these methods differs greatly in terms of fundamental understanding of LIBS emissions. PLS techniques generate principal components, linear combinations of intensities at any number of spectrometer channels, which explain as much variance in the response variables as possible while avoiding multicollinearity between principal components. When the selected number of principal components is projected back into the original feature space of the spectra, 6144 correlation coefficients are generated, a small fraction of which are mathematically significant to the regression. In contrast, the lasso models require only a small number (< 24) of non-zero correlation coefficients (β values) to determine the concentration of each of the ten major elements. Causality between the positively-correlated emission lines chosen by the lasso and the elemental concentration was examined. In general, the higher the lasso coefficient (β), the greater the likelihood that the selected line results from an emission of that element. Emission lines with negative β values should arise from elements that are anti-correlated with the element being predicted. For elements except Fe, Al, Ti, and P, the lasso-selected wavelength with the highest β value corresponds to the element being predicted, e.g. 559.8 nm for neutral Ca. However, the specific lines chosen by the lasso with positive β values are not always those from the element being predicted. Other wavelengths and the elements that most strongly correlate with them to predict concentration are obviously related to known geochemical correlations or close overlap of emission lines, while others must result from matrix effects. Use of the lasso technique thus directly informs our understanding of the underlying physical processes that give rise to LIBS emissions by determining which lines can best represent concentration, and which lines from other elements are causing matrix effects.
NASA Astrophysics Data System (ADS)
Takayama, T.; Iwasaki, A.
2016-06-01
Above-ground biomass prediction of tropical rain forest using remote sensing data is of paramount importance to continuous large-area forest monitoring. Hyperspectral data can provide rich spectral information for the biomass prediction; however, the prediction accuracy is affected by a small-sample-size problem, which widely exists as overfitting in using high dimensional data where the number of training samples is smaller than the dimensionality of the samples due to limitation of require time, cost, and human resources for field surveys. A common approach to addressing this problem is reducing the dimensionality of dataset. Also, acquired hyperspectral data usually have low signal-to-noise ratio due to a narrow bandwidth and local or global shifts of peaks due to instrumental instability or small differences in considering practical measurement conditions. In this work, we propose a methodology based on fused lasso regression that select optimal bands for the biomass prediction model with encouraging sparsity and grouping, which solves the small-sample-size problem by the dimensionality reduction from the sparsity and the noise and peak shift problem by the grouping. The prediction model provided higher accuracy with root-mean-square error (RMSE) of 66.16 t/ha in the cross-validation than other methods; multiple linear analysis, partial least squares regression, and lasso regression. Furthermore, fusion of spectral and spatial information derived from texture index increased the prediction accuracy with RMSE of 62.62 t/ha. This analysis proves efficiency of fused lasso and image texture in biomass estimation of tropical forests.
Zhang, Xiaoshuai; Xue, Fuzhong; Liu, Hong; Zhu, Dianwen; Peng, Bin; Wiemels, Joseph L; Yang, Xiaowei
2014-12-10
Genome-wide Association Studies (GWAS) are typically designed to identify phenotype-associated single nucleotide polymorphisms (SNPs) individually using univariate analysis methods. Though providing valuable insights into genetic risks of common diseases, the genetic variants identified by GWAS generally account for only a small proportion of the total heritability for complex diseases. To solve this "missing heritability" problem, we implemented a strategy called integrative Bayesian Variable Selection (iBVS), which is based on a hierarchical model that incorporates an informative prior by considering the gene interrelationship as a network. It was applied here to both simulated and real data sets. Simulation studies indicated that the iBVS method was advantageous in its performance with highest AUC in both variable selection and outcome prediction, when compared to Stepwise and LASSO based strategies. In an analysis of a leprosy case-control study, iBVS selected 94 SNPs as predictors, while LASSO selected 100 SNPs. The Stepwise regression yielded a more parsimonious model with only 3 SNPs. The prediction results demonstrated that the iBVS method had comparable performance with that of LASSO, but better than Stepwise strategies. The proposed iBVS strategy is a novel and valid method for Genome-wide Association Studies, with the additional advantage in that it produces more interpretable posterior probabilities for each variable unlike LASSO and other penalized regression methods.
Application of Multi-task Lasso Regression in the Parametrization of Stellar Spectra
NASA Astrophysics Data System (ADS)
Chang, Li-Na; Zhang, Pei-Ai
2015-07-01
The multi-task learning approaches have attracted the increasing attention in the fields of machine learning, computer vision, and artificial intelligence. By utilizing the correlations in tasks, learning multiple related tasks simultaneously is better than learning each task independently. An efficient multi-task Lasso (Least Absolute Shrinkage Selection and Operator) regression algorithm is proposed in this paper to estimate the physical parameters of stellar spectra. It not only can obtain the information about the common features of the different physical parameters, but also can preserve effectively their own peculiar features. Experiments were done based on the ELODIE synthetic spectral data simulated with the stellar atmospheric model, and on the SDSS data released by the American large-scale survey Sloan. The estimation precision of our model is better than those of the methods in the related literature, especially for the estimates of the gravitational acceleration (lg g) and the chemical abundance ([Fe/H]). In the experiments we changed the spectral resolution, and applied the noises with different signal-to-noise ratios (SNRs) to the spectral data, so as to illustrate the stability of the model. The results show that the model is influenced by both the resolution and the noise. But the influence of the noise is larger than that of the resolution. In general, the multi-task Lasso regression algorithm is easy to operate, it has a strong stability, and can also improve the overall prediction accuracy of the model.
ELASTIC NET FOR COX'S PROPORTIONAL HAZARDS MODEL WITH A SOLUTION PATH ALGORITHM.
Wu, Yichao
2012-01-01
For least squares regression, Efron et al. (2004) proposed an efficient solution path algorithm, the least angle regression (LAR). They showed that a slight modification of the LAR leads to the whole LASSO solution path. Both the LAR and LASSO solution paths are piecewise linear. Recently Wu (2011) extended the LAR to generalized linear models and the quasi-likelihood method. In this work we extend the LAR further to handle Cox's proportional hazards model. The goal is to develop a solution path algorithm for the elastic net penalty (Zou and Hastie (2005)) in Cox's proportional hazards model. This goal is achieved in two steps. First we extend the LAR to optimizing the log partial likelihood plus a fixed small ridge term. Then we define a path modification, which leads to the solution path of the elastic net regularized log partial likelihood. Our solution path is exact and piecewise determined by ordinary differential equation systems.
NASA Astrophysics Data System (ADS)
Pandremmenou, K.; Shahid, M.; Kondi, L. P.; Lövström, B.
2015-03-01
In this work, we propose a No-Reference (NR) bitstream-based model for predicting the quality of H.264/AVC video sequences, affected by both compression artifacts and transmission impairments. The proposed model is based on a feature extraction procedure, where a large number of features are calculated from the packet-loss impaired bitstream. Many of the features are firstly proposed in this work, and the specific set of the features as a whole is applied for the first time for making NR video quality predictions. All feature observations are taken as input to the Least Absolute Shrinkage and Selection Operator (LASSO) regression method. LASSO indicates the most important features, and using only them, it is possible to estimate the Mean Opinion Score (MOS) with high accuracy. Indicatively, we point out that only 13 features are able to produce a Pearson Correlation Coefficient of 0.92 with the MOS. Interestingly, the performance statistics we computed in order to assess our method for predicting the Structural Similarity Index and the Video Quality Metric are equally good. Thus, the obtained experimental results verified the suitability of the features selected by LASSO as well as the ability of LASSO in making accurate predictions through sparse modeling.
ELASTIC NET FOR COX’S PROPORTIONAL HAZARDS MODEL WITH A SOLUTION PATH ALGORITHM
Wu, Yichao
2012-01-01
For least squares regression, Efron et al. (2004) proposed an efficient solution path algorithm, the least angle regression (LAR). They showed that a slight modification of the LAR leads to the whole LASSO solution path. Both the LAR and LASSO solution paths are piecewise linear. Recently Wu (2011) extended the LAR to generalized linear models and the quasi-likelihood method. In this work we extend the LAR further to handle Cox’s proportional hazards model. The goal is to develop a solution path algorithm for the elastic net penalty (Zou and Hastie (2005)) in Cox’s proportional hazards model. This goal is achieved in two steps. First we extend the LAR to optimizing the log partial likelihood plus a fixed small ridge term. Then we define a path modification, which leads to the solution path of the elastic net regularized log partial likelihood. Our solution path is exact and piecewise determined by ordinary differential equation systems. PMID:23226932
Kopprasch, Steffi; Dheban, Srirangan; Schuhmann, Kai; Xu, Aimin; Schulte, Klaus-Martin; Simeonovic, Charmaine J; Schwarz, Peter E H; Bornstein, Stefan R; Shevchenko, Andrej; Graessler, Juergen
2016-01-01
Glucolipotoxicity is a major pathophysiological mechanism in the development of insulin resistance and type 2 diabetes mellitus (T2D). We aimed to detect subtle changes in the circulating lipid profile by shotgun lipidomics analyses and to associate them with four different insulin sensitivity indices. The cross-sectional study comprised 90 men with a broad range of insulin sensitivity including normal glucose tolerance (NGT, n = 33), impaired glucose tolerance (IGT, n = 32) and newly detected T2D (n = 25). Prior to oral glucose challenge plasma was obtained and quantitatively analyzed for 198 lipid molecular species from 13 different lipid classes including triacylglycerls (TAGs), phosphatidylcholine plasmalogen/ether (PC O-s), sphingomyelins (SMs), and lysophosphatidylcholines (LPCs). To identify a lipidomic signature of individual insulin sensitivity we applied three data mining approaches, namely least absolute shrinkage and selection operator (LASSO), Support Vector Regression (SVR) and Random Forests (RF) for the following insulin sensitivity indices: homeostasis model of insulin resistance (HOMA-IR), glucose insulin sensitivity index (GSI), insulin sensitivity index (ISI), and disposition index (DI). The LASSO procedure offers a high prediction accuracy and and an easier interpretability than SVR and RF. After LASSO selection, the plasma lipidome explained 3% (DI) to maximal 53% (HOMA-IR) variability of the sensitivity indexes. Among the lipid species with the highest positive LASSO regression coefficient were TAG 54:2 (HOMA-IR), PC O- 32:0 (GSI), and SM 40:3:1 (ISI). The highest negative regression coefficient was obtained for LPC 22:5 (HOMA-IR), TAG 51:1 (GSI), and TAG 58:6 (ISI). Although a substantial part of lipid molecular species showed a significant correlation with insulin sensitivity indices we were able to identify a limited number of lipid metabolites of particular importance based on the LASSO approach. These few selected lipids with the closest connection to sensitivity indices may help to further improve disease risk prediction and disease and therapy monitoring.
Jiang, Xiaoyu; Fuchs, Mathias
2017-01-01
As modern biotechnologies advance, it has become increasingly frequent that different modalities of high-dimensional molecular data (termed “omics” data in this paper), such as gene expression, methylation, and copy number, are collected from the same patient cohort to predict the clinical outcome. While prediction based on omics data has been widely studied in the last fifteen years, little has been done in the statistical literature on the integration of multiple omics modalities to select a subset of variables for prediction, which is a critical task in personalized medicine. In this paper, we propose a simple penalized regression method to address this problem by assigning different penalty factors to different data modalities for feature selection and prediction. The penalty factors can be chosen in a fully data-driven fashion by cross-validation or by taking practical considerations into account. In simulation studies, we compare the prediction performance of our approach, called IPF-LASSO (Integrative LASSO with Penalty Factors) and implemented in the R package ipflasso, with the standard LASSO and sparse group LASSO. The use of IPF-LASSO is also illustrated through applications to two real-life cancer datasets. All data and codes are available on the companion website to ensure reproducibility. PMID:28546826
Sparse regressions for predicting and interpreting subcellular localization of multi-label proteins.
Wan, Shibiao; Mak, Man-Wai; Kung, Sun-Yuan
2016-02-24
Predicting protein subcellular localization is indispensable for inferring protein functions. Recent studies have been focusing on predicting not only single-location proteins, but also multi-location proteins. Almost all of the high performing predictors proposed recently use gene ontology (GO) terms to construct feature vectors for classification. Despite their high performance, their prediction decisions are difficult to interpret because of the large number of GO terms involved. This paper proposes using sparse regressions to exploit GO information for both predicting and interpreting subcellular localization of single- and multi-location proteins. Specifically, we compared two multi-label sparse regression algorithms, namely multi-label LASSO (mLASSO) and multi-label elastic net (mEN), for large-scale predictions of protein subcellular localization. Both algorithms can yield sparse and interpretable solutions. By using the one-vs-rest strategy, mLASSO and mEN identified 87 and 429 out of more than 8,000 GO terms, respectively, which play essential roles in determining subcellular localization. More interestingly, many of the GO terms selected by mEN are from the biological process and molecular function categories, suggesting that the GO terms of these categories also play vital roles in the prediction. With these essential GO terms, not only where a protein locates can be decided, but also why it resides there can be revealed. Experimental results show that the output of both mEN and mLASSO are interpretable and they perform significantly better than existing state-of-the-art predictors. Moreover, mEN selects more features and performs better than mLASSO on a stringent human benchmark dataset. For readers' convenience, an online server called SpaPredictor for both mLASSO and mEN is available at http://bioinfo.eie.polyu.edu.hk/SpaPredictorServer/.
Vasquez, Monica M; Hu, Chengcheng; Roe, Denise J; Chen, Zhao; Halonen, Marilyn; Guerra, Stefano
2016-11-14
The study of circulating biomarkers and their association with disease outcomes has become progressively complex due to advances in the measurement of these biomarkers through multiplex technologies. The Least Absolute Shrinkage and Selection Operator (LASSO) is a data analysis method that may be utilized for biomarker selection in these high dimensional data. However, it is unclear which LASSO-type method is preferable when considering data scenarios that may be present in serum biomarker research, such as high correlation between biomarkers, weak associations with the outcome, and sparse number of true signals. The goal of this study was to compare the LASSO to five LASSO-type methods given these scenarios. A simulation study was performed to compare the LASSO, Adaptive LASSO, Elastic Net, Iterated LASSO, Bootstrap-Enhanced LASSO, and Weighted Fusion for the binary logistic regression model. The simulation study was designed to reflect the data structure of the population-based Tucson Epidemiological Study of Airway Obstructive Disease (TESAOD), specifically the sample size (N = 1000 for total population, 500 for sub-analyses), correlation of biomarkers (0.20, 0.50, 0.80), prevalence of overweight (40%) and obese (12%) outcomes, and the association of outcomes with standardized serum biomarker concentrations (log-odds ratio = 0.05-1.75). Each LASSO-type method was then applied to the TESAOD data of 306 overweight, 66 obese, and 463 normal-weight subjects with a panel of 86 serum biomarkers. Based on the simulation study, no method had an overall superior performance. The Weighted Fusion correctly identified more true signals, but incorrectly included more noise variables. The LASSO and Elastic Net correctly identified many true signals and excluded more noise variables. In the application study, biomarkers of overweight and obesity selected by all methods were Adiponectin, Apolipoprotein H, Calcitonin, CD14, Complement 3, C-reactive protein, Ferritin, Growth Hormone, Immunoglobulin M, Interleukin-18, Leptin, Monocyte Chemotactic Protein-1, Myoglobin, Sex Hormone Binding Globulin, Surfactant Protein D, and YKL-40. For the data scenarios examined, choice of optimal LASSO-type method was data structure dependent and should be guided by the research objective. The LASSO-type methods identified biomarkers that have known associations with obesity and obesity related conditions.
Model-assisted survey regression estimation with the lasso
Kelly S. McConville; F. Jay Breidt; Thomas C. M. Lee; Gretchen G. Moisen
2017-01-01
In the U.S. Forest Serviceâs Forest Inventory and Analysis (FIA) program, as in other natural resource surveys, many auxiliary variables are available for use in model-assisted inference about finite population parameters. Some of this auxiliary information may be extraneous, and therefore model selection is appropriate to improve the efficiency of the survey...
Bricklemyer, Ross S; Brown, David J; Turk, Philip J; Clegg, Sam M
2013-10-01
Laser-induced breakdown spectroscopy (LIBS) provides a potential method for rapid, in situ soil C measurement. In previous research on the application of LIBS to intact soil cores, we hypothesized that ultraviolet (UV) spectrum LIBS (200-300 nm) might not provide sufficient elemental information to reliably discriminate between soil organic C (SOC) and inorganic C (IC). In this study, using a custom complete spectrum (245-925 nm) core-scanning LIBS instrument, we analyzed 60 intact soil cores from six wheat fields. Predictive multi-response partial least squares (PLS2) models using full and reduced spectrum LIBS were compared for directly determining soil total C (TC), IC, and SOC. Two regression shrinkage and variable selection approaches, the least absolute shrinkage and selection operator (LASSO) and sparse multivariate regression with covariance estimation (MRCE), were tested for soil C predictions and the identification of wavelengths important for soil C prediction. Using complete spectrum LIBS for PLS2 modeling reduced the calibration standard error of prediction (SEP) 15 and 19% for TC and IC, respectively, compared to UV spectrum LIBS. The LASSO and MRCE approaches provided significantly improved calibration accuracy and reduced SEP 32-55% over UV spectrum PLS2 models. We conclude that (1) complete spectrum LIBS is superior to UV spectrum LIBS for predicting soil C for intact soil cores without pretreatment; (2) LASSO and MRCE approaches provide improved calibration prediction accuracy over PLS2 but require additional testing with increased soil and target analyte diversity; and (3) measurement errors associated with analyzing intact cores (e.g., sample density and surface roughness) require further study and quantification.
Screen and clean: a tool for identifying interactions in genome-wide association studies.
Wu, Jing; Devlin, Bernie; Ringquist, Steven; Trucco, Massimo; Roeder, Kathryn
2010-04-01
Epistasis could be an important source of risk for disease. How interacting loci might be discovered is an open question for genome-wide association studies (GWAS). Most researchers limit their statistical analyses to testing individual pairwise interactions (i.e., marginal tests for association). A more effective means of identifying important predictors is to fit models that include many predictors simultaneously (i.e., higher-dimensional models). We explore a procedure called screen and clean (SC) for identifying liability loci, including interactions, by using the lasso procedure, which is a model selection tool for high-dimensional regression. We approach the problem by using a varying dictionary consisting of terms to include in the model. In the first step the lasso dictionary includes only main effects. The most promising single-nucleotide polymorphisms (SNPs) are identified using a screening procedure. Next the lasso dictionary is adjusted to include these main effects and the corresponding interaction terms. Again, promising terms are identified using lasso screening. Then significant terms are identified through the cleaning process. Implementation of SC for GWAS requires algorithms to explore the complex model space induced by the many SNPs genotyped and their interactions. We propose and explore a set of algorithms and find that SC successfully controls Type I error while yielding good power to identify risk loci and their interactions. When the method is applied to data obtained from the Wellcome Trust Case Control Consortium study of Type 1 Diabetes it uncovers evidence supporting interaction within the HLA class II region as well as within Chromosome 12q24.
Perceptual quality estimation of H.264/AVC videos using reduced-reference and no-reference models
NASA Astrophysics Data System (ADS)
Shahid, Muhammad; Pandremmenou, Katerina; Kondi, Lisimachos P.; Rossholm, Andreas; Lövström, Benny
2016-09-01
Reduced-reference (RR) and no-reference (NR) models for video quality estimation, using features that account for the impact of coding artifacts, spatio-temporal complexity, and packet losses, are proposed. The purpose of this study is to analyze a number of potentially quality-relevant features in order to select the most suitable set of features for building the desired models. The proposed sets of features have not been used in the literature and some of the features are used for the first time in this study. The features are employed by the least absolute shrinkage and selection operator (LASSO), which selects only the most influential of them toward perceptual quality. For comparison, we apply feature selection in the complete feature sets and ridge regression on the reduced sets. The models are validated using a database of H.264/AVC encoded videos that were subjectively assessed for quality in an ITU-T compliant laboratory. We infer that just two features selected by RR LASSO and two bitstream-based features selected by NR LASSO are able to estimate perceptual quality with high accuracy, higher than that of ridge, which uses more features. The comparisons with competing works and two full-reference metrics also verify the superiority of our models.
Reasoning about Independence in Probabilistic Models of Relational Data (Author’s Manuscript)
2014-01-06
for relational variables from A’s perspective, and this result is also applicable to one-to-many data.) To illustrate this fact more concretely ...separators. Technical Report R-254, UCLA Computer Science Department, February 1998. Robert Tibshirani. Regression shrinkage and selection via the lasso
2014-01-01
Background Genome-wide microarrays have been useful for predicting chemical-genetic interactions at the gene level. However, interpreting genome-wide microarray results can be overwhelming due to the vast output of gene expression data combined with off-target transcriptional responses many times induced by a drug treatment. This study demonstrates how experimental and computational methods can interact with each other, to arrive at more accurate predictions of drug-induced perturbations. We present a two-stage strategy that links microarray experimental testing and network training conditions to predict gene perturbations for a drug with a known mechanism of action in a well-studied organism. Results S. cerevisiae cells were treated with the antifungal, fluconazole, and expression profiling was conducted under different biological conditions using Affymetrix genome-wide microarrays. Transcripts were filtered with a formal network-based method, sparse simultaneous equation models and Lasso regression (SSEM-Lasso), under different network training conditions. Gene expression results were evaluated using both gene set and single gene target analyses, and the drug’s transcriptional effects were narrowed first by pathway and then by individual genes. Variables included: (i) Testing conditions – exposure time and concentration and (ii) Network training conditions – training compendium modifications. Two analyses of SSEM-Lasso output – gene set and single gene – were conducted to gain a better understanding of how SSEM-Lasso predicts perturbation targets. Conclusions This study demonstrates that genome-wide microarrays can be optimized using a two-stage strategy for a more in-depth understanding of how a cell manifests biological reactions to a drug treatment at the transcription level. Additionally, a more detailed understanding of how the statistical model, SSEM-Lasso, propagates perturbations through a network of gene regulatory interactions is achieved. PMID:24444313
Statistically Modeling I-V Characteristics of CNT-FET with LASSO
NASA Astrophysics Data System (ADS)
Ma, Dongsheng; Ye, Zuochang; Wang, Yan
2017-08-01
With the advent of internet of things (IOT), the need for studying new material and devices for various applications is increasing. Traditionally we build compact models for transistors on the basis of physics. But physical models are expensive and need a very long time to adjust for non-ideal effects. As the vision for the application of many novel devices is not certain or the manufacture process is not mature, deriving generalized accurate physical models for such devices is very strenuous, whereas statistical modeling is becoming a potential method because of its data oriented property and fast implementation. In this paper, one classical statistical regression method, LASSO, is used to model the I-V characteristics of CNT-FET and a pseudo-PMOS inverter simulation based on the trained model is implemented in Cadence. The normalized relative mean square prediction error of the trained model versus experiment sample data and the simulation results show that the model is acceptable for digital circuit static simulation. And such modeling methodology can extend to general devices.
Description of the LASSO Alpha 1 Release
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gustafson, William I.; Vogelmann, Andrew M.; Cheng, Xiaoping
The Department of Energy (DOE) Atmospheric Radiation Measurement (ARM) Climate Research Facility began a pilot project in May 2015 to design a routine, high-resolution modeling capability to complement ARM’s extensive suite of measurements. This modeling capability has been named the Large-Eddy Simulation (LES) ARM Symbiotic Simulation and Observation (LASSO) project. The availability of LES simulations with concurrent observations will serve many purposes. LES helps bridge the scale gap between DOE ARM observations and models, and the use of routine LES adds value to observations. It provides a self-consistent representation of the atmosphere and a dynamical context for the observations. Further,more » it elucidates unobservable processes and properties. LASSO will generate a simulation library for researchers that enables statistical approaches beyond a single-case mentality. It will also provide tools necessary for modelers to reproduce the LES and conduct their own sensitivity experiments. Many different uses are envisioned for the combined LASSO LES and observational library. For an observationalist, LASSO can help inform instrument remote-sensing retrievals, conduct Observation System Simulation Experiments (OSSEs), and test implications of radar scan strategies or flight paths. For a theoretician, LASSO will help calculate estimates of fluxes and co-variability of values, and test relationships without having to run the model yourself. For a modeler, LASSO will help one know ahead of time which days have good forcing, have co-registered observations at high-resolution scales, and have simulation inputs and corresponding outputs to test parameterizations. Further details on the overall LASSO project are available at http://www.arm. gov/science/themes/lasso.« less
Frank, Laurence E; Heiser, Willem J
2008-05-01
A set of features is the basis for the network representation of proximity data achieved by feature network models (FNMs). Features are binary variables that characterize the objects in an experiment, with some measure of proximity as response variable. Sometimes features are provided by theory and play an important role in the construction of the experimental conditions. In some research settings, the features are not known a priori. This paper shows how to generate features in this situation and how to select an adequate subset of features that takes into account a good compromise between model fit and model complexity, using a new version of least angle regression that restricts coefficients to be non-negative, called the Positive Lasso. It will be shown that features can be generated efficiently with Gray codes that are naturally linked to the FNMs. The model selection strategy makes use of the fact that FNM can be considered as univariate multiple regression model. A simulation study shows that the proposed strategy leads to satisfactory results if the number of objects is less than or equal to 22. If the number of objects is larger than 22, the number of features selected by our method exceeds the true number of features in some conditions.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gustafson, William I.; Vogelmann, Andrew M.; Cheng, Xiaoping
The Department of Energy (DOE) Atmospheric Radiation Measurement (ARM) Climate Research Facility began a pilot project in May 2015 to design a routine, high-resolution modeling capability to complement ARM’s extensive suite of measurements. This modeling capability has been named the Large-Eddy Simulation (LES) ARM Symbiotic Simulation and Observation (LASSO) project. The initial focus of LASSO is on shallow convection at the ARM Southern Great Plains (SGP) Climate Research Facility. The availability of LES simulations with concurrent observations will serve many purposes. LES helps bridge the scale gap between DOE ARM observations and models, and the use of routine LES addsmore » value to observations. It provides a self-consistent representation of the atmosphere and a dynamical context for the observations. Further, it elucidates unobservable processes and properties. LASSO will generate a simulation library for researchers that enables statistical approaches beyond a single-case mentality. It will also provide tools necessary for modelers to reproduce the LES and conduct their own sensitivity experiments. Many different uses are envisioned for the combined LASSO LES and observational library. For an observationalist, LASSO can help inform instrument remote sensing retrievals, conduct Observation System Simulation Experiments (OSSEs), and test implications of radar scan strategies or flight paths. For a theoretician, LASSO will help calculate estimates of fluxes and co-variability of values, and test relationships without having to run the model yourself. For a modeler, LASSO will help one know ahead of time which days have good forcing, have co-registered observations at high-resolution scales, and have simulation inputs and corresponding outputs to test parameterizations. Further details on the overall LASSO project are available at https://www.arm.gov/capabilities/modeling/lasso.« less
Son, Yeongkwon; Osornio-Vargas, Álvaro R; O'Neill, Marie S; Hystad, Perry; Texcalac-Sangrador, José L; Ohman-Strickland, Pamela; Meng, Qingyu; Schwander, Stephan
2018-05-17
The Mexico City Metropolitan Area (MCMA) is one of the largest and most populated urban environments in the world and experiences high air pollution levels. To develop models that estimate pollutant concentrations at fine spatiotemporal scales and provide improved air pollution exposure assessments for health studies in Mexico City. We developed finer spatiotemporal land use regression (LUR) models for PM 2.5 , PM 10 , O 3 , NO 2 , CO and SO 2 using mixed effect models with the Least Absolute Shrinkage and Selection Operator (LASSO). Hourly traffic density was included as a temporal variable besides meteorological and holiday variables. Models of hourly, daily, monthly, 6-monthly and annual averages were developed and evaluated using traditional and novel indices. The developed spatiotemporal LUR models yielded predicted concentrations with good spatial and temporal agreements with measured pollutant levels except for the hourly PM 2.5 , PM 10 and SO 2 . Most of the LUR models met performance goals based on the standardized indices. LUR models with temporal scales greater than one hour were successfully developed using mixed effect models with LASSO and showed superior model performance compared to earlier LUR models, especially for time scales of a day or longer. The newly developed LUR models will be further refined with ongoing Mexico City air pollution sampling campaigns to improve personal exposure assessments. Copyright © 2018. Published by Elsevier B.V.
Knüppel, Sven; Meidtner, Karina; Arregui, Maria; Holzhütter, Hermann-Georg; Boeing, Heiner
2015-07-01
Analyzing multiple single nucleotide polymorphisms (SNPs) is a promising approach to finding genetic effects beyond single-locus associations. We proposed the use of multilocus stepwise regression (MSR) to screen for allele combinations as a method to model joint effects, and compared the results with the often used genetic risk score (GRS), conventional stepwise selection, and the shrinkage method LASSO. In contrast to MSR, the GRS, conventional stepwise selection, and LASSO model each genotype by the risk allele doses. We reanalyzed 20 unlinked SNPs related to type 2 diabetes (T2D) in the EPIC-Potsdam case-cohort study (760 cases, 2193 noncases). No SNP-SNP interactions and no nonlinear effects were found. Two SNP combinations selected by MSR (Nagelkerke's R² = 0.050 and 0.048) included eight SNPs with mean allele combination frequency of 2%. GRS and stepwise selection selected nearly the same SNP combinations consisting of 12 and 13 SNPs (Nagelkerke's R² ranged from 0.020 to 0.029). LASSO showed similar results. The MSR method showed the best model fit measured by Nagelkerke's R² suggesting that further improvement may render this method a useful tool in genetic research. However, our comparison suggests that the GRS is a simple way to model genetic effects since it does not consider linkage, SNP-SNP interactions, and no non-linear effects. © 2015 John Wiley & Sons Ltd/University College London.
Assessment of Weighted Quantile Sum Regression for Modeling Chemical Mixtures and Cancer Risk
Czarnota, Jenna; Gennings, Chris; Wheeler, David C
2015-01-01
In evaluation of cancer risk related to environmental chemical exposures, the effect of many chemicals on disease is ultimately of interest. However, because of potentially strong correlations among chemicals that occur together, traditional regression methods suffer from collinearity effects, including regression coefficient sign reversal and variance inflation. In addition, penalized regression methods designed to remediate collinearity may have limitations in selecting the truly bad actors among many correlated components. The recently proposed method of weighted quantile sum (WQS) regression attempts to overcome these problems by estimating a body burden index, which identifies important chemicals in a mixture of correlated environmental chemicals. Our focus was on assessing through simulation studies the accuracy of WQS regression in detecting subsets of chemicals associated with health outcomes (binary and continuous) in site-specific analyses and in non-site-specific analyses. We also evaluated the performance of the penalized regression methods of lasso, adaptive lasso, and elastic net in correctly classifying chemicals as bad actors or unrelated to the outcome. We based the simulation study on data from the National Cancer Institute Surveillance Epidemiology and End Results Program (NCI-SEER) case–control study of non-Hodgkin lymphoma (NHL) to achieve realistic exposure situations. Our results showed that WQS regression had good sensitivity and specificity across a variety of conditions considered in this study. The shrinkage methods had a tendency to incorrectly identify a large number of components, especially in the case of strong association with the outcome. PMID:26005323
Assessment of weighted quantile sum regression for modeling chemical mixtures and cancer risk.
Czarnota, Jenna; Gennings, Chris; Wheeler, David C
2015-01-01
In evaluation of cancer risk related to environmental chemical exposures, the effect of many chemicals on disease is ultimately of interest. However, because of potentially strong correlations among chemicals that occur together, traditional regression methods suffer from collinearity effects, including regression coefficient sign reversal and variance inflation. In addition, penalized regression methods designed to remediate collinearity may have limitations in selecting the truly bad actors among many correlated components. The recently proposed method of weighted quantile sum (WQS) regression attempts to overcome these problems by estimating a body burden index, which identifies important chemicals in a mixture of correlated environmental chemicals. Our focus was on assessing through simulation studies the accuracy of WQS regression in detecting subsets of chemicals associated with health outcomes (binary and continuous) in site-specific analyses and in non-site-specific analyses. We also evaluated the performance of the penalized regression methods of lasso, adaptive lasso, and elastic net in correctly classifying chemicals as bad actors or unrelated to the outcome. We based the simulation study on data from the National Cancer Institute Surveillance Epidemiology and End Results Program (NCI-SEER) case-control study of non-Hodgkin lymphoma (NHL) to achieve realistic exposure situations. Our results showed that WQS regression had good sensitivity and specificity across a variety of conditions considered in this study. The shrinkage methods had a tendency to incorrectly identify a large number of components, especially in the case of strong association with the outcome.
Silver, Matt; Montana, Giovanni
2012-01-01
Where causal SNPs (single nucleotide polymorphisms) tend to accumulate within biological pathways, the incorporation of prior pathways information into a statistical model is expected to increase the power to detect true associations in a genetic association study. Most existing pathways-based methods rely on marginal SNP statistics and do not fully exploit the dependence patterns among SNPs within pathways. We use a sparse regression model, with SNPs grouped into pathways, to identify causal pathways associated with a quantitative trait. Notable features of our “pathways group lasso with adaptive weights” (P-GLAW) algorithm include the incorporation of all pathways in a single regression model, an adaptive pathway weighting procedure that accounts for factors biasing pathway selection, and the use of a bootstrap sampling procedure for the ranking of important pathways. P-GLAW takes account of the presence of overlapping pathways and uses a novel combination of techniques to optimise model estimation, making it fast to run, even on whole genome datasets. In a comparison study with an alternative pathways method based on univariate SNP statistics, our method demonstrates high sensitivity and specificity for the detection of important pathways, showing the greatest relative gains in performance where marginal SNP effect sizes are small. PMID:22499682
Shi, Yuan; Liu, Xu; Kok, Suet-Yheng; Rajarethinam, Jayanthi; Liang, Shaohong; Yap, Grace; Chong, Chee-Seng; Lee, Kim-Sung; Tan, Sharon S Y; Chin, Christopher Kuan Yew; Lo, Andrew; Kong, Waiming; Ng, Lee Ching; Cook, Alex R
2016-09-01
With its tropical rainforest climate, rapid urbanization, and changing demography and ecology, Singapore experiences endemic dengue; the last large outbreak in 2013 culminated in 22,170 cases. In the absence of a vaccine on the market, vector control is the key approach for prevention. We sought to forecast the evolution of dengue epidemics in Singapore to provide early warning of outbreaks and to facilitate the public health response to moderate an impending outbreak. We developed a set of statistical models using least absolute shrinkage and selection operator (LASSO) methods to forecast the weekly incidence of dengue notifications over a 3-month time horizon. This forecasting tool used a variety of data streams and was updated weekly, including recent case data, meteorological data, vector surveillance data, and population-based national statistics. The forecasting methodology was compared with alternative approaches that have been proposed to model dengue case data (seasonal autoregressive integrated moving average and step-down linear regression) by fielding them on the 2013 dengue epidemic, the largest on record in Singapore. Operationally useful forecasts were obtained at a 3-month lag using the LASSO-derived models. Based on the mean average percentage error, the LASSO approach provided more accurate forecasts than the other methods we assessed. We demonstrate its utility in Singapore's dengue control program by providing a forecast of the 2013 outbreak for advance preparation of outbreak response. Statistical models built using machine learning methods such as LASSO have the potential to markedly improve forecasting techniques for recurrent infectious disease outbreaks such as dengue. Shi Y, Liu X, Kok SY, Rajarethinam J, Liang S, Yap G, Chong CS, Lee KS, Tan SS, Chin CK, Lo A, Kong W, Ng LC, Cook AR. 2016. Three-month real-time dengue forecast models: an early warning system for outbreak alerts and policy decision support in Singapore. Environ Health Perspect 124:1369-1375; http://dx.doi.org/10.1289/ehp.1509981.
Zhou, Hua; Li, Lexin
2014-01-01
Summary Modern technologies are producing a wealth of data with complex structures. For instance, in two-dimensional digital imaging, flow cytometry and electroencephalography, matrix-type covariates frequently arise when measurements are obtained for each combination of two underlying variables. To address scientific questions arising from those data, new regression methods that take matrices as covariates are needed, and sparsity or other forms of regularization are crucial owing to the ultrahigh dimensionality and complex structure of the matrix data. The popular lasso and related regularization methods hinge on the sparsity of the true signal in terms of the number of its non-zero coefficients. However, for the matrix data, the true signal is often of, or can be well approximated by, a low rank structure. As such, the sparsity is frequently in the form of low rank of the matrix parameters, which may seriously violate the assumption of the classical lasso. We propose a class of regularized matrix regression methods based on spectral regularization. A highly efficient and scalable estimation algorithm is developed, and a degrees-of-freedom formula is derived to facilitate model selection along the regularization path. Superior performance of the method proposed is demonstrated on both synthetic and real examples. PMID:24648830
Mehrtak, Mohammad; Yusefzadeh, Hasan; Jaafaripooyan, Ebrahim
2014-01-01
Background: Performance measurement is essential to the management of health care organizations to which efficiency is per se a vital indicator. Present study accordingly aims to measure the efficiency of hospitals employing two distinct methods. Methods: Data Envelopment Analysis and Pabon Lasso Model were jointly applied to calculate the efficiency of all general hospitals located in Iranian Eastern Azerbijan Province. Data was collected using hospitals’ monthly performance forms and analyzed and displayed by MS Visio and DEAP software. Results: In accord with Pabon Lasso model, 44.5% of the hospitals were entirely efficient, whilst DEA revealed 61% to be efficient. As such, 39% of the hospitals, by the Pabon Lasso, were wholly inefficient; based on DEA though; the relevant figure was only 22.2%. Finally, 16.5% of hospitals as calculated by Pabon Lasso and 16.7% by DEA were relatively efficient. DEA appeared to show more hospitals as efficient as opposed to the Pabon Lasso model. Conclusion: Simultaneous use of two models rendered complementary and corroborative results as both evidently reveal efficient hospitals. However, their results should be compared with prudence. Whilst the Pabon Lasso inefficient zone is fully clear, DEA does not provide such a crystal clear limit for inefficiency. PMID:24999147
Recommendations for the Implementation of the LASSO Workflow
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gustafson, William I; Vogelmann, Andrew M; Cheng, Xiaoping
The U. S. Department of Energy (DOE) Atmospheric Radiation Measurement (ARM) Research Fa-cility began a pilot project in May 2015 to design a routine, high-resolution modeling capability to complement ARM’s extensive suite of measurements. This modeling capability, envisioned in the ARM Decadal Vision (U.S. Department of Energy 2014), subsequently has been named the Large-Eddy Simu-lation (LES) ARM Symbiotic Simulation and Observation (LASSO) project, and it has an initial focus of shallow convection at the ARM Southern Great Plains (SGP) atmospheric observatory. This report documents the recommendations resulting from the pilot project to be considered by ARM for imple-mentation into routinemore » operations. During the pilot phase, LASSO has evolved from the initial vision outlined in the pilot project white paper (Gustafson and Vogelmann 2015) to what is recommended in this report. Further details on the overall LASSO project are available at https://www.arm.gov/capabilities/modeling/lasso. Feedback regarding LASSO and the recommendations in this report can be directed to William Gustafson, the project principal investigator (PI), and Andrew Vogelmann, the co-principal investigator (Co-PI), via lasso@arm.gov.« less
A Selective Review of Group Selection in High-Dimensional Models
Huang, Jian; Breheny, Patrick; Ma, Shuangge
2013-01-01
Grouping structures arise naturally in many statistical modeling problems. Several methods have been proposed for variable selection that respect grouping structure in variables. Examples include the group LASSO and several concave group selection methods. In this article, we give a selective review of group selection concerning methodological developments, theoretical properties and computational algorithms. We pay particular attention to group selection methods involving concave penalties. We address both group selection and bi-level selection methods. We describe several applications of these methods in nonparametric additive models, semiparametric regression, seemingly unrelated regressions, genomic data analysis and genome wide association studies. We also highlight some issues that require further study. PMID:24174707
Kim, Dongchul; Kang, Mingon; Biswas, Ashis; Liu, Chunyu; Gao, Jean
2016-08-10
Inferring gene regulatory networks is one of the most interesting research areas in the systems biology. Many inference methods have been developed by using a variety of computational models and approaches. However, there are two issues to solve. First, depending on the structural or computational model of inference method, the results tend to be inconsistent due to innately different advantages and limitations of the methods. Therefore the combination of dissimilar approaches is demanded as an alternative way in order to overcome the limitations of standalone methods through complementary integration. Second, sparse linear regression that is penalized by the regularization parameter (lasso) and bootstrapping-based sparse linear regression methods were suggested in state of the art methods for network inference but they are not effective for a small sample size data and also a true regulator could be missed if the target gene is strongly affected by an indirect regulator with high correlation or another true regulator. We present two novel network inference methods based on the integration of three different criteria, (i) z-score to measure the variation of gene expression from knockout data, (ii) mutual information for the dependency between two genes, and (iii) linear regression-based feature selection. Based on these criterion, we propose a lasso-based random feature selection algorithm (LARF) to achieve better performance overcoming the limitations of bootstrapping as mentioned above. In this work, there are three main contributions. First, our z score-based method to measure gene expression variations from knockout data is more effective than similar criteria of related works. Second, we confirmed that the true regulator selection can be effectively improved by LARF. Lastly, we verified that an integrative approach can clearly outperform a single method when two different methods are effectively jointed. In the experiments, our methods were validated by outperforming the state of the art methods on DREAM challenge data, and then LARF was applied to inferences of gene regulatory network associated with psychiatric disorders.
Content Coding of Psychotherapy Transcripts Using Labeled Topic Models.
Gaut, Garren; Steyvers, Mark; Imel, Zac E; Atkins, David C; Smyth, Padhraic
2017-03-01
Psychotherapy represents a broad class of medical interventions received by millions of patients each year. Unlike most medical treatments, its primary mechanisms are linguistic; i.e., the treatment relies directly on a conversation between a patient and provider. However, the evaluation of patient-provider conversation suffers from critical shortcomings, including intensive labor requirements, coder error, nonstandardized coding systems, and inability to scale up to larger data sets. To overcome these shortcomings, psychotherapy analysis needs a reliable and scalable method for summarizing the content of treatment encounters. We used a publicly available psychotherapy corpus from Alexander Street press comprising a large collection of transcripts of patient-provider conversations to compare coding performance for two machine learning methods. We used the labeled latent Dirichlet allocation (L-LDA) model to learn associations between text and codes, to predict codes in psychotherapy sessions, and to localize specific passages of within-session text representative of a session code. We compared the L-LDA model to a baseline lasso regression model using predictive accuracy and model generalizability (measured by calculating the area under the curve (AUC) from the receiver operating characteristic curve). The L-LDA model outperforms the lasso logistic regression model at predicting session-level codes with average AUC scores of 0.79, and 0.70, respectively. For fine-grained level coding, L-LDA and logistic regression are able to identify specific talk-turns representative of symptom codes. However, model performance for talk-turn identification is not yet as reliable as human coders. We conclude that the L-LDA model has the potential to be an objective, scalable method for accurate automated coding of psychotherapy sessions that perform better than comparable discriminative methods at session-level coding and can also predict fine-grained codes.
Content Coding of Psychotherapy Transcripts Using Labeled Topic Models
Gaut, Garren; Steyvers, Mark; Imel, Zac E; Atkins, David C; Smyth, Padhraic
2016-01-01
Psychotherapy represents a broad class of medical interventions received by millions of patients each year. Unlike most medical treatments, its primary mechanisms are linguistic; i.e., the treatment relies directly on a conversation between a patient and provider. However, the evaluation of patient-provider conversation suffers from critical shortcomings, including intensive labor requirements, coder error, non-standardized coding systems, and inability to scale up to larger data sets. To overcome these shortcomings, psychotherapy analysis needs a reliable and scalable method for summarizing the content of treatment encounters. We used a publicly-available psychotherapy corpus from Alexander Street press comprising a large collection of transcripts of patient-provider conversations to compare coding performance for two machine learning methods. We used the Labeled Latent Dirichlet Allocation (L-LDA) model to learn associations between text and codes, to predict codes in psychotherapy sessions, and to localize specific passages of within-session text representative of a session code. We compared the L-LDA model to a baseline lasso regression model using predictive accuracy and model generalizability (measured by calculating the area under the curve (AUC) from the receiver operating characteristic (ROC) curve). The L-LDA model outperforms the lasso logistic regression model at predicting session-level codes with average AUC scores of .79, and .70, respectively. For fine-grained level coding, L-LDA and logistic regression are able to identify specific talk-turns representative of symptom codes. However, model performance for talk-turn identification is not yet as reliable as human coders. We conclude that the L-LDA model has the potential to be an objective, scaleable method for accurate automated coding of psychotherapy sessions that performs better than comparable discriminative methods at session-level coding and can also predict fine-grained codes. PMID:26625437
Wang, Jing-Jing; Wu, Hai-Feng; Sun, Tao; Li, Xia; Wang, Wei; Tao, Li-Xin; Huo, Da; Lv, Ping-Xin; He, Wen; Guo, Xiu-Hua
2013-01-01
Lung cancer, one of the leading causes of cancer-related deaths, usually appears as solitary pulmonary nodules (SPNs) which are hard to diagnose using the naked eye. In this paper, curvelet-based textural features and clinical parameters are used with three prediction models [a multilevel model, a least absolute shrinkage and selection operator (LASSO) regression method, and a support vector machine (SVM)] to improve the diagnosis of benign and malignant SPNs. Dimensionality reduction of the original curvelet-based textural features was achieved using principal component analysis. In addition, non-conditional logistical regression was used to find clinical predictors among demographic parameters and morphological features. The results showed that, combined with 11 clinical predictors, the accuracy rates using 12 principal components were higher than those using the original curvelet-based textural features. To evaluate the models, 10-fold cross validation and back substitution were applied. The results obtained, respectively, were 0.8549 and 0.9221 for the LASSO method, 0.9443 and 0.9831 for SVM, and 0.8722 and 0.9722 for the multilevel model. All in all, it was found that using curvelet-based textural features after dimensionality reduction and using clinical predictors, the highest accuracy rate was achieved with SVM. The method may be used as an auxiliary tool to differentiate between benign and malignant SPNs in CT images.
Penalized regression procedures for variable selection in the potential outcomes framework
Ghosh, Debashis; Zhu, Yeying; Coffman, Donna L.
2015-01-01
A recent topic of much interest in causal inference is model selection. In this article, we describe a framework in which to consider penalized regression approaches to variable selection for causal effects. The framework leads to a simple ‘impute, then select’ class of procedures that is agnostic to the type of imputation algorithm as well as penalized regression used. It also clarifies how model selection involves a multivariate regression model for causal inference problems, and that these methods can be applied for identifying subgroups in which treatment effects are homogeneous. Analogies and links with the literature on machine learning methods, missing data and imputation are drawn. A difference LASSO algorithm is defined, along with its multiple imputation analogues. The procedures are illustrated using a well-known right heart catheterization dataset. PMID:25628185
A data-driven approach to modeling physical fatigue in the workplace using wearable sensors.
Sedighi Maman, Zahra; Alamdar Yazdi, Mohammad Ali; Cavuoto, Lora A; Megahed, Fadel M
2017-11-01
Wearable sensors are currently being used to manage fatigue in professional athletics, transportation and mining industries. In manufacturing, physical fatigue is a challenging ergonomic/safety "issue" since it lowers productivity and increases the incidence of accidents. Therefore, physical fatigue must be managed. There are two main goals for this study. First, we examine the use of wearable sensors to detect physical fatigue occurrence in simulated manufacturing tasks. The second goal is to estimate the physical fatigue level over time. In order to achieve these goals, sensory data were recorded for eight healthy participants. Penalized logistic and multiple linear regression models were used for physical fatigue detection and level estimation, respectively. Important features from the five sensors locations were selected using Least Absolute Shrinkage and Selection Operator (LASSO), a popular variable selection methodology. The results show that the LASSO model performed well for both physical fatigue detection and modeling. The modeling approach is not participant and/or workload regime specific and thus can be adopted for other applications. Copyright © 2017 Elsevier Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Arumsari, Nurvita; Sutidjo, S. U.; Brodjol; Soedjono, Eddy S.
2014-03-01
Diarrhea has been one main cause of morbidity and mortality to children around the world, especially in the developing countries According to available data that was mentioned. It showed that sanitary and healthy lifestyle implementation by the inhabitants was not good yet. Inadequacy of environmental influence and the availability of health services were suspected factors which influenced diarrhea cases happened followed by heightened percentage of the diarrheic. This research is aimed at modelling the diarrheic by using Geographically Weighted Lasso method. With the existence of spatial heterogeneity was tested by Breusch Pagan, it was showed that diarrheic modeling with weighted regression, especially GWR and GWL, can explain the variation in each location. But, the absence of multi-collinearity cases on predictor variables, which were affecting the diarrheic, resulted in GWR and GWL modelling to be not different or identical. It is shown from the resulting MSE value. While from R2 value which usually higher on GWL model showed a significant variable predictor based on more parametric shrinkage value.
Mose, Louise S; Pedersen, Susanne S; Debrabant, Birgit; Jensen, Rigmor H; Gram, Bibi
2018-05-25
Factors associated with development of medication-overuse headache (MOH) in migraine patients are not fully understood, but with respect to prevention, the ability to predict the onset of MOH is clinically important. The aims were to examine if personality characteristics, disability and physical activity level are associated with the onset of MOH in a group of migraine patients and explore to which extend these factors combined can predict the onset of MOH. The study was a single-center prospective observational study of migraine patients. At inclusion, all patients completed questionnaires evaluating 1) personality (NEO Five-Factor Inventory), 2) disability (Migraine Disability Assessment), and 3) physical activity level (Physical Activity Scale 2.1). Diagnostic codes from patients' electronic health records confirmed if they had developed MOH during the study period of 20 months. Analyses of associations were performed and to identify which of the variables predict onset MOH, a multivariable least absolute shrinkage and selection operator (LASSO) logistic regression model was fitted to predict presence or absence of MOH. Out of 131 participants, 12 % (n=16) developed MOH. Migraine disability score (OR=1.02, 95 % CI: 1.00 to 1.04), intensity of headache (OR=1.49, 95 % CI: 1.03 to 2.15) and headache frequency (OR=1.02, 95 % CI: 1.00 to 1.04) were associated with the onset of MOH adjusting for age and gender. To identify which of the variables predict onset MOH, we used a LASSO regression model, and evaluating the predictive performance of the LASSO-mode (containing the predictors MIDAS score, MIDAS-intensity and -frequency, neuroticism score, time with moderate physical activity, educational level, hours of sleep daily and number of contacts to the headache clinic) in terms of area under the curve (AUC) was weak (apparent AUC=0.62, 95% CI: 0.41-0.82). Disability, headache intensity and frequency were associated with the onset of MOH whereas personality and the level of physical activity were not. The multivariable LASSO model based on personality, disability and physical activity is applicable despite moderate study size, however it can be considered as a weak classifier for discriminating between absence and presence of MOH.
Ridge, Lasso and Bayesian additive-dominance genomic models.
Azevedo, Camila Ferreira; de Resende, Marcos Deon Vilela; E Silva, Fabyano Fonseca; Viana, José Marcelo Soriano; Valente, Magno Sávio Ferreira; Resende, Márcio Fernando Ribeiro; Muñoz, Patricio
2015-08-25
A complete approach for genome-wide selection (GWS) involves reliable statistical genetics models and methods. Reports on this topic are common for additive genetic models but not for additive-dominance models. The objective of this paper was (i) to compare the performance of 10 additive-dominance predictive models (including current models and proposed modifications), fitted using Bayesian, Lasso and Ridge regression approaches; and (ii) to decompose genomic heritability and accuracy in terms of three quantitative genetic information sources, namely, linkage disequilibrium (LD), co-segregation (CS) and pedigree relationships or family structure (PR). The simulation study considered two broad sense heritability levels (0.30 and 0.50, associated with narrow sense heritabilities of 0.20 and 0.35, respectively) and two genetic architectures for traits (the first consisting of small gene effects and the second consisting of a mixed inheritance model with five major genes). G-REML/G-BLUP and a modified Bayesian/Lasso (called BayesA*B* or t-BLASSO) method performed best in the prediction of genomic breeding as well as the total genotypic values of individuals in all four scenarios (two heritabilities x two genetic architectures). The BayesA*B*-type method showed a better ability to recover the dominance variance/additive variance ratio. Decomposition of genomic heritability and accuracy revealed the following descending importance order of information: LD, CS and PR not captured by markers, the last two being very close. Amongst the 10 models/methods evaluated, the G-BLUP, BAYESA*B* (-2,8) and BAYESA*B* (4,6) methods presented the best results and were found to be adequate for accurately predicting genomic breeding and total genotypic values as well as for estimating additive and dominance in additive-dominance genomic models.
REGULARIZATION FOR COX’S PROPORTIONAL HAZARDS MODEL WITH NP-DIMENSIONALITY*
Fan, Jianqing; Jiang, Jiancheng
2011-01-01
High throughput genetic sequencing arrays with thousands of measurements per sample and a great amount of related censored clinical data have increased demanding need for better measurement specific model selection. In this paper we establish strong oracle properties of non-concave penalized methods for non-polynomial (NP) dimensional data with censoring in the framework of Cox’s proportional hazards model. A class of folded-concave penalties are employed and both LASSO and SCAD are discussed specifically. We unveil the question under which dimensionality and correlation restrictions can an oracle estimator be constructed and grasped. It is demonstrated that non-concave penalties lead to significant reduction of the “irrepresentable condition” needed for LASSO model selection consistency. The large deviation result for martingales, bearing interests of its own, is developed for characterizing the strong oracle property. Moreover, the non-concave regularized estimator, is shown to achieve asymptotically the information bound of the oracle estimator. A coordinate-wise algorithm is developed for finding the grid of solution paths for penalized hazard regression problems, and its performance is evaluated on simulated and gene association study examples. PMID:23066171
REGULARIZATION FOR COX'S PROPORTIONAL HAZARDS MODEL WITH NP-DIMENSIONALITY.
Bradic, Jelena; Fan, Jianqing; Jiang, Jiancheng
2011-01-01
High throughput genetic sequencing arrays with thousands of measurements per sample and a great amount of related censored clinical data have increased demanding need for better measurement specific model selection. In this paper we establish strong oracle properties of non-concave penalized methods for non-polynomial (NP) dimensional data with censoring in the framework of Cox's proportional hazards model. A class of folded-concave penalties are employed and both LASSO and SCAD are discussed specifically. We unveil the question under which dimensionality and correlation restrictions can an oracle estimator be constructed and grasped. It is demonstrated that non-concave penalties lead to significant reduction of the "irrepresentable condition" needed for LASSO model selection consistency. The large deviation result for martingales, bearing interests of its own, is developed for characterizing the strong oracle property. Moreover, the non-concave regularized estimator, is shown to achieve asymptotically the information bound of the oracle estimator. A coordinate-wise algorithm is developed for finding the grid of solution paths for penalized hazard regression problems, and its performance is evaluated on simulated and gene association study examples.
Shi, Yuan; Liu, Xu; Kok, Suet-Yheng; Rajarethinam, Jayanthi; Liang, Shaohong; Yap, Grace; Chong, Chee-Seng; Lee, Kim-Sung; Tan, Sharon S.Y.; Chin, Christopher Kuan Yew; Lo, Andrew; Kong, Waiming; Ng, Lee Ching; Cook, Alex R.
2015-01-01
Background: With its tropical rainforest climate, rapid urbanization, and changing demography and ecology, Singapore experiences endemic dengue; the last large outbreak in 2013 culminated in 22,170 cases. In the absence of a vaccine on the market, vector control is the key approach for prevention. Objectives: We sought to forecast the evolution of dengue epidemics in Singapore to provide early warning of outbreaks and to facilitate the public health response to moderate an impending outbreak. Methods: We developed a set of statistical models using least absolute shrinkage and selection operator (LASSO) methods to forecast the weekly incidence of dengue notifications over a 3-month time horizon. This forecasting tool used a variety of data streams and was updated weekly, including recent case data, meteorological data, vector surveillance data, and population-based national statistics. The forecasting methodology was compared with alternative approaches that have been proposed to model dengue case data (seasonal autoregressive integrated moving average and step-down linear regression) by fielding them on the 2013 dengue epidemic, the largest on record in Singapore. Results: Operationally useful forecasts were obtained at a 3-month lag using the LASSO-derived models. Based on the mean average percentage error, the LASSO approach provided more accurate forecasts than the other methods we assessed. We demonstrate its utility in Singapore’s dengue control program by providing a forecast of the 2013 outbreak for advance preparation of outbreak response. Conclusions: Statistical models built using machine learning methods such as LASSO have the potential to markedly improve forecasting techniques for recurrent infectious disease outbreaks such as dengue. Citation: Shi Y, Liu X, Kok SY, Rajarethinam J, Liang S, Yap G, Chong CS, Lee KS, Tan SS, Chin CK, Lo A, Kong W, Ng LC, Cook AR. 2016. Three-month real-time dengue forecast models: an early warning system for outbreak alerts and policy decision support in Singapore. Environ Health Perspect 124:1369–1375; http://dx.doi.org/10.1289/ehp.1509981 PMID:26662617
VARIABLE SELECTION IN NONPARAMETRIC ADDITIVE MODELS
Huang, Jian; Horowitz, Joel L.; Wei, Fengrong
2010-01-01
We consider a nonparametric additive model of a conditional mean function in which the number of variables and additive components may be larger than the sample size but the number of nonzero additive components is “small” relative to the sample size. The statistical problem is to determine which additive components are nonzero. The additive components are approximated by truncated series expansions with B-spline bases. With this approximation, the problem of component selection becomes that of selecting the groups of coefficients in the expansion. We apply the adaptive group Lasso to select nonzero components, using the group Lasso to obtain an initial estimator and reduce the dimension of the problem. We give conditions under which the group Lasso selects a model whose number of components is comparable with the underlying model, and the adaptive group Lasso selects the nonzero components correctly with probability approaching one as the sample size increases and achieves the optimal rate of convergence. The results of Monte Carlo experiments show that the adaptive group Lasso procedure works well with samples of moderate size. A data example is used to illustrate the application of the proposed method. PMID:21127739
Guo, Pi; Zeng, Fangfang; Hu, Xiaomin; Zhang, Dingmei; Zhu, Shuming; Deng, Yu; Hao, Yuantao
2015-01-01
Objectives In epidemiological studies, it is important to identify independent associations between collective exposures and a health outcome. The current stepwise selection technique ignores stochastic errors and suffers from a lack of stability. The alternative LASSO-penalized regression model can be applied to detect significant predictors from a pool of candidate variables. However, this technique is prone to false positives and tends to create excessive biases. It remains challenging to develop robust variable selection methods and enhance predictability. Material and methods Two improved algorithms denoted the two-stage hybrid and bootstrap ranking procedures, both using a LASSO-type penalty, were developed for epidemiological association analysis. The performance of the proposed procedures and other methods including conventional LASSO, Bolasso, stepwise and stability selection models were evaluated using intensive simulation. In addition, methods were compared by using an empirical analysis based on large-scale survey data of hepatitis B infection-relevant factors among Guangdong residents. Results The proposed procedures produced comparable or less biased selection results when compared to conventional variable selection models. In total, the two newly proposed procedures were stable with respect to various scenarios of simulation, demonstrating a higher power and a lower false positive rate during variable selection than the compared methods. In empirical analysis, the proposed procedures yielding a sparse set of hepatitis B infection-relevant factors gave the best predictive performance and showed that the procedures were able to select a more stringent set of factors. The individual history of hepatitis B vaccination, family and individual history of hepatitis B infection were associated with hepatitis B infection in the studied residents according to the proposed procedures. Conclusions The newly proposed procedures improve the identification of significant variables and enable us to derive a new insight into epidemiological association analysis. PMID:26214802
Predicting recreational water quality advisories: A comparison of statistical methods
Brooks, Wesley R.; Corsi, Steven R.; Fienen, Michael N.; Carvin, Rebecca B.
2016-01-01
Epidemiological studies indicate that fecal indicator bacteria (FIB) in beach water are associated with illnesses among people having contact with the water. In order to mitigate public health impacts, many beaches are posted with an advisory when the concentration of FIB exceeds a beach action value. The most commonly used method of measuring FIB concentration takes 18–24 h before returning a result. In order to avoid the 24 h lag, it has become common to ”nowcast” the FIB concentration using statistical regressions on environmental surrogate variables. Most commonly, nowcast models are estimated using ordinary least squares regression, but other regression methods from the statistical and machine learning literature are sometimes used. This study compares 14 regression methods across 7 Wisconsin beaches to identify which consistently produces the most accurate predictions. A random forest model is identified as the most accurate, followed by multiple regression fit using the adaptive LASSO.
Liang, Zhaohui; Liu, Jun; Huang, Jimmy X; Zeng, Xing
2018-01-01
The genetic polymorphism of Cytochrome P450 (CYP 450) is considered as one of the main causes for adverse drug reactions (ADRs). In order to explore the latent correlations between ADRs and potentially corresponding single-nucleotide polymorphism (SNPs) in CYP450, three algorithms based on information theory are used as the main method to predict the possible relation. The study uses a retrospective case-control study to explore the potential relation of ADRs to specific genomic locations and single-nucleotide polymorphism (SNP). The genomic data collected from 53 healthy volunteers are applied for the analysis, another group of genomic data collected from 30 healthy volunteers excluded from the study are used as the control group. The SNPs respective on five loci of CYP2D6*2,*10,*14 and CYP1A2*1C, *1F are detected by the Applied Biosystem 3130xl. The raw data is processed by ChromasPro to detect the specific alleles on the above loci from each sample. The secondary data are reorganized and processed by R combined with the reports of ADRs from clinical reports. Three information theory based algorithms are implemented for the screening task: JMI, CMIM, and mRMR. If a SNP is selected by more than two algorithms, we are confident to conclude that it is related to the corresponding ADR. The selection results are compared with the control decision tree + LASSO regression model. In the study group where ADRs occur, 10 SNPs are considered relevant to the occurrence of a specific ADR by the combined information theory model. In comparison, only 5 SNPs are considered relevant to a specific ADR by the decision tree + LASSO regression model. In addition, the new method detects more relevant pairs of SNP and ADR which are affected by both SNP and dosage. This implies that the new information theory based model is effective to discover correlations of ADRs and CYP 450 SNPs and is helpful in predicting the potential vulnerable genotype for some ADRs. The newly proposed information theory based model has superiority performance in detecting the relation between SNP and ADR compared to the decision tree + LASSO regression model. The new model is more sensitive to detect ADRs compared to the old method, while the old method is more reliable. Therefore, the selection criteria for selecting algorithms should depend on the pragmatic needs. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.
Predicting Quantitative Traits With Regression Models for Dense Molecular Markers and Pedigree
de los Campos, Gustavo; Naya, Hugo; Gianola, Daniel; Crossa, José; Legarra, Andrés; Manfredi, Eduardo; Weigel, Kent; Cotes, José Miguel
2009-01-01
The availability of genomewide dense markers brings opportunities and challenges to breeding programs. An important question concerns the ways in which dense markers and pedigrees, together with phenotypic records, should be used to arrive at predictions of genetic values for complex traits. If a large number of markers are included in a regression model, marker-specific shrinkage of regression coefficients may be needed. For this reason, the Bayesian least absolute shrinkage and selection operator (LASSO) (BL) appears to be an interesting approach for fitting marker effects in a regression model. This article adapts the BL to arrive at a regression model where markers, pedigrees, and covariates other than markers are considered jointly. Connections between BL and other marker-based regression models are discussed, and the sensitivity of BL with respect to the choice of prior distributions assigned to key parameters is evaluated using simulation. The proposed model was fitted to two data sets from wheat and mouse populations, and evaluated using cross-validation methods. Results indicate that inclusion of markers in the regression further improved the predictive ability of models. An R program that implements the proposed model is freely available. PMID:19293140
Xu, Cheng-Jian; van der Schaaf, Arjen; Schilstra, Cornelis; Langendijk, Johannes A; van't Veld, Aart A
2012-03-15
To study the impact of different statistical learning methods on the prediction performance of multivariate normal tissue complication probability (NTCP) models. In this study, three learning methods, stepwise selection, least absolute shrinkage and selection operator (LASSO), and Bayesian model averaging (BMA), were used to build NTCP models of xerostomia following radiotherapy treatment for head and neck cancer. Performance of each learning method was evaluated by a repeated cross-validation scheme in order to obtain a fair comparison among methods. It was found that the LASSO and BMA methods produced models with significantly better predictive power than that of the stepwise selection method. Furthermore, the LASSO method yields an easily interpretable model as the stepwise method does, in contrast to the less intuitive BMA method. The commonly used stepwise selection method, which is simple to execute, may be insufficient for NTCP modeling. The LASSO method is recommended. Copyright © 2012 Elsevier Inc. All rights reserved.
Tian, Xinyu; Wang, Xuefeng; Chen, Jun
2014-01-01
Classic multinomial logit model, commonly used in multiclass regression problem, is restricted to few predictors and does not take into account the relationship among variables. It has limited use for genomic data, where the number of genomic features far exceeds the sample size. Genomic features such as gene expressions are usually related by an underlying biological network. Efficient use of the network information is important to improve classification performance as well as the biological interpretability. We proposed a multinomial logit model that is capable of addressing both the high dimensionality of predictors and the underlying network information. Group lasso was used to induce model sparsity, and a network-constraint was imposed to induce the smoothness of the coefficients with respect to the underlying network structure. To deal with the non-smoothness of the objective function in optimization, we developed a proximal gradient algorithm for efficient computation. The proposed model was compared to models with no prior structure information in both simulations and a problem of cancer subtype prediction with real TCGA (the cancer genome atlas) gene expression data. The network-constrained mode outperformed the traditional ones in both cases.
Zhang, Zhongheng; Hong, Yucai
2017-07-25
There are several disease severity scores being used for the prediction of mortality in critically ill patients. However, none of them was developed and validated specifically for patients with severe sepsis. The present study aimed to develop a novel prediction score for severe sepsis. A total of 3206 patients with severe sepsis were enrolled, including 1054 non-survivors and 2152 survivors. The LASSO score showed the best discrimination (area under curve: 0.772; 95% confidence interval: 0.735-0.810) in the validation cohort as compared with other scores such as simplified acute physiology score II, acute physiological score III, Logistic organ dysfunction system, sequential organ failure assessment score, and Oxford Acute Severity of Illness Score. The calibration slope was 0.889 and Brier value was 0.173. The study employed a single center database called Medical Information Mart for Intensive Care-III) MIMIC-III for analysis. Severe sepsis was defined as infection and acute organ dysfunction. Clinical and laboratory variables used in clinical routines were included for screening. Subjects without missing values were included, and the whole dataset was split into training and validation cohorts. The score was coined LASSO score because variable selection was performed using the least absolute shrinkage and selection operator (LASSO) technique. Finally, the LASSO score was evaluated for its discrimination and calibration in the validation cohort. The study developed the LASSO score for mortality prediction in patients with severe sepsis. Although the score had good discrimination and calibration in a randomly selected subsample, external validations are still required.
Rosswog, Carolina; Schmidt, Rene; Oberthuer, André; Juraeva, Dilafruz; Brors, Benedikt; Engesser, Anne; Kahlert, Yvonne; Volland, Ruth; Bartenhagen, Christoph; Simon, Thorsten; Berthold, Frank; Hero, Barbara; Faldum, Andreas; Fischer, Matthias
2017-12-01
Current risk stratification systems for neuroblastoma patients consider clinical, histopathological, and genetic variables, and additional prognostic markers have been proposed in recent years. We here sought to select highly informative covariates in a multistep strategy based on consecutive Cox regression models, resulting in a risk score that integrates hazard ratios of prognostic variables. A cohort of 695 neuroblastoma patients was divided into a discovery set (n=75) for multigene predictor generation, a training set (n=411) for risk score development, and a validation set (n=209). Relevant prognostic variables were identified by stepwise multivariable L1-penalized least absolute shrinkage and selection operator (LASSO) Cox regression, followed by backward selection in multivariable Cox regression, and then integrated into a novel risk score. The variables stage, age, MYCN status, and two multigene predictors, NB-th24 and NB-th44, were selected as independent prognostic markers by LASSO Cox regression analysis. Following backward selection, only the multigene predictors were retained in the final model. Integration of these classifiers in a risk scoring system distinguished three patient subgroups that differed substantially in their outcome. The scoring system discriminated patients with diverging outcome in the validation cohort (5-year event-free survival, 84.9±3.4 vs 63.6±14.5 vs 31.0±5.4; P<.001), and its prognostic value was validated by multivariable analysis. We here propose a translational strategy for developing risk assessment systems based on hazard ratios of relevant prognostic variables. Our final neuroblastoma risk score comprised two multigene predictors only, supporting the notion that molecular properties of the tumor cells strongly impact clinical courses of neuroblastoma patients. Copyright © 2017 The Authors. Published by Elsevier Inc. All rights reserved.
Bratosin, S; Laub, O; Tal, J; Aloni, Y
1979-09-01
During an electron-microscopic survey with the aim of identifying the parvovirus MVM transcription template, we observed previously unidentified structures of MVM DNA in lysates of virus-infected cells. These included double-stranded "lasso"-like structures and relaxed circles. Both structures were of unit length MVM DNA, indicating that they were not intermediates formed during replication; they each represented about 5% of the total nuclear MVM DNA. The proportion of these structures was unchanged after digestion with sodium dodecyl sulfate/Pronase and RNase and after mild denaturation treatment. Cleavage of the "lasso" structures with EcoRI restriction endonuclease indicated that the "noose" part of the "lasso" structure is located on the 5' side of the genomic single-stranded MVM DNA. A model is presented for the molecular nature of the circularization process of MVM DNA in which the "lasso" structures are identified as intermediates during circle formation. This model proposes a mechanism for circularization of linear DNAs.
A Path Algorithm for Constrained Estimation
Zhou, Hua; Lange, Kenneth
2013-01-01
Many least-square problems involve affine equality and inequality constraints. Although there are a variety of methods for solving such problems, most statisticians find constrained estimation challenging. The current article proposes a new path-following algorithm for quadratic programming that replaces hard constraints by what are called exact penalties. Similar penalties arise in l1 regularization in model selection. In the regularization setting, penalties encapsulate prior knowledge, and penalized parameter estimates represent a trade-off between the observed data and the prior knowledge. Classical penalty methods of optimization, such as the quadratic penalty method, solve a sequence of unconstrained problems that put greater and greater stress on meeting the constraints. In the limit as the penalty constant tends to ∞, one recovers the constrained solution. In the exact penalty method, squared penalties!are replaced by absolute value penalties, and the solution is recovered for a finite value of the penalty constant. The exact path-following method starts at the unconstrained solution and follows the solution path as the penalty constant increases. In the process, the solution path hits, slides along, and exits from the various constraints. Path following in Lasso penalized regression, in contrast, starts with a large value of the penalty constant and works its way downward. In both settings, inspection of the entire solution path is revealing. Just as with the Lasso and generalized Lasso, it is possible to plot the effective degrees of freedom along the solution path. For a strictly convex quadratic program, the exact penalty algorithm can be framed entirely in terms of the sweep operator of regression analysis. A few well-chosen examples illustrate the mechanics and potential of path following. This article has supplementary materials available online. PMID:24039382
Oh, Ein; Yoo, Tae Keun; Park, Eun-Cheol
2013-09-13
Blindness due to diabetic retinopathy (DR) is the major disability in diabetic patients. Although early management has shown to prevent vision loss, diabetic patients have a low rate of routine ophthalmologic examination. Hence, we developed and validated sparse learning models with the aim of identifying the risk of DR in diabetic patients. Health records from the Korea National Health and Nutrition Examination Surveys (KNHANES) V-1 were used. The prediction models for DR were constructed using data from 327 diabetic patients, and were validated internally on 163 patients in the KNHANES V-1. External validation was performed using 562 diabetic patients in the KNHANES V-2. The learning models, including ridge, elastic net, and LASSO, were compared to the traditional indicators of DR. Considering the Bayesian information criterion, LASSO predicted DR most efficiently. In the internal and external validation, LASSO was significantly superior to the traditional indicators by calculating the area under the curve (AUC) of the receiver operating characteristic. LASSO showed an AUC of 0.81 and an accuracy of 73.6% in the internal validation, and an AUC of 0.82 and an accuracy of 75.2% in the external validation. The sparse learning model using LASSO was effective in analyzing the epidemiological underlying patterns of DR. This is the first study to develop a machine learning model to predict DR risk using health records. LASSO can be an excellent choice when both discriminative power and variable selection are important in the analysis of high-dimensional electronic health records.
Erdem, Cemal; Nagle, Alison M.; Casa, Angelo J.; Litzenburger, Beate C.; Wang, Yu-fen; Taylor, D. Lansing; Lee, Adrian V.; Lezon, Timothy R.
2016-01-01
Insulin and insulin-like growth factor I (IGF1) influence cancer risk and progression through poorly understood mechanisms. To better understand the roles of insulin and IGF1 signaling in breast cancer, we combined proteomic screening with computational network inference to uncover differences in IGF1 and insulin induced signaling. Using reverse phase protein array, we measured the levels of 134 proteins in 21 breast cancer cell lines stimulated with IGF1 or insulin for up to 48 h. We then constructed directed protein expression networks using three separate methods: (i) lasso regression, (ii) conventional matrix inversion, and (iii) entropy maximization. These networks, named here as the time translation models, were analyzed and the inferred interactions were ranked by differential magnitude to identify pathway differences. The two top candidates, chosen for experimental validation, were shown to regulate IGF1/insulin induced phosphorylation events. First, acetyl-CoA carboxylase (ACC) knock-down was shown to increase the level of mitogen-activated protein kinase (MAPK) phosphorylation. Second, stable knock-down of E-Cadherin increased the phospho-Akt protein levels. Both of the knock-down perturbations incurred phosphorylation responses stronger in IGF1 stimulated cells compared with insulin. Overall, the time-translation modeling coupled to wet-lab experiments has proven to be powerful in inferring differential interactions downstream of IGF1 and insulin signaling, in vitro. PMID:27364358
Aeroelastic Model Structure Computation for Envelope Expansion
NASA Technical Reports Server (NTRS)
Kukreja, Sunil L.
2007-01-01
Structure detection is a procedure for selecting a subset of candidate terms, from a full model description, that best describes the observed output. This is a necessary procedure to compute an efficient system description which may afford greater insight into the functionality of the system or a simpler controller design. Structure computation as a tool for black-box modelling may be of critical importance in the development of robust, parsimonious models for the flight-test community. Moreover, this approach may lead to efficient strategies for rapid envelope expansion which may save significant development time and costs. In this study, a least absolute shrinkage and selection operator (LASSO) technique is investigated for computing efficient model descriptions of nonlinear aeroelastic systems. The LASSO minimises the residual sum of squares by the addition of an l(sub 1) penalty term on the parameter vector of the traditional 2 minimisation problem. Its use for structure detection is a natural extension of this constrained minimisation approach to pseudolinear regression problems which produces some model parameters that are exactly zero and, therefore, yields a parsimonious system description. Applicability of this technique for model structure computation for the F/A-18 Active Aeroelastic Wing using flight test data is shown for several flight conditions (Mach numbers) by identifying a parsimonious system description with a high percent fit for cross-validated data.
Wu, Haifeng; Sun, Tao; Wang, Jingjing; Li, Xia; Wang, Wei; Huo, Da; Lv, Pingxin; He, Wen; Wang, Keyang; Guo, Xiuhua
2013-08-01
The objective of this study was to investigate the method of the combination of radiological and textural features for the differentiation of malignant from benign solitary pulmonary nodules by computed tomography. Features including 13 gray level co-occurrence matrix textural features and 12 radiological features were extracted from 2,117 CT slices, which came from 202 (116 malignant and 86 benign) patients. Lasso-type regularization to a nonlinear regression model was applied to select predictive features and a BP artificial neural network was used to build the diagnostic model. Eight radiological and two textural features were obtained after the Lasso-type regularization procedure. Twelve radiological features alone could reach an area under the ROC curve (AUC) of 0.84 in differentiating between malignant and benign lesions. The 10 selected characters improved the AUC to 0.91. The evaluation results showed that the method of selecting radiological and textural features appears to yield more effective in the distinction of malignant from benign solitary pulmonary nodules by computed tomography.
Comparison of methods for the implementation of genome-assisted evaluation of Spanish dairy cattle.
Jiménez-Montero, J A; González-Recio, O; Alenda, R
2013-01-01
The aim of this study was to evaluate methods for genomic evaluation of the Spanish Holstein population as an initial step toward the implementation of routine genomic evaluations. This study provides a description of the population structure of progeny tested bulls in Spain at the genomic level and compares different genomic evaluation methods with regard to accuracy and bias. Two bayesian linear regression models, Bayes-A and Bayesian-LASSO (B-LASSO), as well as a machine learning algorithm, Random-Boosting (R-Boost), and BLUP using a realized genomic relationship matrix (G-BLUP), were compared. Five traits that are currently under selection in the Spanish Holstein population were used: milk yield, fat yield, protein yield, fat percentage, and udder depth. In total, genotypes from 1859 progeny tested bulls were used. The training sets were composed of bulls born before 2005; including 1601 bulls for production and 1574 bulls for type, whereas the testing sets contained 258 and 235 bulls born in 2005 or later for production and type, respectively. Deregressed proofs (DRP) from January 2009 Interbull (Uppsala, Sweden) evaluation were used as the dependent variables for bulls in the training sets, whereas DRP from the December 2011 DRPs Interbull evaluation were used to compare genomic predictions with progeny test results for bulls in the testing set. Genomic predictions were more accurate than traditional pedigree indices for predicting future progeny test results of young bulls. The gain in accuracy, due to inclusion of genomic data varied by trait and ranged from 0.04 to 0.42 Pearson correlation units. Results averaged across traits showed that B-LASSO had the highest accuracy with an advantage of 0.01, 0.03 and 0.03 points in Pearson correlation compared with R-Boost, Bayes-A, and G-BLUP, respectively. The B-LASSO predictions also showed the least bias (0.02, 0.03 and 0.10 SD units less than Bayes-A, R-Boost and G-BLUP, respectively) as measured by mean difference between genomic predictions and progeny test results. The R-Boosting algorithm provided genomic predictions with regression coefficients closer to unity, which is an alternative measure of bias, for 4 out of 5 traits and also resulted in mean squared errors estimates that were 2%, 10%, and 12% smaller than B-LASSO, Bayes-A, and G-BLUP, respectively. The observed prediction accuracy obtained with these methods was within the range of values expected for a population of similar size, suggesting that the prediction method and reference population described herein are appropriate for implementation of routine genome-assisted evaluations in Spanish dairy cattle. R-Boost is a competitive marker regression methodology in terms of predictive ability that can accommodate large data sets. Copyright © 2013 American Dairy Science Association. Published by Elsevier Inc. All rights reserved.
Gustafson, William Jr; Vogelmann, Andrew; Endo, Satoshi; Toto, Tami; Xiao, Heng; Li, Zhijin; Cheng, Xiaoping; Kim, Jinwon; Krishna, Bhargavi
2015-08-31
The Alpha 2 release is the second release from the LASSO Pilot Phase that builds upon the Alpha 1 release. Alpha 2 contains additional diagnostics in the data bundles and focuses on cases from spring-summer 2016. A data bundle is a unified package consisting of LASSO LES input and output, observations, evaluation diagnostics, and model skill scores. LES input include model configuration information and forcing data. LES output includes profile statistics and full domain fields of cloud and environmental variables. Model evaluation data consists of LES output and ARM observations co-registered on the same grid and sampling frequency. Model performance is quantified by skill scores and diagnostics in terms of cloud and environmental variables.
Regularization Paths for Conditional Logistic Regression: The clogitL1 Package.
Reid, Stephen; Tibshirani, Rob
2014-07-01
We apply the cyclic coordinate descent algorithm of Friedman, Hastie, and Tibshirani (2010) to the fitting of a conditional logistic regression model with lasso [Formula: see text] and elastic net penalties. The sequential strong rules of Tibshirani, Bien, Hastie, Friedman, Taylor, Simon, and Tibshirani (2012) are also used in the algorithm and it is shown that these offer a considerable speed up over the standard coordinate descent algorithm with warm starts. Once implemented, the algorithm is used in simulation studies to compare the variable selection and prediction performance of the conditional logistic regression model against that of its unconditional (standard) counterpart. We find that the conditional model performs admirably on datasets drawn from a suitable conditional distribution, outperforming its unconditional counterpart at variable selection. The conditional model is also fit to a small real world dataset, demonstrating how we obtain regularization paths for the parameters of the model and how we apply cross validation for this method where natural unconditional prediction rules are hard to come by.
Regularization Paths for Conditional Logistic Regression: The clogitL1 Package
Reid, Stephen; Tibshirani, Rob
2014-01-01
We apply the cyclic coordinate descent algorithm of Friedman, Hastie, and Tibshirani (2010) to the fitting of a conditional logistic regression model with lasso (ℓ1) and elastic net penalties. The sequential strong rules of Tibshirani, Bien, Hastie, Friedman, Taylor, Simon, and Tibshirani (2012) are also used in the algorithm and it is shown that these offer a considerable speed up over the standard coordinate descent algorithm with warm starts. Once implemented, the algorithm is used in simulation studies to compare the variable selection and prediction performance of the conditional logistic regression model against that of its unconditional (standard) counterpart. We find that the conditional model performs admirably on datasets drawn from a suitable conditional distribution, outperforming its unconditional counterpart at variable selection. The conditional model is also fit to a small real world dataset, demonstrating how we obtain regularization paths for the parameters of the model and how we apply cross validation for this method where natural unconditional prediction rules are hard to come by. PMID:26257587
NASA Astrophysics Data System (ADS)
Tang, Jie; Liu, Rong; Zhang, Yue-Li; Liu, Mou-Ze; Hu, Yong-Fang; Shao, Ming-Jie; Zhu, Li-Jun; Xin, Hua-Wen; Feng, Gui-Wen; Shang, Wen-Jun; Meng, Xiang-Guang; Zhang, Li-Rong; Ming, Ying-Zi; Zhang, Wei
2017-02-01
Tacrolimus has a narrow therapeutic window and considerable variability in clinical use. Our goal was to compare the performance of multiple linear regression (MLR) and eight machine learning techniques in pharmacogenetic algorithm-based prediction of tacrolimus stable dose (TSD) in a large Chinese cohort. A total of 1,045 renal transplant patients were recruited, 80% of which were randomly selected as the “derivation cohort” to develop dose-prediction algorithm, while the remaining 20% constituted the “validation cohort” to test the final selected algorithm. MLR, artificial neural network (ANN), regression tree (RT), multivariate adaptive regression splines (MARS), boosted regression tree (BRT), support vector regression (SVR), random forest regression (RFR), lasso regression (LAR) and Bayesian additive regression trees (BART) were applied and their performances were compared in this work. Among all the machine learning models, RT performed best in both derivation [0.71 (0.67-0.76)] and validation cohorts [0.73 (0.63-0.82)]. In addition, the ideal rate of RT was 4% higher than that of MLR. To our knowledge, this is the first study to use machine learning models to predict TSD, which will further facilitate personalized medicine in tacrolimus administration in the future.
Probability genotype imputation method and integrated weighted lasso for QTL identification.
Demetrashvili, Nino; Van den Heuvel, Edwin R; Wit, Ernst C
2013-12-30
Many QTL studies have two common features: (1) often there is missing marker information, (2) among many markers involved in the biological process only a few are causal. In statistics, the second issue falls under the headings "sparsity" and "causal inference". The goal of this work is to develop a two-step statistical methodology for QTL mapping for markers with binary genotypes. The first step introduces a novel imputation method for missing genotypes. Outcomes of the proposed imputation method are probabilities which serve as weights to the second step, namely in weighted lasso. The sparse phenotype inference is employed to select a set of predictive markers for the trait of interest. Simulation studies validate the proposed methodology under a wide range of realistic settings. Furthermore, the methodology outperforms alternative imputation and variable selection methods in such studies. The methodology was applied to an Arabidopsis experiment, containing 69 markers for 165 recombinant inbred lines of a F8 generation. The results confirm previously identified regions, however several new markers are also found. On the basis of the inferred ROC behavior these markers show good potential for being real, especially for the germination trait Gmax. Our imputation method shows higher accuracy in terms of sensitivity and specificity compared to alternative imputation method. Also, the proposed weighted lasso outperforms commonly practiced multiple regression as well as the traditional lasso and adaptive lasso with three weighting schemes. This means that under realistic missing data settings this methodology can be used for QTL identification.
LES ARM Symbiotic Simulation and Observation (LASSO) Implementation Strategy
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gustafson Jr., WI; Vogelmann, AM
2015-09-01
This document illustrates the design of the Large-Eddy Simulation (LES) ARM Symbiotic Simulation and Observation (LASSO) workflow to provide a routine, high-resolution modeling capability to augment the U.S. Department of Energy (DOE) Atmospheric Radiation Measurement (ARM) Climate Research Facility’s high-density observations. LASSO will create a powerful new capability for furthering ARM’s mission to advance understanding of cloud, radiation, aerosol, and land-surface processes. The combined observational and modeling elements will enable a new level of scientific inquiry by connecting processes and context to observations and providing needed statistics for details that cannot be measured. The result will be improved process understandingmore » that facilitates concomitant improvements in climate model parameterizations. The initial LASSO implementation will be for ARM’s Southern Great Plains site in Oklahoma and will focus on shallow convection, which is poorly simulated by climate models due in part to clouds’ typically small spatial scale compared to model grid spacing, and because the convection involves complicated interactions of microphysical and boundary layer processes.« less
Regularized rare variant enrichment analysis for case-control exome sequencing data.
Larson, Nicholas B; Schaid, Daniel J
2014-02-01
Rare variants have recently garnered an immense amount of attention in genetic association analysis. However, unlike methods traditionally used for single marker analysis in GWAS, rare variant analysis often requires some method of aggregation, since single marker approaches are poorly powered for typical sequencing study sample sizes. Advancements in sequencing technologies have rendered next-generation sequencing platforms a realistic alternative to traditional genotyping arrays. Exome sequencing in particular not only provides base-level resolution of genetic coding regions, but also a natural paradigm for aggregation via genes and exons. Here, we propose the use of penalized regression in combination with variant aggregation measures to identify rare variant enrichment in exome sequencing data. In contrast to marginal gene-level testing, we simultaneously evaluate the effects of rare variants in multiple genes, focusing on gene-based least absolute shrinkage and selection operator (LASSO) and exon-based sparse group LASSO models. By using gene membership as a grouping variable, the sparse group LASSO can be used as a gene-centric analysis of rare variants while also providing a penalized approach toward identifying specific regions of interest. We apply extensive simulations to evaluate the performance of these approaches with respect to specificity and sensitivity, comparing these results to multiple competing marginal testing methods. Finally, we discuss our findings and outline future research. © 2013 WILEY PERIODICALS, INC.
Wen, Ye; Pi, Fu-Hua; Guo, Pi; Dong, Wen-Ya; Xie, Yu-Qing; Wang, Xiang-Yu; Xia, Fang-Fang; Pang, Shao-Jie; Wu, Yan-Chun; Wang, Yuan-Yuan; Zhang, Qing-Ying
2016-01-01
Sleep habits are associated with stroke in western populations, but this relation has been rarely investigated in China. Moreover, the differences among stroke subtypes remain unclear. This study aimed to explore the associations of total stroke, including ischemic and hemorrhagic type, with sleep habits of a population in southern China. We performed a case-control study in patients admitted to the hospital with first stroke and community control subjects. A total of 333 patients (n = 223, 67.0%, with ischemic stroke; n = 110, 23.0%, with hemorrhagic stroke) and 547 controls were enrolled in the study. Participants completed a structured questionnaire to identify sleep habits and other stroke risk factors. Least absolute shrinkage and selection operator (Lasso) and multiple logistic regression were performed to identify risk factors of disease. Incidence of stroke, and its subtypes, was significantly associated with snorting/gasping, snoring, sleep duration, and daytime napping. Snorting/gasping was identified as an important risk factor in the Lasso logistic regression model (Lasso’ β = 0.84), and the result was proven to be robust. This study showed the association between stroke and sleep habits in the southern Chinese population and might help in better detecting important sleep-related factors for stroke risk. PMID:27698374
Zheng, Qi; Peng, Limin
2016-01-01
Quantile regression provides a flexible platform for evaluating covariate effects on different segments of the conditional distribution of response. As the effects of covariates may change with quantile level, contemporaneously examining a spectrum of quantiles is expected to have a better capacity to identify variables with either partial or full effects on the response distribution, as compared to focusing on a single quantile. Under this motivation, we study a general adaptively weighted LASSO penalization strategy in the quantile regression setting, where a continuum of quantile index is considered and coefficients are allowed to vary with quantile index. We establish the oracle properties of the resulting estimator of coefficient function. Furthermore, we formally investigate a BIC-type uniform tuning parameter selector and show that it can ensure consistent model selection. Our numerical studies confirm the theoretical findings and illustrate an application of the new variable selection procedure. PMID:28008212
A computational model for biosonar echoes from foliage
Gupta, Anupam Kumar; Lu, Ruijin; Zhu, Hongxiao
2017-01-01
Since many bat species thrive in densely vegetated habitats, echoes from foliage are likely to be of prime importance to the animals’ sensory ecology, be it as clutter that masks prey echoes or as sources of information about the environment. To better understand the characteristics of foliage echoes, a new model for the process that generates these signals has been developed. This model takes leaf size and orientation into account by representing the leaves as circular disks of varying diameter. The two added leaf parameters are of potential importance to the sensory ecology of bats, e.g., with respect to landmark recognition and flight guidance along vegetation contours. The full model is specified by a total of three parameters: leaf density, average leaf size, and average leaf orientation. It assumes that all leaf parameters are independently and identically distributed. Leaf positions were drawn from a uniform probability density function, sizes and orientations each from a Gaussian probability function. The model was found to reproduce the first-order amplitude statistics of measured example echoes and showed time-variant echo properties that depended on foliage parameters. Parameter estimation experiments using lasso regression have demonstrated that a single foliage parameter can be estimated with high accuracy if the other two parameters are known a priori. If only one parameter is known a priori, the other two can still be estimated, but with a reduced accuracy. Lasso regression did not support simultaneous estimation of all three parameters. Nevertheless, these results demonstrate that foliage echoes contain accessible information on foliage type and orientation that could play a role in supporting sensory tasks such as landmark identification and contour following in echolocating bats. PMID:28817631
A computational model for biosonar echoes from foliage.
Ming, Chen; Gupta, Anupam Kumar; Lu, Ruijin; Zhu, Hongxiao; Müller, Rolf
2017-01-01
Since many bat species thrive in densely vegetated habitats, echoes from foliage are likely to be of prime importance to the animals' sensory ecology, be it as clutter that masks prey echoes or as sources of information about the environment. To better understand the characteristics of foliage echoes, a new model for the process that generates these signals has been developed. This model takes leaf size and orientation into account by representing the leaves as circular disks of varying diameter. The two added leaf parameters are of potential importance to the sensory ecology of bats, e.g., with respect to landmark recognition and flight guidance along vegetation contours. The full model is specified by a total of three parameters: leaf density, average leaf size, and average leaf orientation. It assumes that all leaf parameters are independently and identically distributed. Leaf positions were drawn from a uniform probability density function, sizes and orientations each from a Gaussian probability function. The model was found to reproduce the first-order amplitude statistics of measured example echoes and showed time-variant echo properties that depended on foliage parameters. Parameter estimation experiments using lasso regression have demonstrated that a single foliage parameter can be estimated with high accuracy if the other two parameters are known a priori. If only one parameter is known a priori, the other two can still be estimated, but with a reduced accuracy. Lasso regression did not support simultaneous estimation of all three parameters. Nevertheless, these results demonstrate that foliage echoes contain accessible information on foliage type and orientation that could play a role in supporting sensory tasks such as landmark identification and contour following in echolocating bats.
Erdem, Cemal; Nagle, Alison M; Casa, Angelo J; Litzenburger, Beate C; Wang, Yu-Fen; Taylor, D Lansing; Lee, Adrian V; Lezon, Timothy R
2016-09-01
Insulin and insulin-like growth factor I (IGF1) influence cancer risk and progression through poorly understood mechanisms. To better understand the roles of insulin and IGF1 signaling in breast cancer, we combined proteomic screening with computational network inference to uncover differences in IGF1 and insulin induced signaling. Using reverse phase protein array, we measured the levels of 134 proteins in 21 breast cancer cell lines stimulated with IGF1 or insulin for up to 48 h. We then constructed directed protein expression networks using three separate methods: (i) lasso regression, (ii) conventional matrix inversion, and (iii) entropy maximization. These networks, named here as the time translation models, were analyzed and the inferred interactions were ranked by differential magnitude to identify pathway differences. The two top candidates, chosen for experimental validation, were shown to regulate IGF1/insulin induced phosphorylation events. First, acetyl-CoA carboxylase (ACC) knock-down was shown to increase the level of mitogen-activated protein kinase (MAPK) phosphorylation. Second, stable knock-down of E-Cadherin increased the phospho-Akt protein levels. Both of the knock-down perturbations incurred phosphorylation responses stronger in IGF1 stimulated cells compared with insulin. Overall, the time-translation modeling coupled to wet-lab experiments has proven to be powerful in inferring differential interactions downstream of IGF1 and insulin signaling, in vitro. © 2016 by The American Society for Biochemistry and Molecular Biology, Inc.
Aeroelastic Model Structure Computation for Envelope Expansion
NASA Technical Reports Server (NTRS)
Kukreja, Sunil L.
2007-01-01
Structure detection is a procedure for selecting a subset of candidate terms, from a full model description, that best describes the observed output. This is a necessary procedure to compute an efficient system description which may afford greater insight into the functionality of the system or a simpler controller design. Structure computation as a tool for black-box modeling may be of critical importance in the development of robust, parsimonious models for the flight-test community. Moreover, this approach may lead to efficient strategies for rapid envelope expansion that may save significant development time and costs. In this study, a least absolute shrinkage and selection operator (LASSO) technique is investigated for computing efficient model descriptions of non-linear aeroelastic systems. The LASSO minimises the residual sum of squares with the addition of an l(Sub 1) penalty term on the parameter vector of the traditional l(sub 2) minimisation problem. Its use for structure detection is a natural extension of this constrained minimisation approach to pseudo-linear regression problems which produces some model parameters that are exactly zero and, therefore, yields a parsimonious system description. Applicability of this technique for model structure computation for the F/A-18 (McDonnell Douglas, now The Boeing Company, Chicago, Illinois) Active Aeroelastic Wing project using flight test data is shown for several flight conditions (Mach numbers) by identifying a parsimonious system description with a high percent fit for cross-validated data.
NASA Astrophysics Data System (ADS)
Boucher, Thomas F.; Ozanne, Marie V.; Carmosino, Marco L.; Dyar, M. Darby; Mahadevan, Sridhar; Breves, Elly A.; Lepore, Kate H.; Clegg, Samuel M.
2015-05-01
The ChemCam instrument on the Mars Curiosity rover is generating thousands of LIBS spectra and bringing interest in this technique to public attention. The key to interpreting Mars or any other types of LIBS data are calibrations that relate laboratory standards to unknowns examined in other settings and enable predictions of chemical composition. Here, LIBS spectral data are analyzed using linear regression methods including partial least squares (PLS-1 and PLS-2), principal component regression (PCR), least absolute shrinkage and selection operator (lasso), elastic net, and linear support vector regression (SVR-Lin). These were compared against results from nonlinear regression methods including kernel principal component regression (K-PCR), polynomial kernel support vector regression (SVR-Py) and k-nearest neighbor (kNN) regression to discern the most effective models for interpreting chemical abundances from LIBS spectra of geological samples. The results were evaluated for 100 samples analyzed with 50 laser pulses at each of five locations averaged together. Wilcoxon signed-rank tests were employed to evaluate the statistical significance of differences among the nine models using their predicted residual sum of squares (PRESS) to make comparisons. For MgO, SiO2, Fe2O3, CaO, and MnO, the sparse models outperform all the others except for linear SVR, while for Na2O, K2O, TiO2, and P2O5, the sparse methods produce inferior results, likely because their emission lines in this energy range have lower transition probabilities. The strong performance of the sparse methods in this study suggests that use of dimensionality-reduction techniques as a preprocessing step may improve the performance of the linear models. Nonlinear methods tend to overfit the data and predict less accurately, while the linear methods proved to be more generalizable with better predictive performance. These results are attributed to the high dimensionality of the data (6144 channels) relative to the small number of samples studied. The best-performing models were SVR-Lin for SiO2, MgO, Fe2O3, and Na2O, lasso for Al2O3, elastic net for MnO, and PLS-1 for CaO, TiO2, and K2O. Although these differences in model performance between methods were identified, most of the models produce comparable results when p ≤ 0.05 and all techniques except kNN produced statistically-indistinguishable results. It is likely that a combination of models could be used together to yield a lower total error of prediction, depending on the requirements of the user.
Paz-Linares, Deirel; Vega-Hernández, Mayrim; Rojas-López, Pedro A.; Valdés-Hernández, Pedro A.; Martínez-Montes, Eduardo; Valdés-Sosa, Pedro A.
2017-01-01
The estimation of EEG generating sources constitutes an Inverse Problem (IP) in Neuroscience. This is an ill-posed problem due to the non-uniqueness of the solution and regularization or prior information is needed to undertake Electrophysiology Source Imaging. Structured Sparsity priors can be attained through combinations of (L1 norm-based) and (L2 norm-based) constraints such as the Elastic Net (ENET) and Elitist Lasso (ELASSO) models. The former model is used to find solutions with a small number of smooth nonzero patches, while the latter imposes different degrees of sparsity simultaneously along different dimensions of the spatio-temporal matrix solutions. Both models have been addressed within the penalized regression approach, where the regularization parameters are selected heuristically, leading usually to non-optimal and computationally expensive solutions. The existing Bayesian formulation of ENET allows hyperparameter learning, but using the computationally intensive Monte Carlo/Expectation Maximization methods, which makes impractical its application to the EEG IP. While the ELASSO have not been considered before into the Bayesian context. In this work, we attempt to solve the EEG IP using a Bayesian framework for ENET and ELASSO models. We propose a Structured Sparse Bayesian Learning algorithm based on combining the Empirical Bayes and the iterative coordinate descent procedures to estimate both the parameters and hyperparameters. Using realistic simulations and avoiding the inverse crime we illustrate that our methods are able to recover complicated source setups more accurately and with a more robust estimation of the hyperparameters and behavior under different sparsity scenarios than classical LORETA, ENET and LASSO Fusion solutions. We also solve the EEG IP using data from a visual attention experiment, finding more interpretable neurophysiological patterns with our methods. The Matlab codes used in this work, including Simulations, Methods, Quality Measures and Visualization Routines are freely available in a public website. PMID:29200994
Paz-Linares, Deirel; Vega-Hernández, Mayrim; Rojas-López, Pedro A; Valdés-Hernández, Pedro A; Martínez-Montes, Eduardo; Valdés-Sosa, Pedro A
2017-01-01
The estimation of EEG generating sources constitutes an Inverse Problem (IP) in Neuroscience. This is an ill-posed problem due to the non-uniqueness of the solution and regularization or prior information is needed to undertake Electrophysiology Source Imaging. Structured Sparsity priors can be attained through combinations of (L1 norm-based) and (L2 norm-based) constraints such as the Elastic Net (ENET) and Elitist Lasso (ELASSO) models. The former model is used to find solutions with a small number of smooth nonzero patches, while the latter imposes different degrees of sparsity simultaneously along different dimensions of the spatio-temporal matrix solutions. Both models have been addressed within the penalized regression approach, where the regularization parameters are selected heuristically, leading usually to non-optimal and computationally expensive solutions. The existing Bayesian formulation of ENET allows hyperparameter learning, but using the computationally intensive Monte Carlo/Expectation Maximization methods, which makes impractical its application to the EEG IP. While the ELASSO have not been considered before into the Bayesian context. In this work, we attempt to solve the EEG IP using a Bayesian framework for ENET and ELASSO models. We propose a Structured Sparse Bayesian Learning algorithm based on combining the Empirical Bayes and the iterative coordinate descent procedures to estimate both the parameters and hyperparameters. Using realistic simulations and avoiding the inverse crime we illustrate that our methods are able to recover complicated source setups more accurately and with a more robust estimation of the hyperparameters and behavior under different sparsity scenarios than classical LORETA, ENET and LASSO Fusion solutions. We also solve the EEG IP using data from a visual attention experiment, finding more interpretable neurophysiological patterns with our methods. The Matlab codes used in this work, including Simulations, Methods, Quality Measures and Visualization Routines are freely available in a public website.
Abraham, Gad; Kowalczyk, Adam; Zobel, Justin; Inouye, Michael
2013-02-01
A central goal of medical genetics is to accurately predict complex disease from genotypes. Here, we present a comprehensive analysis of simulated and real data using lasso and elastic-net penalized support-vector machine models, a mixed-effects linear model, a polygenic score, and unpenalized logistic regression. In simulation, the sparse penalized models achieved lower false-positive rates and higher precision than the other methods for detecting causal SNPs. The common practice of prefiltering SNP lists for subsequent penalized modeling was examined and shown to substantially reduce the ability to recover the causal SNPs. Using genome-wide SNP profiles across eight complex diseases within cross-validation, lasso and elastic-net models achieved substantially better predictive ability in celiac disease, type 1 diabetes, and Crohn's disease, and had equivalent predictive ability in the rest, with the results in celiac disease strongly replicating between independent datasets. We investigated the effect of linkage disequilibrium on the predictive models, showing that the penalized methods leverage this information to their advantage, compared with methods that assume SNP independence. Our findings show that sparse penalized approaches are robust across different disease architectures, producing as good as or better phenotype predictions and variance explained. This has fundamental ramifications for the selection and future development of methods to genetically predict human disease. © 2012 WILEY PERIODICALS, INC.
Evaluation of digital soil mapping approaches with large sets of environmental covariates
NASA Astrophysics Data System (ADS)
Nussbaum, Madlene; Spiess, Kay; Baltensweiler, Andri; Grob, Urs; Keller, Armin; Greiner, Lucie; Schaepman, Michael E.; Papritz, Andreas
2018-01-01
The spatial assessment of soil functions requires maps of basic soil properties. Unfortunately, these are either missing for many regions or are not available at the desired spatial resolution or down to the required soil depth. The field-based generation of large soil datasets and conventional soil maps remains costly. Meanwhile, legacy soil data and comprehensive sets of spatial environmental data are available for many regions. Digital soil mapping (DSM) approaches relating soil data (responses) to environmental data (covariates) face the challenge of building statistical models from large sets of covariates originating, for example, from airborne imaging spectroscopy or multi-scale terrain analysis. We evaluated six approaches for DSM in three study regions in Switzerland (Berne, Greifensee, ZH forest) by mapping the effective soil depth available to plants (SD), pH, soil organic matter (SOM), effective cation exchange capacity (ECEC), clay, silt, gravel content and fine fraction bulk density for four soil depths (totalling 48 responses). Models were built from 300-500 environmental covariates by selecting linear models through (1) grouped lasso and (2) an ad hoc stepwise procedure for robust external-drift kriging (georob). For (3) geoadditive models we selected penalized smoothing spline terms by component-wise gradient boosting (geoGAM). We further used two tree-based methods: (4) boosted regression trees (BRTs) and (5) random forest (RF). Lastly, we computed (6) weighted model averages (MAs) from the predictions obtained from methods 1-5. Lasso, georob and geoGAM successfully selected strongly reduced sets of covariates (subsets of 3-6 % of all covariates). Differences in predictive performance, tested on independent validation data, were mostly small and did not reveal a single best method for 48 responses. Nevertheless, RF was often the best among methods 1-5 (28 of 48 responses), but was outcompeted by MA for 14 of these 28 responses. RF tended to over-fit the data. The performance of BRT was slightly worse than RF. GeoGAM performed poorly on some responses and was the best only for 7 of 48 responses. The prediction accuracy of lasso was intermediate. All models generally had small bias. Only the computationally very efficient lasso had slightly larger bias because it tended to under-fit the data. Summarizing, although differences were small, the frequencies of the best and worst performance clearly favoured RF if a single method is applied and MA if multiple prediction models can be developed.
Wang, Maggie Haitian; Chong, Ka Chun; Storer, Malina; Pickering, John W; Endre, Zoltan H; Lau, Steven Yf; Kwok, Chloe; Lai, Maria; Chung, Hau Yin; Ying Zee, Benny Chung
2016-09-28
Selected ion flow tube-mass spectrometry (SIFT-MS) provides rapid, non-invasive measurements of a full-mass scan of volatile compounds in exhaled breath. Although various studies have suggested that breath metabolites may be indicators of human disease status, many of these studies have included few breath samples and large numbers of compounds, limiting their power to detect significant metabolites. This study employed a least absolute shrinkage and selective operator (LASSO) approach to SIFT-MS data of breath samples to preliminarily evaluate the ability of exhaled breath findings to monitor the efficacy of dialysis in hemodialysis patients. A process of model building and validation showed that blood creatinine and urea concentrations could be accurately predicted by LASSO-selected masses. Using various precursors, the LASSO models were able to predict creatinine and urea concentrations with high adjusted R-square (>80%) values. The correlation between actual concentrations and concentrations predicted by the LASSO model (using precursor H 3 O + ) was high (Pearson correlation coefficient = 0.96). Moreover, use of full mass scan data provided a better prediction than compounds from selected ion mode. These findings warrant further investigations in larger patient cohorts. By employing a more powerful statistical approach to predict disease outcomes, breath analysis using SIFT-MS technology could be applicable in future to daily medical diagnoses.
A practical approach to Sasang constitutional diagnosis using vocal features
2013-01-01
Background Sasang constitutional medicine (SCM) is a type of tailored medicine that divides human beings into four Sasang constitutional (SC) types. Diagnosis of SC types is crucial to proper treatment in SCM. Voice characteristics have been used as an essential clue for diagnosing SC types. In the past, many studies tried to extract quantitative vocal features to make diagnosis models; however, these studies were flawed by limited data collected from one or a few sites, long recording time, and low accuracy. We propose a practical diagnosis model having only a few variables, which decreases model complexity. This in turn, makes our model appropriate for clinical applications. Methods A total of 2,341 participants’ voice recordings were used in making a SC classification model and to test the generalization ability of the model. Although the voice data consisted of five vowels and two repeated sentences per participant, we used only the sentence part for our study. A total of 21 features were extracted, and an advanced feature selection method—the least absolute shrinkage and selection operator (LASSO)—was applied to reduce the number of variables for classifier learning. A SC classification model was developed using multinomial logistic regression via LASSO. Results We compared the proposed classification model to the previous study, which used both sentences and five vowels from the same patient’s group. The classification accuracies for the test set were 47.9% and 40.4% for male and female, respectively. Our result showed that the proposed method was superior to the previous study in that it required shorter voice recordings, is more applicable to practical use, and had better generalization performance. Conclusions We proposed a practical SC classification method and showed that our model having fewer variables outperformed the model having many variables in the generalization test. We attempted to reduce the number of variables in two ways: 1) the initial number of candidate features was decreased by considering shorter voice recording, and 2) LASSO was introduced for reducing model complexity. The proposed method is suitable for an actual clinical environment. Moreover, we expect it to yield more stable results because of the model’s simplicity. PMID:24200041
TU-CD-BRB-01: Normal Lung CT Texture Features Improve Predictive Models for Radiation Pneumonitis
DOE Office of Scientific and Technical Information (OSTI.GOV)
Krafft, S; The University of Texas Graduate School of Biomedical Sciences, Houston, TX; Briere, T
2015-06-15
Purpose: Existing normal tissue complication probability (NTCP) models for radiation pneumonitis (RP) traditionally rely on dosimetric and clinical data but are limited in terms of performance and generalizability. Extraction of pre-treatment image features provides a potential new category of data that can improve NTCP models for RP. We consider quantitative measures of total lung CT intensity and texture in a framework for prediction of RP. Methods: Available clinical and dosimetric data was collected for 198 NSCLC patients treated with definitive radiotherapy. Intensity- and texture-based image features were extracted from the T50 phase of the 4D-CT acquired for treatment planning. Amore » total of 3888 features (15 clinical, 175 dosimetric, and 3698 image features) were gathered and considered candidate predictors for modeling of RP grade≥3. A baseline logistic regression model with mean lung dose (MLD) was first considered. Additionally, a least absolute shrinkage and selection operator (LASSO) logistic regression was applied to the set of clinical and dosimetric features, and subsequently to the full set of clinical, dosimetric, and image features. Model performance was assessed by comparing area under the curve (AUC). Results: A simple logistic fit of MLD was an inadequate model of the data (AUC∼0.5). Including clinical and dosimetric parameters within the framework of the LASSO resulted in improved performance (AUC=0.648). Analysis of the full cohort of clinical, dosimetric, and image features provided further and significant improvement in model performance (AUC=0.727). Conclusions: To achieve significant gains in predictive modeling of RP, new categories of data should be considered in addition to clinical and dosimetric features. We have successfully incorporated CT image features into a framework for modeling RP and have demonstrated improved predictive performance. Validation and further investigation of CT image features in the context of RP NTCP modeling is warranted. This work was supported by the Rosalie B. Hite Fellowship in Cancer research awarded to SPK.« less
A new genome-mining tool redefines the lasso peptide biosynthetic landscape
Tietz, Jonathan I.; Schwalen, Christopher J.; Patel, Parth S.; Maxson, Tucker; Blair, Patricia M.; Tai, Hua-Chia; Zakai, Uzma I.; Mitchell, Douglas A.
2016-01-01
Ribosomally synthesized and post-translationally modified peptide (RiPP) natural products are attractive for genome-driven discovery and re-engineering, but limitations in bioinformatic methods and exponentially increasing genomic data make large-scale mining difficult. We report RODEO (Rapid ORF Description and Evaluation Online), which combines hidden Markov model-based analysis, heuristic scoring, and machine learning to identify biosynthetic gene clusters and predict RiPP precursor peptides. We initially focused on lasso peptides, which display intriguing physiochemical properties and bioactivities, but their hypervariability renders them challenging prospects for automated mining. Our approach yielded the most comprehensive mapping of lasso peptide space, revealing >1,300 compounds. We characterized the structures and bioactivities of six lasso peptides, prioritized based on predicted structural novelty, including an unprecedented handcuff-like topology and another with a citrulline modification exceptionally rare among bacteria. These combined insights significantly expand the knowledge of lasso peptides, and more broadly, provide a framework for future genome-mining efforts. PMID:28244986
Statistical validation of normal tissue complication probability models.
Xu, Cheng-Jian; van der Schaaf, Arjen; Van't Veld, Aart A; Langendijk, Johannes A; Schilstra, Cornelis
2012-09-01
To investigate the applicability and value of double cross-validation and permutation tests as established statistical approaches in the validation of normal tissue complication probability (NTCP) models. A penalized regression method, LASSO (least absolute shrinkage and selection operator), was used to build NTCP models for xerostomia after radiation therapy treatment of head-and-neck cancer. Model assessment was based on the likelihood function and the area under the receiver operating characteristic curve. Repeated double cross-validation showed the uncertainty and instability of the NTCP models and indicated that the statistical significance of model performance can be obtained by permutation testing. Repeated double cross-validation and permutation tests are recommended to validate NTCP models before clinical use. Copyright © 2012 Elsevier Inc. All rights reserved.
Potential application of machine learning in health outcomes research and some statistical cautions.
Crown, William H
2015-03-01
Traditional analytic methods are often ill-suited to the evolving world of health care big data characterized by massive volume, complexity, and velocity. In particular, methods are needed that can estimate models efficiently using very large datasets containing healthcare utilization data, clinical data, data from personal devices, and many other sources. Although very large, such datasets can also be quite sparse (e.g., device data may only be available for a small subset of individuals), which creates problems for traditional regression models. Many machine learning methods address such limitations effectively but are still subject to the usual sources of bias that commonly arise in observational studies. Researchers using machine learning methods such as lasso or ridge regression should assess these models using conventional specification tests. Copyright © 2015 International Society for Pharmacoeconomics and Outcomes Research (ISPOR). Published by Elsevier Inc. All rights reserved.
Lien, Tonje G; Borgan, Ørnulf; Reppe, Sjur; Gautvik, Kaare; Glad, Ingrid Kristine
2018-03-07
Using high-dimensional penalized regression we studied genome-wide DNA-methylation in bone biopsies of 80 postmenopausal women in relation to their bone mineral density (BMD). The women showed BMD varying from severely osteoporotic to normal. Global gene expression data from the same individuals was available, and since DNA-methylation often affects gene expression, the overall aim of this paper was to include both of these omics data sets into an integrated analysis. The classical penalized regression uses one penalty, but we incorporated individual penalties for each of the DNA-methylation sites. These individual penalties were guided by the strength of association between DNA-methylations and gene transcript levels. DNA-methylations that were highly associated to one or more transcripts got lower penalties and were therefore favored compared to DNA-methylations showing less association to expression. Because of the complex pathways and interactions among genes, we investigated both the association between DNA-methylations and their corresponding cis gene, as well as the association between DNA-methylations and trans-located genes. Two integrating penalized methods were used: first, an adaptive group-regularized ridge regression, and secondly, variable selection was performed through a modified version of the weighted lasso. When information from gene expressions was integrated, predictive performance was considerably improved, in terms of predictive mean square error, compared to classical penalized regression without data integration. We found a 14.7% improvement in the ridge regression case and a 17% improvement for the lasso case. Our version of the weighted lasso with data integration found a list of 22 interesting methylation sites. Several corresponded to genes that are known to be important in bone formation. Using BMD as response and these 22 methylation sites as covariates, least square regression analyses resulted in R 2 =0.726, comparable to an average R 2 =0.438 for 10000 randomly selected groups of DNA-methylations with group size 22. Two recent types of penalized regression methods were adapted to integrate DNA-methylation and their association to gene expression in the analysis of bone mineral density. In both cases predictions clearly benefit from including the additional information on gene expressions.
Alpha1 LASSO data bundles Lamont, OK
Gustafson, William Jr; Vogelmann, Andrew; Endo, Satoshi; Toto, Tami; Xiao, Heng; Li, Zhijin; Cheng, Xiaoping; Krishna, Bhargavi (ORCID:000000018828528X)
2016-08-03
A data bundle is a unified package consisting of LASSO LES input and output, observations, evaluation diagnostics, and model skill scores. LES input includes model configuration information and forcing data. LES output includes profile statistics and full domain fields of cloud and environmental variables. Model evaluation data consists of LES output and ARM observations co-registered on the same grid and sampling frequency. Model performance is quantified by skill scores and diagnostics in terms of cloud and environmental variables.
Covariate Selection for Multilevel Models with Missing Data
Marino, Miguel; Buxton, Orfeu M.; Li, Yi
2017-01-01
Missing covariate data hampers variable selection in multilevel regression settings. Current variable selection techniques for multiply-imputed data commonly address missingness in the predictors through list-wise deletion and stepwise-selection methods which are problematic. Moreover, most variable selection methods are developed for independent linear regression models and do not accommodate multilevel mixed effects regression models with incomplete covariate data. We develop a novel methodology that is able to perform covariate selection across multiply-imputed data for multilevel random effects models when missing data is present. Specifically, we propose to stack the multiply-imputed data sets from a multiple imputation procedure and to apply a group variable selection procedure through group lasso regularization to assess the overall impact of each predictor on the outcome across the imputed data sets. Simulations confirm the advantageous performance of the proposed method compared with the competing methods. We applied the method to reanalyze the Healthy Directions-Small Business cancer prevention study, which evaluated a behavioral intervention program targeting multiple risk-related behaviors in a working-class, multi-ethnic population. PMID:28239457
MORADI, Ghobad; PIROOZI, Bakhtiar; SAFARI, Hossein; ESMAIL NASAB, Nader; MOHAMADI BOLBANABAD, Amjad; YARI, Arezoo
2017-01-01
Background: Pabon Lasso model was applied to assess the relative performance of hospitals affiliated to Kurdistan University of Medical Sciences (KUMS) before and after the implementation of Health Sector Evolution Plan (HSEP) in Iran. Methods: This cross-sectional study was carried out in 11 public hospitals affiliated to KUMS in 2015. Twelve months before and after the implementation of the first phase of HSEP, a checklist was used to collect data from computerized databases within the hospitals’ admission and discharge units. Pabon Lasso model includes three indices: bed turnover, bed occupancy ratio, and average length of stay. Results: Analysis of hospital performance showed an increase in mean of bed occupancy and turnover ratio, which changed from 65.40% and 86.22 times/year during 12 months before to 69.97% and 90.98 times/year during 12 months after HSEP, respectively. In line with Pabon Lasso model, before the implementation of HSEP, 27.27% and 36.36% of the hospitals were entirely efficient and inefficient, respectively, whilst after the implementation of HSEP, their condition changed to 18.18% and 27.27%, in order. Conclusion: Indicators of bed occupancy and turnover ratio had a 4% increase in the studied hospitals after the implementation of HSEP. Number of the hospitals in the efficient zone reduced because of the relative measurement of efficiency by Pabon Lasso model. Since more than 50% of the hospitals in the studied province have not yet reached their optimal bed occupancy ratio (more than 70%), short-term and suitable strategy for improving the efficiency is to stop further expansion of hospitals as well as developing the number of hospital beds. PMID:28435825
Folded concave penalized learning in identifying multimodal MRI marker for Parkinson’s disease
Liu, Hongcheng; Du, Guangwei; Zhang, Lijun; Lewis, Mechelle M.; Wang, Xue; Yao, Tao; Li, Runze; Huang, Xuemei
2016-01-01
Background Brain MRI holds promise to gauge different aspects of Parkinson’s disease (PD)-related pathological changes. Its analysis, however, is hindered by the high-dimensional nature of the data. New method This study introduces folded concave penalized (FCP) sparse logistic regression to identify biomarkers for PD from a large number of potential factors. The proposed statistical procedures target the challenges of high-dimensionality with limited data samples acquired. The maximization problem associated with the sparse logistic regression model is solved by local linear approximation. The proposed procedures then are applied to the empirical analysis of multimodal MRI data. Results From 45 features, the proposed approach identified 15 MRI markers and the UPSIT, which are known to be clinically relevant to PD. By combining the MRI and clinical markers, we can enhance substantially the specificity and sensitivity of the model, as indicated by the ROC curves. Comparison to existing methods We compare the folded concave penalized learning scheme with both the Lasso penalized scheme and the principle component analysis-based feature selection (PCA) in the Parkinson’s biomarker identification problem that takes into account both the clinical features and MRI markers. The folded concave penalty method demonstrates a substantially better clinical potential than both the Lasso and PCA in terms of specificity and sensitivity. Conclusions For the first time, we applied the FCP learning method to MRI biomarker discovery in PD. The proposed approach successfully identified MRI markers that are clinically relevant. Combining these biomarkers with clinical features can substantially enhance performance. PMID:27102045
Tighe, Patrick J.; Harle, Christopher A.; Hurley, Robert W.; Aytug, Haldun; Boezaart, Andre P.; Fillingim, Roger B.
2015-01-01
Background Given their ability to process highly dimensional datasets with hundreds of variables, machine learning algorithms may offer one solution to the vexing challenge of predicting postoperative pain. Methods Here, we report on the application of machine learning algorithms to predict postoperative pain outcomes in a retrospective cohort of 8071 surgical patients using 796 clinical variables. Five algorithms were compared in terms of their ability to forecast moderate to severe postoperative pain: Least Absolute Shrinkage and Selection Operator (LASSO), gradient-boosted decision tree, support vector machine, neural network, and k-nearest neighbor, with logistic regression included for baseline comparison. Results In forecasting moderate to severe postoperative pain for postoperative day (POD) 1, the LASSO algorithm, using all 796 variables, had the highest accuracy with an area under the receiver-operating curve (ROC) of 0.704. Next, the gradient-boosted decision tree had an ROC of 0.665 and the k-nearest neighbor algorithm had an ROC of 0.643. For POD 3, the LASSO algorithm, using all variables, again had the highest accuracy, with an ROC of 0.727. Logistic regression had a lower ROC of 0.5 for predicting pain outcomes on POD 1 and 3. Conclusions Machine learning algorithms, when combined with complex and heterogeneous data from electronic medical record systems, can forecast acute postoperative pain outcomes with accuracies similar to methods that rely only on variables specifically collected for pain outcome prediction. PMID:26031220
Gupta, Punkaj; Rettiganti, Mallikarjuna; Gossett, Jeffrey M; Daufeldt, Jennifer; Rice, Tom B; Wetzel, Randall C
2018-01-01
To create a novel tool to predict favorable neurologic outcomes during ICU stay among children with critical illness. Logistic regression models using adaptive lasso methodology were used to identify independent factors associated with favorable neurologic outcomes. A mixed effects logistic regression model was used to create the final prediction model including all predictors selected from the lasso model. Model validation was performed using a 10-fold internal cross-validation approach. Virtual Pediatric Systems (VPS, LLC, Los Angeles, CA) database. Patients less than 18 years old admitted to one of the participating ICUs in the Virtual Pediatric Systems database were included (2009-2015). None. A total of 160,570 patients from 90 hospitals qualified for inclusion. Of these, 1,675 patients (1.04%) were associated with a decline in Pediatric Cerebral Performance Category scale by at least 2 between ICU admission and ICU discharge (unfavorable neurologic outcome). The independent factors associated with unfavorable neurologic outcome included higher weight at ICU admission, higher Pediatric Index of Morality-2 score at ICU admission, cardiac arrest, stroke, seizures, head/nonhead trauma, use of conventional mechanical ventilation and high-frequency oscillatory ventilation, prolonged hospital length of ICU stay, and prolonged use of mechanical ventilation. The presence of chromosomal anomaly, cardiac surgery, and utilization of nitric oxide were associated with favorable neurologic outcome. The final online prediction tool can be accessed at https://soipredictiontool.shinyapps.io/GNOScore/. Our model predicted 139,688 patients with favorable neurologic outcomes in an internal validation sample when the observed number of patients with favorable neurologic outcomes was among 139,591 patients. The area under the receiver operating curve for the validation model was 0.90. This proposed prediction tool encompasses 20 risk factors into one probability to predict favorable neurologic outcome during ICU stay among children with critical illness. Future studies should seek external validation and improved discrimination of this prediction tool.
Ramadan, Ahmed; Boss, Connor; Choi, Jongeun; Peter Reeves, N; Cholewicki, Jacek; Popovich, John M; Radcliffe, Clark J
2018-07-01
Estimating many parameters of biomechanical systems with limited data may achieve good fit but may also increase 95% confidence intervals in parameter estimates. This results in poor identifiability in the estimation problem. Therefore, we propose a novel method to select sensitive biomechanical model parameters that should be estimated, while fixing the remaining parameters to values obtained from preliminary estimation. Our method relies on identifying the parameters to which the measurement output is most sensitive. The proposed method is based on the Fisher information matrix (FIM). It was compared against the nonlinear least absolute shrinkage and selection operator (LASSO) method to guide modelers on the pros and cons of our FIM method. We present an application identifying a biomechanical parametric model of a head position-tracking task for ten human subjects. Using measured data, our method (1) reduced model complexity by only requiring five out of twelve parameters to be estimated, (2) significantly reduced parameter 95% confidence intervals by up to 89% of the original confidence interval, (3) maintained goodness of fit measured by variance accounted for (VAF) at 82%, (4) reduced computation time, where our FIM method was 164 times faster than the LASSO method, and (5) selected similar sensitive parameters to the LASSO method, where three out of five selected sensitive parameters were shared by FIM and LASSO methods.
Genotype-phenotype association study via new multi-task learning model
Huo, Zhouyuan; Shen, Dinggang
2018-01-01
Research on the associations between genetic variations and imaging phenotypes is developing with the advance in high-throughput genotype and brain image techniques. Regression analysis of single nucleotide polymorphisms (SNPs) and imaging measures as quantitative traits (QTs) has been proposed to identify the quantitative trait loci (QTL) via multi-task learning models. Recent studies consider the interlinked structures within SNPs and imaging QTs through group lasso, e.g. ℓ2,1-norm, leading to better predictive results and insights of SNPs. However, group sparsity is not enough for representing the correlation between multiple tasks and ℓ2,1-norm regularization is not robust either. In this paper, we propose a new multi-task learning model to analyze the associations between SNPs and QTs. We suppose that low-rank structure is also beneficial to uncover the correlation between genetic variations and imaging phenotypes. Finally, we conduct regression analysis of SNPs and QTs. Experimental results show that our model is more accurate in prediction than compared methods and presents new insights of SNPs. PMID:29218896
ORACLE INEQUALITIES FOR THE LASSO IN THE COX MODEL
Huang, Jian; Sun, Tingni; Ying, Zhiliang; Yu, Yi; Zhang, Cun-Hui
2013-01-01
We study the absolute penalized maximum partial likelihood estimator in sparse, high-dimensional Cox proportional hazards regression models where the number of time-dependent covariates can be larger than the sample size. We establish oracle inequalities based on natural extensions of the compatibility and cone invertibility factors of the Hessian matrix at the true regression coefficients. Similar results based on an extension of the restricted eigenvalue can be also proved by our method. However, the presented oracle inequalities are sharper since the compatibility and cone invertibility factors are always greater than the corresponding restricted eigenvalue. In the Cox regression model, the Hessian matrix is based on time-dependent covariates in censored risk sets, so that the compatibility and cone invertibility factors, and the restricted eigenvalue as well, are random variables even when they are evaluated for the Hessian at the true regression coefficients. Under mild conditions, we prove that these quantities are bounded from below by positive constants for time-dependent covariates, including cases where the number of covariates is of greater order than the sample size. Consequently, the compatibility and cone invertibility factors can be treated as positive constants in our oracle inequalities. PMID:24086091
ORACLE INEQUALITIES FOR THE LASSO IN THE COX MODEL.
Huang, Jian; Sun, Tingni; Ying, Zhiliang; Yu, Yi; Zhang, Cun-Hui
2013-06-01
We study the absolute penalized maximum partial likelihood estimator in sparse, high-dimensional Cox proportional hazards regression models where the number of time-dependent covariates can be larger than the sample size. We establish oracle inequalities based on natural extensions of the compatibility and cone invertibility factors of the Hessian matrix at the true regression coefficients. Similar results based on an extension of the restricted eigenvalue can be also proved by our method. However, the presented oracle inequalities are sharper since the compatibility and cone invertibility factors are always greater than the corresponding restricted eigenvalue. In the Cox regression model, the Hessian matrix is based on time-dependent covariates in censored risk sets, so that the compatibility and cone invertibility factors, and the restricted eigenvalue as well, are random variables even when they are evaluated for the Hessian at the true regression coefficients. Under mild conditions, we prove that these quantities are bounded from below by positive constants for time-dependent covariates, including cases where the number of covariates is of greater order than the sample size. Consequently, the compatibility and cone invertibility factors can be treated as positive constants in our oracle inequalities.
Shrinkage Estimation of Varying Covariate Effects Based On Quantile Regression
Peng, Limin; Xu, Jinfeng; Kutner, Nancy
2013-01-01
Varying covariate effects often manifest meaningful heterogeneity in covariate-response associations. In this paper, we adopt a quantile regression model that assumes linearity at a continuous range of quantile levels as a tool to explore such data dynamics. The consideration of potential non-constancy of covariate effects necessitates a new perspective for variable selection, which, under the assumed quantile regression model, is to retain variables that have effects on all quantiles of interest as well as those that influence only part of quantiles considered. Current work on l1-penalized quantile regression either does not concern varying covariate effects or may not produce consistent variable selection in the presence of covariates with partial effects, a practical scenario of interest. In this work, we propose a shrinkage approach by adopting a novel uniform adaptive LASSO penalty. The new approach enjoys easy implementation without requiring smoothing. Moreover, it can consistently identify the true model (uniformly across quantiles) and achieve the oracle estimation efficiency. We further extend the proposed shrinkage method to the case where responses are subject to random right censoring. Numerical studies confirm the theoretical results and support the utility of our proposals. PMID:25332515
Jacquin, Laval; Cao, Tuong-Vi; Ahmadi, Nourollah
2016-01-01
One objective of this study was to provide readers with a clear and unified understanding of parametric statistical and kernel methods, used for genomic prediction, and to compare some of these in the context of rice breeding for quantitative traits. Furthermore, another objective was to provide a simple and user-friendly R package, named KRMM, which allows users to perform RKHS regression with several kernels. After introducing the concept of regularized empirical risk minimization, the connections between well-known parametric and kernel methods such as Ridge regression [i.e., genomic best linear unbiased predictor (GBLUP)] and reproducing kernel Hilbert space (RKHS) regression were reviewed. Ridge regression was then reformulated so as to show and emphasize the advantage of the kernel "trick" concept, exploited by kernel methods in the context of epistatic genetic architectures, over parametric frameworks used by conventional methods. Some parametric and kernel methods; least absolute shrinkage and selection operator (LASSO), GBLUP, support vector machine regression (SVR) and RKHS regression were thereupon compared for their genomic predictive ability in the context of rice breeding using three real data sets. Among the compared methods, RKHS regression and SVR were often the most accurate methods for prediction followed by GBLUP and LASSO. An R function which allows users to perform RR-BLUP of marker effects, GBLUP and RKHS regression, with a Gaussian, Laplacian, polynomial or ANOVA kernel, in a reasonable computation time has been developed. Moreover, a modified version of this function, which allows users to tune kernels for RKHS regression, has also been developed and parallelized for HPC Linux clusters. The corresponding KRMM package and all scripts have been made publicly available.
Variance Component Selection With Applications to Microbiome Taxonomic Data.
Zhai, Jing; Kim, Juhyun; Knox, Kenneth S; Twigg, Homer L; Zhou, Hua; Zhou, Jin J
2018-01-01
High-throughput sequencing technology has enabled population-based studies of the role of the human microbiome in disease etiology and exposure response. Microbiome data are summarized as counts or composition of the bacterial taxa at different taxonomic levels. An important problem is to identify the bacterial taxa that are associated with a response. One method is to test the association of specific taxon with phenotypes in a linear mixed effect model, which incorporates phylogenetic information among bacterial communities. Another type of approaches consider all taxa in a joint model and achieves selection via penalization method, which ignores phylogenetic information. In this paper, we consider regression analysis by treating bacterial taxa at different level as multiple random effects. For each taxon, a kernel matrix is calculated based on distance measures in the phylogenetic tree and acts as one variance component in the joint model. Then taxonomic selection is achieved by the lasso (least absolute shrinkage and selection operator) penalty on variance components. Our method integrates biological information into the variable selection problem and greatly improves selection accuracies. Simulation studies demonstrate the superiority of our methods versus existing methods, for example, group-lasso. Finally, we apply our method to a longitudinal microbiome study of Human Immunodeficiency Virus (HIV) infected patients. We implement our method using the high performance computing language Julia. Software and detailed documentation are freely available at https://github.com/JingZhai63/VCselection.
A prediction scheme of tropical cyclone frequency based on lasso and random forest
NASA Astrophysics Data System (ADS)
Tan, Jinkai; Liu, Hexiang; Li, Mengya; Wang, Jun
2017-07-01
This study aims to propose a novel prediction scheme of tropical cyclone frequency (TCF) over the Western North Pacific (WNP). We concerned the large-scale meteorological factors inclusive of the sea surface temperature, sea level pressure, the Niño-3.4 index, the wind shear, the vorticity, the subtropical high, and the sea ice cover, since the chronic change of these factors in the context of climate change would cause a gradual variation of the annual TCF. Specifically, we focus on the correlation between the year-to-year increment of these factors and TCF. The least absolute shrinkage and selection operator (Lasso) method was used for variable selection and dimension reduction from 11 initial predictors. Then, a prediction model based on random forest (RF) was established by using the training samples (1978-2011) for calibration and the testing samples (2012-2016) for validation. The RF model presents a major variation and trend of TCF in the period of calibration, and also fitted well with the observed TCF in the period of validation though there were some deviations. The leave-one-out cross validation of the model exhibited most of the predicted TCF are in consistence with the observed TCF with a high correlation coefficient. A comparison between results of the RF model and the multiple linear regression (MLR) model suggested the RF is more practical and capable of giving reliable results of TCF prediction over the WNP.
Compound Identification Using Penalized Linear Regression on Metabolomics
Liu, Ruiqi; Wu, Dongfeng; Zhang, Xiang; Kim, Seongho
2014-01-01
Compound identification is often achieved by matching the experimental mass spectra to the mass spectra stored in a reference library based on mass spectral similarity. Because the number of compounds in the reference library is much larger than the range of mass-to-charge ratio (m/z) values so that the data become high dimensional data suffering from singularity. For this reason, penalized linear regressions such as ridge regression and the lasso are used instead of the ordinary least squares regression. Furthermore, two-step approaches using the dot product and Pearson’s correlation along with the penalized linear regression are proposed in this study. PMID:27212894
Cui, Zaixu; Gong, Gaolang
2018-06-02
Individualized behavioral/cognitive prediction using machine learning (ML) regression approaches is becoming increasingly applied. The specific ML regression algorithm and sample size are two key factors that non-trivially influence prediction accuracies. However, the effects of the ML regression algorithm and sample size on individualized behavioral/cognitive prediction performance have not been comprehensively assessed. To address this issue, the present study included six commonly used ML regression algorithms: ordinary least squares (OLS) regression, least absolute shrinkage and selection operator (LASSO) regression, ridge regression, elastic-net regression, linear support vector regression (LSVR), and relevance vector regression (RVR), to perform specific behavioral/cognitive predictions based on different sample sizes. Specifically, the publicly available resting-state functional MRI (rs-fMRI) dataset from the Human Connectome Project (HCP) was used, and whole-brain resting-state functional connectivity (rsFC) or rsFC strength (rsFCS) were extracted as prediction features. Twenty-five sample sizes (ranged from 20 to 700) were studied by sub-sampling from the entire HCP cohort. The analyses showed that rsFC-based LASSO regression performed remarkably worse than the other algorithms, and rsFCS-based OLS regression performed markedly worse than the other algorithms. Regardless of the algorithm and feature type, both the prediction accuracy and its stability exponentially increased with increasing sample size. The specific patterns of the observed algorithm and sample size effects were well replicated in the prediction using re-testing fMRI data, data processed by different imaging preprocessing schemes, and different behavioral/cognitive scores, thus indicating excellent robustness/generalization of the effects. The current findings provide critical insight into how the selected ML regression algorithm and sample size influence individualized predictions of behavior/cognition and offer important guidance for choosing the ML regression algorithm or sample size in relevant investigations. Copyright © 2018 Elsevier Inc. All rights reserved.
Equivalence of MAXENT and Poisson point process models for species distribution modeling in ecology.
Renner, Ian W; Warton, David I
2013-03-01
Modeling the spatial distribution of a species is a fundamental problem in ecology. A number of modeling methods have been developed, an extremely popular one being MAXENT, a maximum entropy modeling approach. In this article, we show that MAXENT is equivalent to a Poisson regression model and hence is related to a Poisson point process model, differing only in the intercept term, which is scale-dependent in MAXENT. We illustrate a number of improvements to MAXENT that follow from these relations. In particular, a point process model approach facilitates methods for choosing the appropriate spatial resolution, assessing model adequacy, and choosing the LASSO penalty parameter, all currently unavailable to MAXENT. The equivalence result represents a significant step in the unification of the species distribution modeling literature. Copyright © 2013, The International Biometric Society.
Predictive modeling of cardiovascular complications in incident hemodialysis patients.
Ion Titapiccolo, J; Ferrario, M; Barbieri, C; Marcelli, D; Mari, F; Gatti, E; Cerutti, S; Smyth, P; Signorini, M G
2012-01-01
The administration of hemodialysis (HD) treatment leads to the continuous collection of a vast quantity of medical data. Many variables related to the patient health status, to the treatment, and to dialyzer settings can be recorded and stored at each treatment session. In this study a dataset of 42 variables and 1526 patients extracted from the Fresenius Medical Care database EuCliD was used to develop and apply a random forest predictive model for the prediction of cardiovascular events in the first year of HD treatment. A ridge-lasso logistic regression algorithm was then applied to the subset of variables mostly involved in the prediction model to get insights in the mechanisms underlying the incidence of cardiovascular complications in this high risk population of patients.
Won, Sungho; Choi, Hosik; Park, Suyeon; Lee, Juyoung; Park, Changyi; Kwon, Sunghoon
2015-01-01
Owing to recent improvement of genotyping technology, large-scale genetic data can be utilized to identify disease susceptibility loci and this successful finding has substantially improved our understanding of complex diseases. However, in spite of these successes, most of the genetic effects for many complex diseases were found to be very small, which have been a big hurdle to build disease prediction model. Recently, many statistical methods based on penalized regressions have been proposed to tackle the so-called "large P and small N" problem. Penalized regressions including least absolute selection and shrinkage operator (LASSO) and ridge regression limit the space of parameters, and this constraint enables the estimation of effects for very large number of SNPs. Various extensions have been suggested, and, in this report, we compare their accuracy by applying them to several complex diseases. Our results show that penalized regressions are usually robust and provide better accuracy than the existing methods for at least diseases under consideration.
[Atmospheric parameter estimation for LAMOST/GUOSHOUJING spectra].
Lu, Yu; Li, Xiang-Ru; Yang, Tan
2014-11-01
It is a key task to estimate the atmospheric parameters from the observed stellar spectra in exploring the nature of stars and universe. With our Large Sky Area Multi-Object Fiber Spectroscopy Telescope (LAMOST) which begun its formal Sky Survey in September 2012, we are obtaining a mass of stellar spectra in an unprecedented speed. It has brought a new opportunity and a challenge for the research of galaxies. Due to the complexity of the observing system, the noise in the spectrum is relatively large. At the same time, the preprocessing procedures of spectrum are also not ideal, such as the wavelength calibration and the flow calibration. Therefore, there is a slight distortion of the spectrum. They result in the high difficulty of estimating the atmospheric parameters for the measured stellar spectra. It is one of the important issues to estimate the atmospheric parameters for the massive stellar spectra of LAMOST. The key of this study is how to eliminate noise and improve the accuracy and robustness of estimating the atmospheric parameters for the measured stellar spectra. We propose a regression model for estimating the atmospheric parameters of LAMOST stellar(SVM(lasso)). The basic idea of this model is: First, we use the Haar wavelet to filter spectrum, suppress the adverse effects of the spectral noise and retain the most discrimination information of spectrum. Secondly, We use the lasso algorithm for feature selection and extract the features of strongly correlating with the atmospheric parameters. Finally, the features are input to the support vector regression model for estimating the parameters. Because the model has better tolerance to the slight distortion and the noise of the spectrum, the accuracy of the measurement is improved. To evaluate the feasibility of the above scheme, we conduct experiments extensively on the 33 963 pilot surveys spectrums by LAMOST. The accuracy of three atmospheric parameters is log Teff: 0.006 8 dex, log g: 0.155 1 dex, [Fe/H]: 0.104 0 dex.
VARIABLE SELECTION FOR REGRESSION MODELS WITH MISSING DATA
Garcia, Ramon I.; Ibrahim, Joseph G.; Zhu, Hongtu
2009-01-01
We consider the variable selection problem for a class of statistical models with missing data, including missing covariate and/or response data. We investigate the smoothly clipped absolute deviation penalty (SCAD) and adaptive LASSO and propose a unified model selection and estimation procedure for use in the presence of missing data. We develop a computationally attractive algorithm for simultaneously optimizing the penalized likelihood function and estimating the penalty parameters. Particularly, we propose to use a model selection criterion, called the ICQ statistic, for selecting the penalty parameters. We show that the variable selection procedure based on ICQ automatically and consistently selects the important covariates and leads to efficient estimates with oracle properties. The methodology is very general and can be applied to numerous situations involving missing data, from covariates missing at random in arbitrary regression models to nonignorably missing longitudinal responses and/or covariates. Simulations are given to demonstrate the methodology and examine the finite sample performance of the variable selection procedures. Melanoma data from a cancer clinical trial is presented to illustrate the proposed methodology. PMID:20336190
Mevaere, Jimmy; Goulard, Christophe; Schneider, Olha; Sekurova, Olga N; Ma, Haiyan; Zirah, Séverine; Afonso, Carlos; Rebuffat, Sylvie; Zotchev, Sergey B; Li, Yanyan
2018-05-29
Lasso peptides are ribosomally synthesized and post-translationally modified peptides produced by bacteria. They are characterized by an unusual lariat-knot structure. Targeted genome scanning revealed a wide diversity of lasso peptides encoded in actinobacterial genomes, but cloning and heterologous expression of these clusters turned out to be problematic. To circumvent this, we developed an orthogonal expression system for heterologous production of actinobacterial lasso peptides in Streptomyces hosts based on a newly-identified regulatory circuit from Actinoalloteichus fjordicus. Six lasso peptide gene clusters, mainly originating from marine Actinobacteria, were chosen for proof-of-concept studies. By varying the Streptomyces expression hosts and a small set of culture conditions, three new lasso peptides were successfully produced and characterized by tandem MS. The newly developed expression system thus sets the stage to uncover and bioengineer the chemo-diversity of actinobacterial lasso peptides. Moreover, our data provide some considerations for future bioprospecting efforts for such peptides.
Complex lasso: new entangled motifs in proteins
NASA Astrophysics Data System (ADS)
Niemyska, Wanda; Dabrowski-Tumanski, Pawel; Kadlof, Michal; Haglund, Ellinor; Sułkowski, Piotr; Sulkowska, Joanna I.
2016-11-01
We identify new entangled motifs in proteins that we call complex lassos. Lassos arise in proteins with disulfide bridges (or in proteins with amide linkages), when termini of a protein backbone pierce through an auxiliary surface of minimal area, spanned on a covalent loop. We find that as much as 18% of all proteins with disulfide bridges in a non-redundant subset of PDB form complex lassos, and classify them into six distinct geometric classes, one of which resembles supercoiling known from DNA. Based on biological classification of proteins we find that lassos are much more common in viruses, plants and fungi than in other kingdoms of life. We also discuss how changes in the oxidation/reduction potential may affect the function of proteins with lassos. Lassos and associated surfaces of minimal area provide new, interesting and possessing many potential applications geometric characteristics not only of proteins, but also of other biomolecules.
Howard, Réka; Carriquiry, Alicia L.; Beavis, William D.
2014-01-01
Parametric and nonparametric methods have been developed for purposes of predicting phenotypes. These methods are based on retrospective analyses of empirical data consisting of genotypic and phenotypic scores. Recent reports have indicated that parametric methods are unable to predict phenotypes of traits with known epistatic genetic architectures. Herein, we review parametric methods including least squares regression, ridge regression, Bayesian ridge regression, least absolute shrinkage and selection operator (LASSO), Bayesian LASSO, best linear unbiased prediction (BLUP), Bayes A, Bayes B, Bayes C, and Bayes Cπ. We also review nonparametric methods including Nadaraya-Watson estimator, reproducing kernel Hilbert space, support vector machine regression, and neural networks. We assess the relative merits of these 14 methods in terms of accuracy and mean squared error (MSE) using simulated genetic architectures consisting of completely additive or two-way epistatic interactions in an F2 population derived from crosses of inbred lines. Each simulated genetic architecture explained either 30% or 70% of the phenotypic variability. The greatest impact on estimates of accuracy and MSE was due to genetic architecture. Parametric methods were unable to predict phenotypic values when the underlying genetic architecture was based entirely on epistasis. Parametric methods were slightly better than nonparametric methods for additive genetic architectures. Distinctions among parametric methods for additive genetic architectures were incremental. Heritability, i.e., proportion of phenotypic variability, had the second greatest impact on estimates of accuracy and MSE. PMID:24727289
Jackman, Patrick; Sun, Da-Wen; Elmasry, Gamal
2012-08-01
A new algorithm for the conversion of device dependent RGB colour data into device independent L*a*b* colour data without introducing noticeable error has been developed. By combining a linear colour space transform and advanced multiple regression methodologies it was possible to predict L*a*b* colour data with less than 2.2 colour units of error (CIE 1976). By transforming the red, green and blue colour components into new variables that better reflect the structure of the L*a*b* colour space, a low colour calibration error was immediately achieved (ΔE(CAL) = 14.1). Application of a range of regression models on the data further reduced the colour calibration error substantially (multilinear regression ΔE(CAL) = 5.4; response surface ΔE(CAL) = 2.9; PLSR ΔE(CAL) = 2.6; LASSO regression ΔE(CAL) = 2.1). Only the PLSR models deteriorated substantially under cross validation. The algorithm is adaptable and can be easily recalibrated to any working computer vision system. The algorithm was tested on a typical working laboratory computer vision system and delivered only a very marginal loss of colour information ΔE(CAL) = 2.35. Colour features derived on this system were able to safely discriminate between three classes of ham with 100% correct classification whereas colour features measured on a conventional colourimeter were not. Copyright © 2012 Elsevier Ltd. All rights reserved.
Developing a dengue forecast model using machine learning: A case study in China.
Guo, Pi; Liu, Tao; Zhang, Qin; Wang, Li; Xiao, Jianpeng; Zhang, Qingying; Luo, Ganfeng; Li, Zhihao; He, Jianfeng; Zhang, Yonghui; Ma, Wenjun
2017-10-01
In China, dengue remains an important public health issue with expanded areas and increased incidence recently. Accurate and timely forecasts of dengue incidence in China are still lacking. We aimed to use the state-of-the-art machine learning algorithms to develop an accurate predictive model of dengue. Weekly dengue cases, Baidu search queries and climate factors (mean temperature, relative humidity and rainfall) during 2011-2014 in Guangdong were gathered. A dengue search index was constructed for developing the predictive models in combination with climate factors. The observed year and week were also included in the models to control for the long-term trend and seasonality. Several machine learning algorithms, including the support vector regression (SVR) algorithm, step-down linear regression model, gradient boosted regression tree algorithm (GBM), negative binomial regression model (NBM), least absolute shrinkage and selection operator (LASSO) linear regression model and generalized additive model (GAM), were used as candidate models to predict dengue incidence. Performance and goodness of fit of the models were assessed using the root-mean-square error (RMSE) and R-squared measures. The residuals of the models were examined using the autocorrelation and partial autocorrelation function analyses to check the validity of the models. The models were further validated using dengue surveillance data from five other provinces. The epidemics during the last 12 weeks and the peak of the 2014 large outbreak were accurately forecasted by the SVR model selected by a cross-validation technique. Moreover, the SVR model had the consistently smallest prediction error rates for tracking the dynamics of dengue and forecasting the outbreaks in other areas in China. The proposed SVR model achieved a superior performance in comparison with other forecasting techniques assessed in this study. The findings can help the government and community respond early to dengue epidemics.
MRM-Lasso: A Sparse Multiview Feature Selection Method via Low-Rank Analysis.
Yang, Wanqi; Gao, Yang; Shi, Yinghuan; Cao, Longbing
2015-11-01
Learning about multiview data involves many applications, such as video understanding, image classification, and social media. However, when the data dimension increases dramatically, it is important but very challenging to remove redundant features in multiview feature selection. In this paper, we propose a novel feature selection algorithm, multiview rank minimization-based Lasso (MRM-Lasso), which jointly utilizes Lasso for sparse feature selection and rank minimization for learning relevant patterns across views. Instead of simply integrating multiple Lasso from view level, we focus on the performance of sample-level (sample significance) and introduce pattern-specific weights into MRM-Lasso. The weights are utilized to measure the contribution of each sample to the labels in the current view. In addition, the latent correlation across different views is successfully captured by learning a low-rank matrix consisting of pattern-specific weights. The alternating direction method of multipliers is applied to optimize the proposed MRM-Lasso. Experiments on four real-life data sets show that features selected by MRM-Lasso have better multiview classification performance than the baselines. Moreover, pattern-specific weights are demonstrated to be significant for learning about multiview data, compared with view-specific weights.
Gerber, Brian D.; Kendall, William L.; Hooten, Mevin B.; Dubovsky, James A.; Drewien, Roderick C.
2015-01-01
Prediction is fundamental to scientific enquiry and application; however, ecologists tend to favour explanatory modelling. We discuss a predictive modelling framework to evaluate ecological hypotheses and to explore novel/unobserved environmental scenarios to assist conservation and management decision-makers. We apply this framework to develop an optimal predictive model for juvenile (<1 year old) sandhill crane Grus canadensis recruitment of the Rocky Mountain Population (RMP). We consider spatial climate predictors motivated by hypotheses of how drought across multiple time-scales and spring/summer weather affects recruitment.Our predictive modelling framework focuses on developing a single model that includes all relevant predictor variables, regardless of collinearity. This model is then optimized for prediction by controlling model complexity using a data-driven approach that marginalizes or removes irrelevant predictors from the model. Specifically, we highlight two approaches of statistical regularization, Bayesian least absolute shrinkage and selection operator (LASSO) and ridge regression.Our optimal predictive Bayesian LASSO and ridge regression models were similar and on average 37% superior in predictive accuracy to an explanatory modelling approach. Our predictive models confirmed a priori hypotheses that drought and cold summers negatively affect juvenile recruitment in the RMP. The effects of long-term drought can be alleviated by short-term wet spring–summer months; however, the alleviation of long-term drought has a much greater positive effect on juvenile recruitment. The number of freezing days and snowpack during the summer months can also negatively affect recruitment, while spring snowpack has a positive effect.Breeding habitat, mediated through climate, is a limiting factor on population growth of sandhill cranes in the RMP, which could become more limiting with a changing climate (i.e. increased drought). These effects are likely not unique to cranes. The alteration of hydrological patterns and water levels by drought may impact many migratory, wetland nesting birds in the Rocky Mountains and beyond.Generalizable predictive models (trained by out-of-sample fit and based on ecological hypotheses) are needed by conservation and management decision-makers. Statistical regularization improves predictions and provides a general framework for fitting models with a large number of predictors, even those with collinearity, to simultaneously identify an optimal predictive model while conducting rigorous Bayesian model selection. Our framework is important for understanding population dynamics under a changing climate and has direct applications for making harvest and habitat management decisions.
Mallick, Himel; Tiwari, Hemant K
2016-01-01
Count data are increasingly ubiquitous in genetic association studies, where it is possible to observe excess zero counts as compared to what is expected based on standard assumptions. For instance, in rheumatology, data are usually collected in multiple joints within a person or multiple sub-regions of a joint, and it is not uncommon that the phenotypes contain enormous number of zeroes due to the presence of excessive zero counts in majority of patients. Most existing statistical methods assume that the count phenotypes follow one of these four distributions with appropriate dispersion-handling mechanisms: Poisson, Zero-inflated Poisson (ZIP), Negative Binomial, and Zero-inflated Negative Binomial (ZINB). However, little is known about their implications in genetic association studies. Also, there is a relative paucity of literature on their usefulness with respect to model misspecification and variable selection. In this article, we have investigated the performance of several state-of-the-art approaches for handling zero-inflated count data along with a novel penalized regression approach with an adaptive LASSO penalty, by simulating data under a variety of disease models and linkage disequilibrium patterns. By taking into account data-adaptive weights in the estimation procedure, the proposed method provides greater flexibility in multi-SNP modeling of zero-inflated count phenotypes. A fast coordinate descent algorithm nested within an EM (expectation-maximization) algorithm is implemented for estimating the model parameters and conducting variable selection simultaneously. Results show that the proposed method has optimal performance in the presence of multicollinearity, as measured by both prediction accuracy and empirical power, which is especially apparent as the sample size increases. Moreover, the Type I error rates become more or less uncontrollable for the competing methods when a model is misspecified, a phenomenon routinely encountered in practice.
[Multi-mathematical modelings for compatibility optimization of Jiangzhi granules].
Yang, Ming; Zhang, Li; Ge, Yingli; Lu, Yanliu; Ji, Guang
2011-12-01
To investigate into the method of "multi activity index evaluation and combination optimized of mult-component" for Chinese herbal formulas. According to the scheme of uniform experimental design, efficacy experiment, multi index evaluation, least absolute shrinkage, selection operator (LASSO) modeling, evolutionary optimization algorithm, validation experiment, we optimized the combination of Jiangzhi granules based on the activity indexes of blood serum ALT, ALT, AST, TG, TC, HDL, LDL and TG level of liver tissues, ratio of liver tissue to body. Analytic hierarchy process (AHP) combining with criteria importance through intercriteria correlation (CRITIC) for multi activity index evaluation was more reasonable and objective, it reflected the information of activity index's order and objective sample data. LASSO algorithm modeling could accurately reflect the relationship between different combination of Jiangzhi granule and the activity comprehensive indexes. The optimized combination of Jiangzhi granule showed better values of the activity comprehensive indexed than the original formula after the validation experiment. AHP combining with CRITIC can be used for multi activity index evaluation and LASSO algorithm, it is suitable for combination optimized of Chinese herbal formulas.
Predicting the Trends of Social Events on Chinese Social Media.
Zhou, Yang; Zhang, Lei; Liu, Xiaoqian; Zhang, Zhen; Bai, Shuotian; Zhu, Tingshao
2017-09-01
Growing interest in social events on social media came along with the rapid development of the Internet. Social events that occur in the "real" world can spread on social media (e.g., Sina Weibo) rapidly, which may trigger severe consequences and thus require the government's timely attention and responses. This article proposes to predict the trends of social events on Sina Weibo, which is currently the most popular social media in China. Based on the theories of social psychology and communication sciences, we extract an unprecedented amount of comprehensive and effective features that relate to the trends of social events on Chinese social media, and we construct the trends of prediction models by using three classical regression algorithms. We found that lasso regression performed better with the precision 0.78 and the recall 0.88. The results of our experiments demonstrated the effectiveness of our proposed approach.
Comparison of LASSO and GPS time transfers
NASA Technical Reports Server (NTRS)
Lewandowski, W.; Petit, G.; Baumont, F.; Fridelance, P.; Gaignebet, J.; Grudler, P.; Veillet, C.; Wiant, J.; Klepczynski, W. J.
1994-01-01
The LASSO is a technique which should allow the comparison of remote atomic clocks with sub-nanosecond precision and accuracy. The first successful time transfer using LASSO has been carried out between the Observatoire de la Cote d'Azur in France and the McDonald Observatory in Texas, United States. This paper presents a preliminary comparison of LASSO time transfer with GPS common-view time transfer.
Statistical learning and selective inference.
Taylor, Jonathan; Tibshirani, Robert J
2015-06-23
We describe the problem of "selective inference." This addresses the following challenge: Having mined a set of data to find potential associations, how do we properly assess the strength of these associations? The fact that we have "cherry-picked"--searched for the strongest associations--means that we must set a higher bar for declaring significant the associations that we see. This challenge becomes more important in the era of big data and complex statistical modeling. The cherry tree (dataset) can be very large and the tools for cherry picking (statistical learning methods) are now very sophisticated. We describe some recent new developments in selective inference and illustrate their use in forward stepwise regression, the lasso, and principal components analysis.
Discovering graphical Granger causality using the truncating lasso penalty
Shojaie, Ali; Michailidis, George
2010-01-01
Motivation: Components of biological systems interact with each other in order to carry out vital cell functions. Such information can be used to improve estimation and inference, and to obtain better insights into the underlying cellular mechanisms. Discovering regulatory interactions among genes is therefore an important problem in systems biology. Whole-genome expression data over time provides an opportunity to determine how the expression levels of genes are affected by changes in transcription levels of other genes, and can therefore be used to discover regulatory interactions among genes. Results: In this article, we propose a novel penalization method, called truncating lasso, for estimation of causal relationships from time-course gene expression data. The proposed penalty can correctly determine the order of the underlying time series, and improves the performance of the lasso-type estimators. Moreover, the resulting estimate provides information on the time lag between activation of transcription factors and their effects on regulated genes. We provide an efficient algorithm for estimation of model parameters, and show that the proposed method can consistently discover causal relationships in the large p, small n setting. The performance of the proposed model is evaluated favorably in simulated, as well as real, data examples. Availability: The proposed truncating lasso method is implemented in the R-package ‘grangerTlasso’ and is freely available at http://www.stat.lsa.umich.edu/∼shojaie/ Contact: shojaie@umich.edu PMID:20823316
Inferring metabolic networks using the Bayesian adaptive graphical lasso with informative priors.
Peterson, Christine; Vannucci, Marina; Karakas, Cemal; Choi, William; Ma, Lihua; Maletić-Savatić, Mirjana
2013-10-01
Metabolic processes are essential for cellular function and survival. We are interested in inferring a metabolic network in activated microglia, a major neuroimmune cell in the brain responsible for the neuroinflammation associated with neurological diseases, based on a set of quantified metabolites. To achieve this, we apply the Bayesian adaptive graphical lasso with informative priors that incorporate known relationships between covariates. To encourage sparsity, the Bayesian graphical lasso places double exponential priors on the off-diagonal entries of the precision matrix. The Bayesian adaptive graphical lasso allows each double exponential prior to have a unique shrinkage parameter. These shrinkage parameters share a common gamma hyperprior. We extend this model to create an informative prior structure by formulating tailored hyperpriors on the shrinkage parameters. By choosing parameter values for each hyperprior that shift probability mass toward zero for nodes that are close together in a reference network, we encourage edges between covariates with known relationships. This approach can improve the reliability of network inference when the sample size is small relative to the number of parameters to be estimated. When applied to the data on activated microglia, the inferred network includes both known relationships and associations of potential interest for further investigation.
Inferring metabolic networks using the Bayesian adaptive graphical lasso with informative priors
PETERSON, CHRISTINE; VANNUCCI, MARINA; KARAKAS, CEMAL; CHOI, WILLIAM; MA, LIHUA; MALETIĆ-SAVATIĆ, MIRJANA
2014-01-01
Metabolic processes are essential for cellular function and survival. We are interested in inferring a metabolic network in activated microglia, a major neuroimmune cell in the brain responsible for the neuroinflammation associated with neurological diseases, based on a set of quantified metabolites. To achieve this, we apply the Bayesian adaptive graphical lasso with informative priors that incorporate known relationships between covariates. To encourage sparsity, the Bayesian graphical lasso places double exponential priors on the off-diagonal entries of the precision matrix. The Bayesian adaptive graphical lasso allows each double exponential prior to have a unique shrinkage parameter. These shrinkage parameters share a common gamma hyperprior. We extend this model to create an informative prior structure by formulating tailored hyperpriors on the shrinkage parameters. By choosing parameter values for each hyperprior that shift probability mass toward zero for nodes that are close together in a reference network, we encourage edges between covariates with known relationships. This approach can improve the reliability of network inference when the sample size is small relative to the number of parameters to be estimated. When applied to the data on activated microglia, the inferred network includes both known relationships and associations of potential interest for further investigation. PMID:24533172
Efficient methods for overlapping group lasso.
Yuan, Lei; Liu, Jun; Ye, Jieping
2013-09-01
The group Lasso is an extension of the Lasso for feature selection on (predefined) nonoverlapping groups of features. The nonoverlapping group structure limits its applicability in practice. There have been several recent attempts to study a more general formulation where groups of features are given, potentially with overlaps between the groups. The resulting optimization is, however, much more challenging to solve due to the group overlaps. In this paper, we consider the efficient optimization of the overlapping group Lasso penalized problem. We reveal several key properties of the proximal operator associated with the overlapping group Lasso, and compute the proximal operator by solving the smooth and convex dual problem, which allows the use of the gradient descent type of algorithms for the optimization. Our methods and theoretical results are then generalized to tackle the general overlapping group Lasso formulation based on the l(q) norm. We further extend our algorithm to solve a nonconvex overlapping group Lasso formulation based on the capped norm regularization, which reduces the estimation bias introduced by the convex penalty. We have performed empirical evaluations using both a synthetic and the breast cancer gene expression dataset, which consists of 8,141 genes organized into (overlapping) gene sets. Experimental results show that the proposed algorithm is more efficient than existing state-of-the-art algorithms. Results also demonstrate the effectiveness of the nonconvex formulation for overlapping group Lasso.
NASA Astrophysics Data System (ADS)
Chan, H. M.; van der Velden, B. H. M.; E Loo, C.; Gilhuijs, K. G. A.
2017-08-01
We present a radiomics model to discriminate between patients at low risk and those at high risk of treatment failure at long-term follow-up based on eigentumors: principal components computed from volumes encompassing tumors in washin and washout images of pre-treatment dynamic contrast-enhanced (DCE-) MR images. Eigentumors were computed from the images of 563 patients from the MARGINS study. Subsequently, a least absolute shrinkage selection operator (LASSO) selected candidates from the components that contained 90% of the variance of the data. The model for prediction of survival after treatment (median follow-up time 86 months) was based on logistic regression. Receiver operating characteristic (ROC) analysis was applied and area-under-the-curve (AUC) values were computed as measures of training and cross-validated performances. The discriminating potential of the model was confirmed using Kaplan-Meier survival curves and log-rank tests. From the 322 principal components that explained 90% of the variance of the data, the LASSO selected 28 components. The ROC curves of the model yielded AUC values of 0.88, 0.77 and 0.73, for the training, leave-one-out cross-validated and bootstrapped performances, respectively. The bootstrapped Kaplan-Meier survival curves confirmed significant separation for all tumors (P < 0.0001). Survival analysis on immunohistochemical subgroups shows significant separation for the estrogen-receptor subtype tumors (P < 0.0001) and the triple-negative subtype tumors (P = 0.0039), but not for tumors of the HER2 subtype (P = 0.41). The results of this retrospective study show the potential of early-stage pre-treatment eigentumors for use in prediction of treatment failure of breast cancer.
NASA Astrophysics Data System (ADS)
Gustafson, W. I., Jr.; Vogelmann, A. M.; Li, Z.; Cheng, X.; Endo, S.; Krishna, B.; Toto, T.; Xiao, H.
2017-12-01
Large-eddy simulation (LES) is a powerful tool for understanding atmospheric turbulence and cloud development. However, the results are sensitive to the choice of forcing data sets used to drive the LES model, and the most realistic forcing data is difficult to identify a priori. Knowing the sensitivity of boundary layer and cloud processes to forcing data selection is critical when using LES to understand atmospheric processes and when developing associated parameterizations. The U.S. Department of Energy Atmospheric Radiation Measurement (ARM) User Facility has been developing the capability to routinely generate ensembles of LES based on a selection of plausible input forcing data sets. The LES ARM Symbiotic Simulation and Observation (LASSO) project is initially generating simulations for shallow convection days at the ARM Southern Great Plains site in Oklahoma. This talk will examine 13 days with shallow convection selected from the period May-August 2016, with multiple forcing sources and spatial scales used to generate an LES ensemble for each of the days, resulting in hundreds of LES runs with coincident observations from ARM's extensive suite of in situ and retrieval-based products. This talk will focus particularly on the sensitivity of the cloud development and its relation to forcing data. Variability of the PBL characteristics, lifting condensation level, cloud base height, cloud fraction, and liquid water path will be examined. More information about the LASSO project can be found at https://www.arm.gov/capabilities/modeling/lasso.
Sparse Logistic Regression for Diagnosis of Liver Fibrosis in Rat by Using SCAD-Penalized Likelihood
Yan, Fang-Rong; Lin, Jin-Guan; Liu, Yu
2011-01-01
The objective of the present study is to find out the quantitative relationship between progression of liver fibrosis and the levels of certain serum markers using mathematic model. We provide the sparse logistic regression by using smoothly clipped absolute deviation (SCAD) penalized function to diagnose the liver fibrosis in rats. Not only does it give a sparse solution with high accuracy, it also provides the users with the precise probabilities of classification with the class information. In the simulative case and the experiment case, the proposed method is comparable to the stepwise linear discriminant analysis (SLDA) and the sparse logistic regression with least absolute shrinkage and selection operator (LASSO) penalty, by using receiver operating characteristic (ROC) with bayesian bootstrap estimating area under the curve (AUC) diagnostic sensitivity for selected variable. Results show that the new approach provides a good correlation between the serum marker levels and the liver fibrosis induced by thioacetamide (TAA) in rats. Meanwhile, this approach might also be used in predicting the development of liver cirrhosis. PMID:21716672
Genetic risk prediction using a spatial autoregressive model with adaptive lasso.
Wen, Yalu; Shen, Xiaoxi; Lu, Qing
2018-05-31
With rapidly evolving high-throughput technologies, studies are being initiated to accelerate the process toward precision medicine. The collection of the vast amounts of sequencing data provides us with great opportunities to systematically study the role of a deep catalog of sequencing variants in risk prediction. Nevertheless, the massive amount of noise signals and low frequencies of rare variants in sequencing data pose great analytical challenges on risk prediction modeling. Motivated by the development in spatial statistics, we propose a spatial autoregressive model with adaptive lasso (SARAL) for risk prediction modeling using high-dimensional sequencing data. The SARAL is a set-based approach, and thus, it reduces the data dimension and accumulates genetic effects within a single-nucleotide variant (SNV) set. Moreover, it allows different SNV sets having various magnitudes and directions of effect sizes, which reflects the nature of complex diseases. With the adaptive lasso implemented, SARAL can shrink the effects of noise SNV sets to be zero and, thus, further improve prediction accuracy. Through simulation studies, we demonstrate that, overall, SARAL is comparable to, if not better than, the genomic best linear unbiased prediction method. The method is further illustrated by an application to the sequencing data from the Alzheimer's Disease Neuroimaging Initiative. Copyright © 2018 John Wiley & Sons, Ltd.
Validating the LASSO algorithm by unmixing spectral signatures in multicolor phantoms
NASA Astrophysics Data System (ADS)
Samarov, Daniel V.; Clarke, Matthew; Lee, Ji Yoon; Allen, David; Litorja, Maritoni; Hwang, Jeeseong
2012-03-01
As hyperspectral imaging (HSI) sees increased implementation into the biological and medical elds it becomes increasingly important that the algorithms being used to analyze the corresponding output be validated. While certainly important under any circumstance, as this technology begins to see a transition from benchtop to bedside ensuring that the measurements being given to medical professionals are accurate and reproducible is critical. In order to address these issues work has been done in generating a collection of datasets which could act as a test bed for algorithms validation. Using a microarray spot printer a collection of three food color dyes, acid red 1 (AR), brilliant blue R (BBR) and erioglaucine (EG) are mixed together at dierent concentrations in varying proportions at dierent locations on a microarray chip. With the concentration and mixture proportions known at each location, using HSI an algorithm should in principle, based on estimates of abundances, be able to determine the concentrations and proportions of each dye at each location on the chip. These types of data are particularly important in the context of medical measurements as the resulting estimated abundances will be used to make critical decisions which can have a serious impact on an individual's health. In this paper we present a novel algorithm for processing and analyzing HSI data based on the LASSO algorithm (similar to "basis pursuit"). The LASSO is a statistical method for simultaneously performing model estimation and variable selection. In the context of estimating abundances in an HSI scene these so called "sparse" representations provided by the LASSO are appropriate as not every pixel will be expected to contain every endmember. The algorithm we present takes the general framework of the LASSO algorithm a step further and incorporates the rich spatial information which is available in HSI to further improve the estimates of abundance. We show our algorithm's improvement over the standard LASSO using the dye mixture data as the test bed.
Novel harmonic regularization approach for variable selection in Cox's proportional hazards model.
Chu, Ge-Jin; Liang, Yong; Wang, Jia-Xuan
2014-01-01
Variable selection is an important issue in regression and a number of variable selection methods have been proposed involving nonconvex penalty functions. In this paper, we investigate a novel harmonic regularization method, which can approximate nonconvex Lq (1/2 < q < 1) regularizations, to select key risk factors in the Cox's proportional hazards model using microarray gene expression data. The harmonic regularization method can be efficiently solved using our proposed direct path seeking approach, which can produce solutions that closely approximate those for the convex loss function and the nonconvex regularization. Simulation results based on the artificial datasets and four real microarray gene expression datasets, such as real diffuse large B-cell lymphoma (DCBCL), the lung cancer, and the AML datasets, show that the harmonic regularization method can be more accurate for variable selection than existing Lasso series methods.
Supervised group Lasso with applications to microarray data analysis
Ma, Shuangge; Song, Xiao; Huang, Jian
2007-01-01
Background A tremendous amount of efforts have been devoted to identifying genes for diagnosis and prognosis of diseases using microarray gene expression data. It has been demonstrated that gene expression data have cluster structure, where the clusters consist of co-regulated genes which tend to have coordinated functions. However, most available statistical methods for gene selection do not take into consideration the cluster structure. Results We propose a supervised group Lasso approach that takes into account the cluster structure in gene expression data for gene selection and predictive model building. For gene expression data without biological cluster information, we first divide genes into clusters using the K-means approach and determine the optimal number of clusters using the Gap method. The supervised group Lasso consists of two steps. In the first step, we identify important genes within each cluster using the Lasso method. In the second step, we select important clusters using the group Lasso. Tuning parameters are determined using V-fold cross validation at both steps to allow for further flexibility. Prediction performance is evaluated using leave-one-out cross validation. We apply the proposed method to disease classification and survival analysis with microarray data. Conclusion We analyze four microarray data sets using the proposed approach: two cancer data sets with binary cancer occurrence as outcomes and two lymphoma data sets with survival outcomes. The results show that the proposed approach is capable of identifying a small number of influential gene clusters and important genes within those clusters, and has better prediction performance than existing methods. PMID:17316436
NASA Astrophysics Data System (ADS)
Gao, Wei; Li, Xiang-ru
2017-07-01
The multi-task learning takes the multiple tasks together to make analysis and calculation, so as to dig out the correlations among them, and therefore to improve the accuracy of the analyzed results. This kind of methods have been widely applied to the machine learning, pattern recognition, computer vision, and other related fields. This paper investigates the application of multi-task learning in estimating the stellar atmospheric parameters, including the surface temperature (Teff), surface gravitational acceleration (lg g), and chemical abundance ([Fe/H]). Firstly, the spectral features of the three stellar atmospheric parameters are extracted by using the multi-task sparse group Lasso algorithm, then the support vector machine is used to estimate the atmospheric physical parameters. The proposed scheme is evaluated on both the Sloan stellar spectra and the theoretical spectra computed from the Kurucz's New Opacity Distribution Function (NEWODF) model. The mean absolute errors (MAEs) on the Sloan spectra are: 0.0064 for lg (Teff /K), 0.1622 for lg (g/(cm · s-2)), and 0.1221 dex for [Fe/H]; the MAEs on the synthetic spectra are 0.0006 for lg (Teff /K), 0.0098 for lg (g/(cm · s-2)), and 0.0082 dex for [Fe/H]. Experimental results show that the proposed scheme has a rather high accuracy for the estimation of stellar atmospheric parameters.
LASSO experiment: Intercalibrations of the LASSO ranging stations
NASA Technical Reports Server (NTRS)
Gaignebet, J.; Hatat, J.-L.; Mangin, J. F.; Torre, J. M.; Klepczynski, William J.; Mccubin, L.; Wiant, J.; Rickefs, R.
1994-01-01
Presented are equations for time synchronization of laser ranging stations. The system consists of a satellite fitted with laser retroreflectors associated to a light detector and an event timer and two laser ranging stations with their own event timers. Methods of determining the Lasso intercalibration constant are given.
Developing a dengue forecast model using machine learning: A case study in China
Zhang, Qin; Wang, Li; Xiao, Jianpeng; Zhang, Qingying; Luo, Ganfeng; Li, Zhihao; He, Jianfeng; Zhang, Yonghui; Ma, Wenjun
2017-01-01
Background In China, dengue remains an important public health issue with expanded areas and increased incidence recently. Accurate and timely forecasts of dengue incidence in China are still lacking. We aimed to use the state-of-the-art machine learning algorithms to develop an accurate predictive model of dengue. Methodology/Principal findings Weekly dengue cases, Baidu search queries and climate factors (mean temperature, relative humidity and rainfall) during 2011–2014 in Guangdong were gathered. A dengue search index was constructed for developing the predictive models in combination with climate factors. The observed year and week were also included in the models to control for the long-term trend and seasonality. Several machine learning algorithms, including the support vector regression (SVR) algorithm, step-down linear regression model, gradient boosted regression tree algorithm (GBM), negative binomial regression model (NBM), least absolute shrinkage and selection operator (LASSO) linear regression model and generalized additive model (GAM), were used as candidate models to predict dengue incidence. Performance and goodness of fit of the models were assessed using the root-mean-square error (RMSE) and R-squared measures. The residuals of the models were examined using the autocorrelation and partial autocorrelation function analyses to check the validity of the models. The models were further validated using dengue surveillance data from five other provinces. The epidemics during the last 12 weeks and the peak of the 2014 large outbreak were accurately forecasted by the SVR model selected by a cross-validation technique. Moreover, the SVR model had the consistently smallest prediction error rates for tracking the dynamics of dengue and forecasting the outbreaks in other areas in China. Conclusion and significance The proposed SVR model achieved a superior performance in comparison with other forecasting techniques assessed in this study. The findings can help the government and community respond early to dengue epidemics. PMID:29036169
Gerber, Brian D; Kendall, William L; Hooten, Mevin B; Dubovsky, James A; Drewien, Roderick C
2015-09-01
1. Prediction is fundamental to scientific enquiry and application; however, ecologists tend to favour explanatory modelling. We discuss a predictive modelling framework to evaluate ecological hypotheses and to explore novel/unobserved environmental scenarios to assist conservation and management decision-makers. We apply this framework to develop an optimal predictive model for juvenile (<1 year old) sandhill crane Grus canadensis recruitment of the Rocky Mountain Population (RMP). We consider spatial climate predictors motivated by hypotheses of how drought across multiple time-scales and spring/summer weather affects recruitment. 2. Our predictive modelling framework focuses on developing a single model that includes all relevant predictor variables, regardless of collinearity. This model is then optimized for prediction by controlling model complexity using a data-driven approach that marginalizes or removes irrelevant predictors from the model. Specifically, we highlight two approaches of statistical regularization, Bayesian least absolute shrinkage and selection operator (LASSO) and ridge regression. 3. Our optimal predictive Bayesian LASSO and ridge regression models were similar and on average 37% superior in predictive accuracy to an explanatory modelling approach. Our predictive models confirmed a priori hypotheses that drought and cold summers negatively affect juvenile recruitment in the RMP. The effects of long-term drought can be alleviated by short-term wet spring-summer months; however, the alleviation of long-term drought has a much greater positive effect on juvenile recruitment. The number of freezing days and snowpack during the summer months can also negatively affect recruitment, while spring snowpack has a positive effect. 4. Breeding habitat, mediated through climate, is a limiting factor on population growth of sandhill cranes in the RMP, which could become more limiting with a changing climate (i.e. increased drought). These effects are likely not unique to cranes. The alteration of hydrological patterns and water levels by drought may impact many migratory, wetland nesting birds in the Rocky Mountains and beyond. 5. Generalizable predictive models (trained by out-of-sample fit and based on ecological hypotheses) are needed by conservation and management decision-makers. Statistical regularization improves predictions and provides a general framework for fitting models with a large number of predictors, even those with collinearity, to simultaneously identify an optimal predictive model while conducting rigorous Bayesian model selection. Our framework is important for understanding population dynamics under a changing climate and has direct applications for making harvest and habitat management decisions. Published 2015. This article is a U.S. Government work and is in the public domain in the USA.
Dayon, Loïc; Guiraud, Seu Ping; Corthésy, John; Da Silva, Laeticia; Migliavacca, Eugenia; Tautvydaitė, Domilė; Oikonomidi, Aikaterini; Moullet, Barbara; Henry, Hugues; Métairon, Sylviane; Marquis, Julien; Descombes, Patrick; Collino, Sebastiano; Martin, François-Pierre J; Montoliu, Ivan; Kussmann, Martin; Wojcik, Jérôme; Bowman, Gene L; Popp, Julius
2017-06-17
Hyperhomocysteinemia is a risk factor for cognitive decline and dementia, including Alzheimer disease (AD). Homocysteine (Hcy) is a sulfur-containing amino acid and metabolite of the methionine pathway. The interrelated methionine, purine, and thymidylate cycles constitute the one-carbon metabolism that plays a critical role in the synthesis of DNA, neurotransmitters, phospholipids, and myelin. In this study, we tested the hypothesis that one-carbon metabolites beyond Hcy are relevant to cognitive function and cerebrospinal fluid (CSF) measures of AD pathology in older adults. Cross-sectional analysis was performed on matched CSF and plasma collected from 120 older community-dwelling adults with (n = 72) or without (n = 48) cognitive impairment. Liquid chromatography-mass spectrometry was performed to quantify one-carbon metabolites and their cofactors. Least absolute shrinkage and selection operator (LASSO) regression was initially applied to clinical and biomarker measures that generate the highest diagnostic accuracy of a priori-defined cognitive impairment (Clinical Dementia Rating-based) and AD pathology (i.e., CSF tau phosphorylated at threonine 181 [p-tau181]/β-Amyloid 1-42 peptide chain [Aβ 1-42 ] >0.0779) to establish a reference benchmark. Two other LASSO-determined models were generated that included the one-carbon metabolites in CSF and then plasma. Correlations of CSF and plasma one-carbon metabolites with CSF amyloid and tau were explored. LASSO-determined models were stratified by apolipoprotein E (APOE) ε4 carrier status. The diagnostic accuracy of cognitive impairment for the reference model was 80.8% and included age, years of education, Aβ 1-42 , tau, and p-tau181. A model including CSF cystathionine, methionine, S-adenosyl-L-homocysteine (SAH), S-adenosylmethionine (SAM), serine, cysteine, and 5-methyltetrahydrofolate (5-MTHF) improved the diagnostic accuracy to 87.4%. A second model derived from plasma included cystathionine, glycine, methionine, SAH, SAM, serine, cysteine, and Hcy and reached a diagnostic accuracy of 87.5%. CSF SAH and 5-MTHF were associated with CSF tau and p-tau181. Plasma one-carbon metabolites were able to diagnose subjects with a positive CSF profile of AD pathology in APOE ε4 carriers. We observed significant improvements in the prediction of cognitive impairment by adding one-carbon metabolites. This is partially explained by associations with CSF tau and p-tau181, suggesting a role for one-carbon metabolism in the aggregation of tau and neuronal injury. These metabolites may be particularly critical in APOE ε4 carriers.
A Ranking Approach to Genomic Selection.
Blondel, Mathieu; Onogi, Akio; Iwata, Hiroyoshi; Ueda, Naonori
2015-01-01
Genomic selection (GS) is a recent selective breeding method which uses predictive models based on whole-genome molecular markers. Until now, existing studies formulated GS as the problem of modeling an individual's breeding value for a particular trait of interest, i.e., as a regression problem. To assess predictive accuracy of the model, the Pearson correlation between observed and predicted trait values was used. In this paper, we propose to formulate GS as the problem of ranking individuals according to their breeding value. Our proposed framework allows us to employ machine learning methods for ranking which had previously not been considered in the GS literature. To assess ranking accuracy of a model, we introduce a new measure originating from the information retrieval literature called normalized discounted cumulative gain (NDCG). NDCG rewards more strongly models which assign a high rank to individuals with high breeding value. Therefore, NDCG reflects a prerequisite objective in selective breeding: accurate selection of individuals with high breeding value. We conducted a comparison of 10 existing regression methods and 3 new ranking methods on 6 datasets, consisting of 4 plant species and 25 traits. Our experimental results suggest that tree-based ensemble methods including McRank, Random Forests and Gradient Boosting Regression Trees achieve excellent ranking accuracy. RKHS regression and RankSVM also achieve good accuracy when used with an RBF kernel. Traditional regression methods such as Bayesian lasso, wBSR and BayesC were found less suitable for ranking. Pearson correlation was found to correlate poorly with NDCG. Our study suggests two important messages. First, ranking methods are a promising research direction in GS. Second, NDCG can be a useful evaluation measure for GS.
Dit Fouque, Kevin Jeanne; Moreno, Javier; Hegemann, Julian D; Zirah, Séverine; Rebuffat, Sylvie; Fernandez-Lima, Francisco
2018-04-17
Lasso peptides are a fascinating class of bioactive ribosomal natural products characterized by a mechanically interlocked topology. In contrast to their branched-cyclic forms, lasso peptides have higher stability and have become a scaffold for drug development. However, the identification and separation of lasso peptides from their unthreaded topoisomers (branched-cyclic peptides) is analytically challenging since the higher stability is based solely on differences in their tertiary structures. In the present work, a fast and effective workflow is proposed for the separation and identification of lasso from branched cyclic peptides based on differences in their mobility space under native nanoelectrospray ionization-trapped ion mobility spectrometry-mass spectrometry (nESI-TIMS-MS). The high mobility resolving power ( R) of TIMS resulted in the separation of lasso and branched-cyclic topoisomers ( R up to 250, 150 needed on average). The advantages of alkali metalation reagents (e.g., Na, K, and Cs salts) as a way to increase the analytical power of TIMS is demonstrated for topoisomers with similar mobilities as protonated species, efficiently turning the metal ion adduction into additional separation dimensions.
Logsdon, Benjamin A.; Carty, Cara L.; Reiner, Alexander P.; Dai, James Y.; Kooperberg, Charles
2012-01-01
Motivation: For many complex traits, including height, the majority of variants identified by genome-wide association studies (GWAS) have small effects, leaving a significant proportion of the heritable variation unexplained. Although many penalized multiple regression methodologies have been proposed to increase the power to detect associations for complex genetic architectures, they generally lack mechanisms for false-positive control and diagnostics for model over-fitting. Our methodology is the first penalized multiple regression approach that explicitly controls Type I error rates and provide model over-fitting diagnostics through a novel normally distributed statistic defined for every marker within the GWAS, based on results from a variational Bayes spike regression algorithm. Results: We compare the performance of our method to the lasso and single marker analysis on simulated data and demonstrate that our approach has superior performance in terms of power and Type I error control. In addition, using the Women's Health Initiative (WHI) SNP Health Association Resource (SHARe) GWAS of African-Americans, we show that our method has power to detect additional novel associations with body height. These findings replicate by reaching a stringent cutoff of marginal association in a larger cohort. Availability: An R-package, including an implementation of our variational Bayes spike regression (vBsr) algorithm, is available at http://kooperberg.fhcrc.org/soft.html. Contact: blogsdon@fhcrc.org Supplementary information: Supplementary data are available at Bioinformatics online. PMID:22563072
Learning investment indicators through data extension
NASA Astrophysics Data System (ADS)
Dvořák, Marek
2017-07-01
Stock prices in the form of time series were analysed using single and multivariate statistical methods. After simple data preprocessing in the form of logarithmic differences, we augmented this single variate time series to a multivariate representation. This method makes use of sliding windows to calculate several dozen of new variables using simple statistic tools like first and second moments as well as more complicated statistic, like auto-regression coefficients and residual analysis, followed by an optional quadratic transformation that was further used for data extension. These were used as a explanatory variables in a regularized logistic LASSO regression which tried to estimate Buy-Sell Index (BSI) from real stock market data.
Jaspers, Arne; De Beéck, Tim Op; Brink, Michel S; Frencken, Wouter G P; Staes, Filip; Davis, Jesse J; Helsen, Werner F
2018-05-01
Machine learning may contribute to understanding the relationship between the external load and internal load in professional soccer. Therefore, the relationship between external load indicators (ELIs) and the rating of perceived exertion (RPE) was examined using machine learning techniques on a group and individual level. Training data were collected from 38 professional soccer players over 2 seasons. The external load was measured using global positioning system technology and accelerometry. The internal load was obtained using the RPE. Predictive models were constructed using 2 machine learning techniques, artificial neural networks and least absolute shrinkage and selection operator (LASSO) models, and 1 naive baseline method. The predictions were based on a large set of ELIs. Using each technique, 1 group model involving all players and 1 individual model for each player were constructed. These models' performance on predicting the reported RPE values for future training sessions was compared with the naive baseline's performance. Both the artificial neural network and LASSO models outperformed the baseline. In addition, the LASSO model made more accurate predictions for the RPE than did the artificial neural network model. Furthermore, decelerations were identified as important ELIs. Regardless of the applied machine learning technique, the group models resulted in equivalent or better predictions for the reported RPE values than the individual models. Machine learning techniques may have added value in predicting RPE for future sessions to optimize training design and evaluation. These techniques may also be used in conjunction with expert knowledge to select key ELIs for load monitoring.
NASA Technical Reports Server (NTRS)
Kukreja, Sunil L.; Brenner, martin J.
2006-01-01
This viewgraph presentation reviews the 1. Motivation for the study 2. Nonlinear Model Form 3. Structure Detection 4. Least Absolute Shrinkage and Selection Operator (LASSO) 5. Objectives 6. Results 7. Assess LASSO as a Structure Detection Tool: Simulated Nonlinear Models 8. Applicability to Complex Systems: F/A-18 Active Aeroelastic Wing Flight Test Data. The authors conclude that 1. this is a novel approach for detecting the structure of highly over-parameterised nonlinear models in situations where other methods may be inadequate 2. that it is a practical significance in the analysis of aircraft dynamics during envelope expansion and could lead to more efficient control strategies and 3. this could allow greater insight into the functionality of various systems dynamics, by providing a quantitative model which is easily interpretable
Zhang, Jinming; Cavallari, Jennifer M; Fang, Shona C; Weisskopf, Marc G; Lin, Xihong; Mittleman, Murray A; Christiani, David C
2017-01-01
Background Environmental and occupational exposure to metals is ubiquitous worldwide, and understanding the hazardous metal components in this complex mixture is essential for environmental and occupational regulations. Objective To identify hazardous components from metal mixtures that are associated with alterations in cardiac autonomic responses. Methods Urinary concentrations of 16 types of metals were examined and ‘acceleration capacity’ (AC) and ‘deceleration capacity’ (DC), indicators of cardiac autonomic effects, were quantified from ECG recordings among 54 welders. We fitted linear mixed-effects models with least absolute shrinkage and selection operator (LASSO) to identify metal components that are associated with AC and DC. The Bayesian Information Criterion was used as the criterion for model selection procedures. Results Mercury and chromium were selected for DC analysis, whereas mercury, chromium and manganese were selected for AC analysis through the LASSO approach. When we fitted the linear mixed-effects models with ‘selected’ metal components only, the effect of mercury remained significant. Every 1 µg/L increase in urinary mercury was associated with −0.58 ms (−1.03, –0.13) changes in DC and 0.67 ms (0.25, 1.10) changes in AC. Conclusion Our study suggests that exposure to several metals is associated with impaired cardiac autonomic functions. Our findings should be replicated in future studies with larger sample sizes. PMID:28663305
Covariate selection with group lasso and doubly robust estimation of causal effects
Koch, Brandon; Vock, David M.; Wolfson, Julian
2017-01-01
Summary The efficiency of doubly robust estimators of the average causal effect (ACE) of a treatment can be improved by including in the treatment and outcome models only those covariates which are related to both treatment and outcome (i.e., confounders) or related only to the outcome. However, it is often challenging to identify such covariates among the large number that may be measured in a given study. In this paper, we propose GLiDeR (Group Lasso and Doubly Robust Estimation), a novel variable selection technique for identifying confounders and predictors of outcome using an adaptive group lasso approach that simultaneously performs coefficient selection, regularization, and estimation across the treatment and outcome models. The selected variables and corresponding coefficient estimates are used in a standard doubly robust ACE estimator. We provide asymptotic results showing that, for a broad class of data generating mechanisms, GLiDeR yields a consistent estimator of the ACE when either the outcome or treatment model is correctly specified. A comprehensive simulation study shows that GLiDeR is more efficient than doubly robust methods using standard variable selection techniques and has substantial computational advantages over a recently proposed doubly robust Bayesian model averaging method. We illustrate our method by estimating the causal treatment effect of bilateral versus single-lung transplant on forced expiratory volume in one year after transplant using an observational registry. PMID:28636276
Covariate selection with group lasso and doubly robust estimation of causal effects.
Koch, Brandon; Vock, David M; Wolfson, Julian
2018-03-01
The efficiency of doubly robust estimators of the average causal effect (ACE) of a treatment can be improved by including in the treatment and outcome models only those covariates which are related to both treatment and outcome (i.e., confounders) or related only to the outcome. However, it is often challenging to identify such covariates among the large number that may be measured in a given study. In this article, we propose GLiDeR (Group Lasso and Doubly Robust Estimation), a novel variable selection technique for identifying confounders and predictors of outcome using an adaptive group lasso approach that simultaneously performs coefficient selection, regularization, and estimation across the treatment and outcome models. The selected variables and corresponding coefficient estimates are used in a standard doubly robust ACE estimator. We provide asymptotic results showing that, for a broad class of data generating mechanisms, GLiDeR yields a consistent estimator of the ACE when either the outcome or treatment model is correctly specified. A comprehensive simulation study shows that GLiDeR is more efficient than doubly robust methods using standard variable selection techniques and has substantial computational advantages over a recently proposed doubly robust Bayesian model averaging method. We illustrate our method by estimating the causal treatment effect of bilateral versus single-lung transplant on forced expiratory volume in one year after transplant using an observational registry. © 2017, The International Biometric Society.
Human action recognition with group lasso regularized-support vector machine
NASA Astrophysics Data System (ADS)
Luo, Huiwu; Lu, Huanzhang; Wu, Yabei; Zhao, Fei
2016-05-01
The bag-of-visual-words (BOVW) and Fisher kernel are two popular models in human action recognition, and support vector machine (SVM) is the most commonly used classifier for the two models. We show two kinds of group structures in the feature representation constructed by BOVW and Fisher kernel, respectively, since the structural information of feature representation can be seen as a prior for the classifier and can improve the performance of the classifier, which has been verified in several areas. However, the standard SVM employs L2-norm regularization in its learning procedure, which penalizes each variable individually and cannot express the structural information of feature representation. We replace the L2-norm regularization with group lasso regularization in standard SVM, and a group lasso regularized-support vector machine (GLRSVM) is proposed. Then, we embed the group structural information of feature representation into GLRSVM. Finally, we introduce an algorithm to solve the optimization problem of GLRSVM by alternating directions method of multipliers. The experiments evaluated on KTH, YouTube, and Hollywood2 datasets show that our method achieves promising results and improves the state-of-the-art methods on KTH and YouTube datasets.
Multiscale Data Assimilation for Large-Eddy Simulations
NASA Astrophysics Data System (ADS)
Li, Z.; Cheng, X.; Gustafson, W. I., Jr.; Xiao, H.; Vogelmann, A. M.; Endo, S.; Toto, T.
2017-12-01
Large-eddy simulation (LES) is a powerful tool for understanding atmospheric turbulence, boundary layer physics and cloud development, and there is a great need for developing data assimilation methodologies that can constrain LES models. The U.S. Department of Energy Atmospheric Radiation Measurement (ARM) User Facility has been developing the capability to routinely generate ensembles of LES. The LES ARM Symbiotic Simulation and Observation (LASSO) project (https://www.arm.gov/capabilities/modeling/lasso) is generating simulations for shallow convection days at the ARM Southern Great Plains site in Oklahoma. One of major objectives of LASSO is to develop the capability to observationally constrain LES using a hierarchy of ARM observations. We have implemented a multiscale data assimilation (MSDA) scheme, which allows data assimilation to be implemented separately for distinct spatial scales, so that the localized observations can be effectively assimilated to constrain the mesoscale fields in the LES area of about 15 km in width. The MSDA analysis is used to produce forcing data that drive LES. With such LES workflow we have examined 13 days with shallow convection selected from the period May-August 2016. We will describe the implementation of MSDA, present LES results, and address challenges and opportunities for applying data assimilation to LES studies.
Inferring epidemiological parameters from phylogenies using regression-ABC: A comparative study
Gascuel, Olivier
2017-01-01
Inferring epidemiological parameters such as the R0 from time-scaled phylogenies is a timely challenge. Most current approaches rely on likelihood functions, which raise specific issues that range from computing these functions to finding their maxima numerically. Here, we present a new regression-based Approximate Bayesian Computation (ABC) approach, which we base on a large variety of summary statistics intended to capture the information contained in the phylogeny and its corresponding lineage-through-time plot. The regression step involves the Least Absolute Shrinkage and Selection Operator (LASSO) method, which is a robust machine learning technique. It allows us to readily deal with the large number of summary statistics, while avoiding resorting to Markov Chain Monte Carlo (MCMC) techniques. To compare our approach to existing ones, we simulated target trees under a variety of epidemiological models and settings, and inferred parameters of interest using the same priors. We found that, for large phylogenies, the accuracy of our regression-ABC is comparable to that of likelihood-based approaches involving birth-death processes implemented in BEAST2. Our approach even outperformed these when inferring the host population size with a Susceptible-Infected-Removed epidemiological model. It also clearly outperformed a recent kernel-ABC approach when assuming a Susceptible-Infected epidemiological model with two host types. Lastly, by re-analyzing data from the early stages of the recent Ebola epidemic in Sierra Leone, we showed that regression-ABC provides more realistic estimates for the duration parameters (latency and infectiousness) than the likelihood-based method. Overall, ABC based on a large variety of summary statistics and a regression method able to perform variable selection and avoid overfitting is a promising approach to analyze large phylogenies. PMID:28263987
Novel Harmonic Regularization Approach for Variable Selection in Cox's Proportional Hazards Model
Chu, Ge-Jin; Liang, Yong; Wang, Jia-Xuan
2014-01-01
Variable selection is an important issue in regression and a number of variable selection methods have been proposed involving nonconvex penalty functions. In this paper, we investigate a novel harmonic regularization method, which can approximate nonconvex Lq (1/2 < q < 1) regularizations, to select key risk factors in the Cox's proportional hazards model using microarray gene expression data. The harmonic regularization method can be efficiently solved using our proposed direct path seeking approach, which can produce solutions that closely approximate those for the convex loss function and the nonconvex regularization. Simulation results based on the artificial datasets and four real microarray gene expression datasets, such as real diffuse large B-cell lymphoma (DCBCL), the lung cancer, and the AML datasets, show that the harmonic regularization method can be more accurate for variable selection than existing Lasso series methods. PMID:25506389
Complete hazard ranking to analyze right-censored data: An ALS survival study.
Huang, Zhengnan; Zhang, Hongjiu; Boss, Jonathan; Goutman, Stephen A; Mukherjee, Bhramar; Dinov, Ivo D; Guan, Yuanfang
2017-12-01
Survival analysis represents an important outcome measure in clinical research and clinical trials; further, survival ranking may offer additional advantages in clinical trials. In this study, we developed GuanRank, a non-parametric ranking-based technique to transform patients' survival data into a linear space of hazard ranks. The transformation enables the utilization of machine learning base-learners including Gaussian process regression, Lasso, and random forest on survival data. The method was submitted to the DREAM Amyotrophic Lateral Sclerosis (ALS) Stratification Challenge. Ranked first place, the model gave more accurate ranking predictions on the PRO-ACT ALS dataset in comparison to Cox proportional hazard model. By utilizing right-censored data in its training process, the method demonstrated its state-of-the-art predictive power in ALS survival ranking. Its feature selection identified multiple important factors, some of which conflicts with previous studies.
Veturi, Yogasudha; Ritchie, Marylyn D
2018-01-01
Transcriptome-wide association studies (TWAS) have recently been employed as an approach that can draw upon the advantages of genome-wide association studies (GWAS) and gene expression studies to identify genes associated with complex traits. Unlike standard GWAS, summary level data suffices for TWAS and offers improved statistical power. Two popular TWAS methods include either (a) imputing the cis genetic component of gene expression from smaller sized studies (using multi-SNP prediction or MP) into much larger effective sample sizes afforded by GWAS - TWAS-MP or (b) using summary-based Mendelian randomization - TWAS-SMR. Although these methods have been effective at detecting functional variants, it remains unclear how extensive variability in the genetic architecture of complex traits and diseases impacts TWAS results. Our goal was to investigate the different scenarios under which these methods yielded enough power to detect significant expression-trait associations. In this study, we conducted extensive simulations based on 6000 randomly chosen, unrelated Caucasian males from Geisinger's MyCode population to compare the power to detect cis expression-trait associations (within 500 kb of a gene) using the above-described approaches. To test TWAS across varying genetic backgrounds we simulated gene expression and phenotype using different quantitative trait loci per gene and cis-expression /trait heritability under genetic models that differentiate the effect of causality from that of pleiotropy. For each gene, on a training set ranging from 100 to 1000 individuals, we either (a) estimated regression coefficients with gene expression as the response using five different methods: LASSO, elastic net, Bayesian LASSO, Bayesian spike-slab, and Bayesian ridge regression or (b) performed eQTL analysis. We then sampled with replacement 50,000, 150,000, and 300,000 individuals respectively from the testing set of the remaining 5000 individuals and conducted GWAS on each set. Subsequently, we integrated the GWAS summary statistics derived from the testing set with the weights (or eQTLs) derived from the training set to identify expression-trait associations using (a) TWAS-MP (b) TWAS-SMR (c) eQTL-based GWAS, or (d) standalone GWAS. Finally, we examined the power to detect functionally relevant genes using the different approaches under the considered simulation scenarios. In general, we observed great similarities among TWAS-MP methods although the Bayesian methods resulted in improved power in comparison to LASSO and elastic net as the trait architecture grew more complex while training sample sizes and expression heritability remained small. Finally, we observed high power under causality but very low to moderate power under pleiotropy.
Data Shared Lasso: A Novel Tool to Discover Uplift.
Gross, Samuel M; Tibshirani, Robert
2016-09-01
A model is presented for the supervised learning problem where the observations come from a fixed number of pre-specified groups, and the regression coefficients may vary sparsely between groups. The model spans the continuum between individual models for each group and one model for all groups. The resulting algorithm is designed with a high dimensional framework in mind. The approach is applied to a sentiment analysis dataset to show its efficacy and interpretability. One particularly useful application is for finding sub-populations in a randomized trial for which an intervention (treatment) is beneficial, often called the uplift problem. Some new concepts are introduced that are useful for uplift analysis. The value is demonstrated in an application to a real world credit card promotion dataset. In this example, although sending the promotion has a very small average effect, by targeting a particular subgroup with the promotion one can obtain a 15% increase in the proportion of people who purchase the new credit card.
Data Shared Lasso: A Novel Tool to Discover Uplift
Gross, Samuel M.; Tibshirani, Robert
2017-01-01
A model is presented for the supervised learning problem where the observations come from a fixed number of pre-specified groups, and the regression coefficients may vary sparsely between groups. The model spans the continuum between individual models for each group and one model for all groups. The resulting algorithm is designed with a high dimensional framework in mind. The approach is applied to a sentiment analysis dataset to show its efficacy and interpretability. One particularly useful application is for finding sub-populations in a randomized trial for which an intervention (treatment) is beneficial, often called the uplift problem. Some new concepts are introduced that are useful for uplift analysis. The value is demonstrated in an application to a real world credit card promotion dataset. In this example, although sending the promotion has a very small average effect, by targeting a particular subgroup with the promotion one can obtain a 15% increase in the proportion of people who purchase the new credit card. PMID:29056802
Ballert, C; Oberhauser, C; Biering-Sørensen, F; Stucki, G; Cieza, A
2012-10-01
Psychometric study analyzing the data of a cross-sectional, multicentric study with 1048 persons with spinal cord injury (SCI). To shed light on how to apply the Brief Core Sets for SCI of the International Classification of Functioning, Disability and Health (ICF) by determining whether the ICF categories contained in the Core Sets capture differences in overall health. Lasso regression was applied using overall health, rated by the patients and health professionals, as dependent variables and the ICF categories of the Comprehensive ICF Core Sets for SCI as independent variables. The ICF categories that best capture differences in overall health refer to areas of life such as self-care, relationships, economic self-sufficiency and community life. Only about 25% of the ICF categories of the Brief ICF Core Sets for the early post-acute and for long-term contexts were selected in the Lasso regression and differentiate, therefore, among levels of overall health. ICF categories such as d570 Looking after one's health, d870 Economic self-sufficiency, d620 Acquisition of goods and services and d910 Community life, which capture changes in overall health in patients with SCI, should be considered in addition to those of the Brief ICF Core Sets in clinical and epidemiological studies in persons with SCI.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ma Yingliang; Housden, R. James; Razavi, Reza
2013-07-15
Purpose: X-ray fluoroscopically guided cardiac electrophysiology (EP) procedures are commonly carried out to treat patients with arrhythmias. X-ray images have poor soft tissue contrast and, for this reason, overlay of a three-dimensional (3D) roadmap derived from preprocedural volumetric images can be used to add anatomical information. It is useful to know the position of the catheter electrodes relative to the cardiac anatomy, for example, to record ablation therapy locations during atrial fibrillation therapy. Also, the electrode positions of the coronary sinus (CS) catheter or lasso catheter can be used for road map motion correction.Methods: In this paper, the authors presentmore » a novel unified computational framework for image-based catheter detection and tracking without any user interaction. The proposed framework includes fast blob detection, shape-constrained searching and model-based detection. In addition, catheter tracking methods were designed based on the customized catheter models input from the detection method. Three real-time detection and tracking methods are derived from the computational framework to detect or track the three most common types of catheters in EP procedures: the ablation catheter, the CS catheter, and the lasso catheter. Since the proposed methods use the same blob detection method to extract key information from x-ray images, the ablation, CS, and lasso catheters can be detected and tracked simultaneously in real-time.Results: The catheter detection methods were tested on 105 different clinical fluoroscopy sequences taken from 31 clinical procedures. Two-dimensional (2D) detection errors of 0.50 {+-} 0.29, 0.92 {+-} 0.61, and 0.63 {+-} 0.45 mm as well as success rates of 99.4%, 97.2%, and 88.9% were achieved for the CS catheter, ablation catheter, and lasso catheter, respectively. With the tracking method, accuracies were increased to 0.45 {+-} 0.28, 0.64 {+-} 0.37, and 0.53 {+-} 0.38 mm and success rates increased to 100%, 99.2%, and 96.5% for the CS, ablation, and lasso catheters, respectively. Subjective clinical evaluation by three experienced electrophysiologists showed that the detection and tracking results were clinically acceptable.Conclusions: The proposed detection and tracking methods are automatic and can detect and track CS, ablation, and lasso catheters simultaneously and in real-time. The accuracy of the proposed methods is sub-mm and the methods are robust toward low-dose x-ray fluoroscopic images, which are mainly used during EP procedures to maintain low radiation dose.« less
Stabilizing l1-norm prediction models by supervised feature grouping.
Kamkar, Iman; Gupta, Sunil Kumar; Phung, Dinh; Venkatesh, Svetha
2016-02-01
Emerging Electronic Medical Records (EMRs) have reformed the modern healthcare. These records have great potential to be used for building clinical prediction models. However, a problem in using them is their high dimensionality. Since a lot of information may not be relevant for prediction, the underlying complexity of the prediction models may not be high. A popular way to deal with this problem is to employ feature selection. Lasso and l1-norm based feature selection methods have shown promising results. But, in presence of correlated features, these methods select features that change considerably with small changes in data. This prevents clinicians to obtain a stable feature set, which is crucial for clinical decision making. Grouping correlated variables together can improve the stability of feature selection, however, such grouping is usually not known and needs to be estimated for optimal performance. Addressing this problem, we propose a new model that can simultaneously learn the grouping of correlated features and perform stable feature selection. We formulate the model as a constrained optimization problem and provide an efficient solution with guaranteed convergence. Our experiments with both synthetic and real-world datasets show that the proposed model is significantly more stable than Lasso and many existing state-of-the-art shrinkage and classification methods. We further show that in terms of prediction performance, the proposed method consistently outperforms Lasso and other baselines. Our model can be used for selecting stable risk factors for a variety of healthcare problems, so it can assist clinicians toward accurate decision making. Copyright © 2015 Elsevier Inc. All rights reserved.
NASA Astrophysics Data System (ADS)
Brun, Pierre-Thomas
2014-03-01
Trick roping evolved from humble origins as a cattle-catching tool into a sport that delights audiences the world over with its complex patterns or ``tricks,'' such as the Merry-Go-Round , the Wedding-Ring, the Spoke-Jumping, the Texas Skip... Its implement is the lasso, a length of rope with a small loop (``honda'') at one end through which the other end is passed to form a large loop. Here, we study the physics of the simplest rope trick, the Flat Loop, in which the motion of the lasso is forced by a uniform circular motion of the cowboy's/cowgirl's hand in a horizontal plane. To avoid accumulating twist in the rope, the cowboy/cowgirl rolls it between his/her thumb and forefinger while spinning it. The configuration of the rope is stationary in a reference frame that rotates with the hand. Exploiting this fact we derive a dynamical ``string'' model in which line tension is balanced by the centrifugal force and the rope's weight. Using a numerical continuation method, we calculate the steady shapes of a lasso with a fixed honda, examine their stability, and determine a bifurcation diagram exhibiting coat-hanger shapes and whirling modes in addition to flat loops. We then extend the model to a honda with finite sliding friction by using matched asymptotic expansions to determine the structure of the boundary layer where bending forces are significant, thereby obtaining a macroscopic criterion for frictional sliding of the honda. We compare our theoretical results with high-speed videos of a professional trick roper and experiments performed using a laboratory ``robo-cowboy.'' Finally, we conclude with a practical guidance on how to spin a lasso in the air based on the results of our analysis. With the support of Univ. Paris Sud (Lab. FAST/CNRS) and UPMC (d'Alembert/CNRS).
NASA Astrophysics Data System (ADS)
Vogelmann, A. M.; Zhang, D.; Kollias, P.; Endo, S.; Lamer, K.; Gustafson, W. I., Jr.; Romps, D. M.
2017-12-01
Continental boundary layer clouds are important to simulations of weather and climate because of their impact on surface budgets and vertical transports of energy and moisture; however, model-parameterized boundary layer clouds do not agree well with observations in part because small-scale turbulence and convection are not properly represented. To advance parameterization development and evaluation, observational constraints are needed on critical parameters such as cloud-base mass flux and its relationship to cloud cover and the sub-cloud boundary layer structure including vertical velocity variance and skewness. In this study, these constraints are derived from Doppler lidar observations and ensemble large-eddy simulations (LES) from the U.S. Department of Energy Atmospheric Radiation Measurement (ARM) Facility Southern Great Plains (SGP) site in Oklahoma. The Doppler lidar analysis will extend the single-site, long-term analysis of Lamer and Kollias [2015] and augment this information with the short-term but unique 1-2 year period since five Doppler lidars began operation at the SGP, providing critical information on regional variability. These observations will be compared to the statistics obtained from ensemble, routine LES conducted by the LES ARM Symbiotic Simulation and Observation (LASSO) project (https://www.arm.gov/capabilities/modeling/lasso). An Observation System Simulation Experiment (OSSE) will be presented that uses the LASSO LES fields to determine criteria for which relationships from Doppler lidar observations are adequately sampled to yield convergence. Any systematic differences between the observed and simulated relationships will be examined to understand factors contributing to the differences. Lamer, K., and P. Kollias (2015), Observations of fair-weather cumuli over land: Dynamical factors controlling cloud size and cover, Geophys. Res. Lett., 42, 8693-8701, doi:10.1002/2015GL064534
Charter for the ARM Atmospheric Modeling Advisory Group
DOE Office of Scientific and Technical Information (OSTI.GOV)
Advisory Group, ARM Atmospheric Modeling
The Atmospheric Modeling Advisory Group of the U.S. Department of Energy (DOE) Atmospheric Radiation Measurement (ARM) Climate Research Facility is guided by the following: 1. The group will provide feedback on the overall project plan including input on how to address priorities and trade-offs in the modeling and analysis workflow, making sure the modeling follows general best practices, and reviewing the recommendations provided to ARM for the workflow implementation. 2. The group will consist of approximately 6 members plus the PI and co-PI of the Large-Eddy Simulation (LES) ARM Symbiotic Simulation and Observation (LASSO) pilot project. The ARM Technical Director,more » or his designee, serves as an ex-officio member. This size is chosen based on the ability to efficiently conduct teleconferences and to span the general needs for input to the LASSO pilot project.« less
RRegrs: an R package for computer-aided model selection with multiple regression models.
Tsiliki, Georgia; Munteanu, Cristian R; Seoane, Jose A; Fernandez-Lozano, Carlos; Sarimveis, Haralambos; Willighagen, Egon L
2015-01-01
Predictive regression models can be created with many different modelling approaches. Choices need to be made for data set splitting, cross-validation methods, specific regression parameters and best model criteria, as they all affect the accuracy and efficiency of the produced predictive models, and therefore, raising model reproducibility and comparison issues. Cheminformatics and bioinformatics are extensively using predictive modelling and exhibit a need for standardization of these methodologies in order to assist model selection and speed up the process of predictive model development. A tool accessible to all users, irrespectively of their statistical knowledge, would be valuable if it tests several simple and complex regression models and validation schemes, produce unified reports, and offer the option to be integrated into more extensive studies. Additionally, such methodology should be implemented as a free programming package, in order to be continuously adapted and redistributed by others. We propose an integrated framework for creating multiple regression models, called RRegrs. The tool offers the option of ten simple and complex regression methods combined with repeated 10-fold and leave-one-out cross-validation. Methods include Multiple Linear regression, Generalized Linear Model with Stepwise Feature Selection, Partial Least Squares regression, Lasso regression, and Support Vector Machines Recursive Feature Elimination. The new framework is an automated fully validated procedure which produces standardized reports to quickly oversee the impact of choices in modelling algorithms and assess the model and cross-validation results. The methodology was implemented as an open source R package, available at https://www.github.com/enanomapper/RRegrs, by reusing and extending on the caret package. The universality of the new methodology is demonstrated using five standard data sets from different scientific fields. Its efficiency in cheminformatics and QSAR modelling is shown with three use cases: proteomics data for surface-modified gold nanoparticles, nano-metal oxides descriptor data, and molecular descriptors for acute aquatic toxicity data. The results show that for all data sets RRegrs reports models with equal or better performance for both training and test sets than those reported in the original publications. Its good performance as well as its adaptability in terms of parameter optimization could make RRegrs a popular framework to assist the initial exploration of predictive models, and with that, the design of more comprehensive in silico screening applications.Graphical abstractRRegrs is a computer-aided model selection framework for R multiple regression models; this is a fully validated procedure with application to QSAR modelling.
Reconstructing (super)trees from data sets with missing distances: not all is lost.
Kettleborough, George; Dicks, Jo; Roberts, Ian N; Huber, Katharina T
2015-06-01
The wealth of phylogenetic information accumulated over many decades of biological research, coupled with recent technological advances in molecular sequence generation, presents significant opportunities for researchers to investigate relationships across and within the kingdoms of life. However, to make best use of this data wealth, several problems must first be overcome. One key problem is finding effective strategies to deal with missing data. Here, we introduce Lasso, a novel heuristic approach for reconstructing rooted phylogenetic trees from distance matrices with missing values, for data sets where a molecular clock may be assumed. Contrary to other phylogenetic methods on partial data sets, Lasso possesses desirable properties such as its reconstructed trees being both unique and edge-weighted. These properties are achieved by Lasso restricting its leaf set to a large subset of all possible taxa, which in many practical situations is the entire taxa set. Furthermore, the Lasso approach is distance-based, rendering it very fast to run and suitable for data sets of all sizes, including large data sets such as those generated by modern Next Generation Sequencing technologies. To better understand the performance of Lasso, we assessed it by means of artificial and real biological data sets, showing its effectiveness in the presence of missing data. Furthermore, by formulating the supermatrix problem as a particular case of the missing data problem, we assessed Lasso's ability to reconstruct supertrees. We demonstrate that, although not specifically designed for such a purpose, Lasso performs better than or comparably with five leading supertree algorithms on a challenging biological data set. Finally, we make freely available a software implementation of Lasso so that researchers may, for the first time, perform both rooted tree and supertree reconstruction with branch lengths on their own partial data sets. © The Author 2015. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
Belilovsky, Eugene; Gkirtzou, Katerina; Misyrlis, Michail; Konova, Anna B; Honorio, Jean; Alia-Klein, Nelly; Goldstein, Rita Z; Samaras, Dimitris; Blaschko, Matthew B
2015-12-01
We explore various sparse regularization techniques for analyzing fMRI data, such as the ℓ1 norm (often called LASSO in the context of a squared loss function), elastic net, and the recently introduced k-support norm. Employing sparsity regularization allows us to handle the curse of dimensionality, a problem commonly found in fMRI analysis. In this work we consider sparse regularization in both the regression and classification settings. We perform experiments on fMRI scans from cocaine-addicted as well as healthy control subjects. We show that in many cases, use of the k-support norm leads to better predictive performance, solution stability, and interpretability as compared to other standard approaches. We additionally analyze the advantages of using the absolute loss function versus the standard squared loss which leads to significantly better predictive performance for the regularization methods tested in almost all cases. Our results support the use of the k-support norm for fMRI analysis and on the clinical side, the generalizability of the I-RISA model of cocaine addiction. Copyright © 2015 Elsevier Ltd. All rights reserved.
Liu, Hongcheng; Yao, Tao; Li, Runze; Ye, Yinyu
2017-11-01
This paper concerns the folded concave penalized sparse linear regression (FCPSLR), a class of popular sparse recovery methods. Although FCPSLR yields desirable recovery performance when solved globally, computing a global solution is NP-complete. Despite some existing statistical performance analyses on local minimizers or on specific FCPSLR-based learning algorithms, it still remains open questions whether local solutions that are known to admit fully polynomial-time approximation schemes (FPTAS) may already be sufficient to ensure the statistical performance, and whether that statistical performance can be non-contingent on the specific designs of computing procedures. To address the questions, this paper presents the following threefold results: (i) Any local solution (stationary point) is a sparse estimator, under some conditions on the parameters of the folded concave penalties. (ii) Perhaps more importantly, any local solution satisfying a significant subspace second-order necessary condition (S 3 ONC), which is weaker than the second-order KKT condition, yields a bounded error in approximating the true parameter with high probability. In addition, if the minimal signal strength is sufficient, the S 3 ONC solution likely recovers the oracle solution. This result also explicates that the goal of improving the statistical performance is consistent with the optimization criteria of minimizing the suboptimality gap in solving the non-convex programming formulation of FCPSLR. (iii) We apply (ii) to the special case of FCPSLR with minimax concave penalty (MCP) and show that under the restricted eigenvalue condition, any S 3 ONC solution with a better objective value than the Lasso solution entails the strong oracle property. In addition, such a solution generates a model error (ME) comparable to the optimal but exponential-time sparse estimator given a sufficient sample size, while the worst-case ME is comparable to the Lasso in general. Furthermore, to guarantee the S 3 ONC admits FPTAS.
Wu, Kai; Liu, Jing; Wang, Shuai
2016-01-01
Evolutionary games (EG) model a common type of interactions in various complex, networked, natural and social systems. Given such a system with only profit sequences being available, reconstructing the interacting structure of EG networks is fundamental to understand and control its collective dynamics. Existing approaches used to handle this problem, such as the lasso, a convex optimization method, need a user-defined constant to control the tradeoff between the natural sparsity of networks and measurement error (the difference between observed data and simulated data). However, a shortcoming of these approaches is that it is not easy to determine these key parameters which can maximize the performance. In contrast to these approaches, we first model the EG network reconstruction problem as a multiobjective optimization problem (MOP), and then develop a framework which involves multiobjective evolutionary algorithm (MOEA), followed by solution selection based on knee regions, termed as MOEANet, to solve this MOP. We also design an effective initialization operator based on the lasso for MOEA. We apply the proposed method to reconstruct various types of synthetic and real-world networks, and the results show that our approach is effective to avoid the above parameter selecting problem and can reconstruct EG networks with high accuracy. PMID:27886244
NASA Astrophysics Data System (ADS)
Wu, Kai; Liu, Jing; Wang, Shuai
2016-11-01
Evolutionary games (EG) model a common type of interactions in various complex, networked, natural and social systems. Given such a system with only profit sequences being available, reconstructing the interacting structure of EG networks is fundamental to understand and control its collective dynamics. Existing approaches used to handle this problem, such as the lasso, a convex optimization method, need a user-defined constant to control the tradeoff between the natural sparsity of networks and measurement error (the difference between observed data and simulated data). However, a shortcoming of these approaches is that it is not easy to determine these key parameters which can maximize the performance. In contrast to these approaches, we first model the EG network reconstruction problem as a multiobjective optimization problem (MOP), and then develop a framework which involves multiobjective evolutionary algorithm (MOEA), followed by solution selection based on knee regions, termed as MOEANet, to solve this MOP. We also design an effective initialization operator based on the lasso for MOEA. We apply the proposed method to reconstruct various types of synthetic and real-world networks, and the results show that our approach is effective to avoid the above parameter selecting problem and can reconstruct EG networks with high accuracy.
Linear Modeling and Evaluation of Controls on Flow Response in Western Post-Fire Watersheds
NASA Astrophysics Data System (ADS)
Saxe, S.; Hogue, T. S.; Hay, L.
2015-12-01
This research investigates the impact of wildfires on watershed flow regimes throughout the western United States, specifically focusing on evaluation of fire events within specified subregions and determination of the impact of climate and geophysical variables in post-fire flow response. Fire events were collected through federal and state-level databases and streamflow data were collected from U.S. Geological Survey stream gages. 263 watersheds were identified with at least 10 years of continuous pre-fire daily streamflow records and 5 years of continuous post-fire daily flow records. For each watershed, percent changes in runoff ratio (RO), annual seven day low-flows (7Q2) and annual seven day high-flows (7Q10) were calculated from pre- to post-fire. Numerous independent variables were identified for each watershed and fire event, including topographic, land cover, climate, burn severity, and soils data. The national watersheds were divided into five regions through K-clustering and a lasso linear regression model, applying the Leave-One-Out calibration method, was calculated for each region. Nash-Sutcliffe Efficiency (NSE) was used to determine the accuracy of the resulting models. The regions encompassing the United States along and west of the Rocky Mountains, excluding the coastal watersheds, produced the most accurate linear models. The Pacific coast region models produced poor and inconsistent results, indicating that the regions need to be further subdivided. Presently, RO and HF response variables appear to be more easily modeled than LF. Results of linear regression modeling showed varying importance of watershed and fire event variables, with conflicting correlation between land cover types and soil types by region. The addition of further independent variables and constriction of current variables based on correlation indicators is ongoing and should allow for more accurate linear regression modeling.
Prediction-Oriented Marker Selection (PROMISE): With Application to High-Dimensional Regression.
Kim, Soyeon; Baladandayuthapani, Veerabhadran; Lee, J Jack
2017-06-01
In personalized medicine, biomarkers are used to select therapies with the highest likelihood of success based on an individual patient's biomarker/genomic profile. Two goals are to choose important biomarkers that accurately predict treatment outcomes and to cull unimportant biomarkers to reduce the cost of biological and clinical verifications. These goals are challenging due to the high dimensionality of genomic data. Variable selection methods based on penalized regression (e.g., the lasso and elastic net) have yielded promising results. However, selecting the right amount of penalization is critical to simultaneously achieving these two goals. Standard approaches based on cross-validation (CV) typically provide high prediction accuracy with high true positive rates but at the cost of too many false positives. Alternatively, stability selection (SS) controls the number of false positives, but at the cost of yielding too few true positives. To circumvent these issues, we propose prediction-oriented marker selection (PROMISE), which combines SS with CV to conflate the advantages of both methods. Our application of PROMISE with the lasso and elastic net in data analysis shows that, compared to CV, PROMISE produces sparse solutions, few false positives, and small type I + type II error, and maintains good prediction accuracy, with a marginal decrease in the true positive rates. Compared to SS, PROMISE offers better prediction accuracy and true positive rates. In summary, PROMISE can be applied in many fields to select regularization parameters when the goals are to minimize false positives and maximize prediction accuracy.
Wang, Zhu; Shuangge, Ma; Wang, Ching-Yun
2017-01-01
In health services and outcome research, count outcomes are frequently encountered and often have a large proportion of zeros. The zero-inflated negative binomial (ZINB) regression model has important applications for this type of data. With many possible candidate risk factors, this paper proposes new variable selection methods for the ZINB model. We consider maximum likelihood function plus a penalty including the least absolute shrinkage and selection operator (LASSO), smoothly clipped absolute deviation (SCAD) and minimax concave penalty (MCP). An EM (expectation-maximization) algorithm is proposed for estimating the model parameters and conducting variable selection simultaneously. This algorithm consists of estimating penalized weighted negative binomial models and penalized logistic models via the coordinated descent algorithm. Furthermore, statistical properties including the standard error formulae are provided. A simulation study shows that the new algorithm not only has more accurate or at least comparable estimation, also is more robust than the traditional stepwise variable selection. The proposed methods are applied to analyze the health care demand in Germany using an open-source R package mpath. PMID:26059498
Wan, Jian; Chen, Yi-Chieh; Morris, A Julian; Thennadil, Suresh N
2017-07-01
Near-infrared (NIR) spectroscopy is being widely used in various fields ranging from pharmaceutics to the food industry for analyzing chemical and physical properties of the substances concerned. Its advantages over other analytical techniques include available physical interpretation of spectral data, nondestructive nature and high speed of measurements, and little or no need for sample preparation. The successful application of NIR spectroscopy relies on three main aspects: pre-processing of spectral data to eliminate nonlinear variations due to temperature, light scattering effects and many others, selection of those wavelengths that contribute useful information, and identification of suitable calibration models using linear/nonlinear regression . Several methods have been developed for each of these three aspects and many comparative studies of different methods exist for an individual aspect or some combinations. However, there is still a lack of comparative studies for the interactions among these three aspects, which can shed light on what role each aspect plays in the calibration and how to combine various methods of each aspect together to obtain the best calibration model. This paper aims to provide such a comparative study based on four benchmark data sets using three typical pre-processing methods, namely, orthogonal signal correction (OSC), extended multiplicative signal correction (EMSC) and optical path-length estimation and correction (OPLEC); two existing wavelength selection methods, namely, stepwise forward selection (SFS) and genetic algorithm optimization combined with partial least squares regression for spectral data (GAPLSSP); four popular regression methods, namely, partial least squares (PLS), least absolute shrinkage and selection operator (LASSO), least squares support vector machine (LS-SVM), and Gaussian process regression (GPR). The comparative study indicates that, in general, pre-processing of spectral data can play a significant role in the calibration while wavelength selection plays a marginal role and the combination of certain pre-processing, wavelength selection, and nonlinear regression methods can achieve superior performance over traditional linear regression-based calibration.
Optimal Sparse Upstream Sensor Placement for Hydrokinetic Turbines
NASA Astrophysics Data System (ADS)
Cavagnaro, Robert; Strom, Benjamin; Ross, Hannah; Hill, Craig; Polagye, Brian
2016-11-01
Accurate measurement of the flow field incident upon a hydrokinetic turbine is critical for performance evaluation during testing and setting boundary conditions in simulation. Additionally, turbine controllers may leverage real-time flow measurements. Particle image velocimetry (PIV) is capable of rendering a flow field over a wide spatial domain in a controlled, laboratory environment. However, PIV's lack of suitability for natural marine environments, high cost, and intensive post-processing diminish its potential for control applications. Conversely, sensors such as acoustic Doppler velocimeters (ADVs), are designed for field deployment and real-time measurement, but over a small spatial domain. Sparsity-promoting regression analysis such as LASSO is utilized to improve the efficacy of point measurements for real-time applications by determining optimal spatial placement for a small number of ADVs using a training set of PIV velocity fields and turbine data. The study is conducted in a flume (0.8 m2 cross-sectional area, 1 m/s flow) with laboratory-scale axial and cross-flow turbines. Predicted turbine performance utilizing the optimal sparse sensor network and associated regression model is compared to actual performance with corresponding PIV measurements.
Progress of the LASSO experiment
NASA Technical Reports Server (NTRS)
Serene, B. E. H.
1981-01-01
The LASSO (Later Synchronisation from Stationary Orbit) experiment, designed to demonstrate the feasibility of achieving time synchronization between remote atomic clocks with an accuracy of one nanosecond or better by using laser techniques for the first time is described. The experiment uses groundbased laser stations and the SIRIO-2 geostationary satellite to be launched towards the end of 1981. The qualification of the LASSO on-board equipment is discussed with a brief description of the electrical and optical test equipment used. The progress of the operational organization is included.
NASA Astrophysics Data System (ADS)
Vogelmann, A. M.; Gustafson, W. I., Jr.; Toto, T.; Endo, S.; Cheng, X.; Li, Z.; Xiao, H.
2015-12-01
The Department of Energy's Atmospheric Radiation Measurement (ARM) Climate Research Facilities' Large-Eddy Simulation (LES) ARM Symbiotic Simulation and Observation (LASSO) Workflow is currently being designed to provide output from routine LES to complement its extensive observations. The modeling portion of the LASSO workflow is presented by Gustafson et al., which will initially focus on shallow convection over the ARM megasite in Oklahoma, USA. This presentation describes how the LES output will be combined with observations to construct multi-dimensional and dynamically consistent "data cubes", aimed at providing the best description of the atmospheric state for use in analyses by the community. The megasite observations are used to constrain large-eddy simulations that provide a complete spatial and temporal coverage of observables and, further, the simulations also provide information on processes that cannot be observed. Statistical comparisons of model output with their observables are used to assess the quality of a given simulated realization and its associated uncertainties. A data cube is a model-observation package that provides: (1) metrics of model-observation statistical summaries to assess the simulations and the ensemble spread; (2) statistical summaries of additional model property output that cannot be or are very difficult to observe; and (3) snapshots of the 4-D simulated fields from the integration period. Searchable metrics are provided that characterize the general atmospheric state to assist users in finding cases of interest, such as categorization of daily weather conditions and their specific attributes. The data cubes will be accompanied by tools designed for easy access to cube contents from within the ARM archive and externally, the ability to compare multiple data streams within an event as well as across events, and the ability to use common grids and time sampling, where appropriate.
A Hierarchical Poisson Log-Normal Model for Network Inference from RNA Sequencing Data
Gallopin, Mélina; Rau, Andrea; Jaffrézic, Florence
2013-01-01
Gene network inference from transcriptomic data is an important methodological challenge and a key aspect of systems biology. Although several methods have been proposed to infer networks from microarray data, there is a need for inference methods able to model RNA-seq data, which are count-based and highly variable. In this work we propose a hierarchical Poisson log-normal model with a Lasso penalty to infer gene networks from RNA-seq data; this model has the advantage of directly modelling discrete data and accounting for inter-sample variance larger than the sample mean. Using real microRNA-seq data from breast cancer tumors and simulations, we compare this method to a regularized Gaussian graphical model on log-transformed data, and a Poisson log-linear graphical model with a Lasso penalty on power-transformed data. For data simulated with large inter-sample dispersion, the proposed model performs better than the other methods in terms of sensitivity, specificity and area under the ROC curve. These results show the necessity of methods specifically designed for gene network inference from RNA-seq data. PMID:24147011
The B1 Protein Guides the Biosynthesis of a Lasso Peptide
NASA Astrophysics Data System (ADS)
Zhu, Shaozhou; Fage, Christopher D.; Hegemann, Julian D.; Mielcarek, Andreas; Yan, Dushan; Linne, Uwe; Marahiel, Mohamed A.
2016-10-01
Lasso peptides are a class of ribosomally synthesized and post-translationally modified peptides (RiPPs) with a unique lariat knot-like fold that endows them with extraordinary stability and biologically relevant activity. However, the biosynthetic mechanism of these fascinating molecules remains largely speculative. Generally, two enzymes (B for processing and C for cyclization) are required to assemble the unusual knot-like structure. Several subsets of lasso peptide gene clusters feature a “split” B protein on separate open reading frames (B1 and B2), suggesting distinct functions for the B protein in lasso peptide biosynthesis. Herein, we provide new insights into the role of the RiPP recognition element (RRE) PadeB1, characterizing its capacity to bind the paeninodin leader peptide and deliver its peptide substrate to PadeB2 for processing.
Variable selection in subdistribution hazard frailty models with competing risks data
Do Ha, Il; Lee, Minjung; Oh, Seungyoung; Jeong, Jong-Hyeon; Sylvester, Richard; Lee, Youngjo
2014-01-01
The proportional subdistribution hazards model (i.e. Fine-Gray model) has been widely used for analyzing univariate competing risks data. Recently, this model has been extended to clustered competing risks data via frailty. To the best of our knowledge, however, there has been no literature on variable selection method for such competing risks frailty models. In this paper, we propose a simple but unified procedure via a penalized h-likelihood (HL) for variable selection of fixed effects in a general class of subdistribution hazard frailty models, in which random effects may be shared or correlated. We consider three penalty functions (LASSO, SCAD and HL) in our variable selection procedure. We show that the proposed method can be easily implemented using a slight modification to existing h-likelihood estimation approaches. Numerical studies demonstrate that the proposed procedure using the HL penalty performs well, providing a higher probability of choosing the true model than LASSO and SCAD methods without losing prediction accuracy. The usefulness of the new method is illustrated using two actual data sets from multi-center clinical trials. PMID:25042872
Liquid electrolyte informatics using an exhaustive search with linear regression.
Sodeyama, Keitaro; Igarashi, Yasuhiko; Nakayama, Tomofumi; Tateyama, Yoshitaka; Okada, Masato
2018-06-14
Exploring new liquid electrolyte materials is a fundamental target for developing new high-performance lithium-ion batteries. In contrast to solid materials, disordered liquid solution properties have been less studied by data-driven information techniques. Here, we examined the estimation accuracy and efficiency of three information techniques, multiple linear regression (MLR), least absolute shrinkage and selection operator (LASSO), and exhaustive search with linear regression (ES-LiR), by using coordination energy and melting point as test liquid properties. We then confirmed that ES-LiR gives the most accurate estimation among the techniques. We also found that ES-LiR can provide the relationship between the "prediction accuracy" and "calculation cost" of the properties via a weight diagram of descriptors. This technique makes it possible to choose the balance of the "accuracy" and "cost" when the search of a huge amount of new materials was carried out.
Supervised Learning for Dynamical System Learning.
Hefny, Ahmed; Downey, Carlton; Gordon, Geoffrey J
2015-01-01
Recently there has been substantial interest in spectral methods for learning dynamical systems. These methods are popular since they often offer a good tradeoff between computational and statistical efficiency. Unfortunately, they can be difficult to use and extend in practice: e.g., they can make it difficult to incorporate prior information such as sparsity or structure. To address this problem, we present a new view of dynamical system learning: we show how to learn dynamical systems by solving a sequence of ordinary supervised learning problems, thereby allowing users to incorporate prior knowledge via standard techniques such as L 1 regularization. Many existing spectral methods are special cases of this new framework, using linear regression as the supervised learner. We demonstrate the effectiveness of our framework by showing examples where nonlinear regression or lasso let us learn better state representations than plain linear regression does; the correctness of these instances follows directly from our general analysis.
Gene regulatory network inference using fused LASSO on multiple data sets
Omranian, Nooshin; Eloundou-Mbebi, Jeanne M. O.; Mueller-Roeber, Bernd; Nikoloski, Zoran
2016-01-01
Devising computational methods to accurately reconstruct gene regulatory networks given gene expression data is key to systems biology applications. Here we propose a method for reconstructing gene regulatory networks by simultaneous consideration of data sets from different perturbation experiments and corresponding controls. The method imposes three biologically meaningful constraints: (1) expression levels of each gene should be explained by the expression levels of a small number of transcription factor coding genes, (2) networks inferred from different data sets should be similar with respect to the type and number of regulatory interactions, and (3) relationships between genes which exhibit similar differential behavior over the considered perturbations should be favored. We demonstrate that these constraints can be transformed in a fused LASSO formulation for the proposed method. The comparative analysis on transcriptomics time-series data from prokaryotic species, Escherichia coli and Mycobacterium tuberculosis, as well as a eukaryotic species, mouse, demonstrated that the proposed method has the advantages of the most recent approaches for regulatory network inference, while obtaining better performance and assigning higher scores to the true regulatory links. The study indicates that the combination of sparse regression techniques with other biologically meaningful constraints is a promising framework for gene regulatory network reconstructions. PMID:26864687
Sparse EEG/MEG source estimation via a group lasso
Lim, Michael; Ales, Justin M.; Cottereau, Benoit R.; Hastie, Trevor
2017-01-01
Non-invasive recordings of human brain activity through electroencephalography (EEG) or magnetoencelphalography (MEG) are of value for both basic science and clinical applications in sensory, cognitive, and affective neuroscience. Here we introduce a new approach to estimating the intra-cranial sources of EEG/MEG activity measured from extra-cranial sensors. The approach is based on the group lasso, a sparse-prior inverse that has been adapted to take advantage of functionally-defined regions of interest for the definition of physiologically meaningful groups within a functionally-based common space. Detailed simulations using realistic source-geometries and data from a human Visual Evoked Potential experiment demonstrate that the group-lasso method has improved performance over traditional ℓ2 minimum-norm methods. In addition, we show that pooling source estimates across subjects over functionally defined regions of interest results in improvements in the accuracy of source estimates for both the group-lasso and minimum-norm approaches. PMID:28604790
Trehan, Sumeet; Carlberg, Kevin T.; Durlofsky, Louis J.
2017-07-14
A machine learning–based framework for modeling the error introduced by surrogate models of parameterized dynamical systems is proposed. The framework entails the use of high-dimensional regression techniques (eg, random forests, and LASSO) to map a large set of inexpensively computed “error indicators” (ie, features) produced by the surrogate model at a given time instance to a prediction of the surrogate-model error in a quantity of interest (QoI). This eliminates the need for the user to hand-select a small number of informative features. The methodology requires a training set of parameter instances at which the time-dependent surrogate-model error is computed bymore » simulating both the high-fidelity and surrogate models. Using these training data, the method first determines regression-model locality (via classification or clustering) and subsequently constructs a “local” regression model to predict the time-instantaneous error within each identified region of feature space. We consider 2 uses for the resulting error model: (1) as a correction to the surrogate-model QoI prediction at each time instance and (2) as a way to statistically model arbitrary functions of the time-dependent surrogate-model error (eg, time-integrated errors). We then apply the proposed framework to model errors in reduced-order models of nonlinear oil-water subsurface flow simulations, with time-varying well-control (bottom-hole pressure) parameters. The reduced-order models used in this work entail application of trajectory piecewise linearization in conjunction with proper orthogonal decomposition. Moreover, when the first use of the method is considered, numerical experiments demonstrate consistent improvement in accuracy in the time-instantaneous QoI prediction relative to the original surrogate model, across a large number of test cases. When the second use is considered, results show that the proposed method provides accurate statistical predictions of the time- and well-averaged errors.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Trehan, Sumeet; Carlberg, Kevin T.; Durlofsky, Louis J.
A machine learning–based framework for modeling the error introduced by surrogate models of parameterized dynamical systems is proposed. The framework entails the use of high-dimensional regression techniques (eg, random forests, and LASSO) to map a large set of inexpensively computed “error indicators” (ie, features) produced by the surrogate model at a given time instance to a prediction of the surrogate-model error in a quantity of interest (QoI). This eliminates the need for the user to hand-select a small number of informative features. The methodology requires a training set of parameter instances at which the time-dependent surrogate-model error is computed bymore » simulating both the high-fidelity and surrogate models. Using these training data, the method first determines regression-model locality (via classification or clustering) and subsequently constructs a “local” regression model to predict the time-instantaneous error within each identified region of feature space. We consider 2 uses for the resulting error model: (1) as a correction to the surrogate-model QoI prediction at each time instance and (2) as a way to statistically model arbitrary functions of the time-dependent surrogate-model error (eg, time-integrated errors). We then apply the proposed framework to model errors in reduced-order models of nonlinear oil-water subsurface flow simulations, with time-varying well-control (bottom-hole pressure) parameters. The reduced-order models used in this work entail application of trajectory piecewise linearization in conjunction with proper orthogonal decomposition. Moreover, when the first use of the method is considered, numerical experiments demonstrate consistent improvement in accuracy in the time-instantaneous QoI prediction relative to the original surrogate model, across a large number of test cases. When the second use is considered, results show that the proposed method provides accurate statistical predictions of the time- and well-averaged errors.« less
Vidyasagar, Mathukumalli
2015-01-01
This article reviews several techniques from machine learning that can be used to study the problem of identifying a small number of features, from among tens of thousands of measured features, that can accurately predict a drug response. Prediction problems are divided into two categories: sparse classification and sparse regression. In classification, the clinical parameter to be predicted is binary, whereas in regression, the parameter is a real number. Well-known methods for both classes of problems are briefly discussed. These include the SVM (support vector machine) for classification and various algorithms such as ridge regression, LASSO (least absolute shrinkage and selection operator), and EN (elastic net) for regression. In addition, several well-established methods that do not directly fall into machine learning theory are also reviewed, including neural networks, PAM (pattern analysis for microarrays), SAM (significance analysis for microarrays), GSEA (gene set enrichment analysis), and k-means clustering. Several references indicative of the application of these methods to cancer biology are discussed.
Li, Richard Y.; Di Felice, Rosa; Rohs, Remo; Lidar, Daniel A.
2018-01-01
Transcription factors regulate gene expression, but how these proteins recognize and specifically bind to their DNA targets is still debated. Machine learning models are effective means to reveal interaction mechanisms. Here we studied the ability of a quantum machine learning approach to predict binding specificity. Using simplified datasets of a small number of DNA sequences derived from actual binding affinity experiments, we trained a commercially available quantum annealer to classify and rank transcription factor binding. The results were compared to state-of-the-art classical approaches for the same simplified datasets, including simulated annealing, simulated quantum annealing, multiple linear regression, LASSO, and extreme gradient boosting. Despite technological limitations, we find a slight advantage in classification performance and nearly equal ranking performance using the quantum annealer for these fairly small training data sets. Thus, we propose that quantum annealing might be an effective method to implement machine learning for certain computational biology problems. PMID:29652405
Fan, Yue; Wang, Xiao; Peng, Qinke
2017-01-01
Gene regulatory networks (GRNs) play an important role in cellular systems and are important for understanding biological processes. Many algorithms have been developed to infer the GRNs. However, most algorithms only pay attention to the gene expression data but do not consider the topology information in their inference process, while incorporating this information can partially compensate for the lack of reliable expression data. Here we develop a Bayesian group lasso with spike and slab priors to perform gene selection and estimation for nonparametric models. B-spline basis functions are used to capture the nonlinear relationships flexibly and penalties are used to avoid overfitting. Further, we incorporate the topology information into the Bayesian method as a prior. We present the application of our method on DREAM3 and DREAM4 datasets and two real biological datasets. The results show that our method performs better than existing methods and the topology information prior can improve the result.
mLASSO-Hum: A LASSO-based interpretable human-protein subcellular localization predictor.
Wan, Shibiao; Mak, Man-Wai; Kung, Sun-Yuan
2015-10-07
Knowing the subcellular compartments of human proteins is essential to shed light on the mechanisms of a broad range of human diseases. In computational methods for protein subcellular localization, knowledge-based methods (especially gene ontology (GO) based methods) are known to perform better than sequence-based methods. However, existing GO-based predictors often lack interpretability and suffer from overfitting due to the high dimensionality of feature vectors. To address these problems, this paper proposes an interpretable multi-label predictor, namely mLASSO-Hum, which can yield sparse and interpretable solutions for large-scale prediction of human protein subcellular localization. By using the one-vs-rest LASSO-based classifiers, 87 out of more than 8000 GO terms are found to play more significant roles in determining the subcellular localization. Based on these 87 essential GO terms, we can decide not only where a protein resides within a cell, but also why it is located there. To further exploit information from the remaining GO terms, a method based on the GO hierarchical information derived from the depth distance of GO terms is proposed. Experimental results show that mLASSO-Hum performs significantly better than state-of-the-art predictors. We also found that in addition to the GO terms from the cellular component category, GO terms from the other two categories also play important roles in the final classification decisions. For readers' convenience, the mLASSO-Hum server is available online at http://bioinfo.eie.polyu.edu.hk/mLASSOHumServer/. Copyright © 2015 Elsevier Ltd. All rights reserved.
GWAS-based machine learning approach to predict duloxetine response in major depressive disorder.
Maciukiewicz, Malgorzata; Marshe, Victoria S; Hauschild, Anne-Christin; Foster, Jane A; Rotzinger, Susan; Kennedy, James L; Kennedy, Sidney H; Müller, Daniel J; Geraci, Joseph
2018-04-01
Major depressive disorder (MDD) is one of the most prevalent psychiatric disorders and is commonly treated with antidepressant drugs. However, large variability is observed in terms of response to antidepressants. Machine learning (ML) models may be useful to predict treatment outcomes. A sample of 186 MDD patients received treatment with duloxetine for up to 8 weeks were categorized as "responders" based on a MADRS change >50% from baseline; or "remitters" based on a MADRS score ≤10 at end point. The initial dataset (N = 186) was randomly divided into training and test sets in a nested 5-fold cross-validation, where 80% was used as a training set and 20% made up five independent test sets. We performed genome-wide logistic regression to identify potentially significant variants related to duloxetine response/remission and extracted the most promising predictors using LASSO regression. Subsequently, classification-regression trees (CRT) and support vector machines (SVM) were applied to construct models, using ten-fold cross-validation. With regards to response, none of the pairs performed significantly better than chance (accuracy p > .1). For remission, SVM achieved moderate performance with an accuracy = 0.52, a sensitivity = 0.58, and a specificity = 0.46, and 0.51 for all coefficients for CRT. The best performing SVM fold was characterized by an accuracy = 0.66 (p = .071), sensitivity = 0.70 and a sensitivity = 0.61. In this study, the potential of using GWAS data to predict duloxetine outcomes was examined using ML models. The models were characterized by a promising sensitivity, but specificity remained moderate at best. The inclusion of additional non-genetic variables to create integrated models may improve prediction. Copyright © 2017. Published by Elsevier Ltd.
Accuracy of genomic breeding values for meat tenderness in Polled Nellore cattle.
Magnabosco, C U; Lopes, F B; Fragoso, R C; Eifert, E C; Valente, B D; Rosa, G J M; Sainz, R D
2016-07-01
Zebu () cattle, mostly of the Nellore breed, comprise more than 80% of the beef cattle in Brazil, given their tolerance of the tropical climate and high resistance to ectoparasites. Despite their advantages for production in tropical environments, zebu cattle tend to produce tougher meat than Bos taurus breeds. Traditional genetic selection to improve meat tenderness is constrained by the difficulty and cost of phenotypic evaluation for meat quality. Therefore, genomic selection may be the best strategy to improve meat quality traits. This study was performed to compare the accuracies of different Bayesian regression models in predicting molecular breeding values for meat tenderness in Polled Nellore cattle. The data set was composed of Warner-Bratzler shear force (WBSF) of longissimus muscle from 205, 141, and 81 animals slaughtered in 2005, 2010, and 2012, respectively, which were selected and mated so as to create extreme segregation for WBSF. The animals were genotyped with either the Illumina BovineHD (HD; 777,000 from 90 samples) chip or the GeneSeek Genomic Profiler (GGP Indicus HD; 77,000 from 337 samples). The quality controls of SNP were Hard-Weinberg Proportion -value ≥ 0.1%, minor allele frequency > 1%, and call rate > 90%. The FImpute program was used for imputation from the GGP Indicus HD chip to the HD chip. The effect of each SNP was estimated using ridge regression, least absolute shrinkage and selection operator (LASSO), Bayes A, Bayes B, and Bayes Cπ methods. Different numbers of SNP were used, with 1, 2, 3, 4, 5, 7, 10, 20, 40, 60, 80, or 100% of the markers preselected based on their significance test (-value from genomewide association studies [GWAS]) or randomly sampled. The prediction accuracy was assessed by the correlation between genomic breeding value and the observed WBSF phenotype, using a leave-one-out cross-validation methodology. The prediction accuracies using all markers were all very similar for all models, ranging from 0.22 (Bayes Cπ) to 0.25 (Bayes B). When preselecting SNP based on GWAS results, the highest correlation (0.27) between WBSF and the genomic breeding value was achieved using the Bayesian LASSO model with 15,030 (3%) markers. Although this study used relatively few animals, the design of the segregating population ensured wide genetic variability for meat tenderness, which was important to achieve acceptable accuracy of genomic prediction. Although all models showed similar levels of prediction accuracy, some small advantages were observed with the Bayes B approach when higher numbers of markers were preselected based on their -values resulting from a GWAS analysis.
Modeling Alzheimer's disease cognitive scores using multi-task sparse group lasso.
Liu, Xiaoli; Goncalves, André R; Cao, Peng; Zhao, Dazhe; Banerjee, Arindam
2018-06-01
Alzheimer's disease (AD) is a severe neurodegenerative disorder characterized by loss of memory and reduction in cognitive functions due to progressive degeneration of neurons and their connections, eventually leading to death. In this paper, we consider the problem of simultaneously predicting several different cognitive scores associated with categorizing subjects as normal, mild cognitive impairment (MCI), or Alzheimer's disease (AD) in a multi-task learning framework using features extracted from brain images obtained from ADNI (Alzheimer's Disease Neuroimaging Initiative). To solve the problem, we present a multi-task sparse group lasso (MT-SGL) framework, which estimates sparse features coupled across tasks, and can work with loss functions associated with any Generalized Linear Models. Through comparisons with a variety of baseline models using multiple evaluation metrics, we illustrate the promising predictive performance of MT-SGL on ADNI along with its ability to identify brain regions more likely to help the characterization Alzheimer's disease progression. Copyright © 2017 Elsevier Ltd. All rights reserved.
Calle, M. Luz; Rothman, Nathaniel; Urrea, Víctor; Kogevinas, Manolis; Petrus, Sandra; Chanock, Stephen J.; Tardón, Adonina; García-Closas, Montserrat; González-Neira, Anna; Vellalta, Gemma; Carrato, Alfredo; Navarro, Arcadi; Lorente-Galdós, Belén; Silverman, Debra T.; Real, Francisco X.; Wu, Xifeng; Malats, Núria
2013-01-01
The relationship between inflammation and cancer is well established in several tumor types, including bladder cancer. We performed an association study between 886 inflammatory-gene variants and bladder cancer risk in 1,047 cases and 988 controls from the Spanish Bladder Cancer (SBC)/EPICURO Study. A preliminary exploration with the widely used univariate logistic regression approach did not identify any significant SNP after correcting for multiple testing. We further applied two more comprehensive methods to capture the complexity of bladder cancer genetic susceptibility: Bayesian Threshold LASSO (BTL), a regularized regression method, and AUC-Random Forest, a machine-learning algorithm. Both approaches explore the joint effect of markers. BTL analysis identified a signature of 37 SNPs in 34 genes showing an association with bladder cancer. AUC-RF detected an optimal predictive subset of 56 SNPs. 13 SNPs were identified by both methods in the total population. Using resources from the Texas Bladder Cancer study we were able to replicate 30% of the SNPs assessed. The associations between inflammatory SNPs and bladder cancer were reexamined among non-smokers to eliminate the effect of tobacco, one of the strongest and most prevalent environmental risk factor for this tumor. A 9 SNP-signature was detected by BTL. Here we report, for the first time, a set of SNP in inflammatory genes jointly associated with bladder cancer risk. These results highlight the importance of the complex structure of genetic susceptibility associated with cancer risk. PMID:24391818
Reducing the number of reconstructions needed for estimating channelized observer performance
NASA Astrophysics Data System (ADS)
Pineda, Angel R.; Miedema, Hope; Brenner, Melissa; Altaf, Sana
2018-03-01
A challenge for task-based optimization is the time required for each reconstructed image in applications where reconstructions are time consuming. Our goal is to reduce the number of reconstructions needed to estimate the area under the receiver operating characteristic curve (AUC) of the infinitely-trained optimal channelized linear observer. We explore the use of classifiers which either do not invert the channel covariance matrix or do feature selection. We also study the assumption that multiple low contrast signals in the same image of a non-linear reconstruction do not significantly change the estimate of the AUC. We compared the AUC of several classifiers (Hotelling, logistic regression, logistic regression using Firth bias reduction and the least absolute shrinkage and selection operator (LASSO)) with a small number of observations both for normal simulated data and images from a total variation reconstruction in magnetic resonance imaging (MRI). We used 10 Laguerre-Gauss channels and the Mann-Whitney estimator for AUC. For this data, our results show that at small sample sizes feature selection using the LASSO technique can decrease bias of the AUC estimation with increased variance and that for large sample sizes the difference between these classifiers is small. We also compared the use of multiple signals in a single reconstructed image to reduce the number of reconstructions in a total variation reconstruction for accelerated imaging in MRI. We found that AUC estimation using multiple low contrast signals in the same image resulted in similar AUC estimates as doing a single reconstruction per signal leading to a 13x reduction in the number of reconstructions needed.
Transforming RNA-Seq data to improve the performance of prognostic gene signatures.
Zwiener, Isabella; Frisch, Barbara; Binder, Harald
2014-01-01
Gene expression measurements have successfully been used for building prognostic signatures, i.e for identifying a short list of important genes that can predict patient outcome. Mostly microarray measurements have been considered, and there is little advice available for building multivariable risk prediction models from RNA-Seq data. We specifically consider penalized regression techniques, such as the lasso and componentwise boosting, which can simultaneously consider all measurements and provide both, multivariable regression models for prediction and automated variable selection. However, they might be affected by the typical skewness, mean-variance-dependency or extreme values of RNA-Seq covariates and therefore could benefit from transformations of the latter. In an analytical part, we highlight preferential selection of covariates with large variances, which is problematic due to the mean-variance dependency of RNA-Seq data. In a simulation study, we compare different transformations of RNA-Seq data for potentially improving detection of important genes. Specifically, we consider standardization, the log transformation, a variance-stabilizing transformation, the Box-Cox transformation, and rank-based transformations. In addition, the prediction performance for real data from patients with kidney cancer and acute myeloid leukemia is considered. We show that signature size, identification performance, and prediction performance critically depend on the choice of a suitable transformation. Rank-based transformations perform well in all scenarios and can even outperform complex variance-stabilizing approaches. Generally, the results illustrate that the distribution and potential transformations of RNA-Seq data need to be considered as a critical step when building risk prediction models by penalized regression techniques.
Transforming RNA-Seq Data to Improve the Performance of Prognostic Gene Signatures
Zwiener, Isabella; Frisch, Barbara; Binder, Harald
2014-01-01
Gene expression measurements have successfully been used for building prognostic signatures, i.e for identifying a short list of important genes that can predict patient outcome. Mostly microarray measurements have been considered, and there is little advice available for building multivariable risk prediction models from RNA-Seq data. We specifically consider penalized regression techniques, such as the lasso and componentwise boosting, which can simultaneously consider all measurements and provide both, multivariable regression models for prediction and automated variable selection. However, they might be affected by the typical skewness, mean-variance-dependency or extreme values of RNA-Seq covariates and therefore could benefit from transformations of the latter. In an analytical part, we highlight preferential selection of covariates with large variances, which is problematic due to the mean-variance dependency of RNA-Seq data. In a simulation study, we compare different transformations of RNA-Seq data for potentially improving detection of important genes. Specifically, we consider standardization, the log transformation, a variance-stabilizing transformation, the Box-Cox transformation, and rank-based transformations. In addition, the prediction performance for real data from patients with kidney cancer and acute myeloid leukemia is considered. We show that signature size, identification performance, and prediction performance critically depend on the choice of a suitable transformation. Rank-based transformations perform well in all scenarios and can even outperform complex variance-stabilizing approaches. Generally, the results illustrate that the distribution and potential transformations of RNA-Seq data need to be considered as a critical step when building risk prediction models by penalized regression techniques. PMID:24416353
NASA Astrophysics Data System (ADS)
Rachmatia, H.; Kusuma, W. A.; Hasibuan, L. S.
2017-05-01
Selection in plant breeding could be more effective and more efficient if it is based on genomic data. Genomic selection (GS) is a new approach for plant-breeding selection that exploits genomic data through a mechanism called genomic prediction (GP). Most of GP models used linear methods that ignore effects of interaction among genes and effects of higher order nonlinearities. Deep belief network (DBN), one of the architectural in deep learning methods, is able to model data in high level of abstraction that involves nonlinearities effects of the data. This study implemented DBN for developing a GP model utilizing whole-genome Single Nucleotide Polymorphisms (SNPs) as data for training and testing. The case study was a set of traits in maize. The maize dataset was acquisitioned from CIMMYT’s (International Maize and Wheat Improvement Center) Global Maize program. Based on Pearson correlation, DBN is outperformed than other methods, kernel Hilbert space (RKHS) regression, Bayesian LASSO (BL), best linear unbiased predictor (BLUP), in case allegedly non-additive traits. DBN achieves correlation of 0.579 within -1 to 1 range.
Predicting mortality over different time horizons: which data elements are needed?
Goldstein, Benjamin A; Pencina, Michael J; Montez-Rath, Maria E; Winkelmayer, Wolfgang C
2017-01-01
Electronic health records (EHRs) are a resource for "big data" analytics, containing a variety of data elements. We investigate how different categories of information contribute to prediction of mortality over different time horizons among patients undergoing hemodialysis treatment. We derived prediction models for mortality over 7 time horizons using EHR data on older patients from a national chain of dialysis clinics linked with administrative data using LASSO (least absolute shrinkage and selection operator) regression. We assessed how different categories of information relate to risk assessment and compared discrete models to time-to-event models. The best predictors used all the available data (c-statistic ranged from 0.72-0.76), with stronger models in the near term. While different variable groups showed different utility, exclusion of any particular group did not lead to a meaningfully different risk assessment. Discrete time models performed better than time-to-event models. Different variable groups were predictive over different time horizons, with vital signs most predictive for near-term mortality and demographic and comorbidities more important in long-term mortality. © The Author 2016. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
The Spike-and-Slab Lasso Generalized Linear Models for Prediction and Associated Genes Detection.
Tang, Zaixiang; Shen, Yueping; Zhang, Xinyan; Yi, Nengjun
2017-01-01
Large-scale "omics" data have been increasingly used as an important resource for prognostic prediction of diseases and detection of associated genes. However, there are considerable challenges in analyzing high-dimensional molecular data, including the large number of potential molecular predictors, limited number of samples, and small effect of each predictor. We propose new Bayesian hierarchical generalized linear models, called spike-and-slab lasso GLMs, for prognostic prediction and detection of associated genes using large-scale molecular data. The proposed model employs a spike-and-slab mixture double-exponential prior for coefficients that can induce weak shrinkage on large coefficients, and strong shrinkage on irrelevant coefficients. We have developed a fast and stable algorithm to fit large-scale hierarchal GLMs by incorporating expectation-maximization (EM) steps into the fast cyclic coordinate descent algorithm. The proposed approach integrates nice features of two popular methods, i.e., penalized lasso and Bayesian spike-and-slab variable selection. The performance of the proposed method is assessed via extensive simulation studies. The results show that the proposed approach can provide not only more accurate estimates of the parameters, but also better prediction. We demonstrate the proposed procedure on two cancer data sets: a well-known breast cancer data set consisting of 295 tumors, and expression data of 4919 genes; and the ovarian cancer data set from TCGA with 362 tumors, and expression data of 5336 genes. Our analyses show that the proposed procedure can generate powerful models for predicting outcomes and detecting associated genes. The methods have been implemented in a freely available R package BhGLM (http://www.ssg.uab.edu/bhglm/). Copyright © 2017 by the Genetics Society of America.
Häberle, Lothar; Hack, Carolin C; Heusinger, Katharina; Wagner, Florian; Jud, Sebastian M; Uder, Michael; Beckmann, Matthias W; Schulz-Wendtland, Rüdiger; Wittenberg, Thomas; Fasching, Peter A
2017-08-30
Tumors in radiologically dense breast were overlooked on mammograms more often than tumors in low-density breasts. A fast reproducible and automated method of assessing percentage mammographic density (PMD) would be desirable to support decisions whether ultrasonography should be provided for women in addition to mammography in diagnostic mammography units. PMD assessment has still not been included in clinical routine work, as there are issues of interobserver variability and the procedure is quite time consuming. This study investigated whether fully automatically generated texture features of mammograms can replace time-consuming semi-automatic PMD assessment to predict a patient's risk of having an invasive breast tumor that is visible on ultrasound but masked on mammography (mammography failure). This observational study included 1334 women with invasive breast cancer treated at a hospital-based diagnostic mammography unit. Ultrasound was available for the entire cohort as part of routine diagnosis. Computer-based threshold PMD assessments ("observed PMD") were carried out and 363 texture features were obtained from each mammogram. Several variable selection and regression techniques (univariate selection, lasso, boosting, random forest) were applied to predict PMD from the texture features. The predicted PMD values were each used as new predictor for masking in logistic regression models together with clinical predictors. These four logistic regression models with predicted PMD were compared among themselves and with a logistic regression model with observed PMD. The most accurate masking prediction was determined by cross-validation. About 120 of the 363 texture features were selected for predicting PMD. Density predictions with boosting were the best substitute for observed PMD to predict masking. Overall, the corresponding logistic regression model performed better (cross-validated AUC, 0.747) than one without mammographic density (0.734), but less well than the one with the observed PMD (0.753). However, in patients with an assigned mammography failure risk >10%, covering about half of all masked tumors, the boosting-based model performed at least as accurately as the original PMD model. Automatically generated texture features can replace semi-automatically determined PMD in a prediction model for mammography failure, such that more than 50% of masked tumors could be discovered.
NASA Astrophysics Data System (ADS)
Fouque, Kevin Jeanne Dit; Lavanant, Hélène; Zirah, Séverine; Hegemann, Julian D.; Zimmermann, Marcel; Marahiel, Mohamed A.; Rebuffat, Sylvie; Afonso, Carlos
2017-02-01
Lasso peptides are characterized by a mechanically interlocked structure, where the C-terminal tail of the peptide is threaded and trapped within an N-terminal macrolactam ring. Their compact and stable structures have a significant impact on their biological and physical properties and make them highly interesting for drug development. Ion mobility - mass spectrometry (IM-MS) has shown to be effective to discriminate the lasso topology from their corresponding branched-cyclic topoisomers in which the C-terminal tail is unthreaded. In fact, previous comparison of the IM-MS data of the two topologies has yielded three trends that allow differentiation of the lasso fold from the branched-cyclic structure: (1) the low abundance of highly charged ions, (2) the low change in collision cross sections (CCS) with increasing charge state and (3) a narrow ion mobility peak width. In this study, a three-dimensional plot was generated using three indicators based on these three trends: (1) mean charge divided by mass (ζ), (2) relative range of CCS covered by all protonated molecules (ΔΩ/Ω) and (3) mean ion mobility peak width (δΩ). The data were first collected on a set of twenty one lasso peptides and eight branched-cyclic peptides. The indicators were obtained also for eight variants of the well-known lasso peptide MccJ25 obtained by site-directed mutagenesis and further extended to five linear peptides, two macrocyclic peptides and one disulfide constrained peptide. In all cases, a clear clustering was observed between constrained and unconstrained structures, thus providing a new strategy to discriminate mechanically interlocked topologies.
Wang, Zhu; Ma, Shuangge; Wang, Ching-Yun
2015-09-01
In health services and outcome research, count outcomes are frequently encountered and often have a large proportion of zeros. The zero-inflated negative binomial (ZINB) regression model has important applications for this type of data. With many possible candidate risk factors, this paper proposes new variable selection methods for the ZINB model. We consider maximum likelihood function plus a penalty including the least absolute shrinkage and selection operator (LASSO), smoothly clipped absolute deviation (SCAD), and minimax concave penalty (MCP). An EM (expectation-maximization) algorithm is proposed for estimating the model parameters and conducting variable selection simultaneously. This algorithm consists of estimating penalized weighted negative binomial models and penalized logistic models via the coordinated descent algorithm. Furthermore, statistical properties including the standard error formulae are provided. A simulation study shows that the new algorithm not only has more accurate or at least comparable estimation, but also is more robust than the traditional stepwise variable selection. The proposed methods are applied to analyze the health care demand in Germany using the open-source R package mpath. © 2015 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Nuclear receptors (NRs) are important biological macromolecular transcription factors that are implicated in multiple biological pathways and may interact with other xenobiotics that are endocrine disruptors present in the environment. Examples of important NRs include the androg...
NASA Astrophysics Data System (ADS)
Czernecki, Bartosz; Nowosad, Jakub; Jabłońska, Katarzyna
2018-04-01
Changes in the timing of plant phenological phases are important proxies in contemporary climate research. However, most of the commonly used traditional phenological observations do not give any coherent spatial information. While consistent spatial data can be obtained from airborne sensors and preprocessed gridded meteorological data, not many studies robustly benefit from these data sources. Therefore, the main aim of this study is to create and evaluate different statistical models for reconstructing, predicting, and improving quality of phenological phases monitoring with the use of satellite and meteorological products. A quality-controlled dataset of the 13 BBCH plant phenophases in Poland was collected for the period 2007-2014. For each phenophase, statistical models were built using the most commonly applied regression-based machine learning techniques, such as multiple linear regression, lasso, principal component regression, generalized boosted models, and random forest. The quality of the models was estimated using a k-fold cross-validation. The obtained results showed varying potential for coupling meteorological derived indices with remote sensing products in terms of phenological modeling; however, application of both data sources improves models' accuracy from 0.6 to 4.6 day in terms of obtained RMSE. It is shown that a robust prediction of early phenological phases is mostly related to meteorological indices, whereas for autumn phenophases, there is a stronger information signal provided by satellite-derived vegetation metrics. Choosing a specific set of predictors and applying a robust preprocessing procedures is more important for final results than the selection of a particular statistical model. The average RMSE for the best models of all phenophases is 6.3, while the individual RMSE vary seasonally from 3.5 to 10 days. Models give reliable proxy for ground observations with RMSE below 5 days for early spring and late spring phenophases. For other phenophases, RMSE are higher and rise up to 9-10 days in the case of the earliest spring phenophases.
Tracking of time-varying genomic regulatory networks with a LASSO-Kalman smoother
2014-01-01
It is widely accepted that cellular requirements and environmental conditions dictate the architecture of genetic regulatory networks. Nonetheless, the status quo in regulatory network modeling and analysis assumes an invariant network topology over time. In this paper, we refocus on a dynamic perspective of genetic networks, one that can uncover substantial topological changes in network structure during biological processes such as developmental growth. We propose a novel outlook on the inference of time-varying genetic networks, from a limited number of noisy observations, by formulating the network estimation as a target tracking problem. We overcome the limited number of observations (small n large p problem) by performing tracking in a compressed domain. Assuming linear dynamics, we derive the LASSO-Kalman smoother, which recursively computes the minimum mean-square sparse estimate of the network connectivity at each time point. The LASSO operator, motivated by the sparsity of the genetic regulatory networks, allows simultaneous signal recovery and compression, thereby reducing the amount of required observations. The smoothing improves the estimation by incorporating all observations. We track the time-varying networks during the life cycle of the Drosophila melanogaster. The recovered networks show that few genes are permanent, whereas most are transient, acting only during specific developmental phases of the organism. PMID:24517200
The Highly Adaptive Lasso Estimator
Benkeser, David; van der Laan, Mark
2017-01-01
Estimation of a regression functions is a common goal of statistical learning. We propose a novel nonparametric regression estimator that, in contrast to many existing methods, does not rely on local smoothness assumptions nor is it constructed using local smoothing techniques. Instead, our estimator respects global smoothness constraints by virtue of falling in a class of right-hand continuous functions with left-hand limits that have variation norm bounded by a constant. Using empirical process theory, we establish a fast minimal rate of convergence of our proposed estimator and illustrate how such an estimator can be constructed using standard software. In simulations, we show that the finite-sample performance of our estimator is competitive with other popular machine learning techniques across a variety of data generating mechanisms. We also illustrate competitive performance in real data examples using several publicly available data sets. PMID:29094111
Jang, In Sock; Dienstmann, Rodrigo; Margolin, Adam A; Guinney, Justin
2015-01-01
Complex mechanisms involving genomic aberrations in numerous proteins and pathways are believed to be a key cause of many diseases such as cancer. With recent advances in genomics, elucidating the molecular basis of cancer at a patient level is now feasible, and has led to personalized treatment strategies whereby a patient is treated according to his or her genomic profile. However, there is growing recognition that existing treatment modalities are overly simplistic, and do not fully account for the deep genomic complexity associated with sensitivity or resistance to cancer therapies. To overcome these limitations, large-scale pharmacogenomic screens of cancer cell lines--in conjunction with modern statistical learning approaches--have been used to explore the genetic underpinnings of drug response. While these analyses have demonstrated the ability to infer genetic predictors of compound sensitivity, to date most modeling approaches have been data-driven, i.e. they do not explicitly incorporate domain-specific knowledge (priors) in the process of learning a model. While a purely data-driven approach offers an unbiased perspective of the data--and may yield unexpected or novel insights--this strategy introduces challenges for both model interpretability and accuracy. In this study, we propose a novel prior-incorporated sparse regression model in which the choice of informative predictor sets is carried out by knowledge-driven priors (gene sets) in a stepwise fashion. Under regularization in a linear regression model, our algorithm is able to incorporate prior biological knowledge across the predictive variables thereby improving the interpretability of the final model with no loss--and often an improvement--in predictive performance. We evaluate the performance of our algorithm compared to well-known regularization methods such as LASSO, Ridge and Elastic net regression in the Cancer Cell Line Encyclopedia (CCLE) and Genomics of Drug Sensitivity in Cancer (Sanger) pharmacogenomics datasets, demonstrating that incorporation of the biological priors selected by our model confers improved predictability and interpretability, despite much fewer predictors, over existing state-of-the-art methods.
LASSO, two-way, and GPS time comparisons: A (very) preliminary status report
NASA Technical Reports Server (NTRS)
Veillet, Christian J. L.; Feraudy, D.; Torre, J. M.; Mangin, J. F.; Grudler, P.; Baumont, Francoise S.; Gaignebet, Jean C.; Hatat, J. L.; Hanson, Wayne; Clements, A.
1990-01-01
The first results are presented on the time transfer experiments between TUG (Graz, Austria) and OCA (Grasse, France) using common view Global Positioning System (GPS) and two-way stations at both sites. The present data, providing arms of the clock offsets of 2 to 3 nanoseconds for a three month period, have to be further analyzed before any conclusions on the respective precision and accuracy of these techniques can be drawn. Two years after its start, the Laser Synchronization from Stationary Orbit (LASSO) experiment is finally giving its first results at TUG and OCA. The first analysis of three common sessions permitted researchers to conclude that the LASSO package on board Meteosat P2 is working satisfactorily, and that time transfer using this method should provide clock offsets at better than 1 nanosecond precision, and clock rates at better than 10(exp -12) s/s in a 5 to 10 minutes session. A new method for extracting this information from the raw data sent by LASSO should enhance the performances of this experiment, exploiting the stability of the on-board oscillator.
LASSO observations at McDonald and OCA/CERGA: A preliminary analysis
NASA Technical Reports Server (NTRS)
Veillet, CH.; Fridelance, P.; Feraudy, D.; Boudon, Y.; Shelus, P. J.; Ricklefs, R. L.; Wiant, J. R.
1993-01-01
The Laser Synchronization from Synchronous Orbit (LASSO) observations between USA and Europe were made possible with the move of Meteosat 3/P2 toward 50 deg W. Two Lunar Laser Ranging stations participated into the observations: the MLRS at McDonald Observatory (Texas, USA) and OCA/CERGA (Grasse, France). Common sessions were performed since 30 Apr. 1992, and will be continued up to the next Meteosat 3/P2 move further West (planned for January 1993). The preliminary analysis made with the data already collected by the end of Nov. 1992 shows that the precision which can be obtained from LASSO is better than 100 ps, the accuracy depending on how well the stations maintain their time metrology, as well as on the quality of the calibration (still to be made.) For extracting such a precision from the data, the processing has been drastically changed compared to the initial LASSO data analysis. It takes into account all the measurements made, timings on board, and echoes at each station. This complete use of the data increased dramatically the confidence into the synchronization results.
Chaibub Neto, Elias; Bare, J. Christopher; Margolin, Adam A.
2014-01-01
New algorithms are continuously proposed in computational biology. Performance evaluation of novel methods is important in practice. Nonetheless, the field experiences a lack of rigorous methodology aimed to systematically and objectively evaluate competing approaches. Simulation studies are frequently used to show that a particular method outperforms another. Often times, however, simulation studies are not well designed, and it is hard to characterize the particular conditions under which different methods perform better. In this paper we propose the adoption of well established techniques in the design of computer and physical experiments for developing effective simulation studies. By following best practices in planning of experiments we are better able to understand the strengths and weaknesses of competing algorithms leading to more informed decisions about which method to use for a particular task. We illustrate the application of our proposed simulation framework with a detailed comparison of the ridge-regression, lasso and elastic-net algorithms in a large scale study investigating the effects on predictive performance of sample size, number of features, true model sparsity, signal-to-noise ratio, and feature correlation, in situations where the number of covariates is usually much larger than sample size. Analysis of data sets containing tens of thousands of features but only a few hundred samples is nowadays routine in computational biology, where “omics” features such as gene expression, copy number variation and sequence data are frequently used in the predictive modeling of complex phenotypes such as anticancer drug response. The penalized regression approaches investigated in this study are popular choices in this setting and our simulations corroborate well established results concerning the conditions under which each one of these methods is expected to perform best while providing several novel insights. PMID:25289666
Prediction of gene expression with cis-SNPs using mixed models and regularization methods.
Zeng, Ping; Zhou, Xiang; Huang, Shuiping
2017-05-11
It has been shown that gene expression in human tissues is heritable, thus predicting gene expression using only SNPs becomes possible. The prediction of gene expression can offer important implications on the genetic architecture of individual functional associated SNPs and further interpretations of the molecular basis underlying human diseases. We compared three types of methods for predicting gene expression using only cis-SNPs, including the polygenic model, i.e. linear mixed model (LMM), two sparse models, i.e. Lasso and elastic net (ENET), and the hybrid of LMM and sparse model, i.e. Bayesian sparse linear mixed model (BSLMM). The three kinds of prediction methods have very different assumptions of underlying genetic architectures. These methods were evaluated using simulations under various scenarios, and were applied to the Geuvadis gene expression data. The simulations showed that these four prediction methods (i.e. Lasso, ENET, LMM and BSLMM) behaved best when their respective modeling assumptions were satisfied, but BSLMM had a robust performance across a range of scenarios. According to R 2 of these models in the Geuvadis data, the four methods performed quite similarly. We did not observe any clustering or enrichment of predictive genes (defined as genes with R 2 ≥ 0.05) across the chromosomes, and also did not see there was any clear relationship between the proportion of the predictive genes and the proportion of genes in each chromosome. However, an interesting finding in the Geuvadis data was that highly predictive genes (e.g. R 2 ≥ 0.30) may have sparse genetic architectures since Lasso, ENET and BSLMM outperformed LMM for these genes; and this observation was validated in another gene expression data. We further showed that the predictive genes were enriched in approximately independent LD blocks. Gene expression can be predicted with only cis-SNPs using well-developed prediction models and these predictive genes were enriched in some approximately independent LD blocks. The prediction of gene expression can shed some light on the functional interpretation for identified SNPs in GWASs.
Yang, Guanxue; Wang, Lin; Wang, Xiaofan
2017-06-07
Reconstruction of networks underlying complex systems is one of the most crucial problems in many areas of engineering and science. In this paper, rather than identifying parameters of complex systems governed by pre-defined models or taking some polynomial and rational functions as a prior information for subsequent model selection, we put forward a general framework for nonlinear causal network reconstruction from time-series with limited observations. With obtaining multi-source datasets based on the data-fusion strategy, we propose a novel method to handle nonlinearity and directionality of complex networked systems, namely group lasso nonlinear conditional granger causality. Specially, our method can exploit different sets of radial basis functions to approximate the nonlinear interactions between each pair of nodes and integrate sparsity into grouped variables selection. The performance characteristic of our approach is firstly assessed with two types of simulated datasets from nonlinear vector autoregressive model and nonlinear dynamic models, and then verified based on the benchmark datasets from DREAM3 Challenge4. Effects of data size and noise intensity are also discussed. All of the results demonstrate that the proposed method performs better in terms of higher area under precision-recall curve.
Predicting Kenya Short Rains Using the Indian Ocean SST
NASA Astrophysics Data System (ADS)
Peng, X.; Albertson, J. D.; Steinschneider, S.
2017-12-01
The rainfall over the Eastern Africa is charaterized by the typical bimodal monsoon system. Literatures have shown that the monsoon system is closely connected with the large-scale atmospheric motion which is believed to be driven by sea surface temperature anomalies (SSTA). Therefore, we may make use of the predictability of SSTA in estimating future Easter Africa monsoon. In this study, we tried predict the Kenya short rains (Oct, Nov and Dec rainfall) based on the Indian Ocean SSTA. The Least Absolute Shrinkage and Selection Operator (LASSO) regression is used to avoid over-fitting issues. Models for different lead times are trained using a 28-year training set (2006-1979) and are tested using a 10-year test set (2007-2016). Satisfying prediciton skills are achieved at relatively long lead times (i.e., 8 and 10 months) in terms of correlation coefficient and sign accuracy. Unlike some of the previous work, the prediction models are obtained from a data-driven method. Limited predictors are selected for each model and can be used in understanding the underlying physical connection. Still, further investigation is needed since the sampling variability issue cannot be excluded due to the limited sample size.
Prediction of Multiple Infections After Severe Burn Trauma: a Prospective Cohort Study
Yan, Shuangchun; Tsurumi, Amy; Que, Yok-Ai; Ryan, Colleen M.; Bandyopadhaya, Arunava; Morgan, Alexander A.; Flaherty, Patrick J.; Tompkins, Ronald G.; Rahme, Laurence G.
2014-01-01
Objective To develop predictive models for early triage of burn patients based on hyper-susceptibility to repeated infections. Background Infection remains a major cause of mortality and morbidity after severe trauma, demanding new strategies to combat infections. Models for infection prediction are lacking. Methods Secondary analysis of 459 burn patients (≥16 years old) with ≥20% total body surface area burns recruited from six US burn centers. We compared blood transcriptomes with a 180-h cut-off on the injury-to-transcriptome interval of 47 patients (≤1 infection episode) to those of 66 hyper-susceptible patients (multiple [≥2] infection episodes [MIE]). We used LASSO regression to select biomarkers and multivariate logistic regression to built models, accuracy of which were assessed by area under receiver operating characteristic curve (AUROC) and cross-validation. Results Three predictive models were developed covariates of: (1) clinical characteristics; (2) expression profiles of 14 genomic probes; (3) combining (1) and (2). The genomic and clinical models were highly predictive of MIE status (AUROCGenomic = 0.946 [95% CI, 0.906–0.986]); AUROCClinical = 0.864 [CI, 0.794–0.933]; AUROCGenomic/AUROCClinical P = 0.044). Combined model has an increased AUROCCombined of 0.967 (CI, 0.940–0.993) compared to the individual models (AUROCCombined/AUROCClinical P = 0.0069). Hyper-susceptible patients show early alterations in immune-related signaling pathways, epigenetic modulation and chromatin remodeling. Conclusions Early triage of burn patients more susceptible to infections can be made using clinical characteristics and/or genomic signatures. Genomic signature suggests new insights into the pathophysiology of hyper-susceptibility to infection may lead to novel potential therapeutic or prophylactic targets. PMID:24950278
NASA Astrophysics Data System (ADS)
Li, Richard Y.; Di Felice, Rosa; Rohs, Remo; Lidar, Daniel A.
2018-03-01
Transcription factors regulate gene expression, but how these proteins recognize and specifically bind to their DNA targets is still debated. Machine learning models are effective means to reveal interaction mechanisms. Here we studied the ability of a quantum machine learning approach to classify and rank binding affinities. Using simplified data sets of a small number of DNA sequences derived from actual binding affinity experiments, we trained a commercially available quantum annealer to classify and rank transcription factor binding. The results were compared to state-of-the-art classical approaches for the same simplified data sets, including simulated annealing, simulated quantum annealing, multiple linear regression, LASSO, and extreme gradient boosting. Despite technological limitations, we find a slight advantage in classification performance and nearly equal ranking performance using the quantum annealer for these fairly small training data sets. Thus, we propose that quantum annealing might be an effective method to implement machine learning for certain computational biology problems.
Li, Juntao; Wang, Yanyan; Jiang, Tao; Xiao, Huimin; Song, Xuekun
2018-05-09
Diagnosing acute leukemia is the necessary prerequisite to treating it. Multi-classification on the gene expression data of acute leukemia is help for diagnosing it which contains B-cell acute lymphoblastic leukemia (BALL), T-cell acute lymphoblastic leukemia (TALL) and acute myeloid leukemia (AML). However, selecting cancer-causing genes is a challenging problem in performing multi-classification. In this paper, weighted gene co-expression networks are employed to divide the genes into groups. Based on the dividing groups, a new regularized multinomial regression with overlapping group lasso penalty (MROGL) has been presented to simultaneously perform multi-classification and select gene groups. By implementing this method on three-class acute leukemia data, the grouped genes which work synergistically are identified, and the overlapped genes shared by different groups are also highlighted. Moreover, MROGL outperforms other five methods on multi-classification accuracy. Copyright © 2017. Published by Elsevier B.V.
The ring residue proline 8 is crucial for the thermal stability of the lasso peptide caulosegnin II.
Hegemann, Julian D; Fage, Christopher D; Zhu, Shaozhou; Harms, Klaus; Di Leva, Francesco Saverio; Novellino, Ettore; Marinelli, Luciana; Marahiel, Mohamed A
2016-04-01
Lasso peptides are fascinating natural products with a unique structural fold that can exhibit tremendous thermal stability. Here, we investigate factors responsible for the thermal stability of caulosegnin II. By employing X-ray crystallography, mutational analysis and molecular dynamics simulations, the ring residue proline 8 was proven to be crucial for thermal stability.
Source sparsity control of sound field reproduction using the elastic-net and the lasso minimizers.
Gauthier, P-A; Lecomte, P; Berry, A
2017-04-01
Sound field reproduction is aimed at the reconstruction of a sound pressure field in an extended area using dense loudspeaker arrays. In some circumstances, sound field reproduction is targeted at the reproduction of a sound field captured using microphone arrays. Although methods and algorithms already exist to convert microphone array recordings to loudspeaker array signals, one remaining research question is how to control the spatial sparsity in the resulting loudspeaker array signals and what would be the resulting practical advantages. Sparsity is an interesting feature for spatial audio since it can drastically reduce the number of concurrently active reproduction sources and, therefore, increase the spatial contrast of the solution at the expense of a difference between the target and reproduced sound fields. In this paper, the application of the elastic-net cost function to sound field reproduction is compared to the lasso cost function. It is shown that the elastic-net can induce solution sparsity and overcomes limitations of the lasso: The elastic-net solves the non-uniqueness of the lasso solution, induces source clustering in the sparse solution, and provides a smoother solution within the activated source clusters.
Prediction of Baseflow Index of Catchments using Machine Learning Algorithms
NASA Astrophysics Data System (ADS)
Yadav, B.; Hatfield, K.
2017-12-01
We present the results of eight machine learning techniques for predicting the baseflow index (BFI) of ungauged basins using a surrogate of catchment scale climate and physiographic data. The tested algorithms include ordinary least squares, ridge regression, least absolute shrinkage and selection operator (lasso), elasticnet, support vector machine, gradient boosted regression trees, random forests, and extremely randomized trees. Our work seeks to identify the dominant controls of BFI that can be readily obtained from ancillary geospatial databases and remote sensing measurements, such that the developed techniques can be extended to ungauged catchments. More than 800 gauged catchments spanning the continental United States were selected to develop the general methodology. The BFI calculation was based on the baseflow separated from daily streamflow hydrograph using HYSEP filter. The surrogate catchment attributes were compiled from multiple sources including digital elevation model, soil, landuse, climate data, other publicly available ancillary and geospatial data. 80% catchments were used to train the ML algorithms, and the remaining 20% of the catchments were used as an independent test set to measure the generalization performance of fitted models. A k-fold cross-validation using exhaustive grid search was used to fit the hyperparameters of each model. Initial model development was based on 19 independent variables, but after variable selection and feature ranking, we generated revised sparse models of BFI prediction that are based on only six catchment attributes. These key predictive variables selected after the careful evaluation of bias-variance tradeoff include average catchment elevation, slope, fraction of sand, permeability, temperature, and precipitation. The most promising algorithms exceeding an accuracy score (r-square) of 0.7 on test data include support vector machine, gradient boosted regression trees, random forests, and extremely randomized trees. Considering both the accuracy and the computational complexity of these algorithms, we identify the extremely randomized trees as the best performing algorithm for BFI prediction in ungauged basins.
DOE Office of Scientific and Technical Information (OSTI.GOV)
De Ruyck, Kim, E-mail: kim.deruyck@UGent.be; Sabbe, Nick; Oberije, Cary
2011-10-01
Purpose: To construct a model for the prediction of acute esophagitis in lung cancer patients receiving chemoradiotherapy by combining clinical data, treatment parameters, and genotyping profile. Patients and Methods: Data were available for 273 lung cancer patients treated with curative chemoradiotherapy. Clinical data included gender, age, World Health Organization performance score, nicotine use, diabetes, chronic disease, tumor type, tumor stage, lymph node stage, tumor location, and medical center. Treatment parameters included chemotherapy, surgery, radiotherapy technique, tumor dose, mean fractionation size, mean and maximal esophageal dose, and overall treatment time. A total of 332 genetic polymorphisms were considered in 112 candidatemore » genes. The predicting model was achieved by lasso logistic regression for predictor selection, followed by classic logistic regression for unbiased estimation of the coefficients. Performance of the model was expressed as the area under the curve of the receiver operating characteristic and as the false-negative rate in the optimal point on the receiver operating characteristic curve. Results: A total of 110 patients (40%) developed acute esophagitis Grade {>=}2 (Common Terminology Criteria for Adverse Events v3.0). The final model contained chemotherapy treatment, lymph node stage, mean esophageal dose, gender, overall treatment time, radiotherapy technique, rs2302535 (EGFR), rs16930129 (ENG), rs1131877 (TRAF3), and rs2230528 (ITGB2). The area under the curve was 0.87, and the false-negative rate was 16%. Conclusion: Prediction of acute esophagitis can be improved by combining clinical, treatment, and genetic factors. A multicomponent prediction model for acute esophagitis with a sensitivity of 84% was constructed with two clinical parameters, four treatment parameters, and four genetic polymorphisms.« less
Arribas-Gil, Ana; De la Cruz, Rolando; Lebarbier, Emilie; Meza, Cristian
2015-06-01
We propose a classification method for longitudinal data. The Bayes classifier is classically used to determine a classification rule where the underlying density in each class needs to be well modeled and estimated. This work is motivated by a real dataset of hormone levels measured at the early stages of pregnancy that can be used to predict normal versus abnormal pregnancy outcomes. The proposed model, which is a semiparametric linear mixed-effects model (SLMM), is a particular case of the semiparametric nonlinear mixed-effects class of models (SNMM) in which finite dimensional (fixed effects and variance components) and infinite dimensional (an unknown function) parameters have to be estimated. In SNMM's maximum likelihood estimation is performed iteratively alternating parametric and nonparametric procedures. However, if one can make the assumption that the random effects and the unknown function interact in a linear way, more efficient estimation methods can be used. Our contribution is the proposal of a unified estimation procedure based on a penalized EM-type algorithm. The Expectation and Maximization steps are explicit. In this latter step, the unknown function is estimated in a nonparametric fashion using a lasso-type procedure. A simulation study and an application on real data are performed. © 2015, The International Biometric Society.
A novel artificial neural network method for biomedical prediction based on matrix pseudo-inversion.
Cai, Binghuang; Jiang, Xia
2014-04-01
Biomedical prediction based on clinical and genome-wide data has become increasingly important in disease diagnosis and classification. To solve the prediction problem in an effective manner for the improvement of clinical care, we develop a novel Artificial Neural Network (ANN) method based on Matrix Pseudo-Inversion (MPI) for use in biomedical applications. The MPI-ANN is constructed as a three-layer (i.e., input, hidden, and output layers) feed-forward neural network, and the weights connecting the hidden and output layers are directly determined based on MPI without a lengthy learning iteration. The LASSO (Least Absolute Shrinkage and Selection Operator) method is also presented for comparative purposes. Single Nucleotide Polymorphism (SNP) simulated data and real breast cancer data are employed to validate the performance of the MPI-ANN method via 5-fold cross validation. Experimental results demonstrate the efficacy of the developed MPI-ANN for disease classification and prediction, in view of the significantly superior accuracy (i.e., the rate of correct predictions), as compared with LASSO. The results based on the real breast cancer data also show that the MPI-ANN has better performance than other machine learning methods (including support vector machine (SVM), logistic regression (LR), and an iterative ANN). In addition, experiments demonstrate that our MPI-ANN could be used for bio-marker selection as well. Copyright © 2013 Elsevier Inc. All rights reserved.
NASA Astrophysics Data System (ADS)
Samala, Ravi K.; Chan, Heang-Ping; Hadjiiski, Lubomir; Helvie, Mark A.; Kim, Renaid
2017-03-01
Understanding the key radiogenomic associations for breast cancer between DCE-MRI and micro-RNA expressions is the foundation for the discovery of radiomic features as biomarkers for assessing tumor progression and prognosis. We conducted a study to analyze the radiogenomic associations for breast cancer using the TCGA-TCIA data set. The core idea that tumor etiology is a function of the behavior of miRNAs is used to build the regression models. The associations based on regression are analyzed for three study outcomes: diagnosis, prognosis, and treatment. The diagnosis group consists of miRNAs associated with clinicopathologic features of breast cancer and significant aberration of expression in breast cancer patients. The prognosis group consists of miRNAs which are closely associated with tumor suppression and regulation of cell proliferation and differentiation. The treatment group consists of miRNAs that contribute significantly to the regulation of metastasis thereby having the potential to be part of therapeutic mechanisms. As a first step, important miRNA expressions were identified and their ability to classify the clinical phenotypes based on the study outcomes was evaluated using the area under the ROC curve (AUC) as a figure-of-merit. The key mapping between the selected miRNAs and radiomic features were determined using least absolute shrinkage and selection operator (LASSO) regression analysis within a two-loop leave-one-out cross-validation strategy. These key associations indicated a number of radiomic features from DCE-MRI to be potential biomarkers for the three study outcomes.
Shao, Xiaolong; Li, Hui; Wang, Nan; Zhang, Qiang
2015-10-21
An electronic nose (e-nose) was used to characterize sesame oils processed by three different methods (hot-pressed, cold-pressed, and refined), as well as blends of the sesame oils and soybean oil. Seven classification and prediction methods, namely PCA, LDA, PLS, KNN, SVM, LASSO and RF, were used to analyze the e-nose data. The classification accuracy and MAUC were employed to evaluate the performance of these methods. The results indicated that sesame oils processed with different methods resulted in different sensor responses, with cold-pressed sesame oil producing the strongest sensor signals, followed by the hot-pressed sesame oil. The blends of pressed sesame oils with refined sesame oil were more difficult to be distinguished than the blends of pressed sesame oils and refined soybean oil. LDA, KNN, and SVM outperformed the other classification methods in distinguishing sesame oil blends. KNN, LASSO, PLS, and SVM (with linear kernel), and RF models could adequately predict the adulteration level (% of added soybean oil) in the sesame oil blends. Among the prediction models, KNN with k = 1 and 2 yielded the best prediction results.
Lin, Chen-Yen; Halabi, Susan
2017-01-01
We propose a minimand perturbation method to derive the confidence regions for the regularized estimators for the Cox’s proportional hazards model. Although the regularized estimation procedure produces a more stable point estimate, it remains challenging to provide an interval estimator or an analytic variance estimator for the associated point estimate. Based on the sandwich formula, the current variance estimator provides a simple approximation, but its finite sample performance is not entirely satisfactory. Besides, the sandwich formula can only provide variance estimates for the non-zero coefficients. In this article, we present a generic description for the perturbation method and then introduce a computation algorithm using the adaptive least absolute shrinkage and selection operator (LASSO) penalty. Through simulation studies, we demonstrate that our method can better approximate the limiting distribution of the adaptive LASSO estimator and produces more accurate inference compared with the sandwich formula. The simulation results also indicate the possibility of extending the applications to the adaptive elastic-net penalty. We further demonstrate our method using data from a phase III clinical trial in prostate cancer. PMID:29326496
Lin, Chen-Yen; Halabi, Susan
2017-01-01
We propose a minimand perturbation method to derive the confidence regions for the regularized estimators for the Cox's proportional hazards model. Although the regularized estimation procedure produces a more stable point estimate, it remains challenging to provide an interval estimator or an analytic variance estimator for the associated point estimate. Based on the sandwich formula, the current variance estimator provides a simple approximation, but its finite sample performance is not entirely satisfactory. Besides, the sandwich formula can only provide variance estimates for the non-zero coefficients. In this article, we present a generic description for the perturbation method and then introduce a computation algorithm using the adaptive least absolute shrinkage and selection operator (LASSO) penalty. Through simulation studies, we demonstrate that our method can better approximate the limiting distribution of the adaptive LASSO estimator and produces more accurate inference compared with the sandwich formula. The simulation results also indicate the possibility of extending the applications to the adaptive elastic-net penalty. We further demonstrate our method using data from a phase III clinical trial in prostate cancer.
Network Inference via the Time-Varying Graphical Lasso
Hallac, David; Park, Youngsuk; Boyd, Stephen; Leskovec, Jure
2018-01-01
Many important problems can be modeled as a system of interconnected entities, where each entity is recording time-dependent observations or measurements. In order to spot trends, detect anomalies, and interpret the temporal dynamics of such data, it is essential to understand the relationships between the different entities and how these relationships evolve over time. In this paper, we introduce the time-varying graphical lasso (TVGL), a method of inferring time-varying networks from raw time series data. We cast the problem in terms of estimating a sparse time-varying inverse covariance matrix, which reveals a dynamic network of interdependencies between the entities. Since dynamic network inference is a computationally expensive task, we derive a scalable message-passing algorithm based on the Alternating Direction Method of Multipliers (ADMM) to solve this problem in an efficient way. We also discuss several extensions, including a streaming algorithm to update the model and incorporate new observations in real time. Finally, we evaluate our TVGL algorithm on both real and synthetic datasets, obtaining interpretable results and outperforming state-of-the-art baselines in terms of both accuracy and scalability. PMID:29770256
Formation enthalpies for transition metal alloys using machine learning
NASA Astrophysics Data System (ADS)
Ubaru, Shashanka; Miedlar, Agnieszka; Saad, Yousef; Chelikowsky, James R.
2017-06-01
The enthalpy of formation is an important thermodynamic property. Developing fast and accurate methods for its prediction is of practical interest in a variety of applications. Material informatics techniques based on machine learning have recently been introduced in the literature as an inexpensive means of exploiting materials data, and can be used to examine a variety of thermodynamics properties. We investigate the use of such machine learning tools for predicting the formation enthalpies of binary intermetallic compounds that contain at least one transition metal. We consider certain easily available properties of the constituting elements complemented by some basic properties of the compounds, to predict the formation enthalpies. We show how choosing these properties (input features) based on a literature study (using prior physics knowledge) seems to outperform machine learning based feature selection methods such as sensitivity analysis and LASSO (least absolute shrinkage and selection operator) based methods. A nonlinear kernel based support vector regression method is employed to perform the predictions. The predictive ability of our model is illustrated via several experiments on a dataset containing 648 binary alloys. We train and validate the model using the formation enthalpies calculated using a model by Miedema, which is a popular semiempirical model used for the prediction of formation enthalpies of metal alloys.
Liu, Li-Zhi; Wu, Fang-Xiang; Zhang, Wen-Jun
2014-01-01
As an abstract mapping of the gene regulations in the cell, gene regulatory network is important to both biological research study and practical applications. The reverse engineering of gene regulatory networks from microarray gene expression data is a challenging research problem in systems biology. With the development of biological technologies, multiple time-course gene expression datasets might be collected for a specific gene network under different circumstances. The inference of a gene regulatory network can be improved by integrating these multiple datasets. It is also known that gene expression data may be contaminated with large errors or outliers, which may affect the inference results. A novel method, Huber group LASSO, is proposed to infer the same underlying network topology from multiple time-course gene expression datasets as well as to take the robustness to large error or outliers into account. To solve the optimization problem involved in the proposed method, an efficient algorithm which combines the ideas of auxiliary function minimization and block descent is developed. A stability selection method is adapted to our method to find a network topology consisting of edges with scores. The proposed method is applied to both simulation datasets and real experimental datasets. It shows that Huber group LASSO outperforms the group LASSO in terms of both areas under receiver operating characteristic curves and areas under the precision-recall curves. The convergence analysis of the algorithm theoretically shows that the sequence generated from the algorithm converges to the optimal solution of the problem. The simulation and real data examples demonstrate the effectiveness of the Huber group LASSO in integrating multiple time-course gene expression datasets and improving the resistance to large errors or outliers.
Ke, Tracy; Fan, Jianqing; Wu, Yichao
2014-01-01
This paper explores the homogeneity of coefficients in high-dimensional regression, which extends the sparsity concept and is more general and suitable for many applications. Homogeneity arises when regression coefficients corresponding to neighboring geographical regions or a similar cluster of covariates are expected to be approximately the same. Sparsity corresponds to a special case of homogeneity with a large cluster of known atom zero. In this article, we propose a new method called clustering algorithm in regression via data-driven segmentation (CARDS) to explore homogeneity. New mathematics are provided on the gain that can be achieved by exploring homogeneity. Statistical properties of two versions of CARDS are analyzed. In particular, the asymptotic normality of our proposed CARDS estimator is established, which reveals better estimation accuracy for homogeneous parameters than that without homogeneity exploration. When our methods are combined with sparsity exploration, further efficiency can be achieved beyond the exploration of sparsity alone. This provides additional insights into the power of exploring low-dimensional structures in high-dimensional regression: homogeneity and sparsity. Our results also shed lights on the properties of the fussed Lasso. The newly developed method is further illustrated by simulation studies and applications to real data. Supplementary materials for this article are available online. PMID:26085701
Network Data: Statistical Theory and New Models
2016-02-17
SECURITY CLASSIFICATION OF: During this period of review, Bin Yu worked on many thrusts of high-dimensional statistical theory and methodologies. Her...research covered a wide range of topics in statistics including analysis and methods for spectral clustering for sparse and structured networks...2,7,8,21], sparse modeling (e.g. Lasso) [4,10,11,17,18,19], statistical guarantees for the EM algorithm [3], statistical analysis of algorithm leveraging
Resampling procedures to identify important SNPs using a consensus approach.
Pardy, Christopher; Motyer, Allan; Wilson, Susan
2011-11-29
Our goal is to identify common single-nucleotide polymorphisms (SNPs) (minor allele frequency > 1%) that add predictive accuracy above that gained by knowledge of easily measured clinical variables. We take an algorithmic approach to predict each phenotypic variable using a combination of phenotypic and genotypic predictors. We perform our procedure on the first simulated replicate and then validate against the others. Our procedure performs well when predicting Q1 but is less successful for the other outcomes. We use resampling procedures where possible to guard against false positives and to improve generalizability. The approach is based on finding a consensus regarding important SNPs by applying random forests and the least absolute shrinkage and selection operator (LASSO) on multiple subsamples. Random forests are used first to discard unimportant predictors, narrowing our focus to roughly 100 important SNPs. A cross-validation LASSO is then used to further select variables. We combine these procedures to guarantee that cross-validation can be used to choose a shrinkage parameter for the LASSO. If the clinical variables were unavailable, this prefiltering step would be essential. We perform the SNP-based analyses simultaneously rather than one at a time to estimate SNP effects in the presence of other causal variants. We analyzed the first simulated replicate of Genetic Analysis Workshop 17 without knowledge of the true model. Post-conference knowledge of the simulation parameters allowed us to investigate the limitations of our approach. We found that many of the false positives we identified were substantially correlated with genuine causal SNPs.
Sparse image reconstruction for molecular imaging.
Ting, Michael; Raich, Raviv; Hero, Alfred O
2009-06-01
The application that motivates this paper is molecular imaging at the atomic level. When discretized at subatomic distances, the volume is inherently sparse. Noiseless measurements from an imaging technology can be modeled by convolution of the image with the system point spread function (psf). Such is the case with magnetic resonance force microscopy (MRFM), an emerging technology where imaging of an individual tobacco mosaic virus was recently demonstrated with nanometer resolution. We also consider additive white Gaussian noise (AWGN) in the measurements. Many prior works of sparse estimators have focused on the case when H has low coherence; however, the system matrix H in our application is the convolution matrix for the system psf. A typical convolution matrix has high coherence. This paper, therefore, does not assume a low coherence H. A discrete-continuous form of the Laplacian and atom at zero (LAZE) p.d.f. used by Johnstone and Silverman is formulated, and two sparse estimators derived by maximizing the joint p.d.f. of the observation and image conditioned on the hyperparameters. A thresholding rule that generalizes the hard and soft thresholding rule appears in the course of the derivation. This so-called hybrid thresholding rule, when used in the iterative thresholding framework, gives rise to the hybrid estimator, a generalization of the lasso. Estimates of the hyperparameters for the lasso and hybrid estimator are obtained via Stein's unbiased risk estimate (SURE). A numerical study with a Gaussian psf and two sparse images shows that the hybrid estimator outperforms the lasso.
A deep auto-encoder model for gene expression prediction.
Xie, Rui; Wen, Jia; Quitadamo, Andrew; Cheng, Jianlin; Shi, Xinghua
2017-11-17
Gene expression is a key intermediate level that genotypes lead to a particular trait. Gene expression is affected by various factors including genotypes of genetic variants. With an aim of delineating the genetic impact on gene expression, we build a deep auto-encoder model to assess how good genetic variants will contribute to gene expression changes. This new deep learning model is a regression-based predictive model based on the MultiLayer Perceptron and Stacked Denoising Auto-encoder (MLP-SAE). The model is trained using a stacked denoising auto-encoder for feature selection and a multilayer perceptron framework for backpropagation. We further improve the model by introducing dropout to prevent overfitting and improve performance. To demonstrate the usage of this model, we apply MLP-SAE to a real genomic datasets with genotypes and gene expression profiles measured in yeast. Our results show that the MLP-SAE model with dropout outperforms other models including Lasso, Random Forests and the MLP-SAE model without dropout. Using the MLP-SAE model with dropout, we show that gene expression quantifications predicted by the model solely based on genotypes, align well with true gene expression patterns. We provide a deep auto-encoder model for predicting gene expression from SNP genotypes. This study demonstrates that deep learning is appropriate for tackling another genomic problem, i.e., building predictive models to understand genotypes' contribution to gene expression. With the emerging availability of richer genomic data, we anticipate that deep learning models play a bigger role in modeling and interpreting genomics.
Oracle estimation of parametric models under boundary constraints.
Wong, Kin Yau; Goldberg, Yair; Fine, Jason P
2016-12-01
In many classical estimation problems, the parameter space has a boundary. In most cases, the standard asymptotic properties of the estimator do not hold when some of the underlying true parameters lie on the boundary. However, without knowledge of the true parameter values, confidence intervals constructed assuming that the parameters lie in the interior are generally over-conservative. A penalized estimation method is proposed in this article to address this issue. An adaptive lasso procedure is employed to shrink the parameters to the boundary, yielding oracle inference which adapt to whether or not the true parameters are on the boundary. When the true parameters are on the boundary, the inference is equivalent to that which would be achieved with a priori knowledge of the boundary, while if the converse is true, the inference is equivalent to that which is obtained in the interior of the parameter space. The method is demonstrated under two practical scenarios, namely the frailty survival model and linear regression with order-restricted parameters. Simulation studies and real data analyses show that the method performs well with realistic sample sizes and exhibits certain advantages over standard methods. © 2016, The International Biometric Society.
Colonic volvulus in the United States: trends, outcomes, and predictors of mortality.
Halabi, Wissam J; Jafari, Mehraneh D; Kang, Celeste Y; Nguyen, Vinh Q; Carmichael, Joseph C; Mills, Steven; Pigazzi, Alessio; Stamos, Michael J
2014-02-01
Colonic volvulus is a rare entity associated with high mortality rates. Most studies come from areas of high endemicity and are limited by small numbers. No studies have investigated trends, outcomes, and predictors of mortality at the national level. The Nationwide Inpatient Sample 2002-2010 was retrospectively reviewed for colonic volvulus cases admitted emergently. Patients' demographics, hospital factors, and outcomes of the different procedures were analyzed. The LASSO algorithm for logistic regression was used to build a predictive model for mortality in cases of sigmoid (SV) and cecal volvulus (CV) taking into account preoperative and operative variables. An estimated 3,351,152 cases of bowel obstruction were admitted in the United States over the study period. Colonic volvulus was found to be the cause in 63,749 cases (1.90%). The incidence of CV increased by 5.53% per year whereas the incidence of SV remained stable. SV was more common in elderly males (aged 70 years), African Americans, and patients with diabetes and neuropsychiatric disorders. In contrast, CV was more common in younger females. Nonsurgical decompression alone was used in 17% of cases. Among cases managed surgically, resective procedures were performed in 89% of cases, whereas operative detorsion with or without fixation procedures remained uncommon. Mortality rates were 9.44% for SV, 6.64% for CV, 17% for synchronous CV and SV, and 18% for transverse colon volvulus. The LASSO algorithm identified bowel gangrene and peritonitis, coagulopathy, age, the use of stoma, and chronic kidney disease as strong predictors of mortality. Colonic volvulus is a rare cause of bowel obstruction in the United States and is associated with high mortality rates. CV and SV affect different populations and the incidence of CV is on the rise. The presence of bowel gangrene and coagulopathy strongly predicts mortality, suggesting that prompt diagnosis and management are essential.
MO-FG-202-09: Virtual IMRT QA Using Machine Learning: A Multi-Institutional Validation
DOE Office of Scientific and Technical Information (OSTI.GOV)
Valdes, G; Scheuermann, R; Solberg, T
Purpose: To validate a machine learning approach to Virtual IMRT QA for accurately predicting gamma passing rates using different QA devices at different institutions. Methods: A Virtual IMRT QA was constructed using a machine learning algorithm based on 416 IMRT plans, in which QA measurements were performed using diode-array detectors and a 3%local/3mm with 10% threshold. An independent set of 139 IMRT measurements from a different institution, with QA data based on portal dosimetry using the same gamma index and 10% threshold, was used to further test the algorithm. Plans were characterized by 90 different complexity metrics. A weighted poisonmore » regression with Lasso regularization was trained to predict passing rates using the complexity metrics as input. Results: In addition to predicting passing rates with 3% accuracy for all composite plans using diode-array detectors, passing rates for portal dosimetry on per-beam basis were predicted with an error <3.5% for 120 IMRT measurements. The remaining measurements (19) had large areas of low CU, where portal dosimetry has larger disagreement with the calculated dose and, as such, large errors were expected. These beams need to be further modeled to correct the under-response in low dose regions. Important features selected by Lasso to predict gamma passing rates were: complete irradiated area outline (CIAO) area, jaw position, fraction of MLC leafs with gaps smaller than 20 mm or 5mm, fraction of area receiving less than 50% of the total CU, fraction of the area receiving dose from penumbra, weighted Average Irregularity Factor, duty cycle among others. Conclusion: We have demonstrated that the Virtual IMRT QA can predict passing rates using different QA devices and across multiple institutions. Prediction of QA passing rates could have profound implications on the current IMRT process.« less
Zawbaa, Hossam M; Szlȩk, Jakub; Grosan, Crina; Jachowicz, Renata; Mendyk, Aleksander
2016-01-01
Poly-lactide-co-glycolide (PLGA) is a copolymer of lactic and glycolic acid. Drug release from PLGA microspheres depends not only on polymer properties but also on drug type, particle size, morphology of microspheres, release conditions, etc. Selecting a subset of relevant properties for PLGA is a challenging machine learning task as there are over three hundred features to consider. In this work, we formulate the selection of critical attributes for PLGA as a multiobjective optimization problem with the aim of minimizing the error of predicting the dissolution profile while reducing the number of attributes selected. Four bio-inspired optimization algorithms: antlion optimization, binary version of antlion optimization, grey wolf optimization, and social spider optimization are used to select the optimal feature set for predicting the dissolution profile of PLGA. Besides these, LASSO algorithm is also used for comparisons. Selection of crucial variables is performed under the assumption that both predictability and model simplicity are of equal importance to the final result. During the feature selection process, a set of input variables is employed to find minimum generalization error across different predictive models and their settings/architectures. The methodology is evaluated using predictive modeling for which various tools are chosen, such as Cubist, random forests, artificial neural networks (monotonic MLP, deep learning MLP), multivariate adaptive regression splines, classification and regression tree, and hybrid systems of fuzzy logic and evolutionary computations (fugeR). The experimental results are compared with the results reported by Szlȩk. We obtain a normalized root mean square error (NRMSE) of 15.97% versus 15.4%, and the number of selected input features is smaller, nine versus eleven.
Zawbaa, Hossam M.; Szlȩk, Jakub; Grosan, Crina; Jachowicz, Renata; Mendyk, Aleksander
2016-01-01
Poly-lactide-co-glycolide (PLGA) is a copolymer of lactic and glycolic acid. Drug release from PLGA microspheres depends not only on polymer properties but also on drug type, particle size, morphology of microspheres, release conditions, etc. Selecting a subset of relevant properties for PLGA is a challenging machine learning task as there are over three hundred features to consider. In this work, we formulate the selection of critical attributes for PLGA as a multiobjective optimization problem with the aim of minimizing the error of predicting the dissolution profile while reducing the number of attributes selected. Four bio-inspired optimization algorithms: antlion optimization, binary version of antlion optimization, grey wolf optimization, and social spider optimization are used to select the optimal feature set for predicting the dissolution profile of PLGA. Besides these, LASSO algorithm is also used for comparisons. Selection of crucial variables is performed under the assumption that both predictability and model simplicity are of equal importance to the final result. During the feature selection process, a set of input variables is employed to find minimum generalization error across different predictive models and their settings/architectures. The methodology is evaluated using predictive modeling for which various tools are chosen, such as Cubist, random forests, artificial neural networks (monotonic MLP, deep learning MLP), multivariate adaptive regression splines, classification and regression tree, and hybrid systems of fuzzy logic and evolutionary computations (fugeR). The experimental results are compared with the results reported by Szlȩk. We obtain a normalized root mean square error (NRMSE) of 15.97% versus 15.4%, and the number of selected input features is smaller, nine versus eleven. PMID:27315205
What explains health in persons with visual impairment?
2014-01-01
Background Visual impairment is associated with important limitations in functioning. The International Classification of Functioning, Disability and Health (ICF) adopted by the World Health Organisation (WHO) relies on a globally accepted framework for classifying problems in functioning and the influence of contextual factors. Its comprehensive perspective, including biological, individual and social aspects of health, enables the ICF to describe the whole health experience of persons with visual impairment. The objectives of this study are (1) to analyze whether the ICF can be used to comprehensively describe the problems in functioning of persons with visual impairment and the environmental factors that influence their lives and (2) to select the ICF categories that best capture self-perceived health of persons with visual impairment. Methods Data from 105 persons with visual impairment were collected, including socio-demographic data, vision-related data, the Extended ICF Checklist and the visual analogue scale of the EuroQoL-5D, to assess self-perceived health. Descriptive statistics and a Group Lasso regression were performed. The main outcome measures were functioning defined as impairments in Body functions and Body structures, limitations in Activities and restrictions in Participation, influencing Environmental factors and self-perceived health. Results In total, 120 ICF categories covering a broad range of Body functions, Body structures, aspects of Activities and Participation and Environmental factors were identified. Thirteen ICF categories that best capture self-perceived health were selected based on the Group Lasso regression. While Activities-and-Participation categories were selected most frequently, the greatest impact on self-perceived health was found in Body-functions categories. The ICF can be used as a framework to comprehensively describe the problems of persons with visual impairment and the Environmental factors which influence their lives. Conclusions There are plenty of ICF categories, Environmental-factors categories in particular, which are relevant to persons with visual impairment, but have hardly ever been taken into consideration in literature and visual impairment-specific patient-reported outcome measures. PMID:24886326
NELasso: Group-Sparse Modeling for Characterizing Relations Among Named Entities in News Articles.
Tariq, Amara; Karim, Asim; Foroosh, Hassan
2017-10-01
Named entities such as people, locations, and organizations play a vital role in characterizing online content. They often reflect information of interest and are frequently used in search queries. Although named entities can be detected reliably from textual content, extracting relations among them is more challenging, yet useful in various applications (e.g., news recommending systems). In this paper, we present a novel model and system for learning semantic relations among named entities from collections of news articles. We model each named entity occurrence with sparse structured logistic regression, and consider the words (predictors) to be grouped based on background semantics. This sparse group LASSO approach forces the weights of word groups that do not influence the prediction towards zero. The resulting sparse structure is utilized for defining the type and strength of relations. Our unsupervised system yields a named entities' network where each relation is typed, quantified, and characterized in context. These relations are the key to understanding news material over time and customizing newsfeeds for readers. Extensive evaluation of our system on articles from TIME magazine and BBC News shows that the learned relations correlate with static semantic relatedness measures like WLM, and capture the evolving relationships among named entities over time.
Li, Wen; Zhao, Li-Zhong; Ma, Dong-Wang; Wang, De-Zheng; Shi, Lei; Wang, Hong-Lei; Dong, Mo; Zhang, Shu-Yi; Cao, Lei; Zhang, Wei-Hua; Zhang, Xi-Peng; Zhang, Qing-Huai; Yu, Lin; Qin, Hai; Wang, Xi-Mo; Chen, Sam Li-Sheng
2018-05-01
We aimed to predict colorectal cancer (CRC) based on the demographic features and clinical correlates of personal symptoms and signs from Tianjin community-based CRC screening data.A total of 891,199 residents who were aged 60 to 74 and were screened in 2012 were enrolled. The Lasso logistic regression model was used to identify the predictors for CRC. Predictive validity was assessed by the receiver operating characteristic (ROC) curve. Bootstrapping method was also performed to validate this prediction model.CRC was best predicted by a model that included age, sex, education level, occupations, diarrhea, constipation, colon mucosa and bleeding, gallbladder disease, a stressful life event, family history of CRC, and a positive fecal immunochemical test (FIT). The area under curve (AUC) for the questionnaire with a FIT was 84% (95% CI: 82%-86%), followed by 76% (95% CI: 74%-79%) for a FIT alone, and 73% (95% CI: 71%-76%) for the questionnaire alone. With 500 bootstrap replications, the estimated optimism (<0.005) shows good discrimination in validation of prediction model.A risk prediction model for CRC based on a series of symptoms and signs related to enteric diseases in combination with a FIT was developed from first round of screening. The results of the current study are useful for increasing the awareness of high-risk subjects and for individual-risk-guided invitations or strategies to achieve mass screening for CRC.
NASA Astrophysics Data System (ADS)
González, D. L., II; Angus, M. P.; Tetteh, I. K.; Bello, G. A.; Padmanabhan, K.; Pendse, S. V.; Srinivas, S.; Yu, J.; Semazzi, F.; Kumar, V.; Samatova, N. F.
2014-04-01
Decades of hypothesis-driven and/or first-principles research have been applied towards the discovery and explanation of the mechanisms that drive climate phenomena, such as western African Sahel summer rainfall variability. Although connections between various climate factors have been theorized, not all of the key relationships are fully understood. We propose a data-driven approach to identify candidate players in this climate system, which can help explain underlying mechanisms and/or even suggest new relationships, to facilitate building a more comprehensive and predictive model of the modulatory relationships influencing a climate phenomenon of interest. We applied coupled heterogeneous association rule mining (CHARM), Lasso multivariate regression, and Dynamic Bayesian networks to find relationships within a complex system, and explored means with which to obtain a consensus result from the application of such varied methodologies. Using this fusion of approaches, we identified relationships among climate factors that modulate Sahel rainfall, including well-known associations from prior climate knowledge, as well as promising discoveries that invite further research by the climate science community.
Shao, Xiaolong; Li, Hui; Wang, Nan; Zhang, Qiang
2015-01-01
An electronic nose (e-nose) was used to characterize sesame oils processed by three different methods (hot-pressed, cold-pressed, and refined), as well as blends of the sesame oils and soybean oil. Seven classification and prediction methods, namely PCA, LDA, PLS, KNN, SVM, LASSO and RF, were used to analyze the e-nose data. The classification accuracy and MAUC were employed to evaluate the performance of these methods. The results indicated that sesame oils processed with different methods resulted in different sensor responses, with cold-pressed sesame oil producing the strongest sensor signals, followed by the hot-pressed sesame oil. The blends of pressed sesame oils with refined sesame oil were more difficult to be distinguished than the blends of pressed sesame oils and refined soybean oil. LDA, KNN, and SVM outperformed the other classification methods in distinguishing sesame oil blends. KNN, LASSO, PLS, and SVM (with linear kernel), and RF models could adequately predict the adulteration level (% of added soybean oil) in the sesame oil blends. Among the prediction models, KNN with k = 1 and 2 yielded the best prediction results. PMID:26506350
Exploiting Genome Structure in Association Analysis
Kim, Seyoung
2014-01-01
Abstract A genome-wide association study involves examining a large number of single-nucleotide polymorphisms (SNPs) to identify SNPs that are significantly associated with the given phenotype, while trying to reduce the false positive rate. Although haplotype-based association methods have been proposed to accommodate correlation information across nearby SNPs that are in linkage disequilibrium, none of these methods directly incorporated the structural information such as recombination events along chromosome. In this paper, we propose a new approach called stochastic block lasso for association mapping that exploits prior knowledge on linkage disequilibrium structure in the genome such as recombination rates and distances between adjacent SNPs in order to increase the power of detecting true associations while reducing false positives. Following a typical linear regression framework with the genotypes as inputs and the phenotype as output, our proposed method employs a sparsity-enforcing Laplacian prior for the regression coefficients, augmented by a first-order Markov process along the sequence of SNPs that incorporates the prior information on the linkage disequilibrium structure. The Markov-chain prior models the structural dependencies between a pair of adjacent SNPs, and allows us to look for association SNPs in a coupled manner, combining strength from multiple nearby SNPs. Our results on HapMap-simulated datasets and mouse datasets show that there is a significant advantage in incorporating the prior knowledge on linkage disequilibrium structure for marker identification under whole-genome association. PMID:21548809
Functional Multi-Locus QTL Mapping of Temporal Trends in Scots Pine Wood Traits
Li, Zitong; Hallingbäck, Henrik R.; Abrahamsson, Sara; Fries, Anders; Gull, Bengt Andersson; Sillanpää, Mikko J.; García-Gil, M. Rosario
2014-01-01
Quantitative trait loci (QTL) mapping of wood properties in conifer species has focused on single time point measurements or on trait means based on heterogeneous wood samples (e.g., increment cores), thus ignoring systematic within-tree trends. In this study, functional QTL mapping was performed for a set of important wood properties in increment cores from a 17-yr-old Scots pine (Pinus sylvestris L.) full-sib family with the aim of detecting wood trait QTL for general intercepts (means) and for linear slopes by increasing cambial age. Two multi-locus functional QTL analysis approaches were proposed and their performances were compared on trait datasets comprising 2 to 9 time points, 91 to 455 individual tree measurements and genotype datasets of amplified length polymorphisms (AFLP), and single nucleotide polymorphism (SNP) markers. The first method was a multilevel LASSO analysis whereby trend parameter estimation and QTL mapping were conducted consecutively; the second method was our Bayesian linear mixed model whereby trends and underlying genetic effects were estimated simultaneously. We also compared several different hypothesis testing methods under either the LASSO or the Bayesian framework to perform QTL inference. In total, five and four significant QTL were observed for the intercepts and slopes, respectively, across wood traits such as earlywood percentage, wood density, radial fiberwidth, and spiral grain angle. Four of these QTL were represented by candidate gene SNPs, thus providing promising targets for future research in QTL mapping and molecular function. Bayesian and LASSO methods both detected similar sets of QTL given datasets that comprised large numbers of individuals. PMID:25305041
Functional multi-locus QTL mapping of temporal trends in Scots pine wood traits.
Li, Zitong; Hallingbäck, Henrik R; Abrahamsson, Sara; Fries, Anders; Gull, Bengt Andersson; Sillanpää, Mikko J; García-Gil, M Rosario
2014-10-09
Quantitative trait loci (QTL) mapping of wood properties in conifer species has focused on single time point measurements or on trait means based on heterogeneous wood samples (e.g., increment cores), thus ignoring systematic within-tree trends. In this study, functional QTL mapping was performed for a set of important wood properties in increment cores from a 17-yr-old Scots pine (Pinus sylvestris L.) full-sib family with the aim of detecting wood trait QTL for general intercepts (means) and for linear slopes by increasing cambial age. Two multi-locus functional QTL analysis approaches were proposed and their performances were compared on trait datasets comprising 2 to 9 time points, 91 to 455 individual tree measurements and genotype datasets of amplified length polymorphisms (AFLP), and single nucleotide polymorphism (SNP) markers. The first method was a multilevel LASSO analysis whereby trend parameter estimation and QTL mapping were conducted consecutively; the second method was our Bayesian linear mixed model whereby trends and underlying genetic effects were estimated simultaneously. We also compared several different hypothesis testing methods under either the LASSO or the Bayesian framework to perform QTL inference. In total, five and four significant QTL were observed for the intercepts and slopes, respectively, across wood traits such as earlywood percentage, wood density, radial fiberwidth, and spiral grain angle. Four of these QTL were represented by candidate gene SNPs, thus providing promising targets for future research in QTL mapping and molecular function. Bayesian and LASSO methods both detected similar sets of QTL given datasets that comprised large numbers of individuals. Copyright © 2014 Li et al.
Variable selection in discrete survival models including heterogeneity.
Groll, Andreas; Tutz, Gerhard
2017-04-01
Several variable selection procedures are available for continuous time-to-event data. However, if time is measured in a discrete way and therefore many ties occur models for continuous time are inadequate. We propose penalized likelihood methods that perform efficient variable selection in discrete survival modeling with explicit modeling of the heterogeneity in the population. The method is based on a combination of ridge and lasso type penalties that are tailored to the case of discrete survival. The performance is studied in simulation studies and an application to the birth of the first child.
Stochastic model search with binary outcomes for genome-wide association studies.
Russu, Alberto; Malovini, Alberto; Puca, Annibale A; Bellazzi, Riccardo
2012-06-01
The spread of case-control genome-wide association studies (GWASs) has stimulated the development of new variable selection methods and predictive models. We introduce a novel Bayesian model search algorithm, Binary Outcome Stochastic Search (BOSS), which addresses the model selection problem when the number of predictors far exceeds the number of binary responses. Our method is based on a latent variable model that links the observed outcomes to the underlying genetic variables. A Markov Chain Monte Carlo approach is used for model search and to evaluate the posterior probability of each predictor. BOSS is compared with three established methods (stepwise regression, logistic lasso, and elastic net) in a simulated benchmark. Two real case studies are also investigated: a GWAS on the genetic bases of longevity, and the type 2 diabetes study from the Wellcome Trust Case Control Consortium. Simulations show that BOSS achieves higher precisions than the reference methods while preserving good recall rates. In both experimental studies, BOSS successfully detects genetic polymorphisms previously reported to be associated with the analyzed phenotypes. BOSS outperforms the other methods in terms of F-measure on simulated data. In the two real studies, BOSS successfully detects biologically relevant features, some of which are missed by univariate analysis and the three reference techniques. The proposed algorithm is an advance in the methodology for model selection with a large number of features. Our simulated and experimental results showed that BOSS proves effective in detecting relevant markers while providing a parsimonious model.
Robust Gaussian Graphical Modeling via l1 Penalization
Sun, Hokeun; Li, Hongzhe
2012-01-01
Summary Gaussian graphical models have been widely used as an effective method for studying the conditional independency structure among genes and for constructing genetic networks. However, gene expression data typically have heavier tails or more outlying observations than the standard Gaussian distribution. Such outliers in gene expression data can lead to wrong inference on the dependency structure among the genes. We propose a l1 penalized estimation procedure for the sparse Gaussian graphical models that is robustified against possible outliers. The likelihood function is weighted according to how the observation is deviated, where the deviation of the observation is measured based on its own likelihood. An efficient computational algorithm based on the coordinate gradient descent method is developed to obtain the minimizer of the negative penalized robustified-likelihood, where nonzero elements of the concentration matrix represents the graphical links among the genes. After the graphical structure is obtained, we re-estimate the positive definite concentration matrix using an iterative proportional fitting algorithm. Through simulations, we demonstrate that the proposed robust method performs much better than the graphical Lasso for the Gaussian graphical models in terms of both graph structure selection and estimation when outliers are present. We apply the robust estimation procedure to an analysis of yeast gene expression data and show that the resulting graph has better biological interpretation than that obtained from the graphical Lasso. PMID:23020775
DOE Office of Scientific and Technical Information (OSTI.GOV)
Depeursinge, Adrien, E-mail: adrien.depeursinge@hevs.ch; Yanagawa, Masahiro; Leung, Ann N.
Purpose: To investigate the importance of presurgical computed tomography (CT) intensity and texture information from ground-glass opacities (GGO) and solid nodule components for the prediction of adenocarcinoma recurrence. Methods: For this study, 101 patients with surgically resected stage I adenocarcinoma were selected. During the follow-up period, 17 patients had disease recurrence with six associated cancer-related deaths. GGO and solid tumor components were delineated on presurgical CT scans by a radiologist. Computational texture models of GGO and solid regions were built using linear combinations of steerable Riesz wavelets learned with linear support vector machines (SVMs). Unlike other traditional texture attributes, themore » proposed texture models are designed to encode local image scales and directions that are specific to GGO and solid tissue. The responses of the locally steered models were used as texture attributes and compared to the responses of unaligned Riesz wavelets. The texture attributes were combined with CT intensities to predict tumor recurrence and patient hazard according to disease-free survival (DFS) time. Two families of predictive models were compared: LASSO and SVMs, and their survival counterparts: Cox-LASSO and survival SVMs. Results: The best-performing predictive model of patient hazard was associated with a concordance index (C-index) of 0.81 ± 0.02 and was based on the combination of the steered models and CT intensities with survival SVMs. The same feature group and the LASSO model yielded the highest area under the receiver operating characteristic curve (AUC) of 0.8 ± 0.01 for predicting tumor recurrence, although no statistically significant difference was found when compared to using intensity features solely. For all models, the performance was found to be significantly higher when image attributes were based on the solid components solely versus using the entire tumors (p < 3.08 × 10{sup −5}). Conclusions: This study constitutes a novel perspective on how to interpret imaging information from CT examinations by suggesting that most of the information related to adenocarcinoma aggressiveness is related to the intensity and morphological properties of solid components of the tumor. The prediction of adenocarcinoma relapse was found to have low specificity but very high sensitivity. Our results could be useful in clinical practice to identify patients for which no recurrence is expected with a very high confidence using a presurgical CT scan only. It also provided an accurate estimation of the risk of recurrence after a given duration t from surgical resection (i.e., C-index = 0.81 ± 0.02)« less
Matching a Distribution by Matching Quantiles Estimation
Sgouropoulos, Nikolaos; Yao, Qiwei; Yastremiz, Claudia
2015-01-01
Motivated by the problem of selecting representative portfolios for backtesting counterparty credit risks, we propose a matching quantiles estimation (MQE) method for matching a target distribution by that of a linear combination of a set of random variables. An iterative procedure based on the ordinary least-squares estimation (OLS) is proposed to compute MQE. MQE can be easily modified by adding a LASSO penalty term if a sparse representation is desired, or by restricting the matching within certain range of quantiles to match a part of the target distribution. The convergence of the algorithm and the asymptotic properties of the estimation, both with or without LASSO, are established. A measure and an associated statistical test are proposed to assess the goodness-of-match. The finite sample properties are illustrated by simulation. An application in selecting a counterparty representative portfolio with a real dataset is reported. The proposed MQE also finds applications in portfolio tracking, which demonstrates the usefulness of combining MQE with LASSO. PMID:26692592
Similarity regularized sparse group lasso for cup to disc ratio computation.
Cheng, Jun; Zhang, Zhuo; Tao, Dacheng; Wong, Damon Wing Kee; Liu, Jiang; Baskaran, Mani; Aung, Tin; Wong, Tien Yin
2017-08-01
Automatic cup to disc ratio (CDR) computation from color fundus images has shown to be promising for glaucoma detection. Over the past decade, many algorithms have been proposed. In this paper, we first review the recent work in the area and then present a novel similarity-regularized sparse group lasso method for automated CDR estimation. The proposed method reconstructs the testing disc image based on a set of reference disc images by integrating the similarity between testing and the reference disc images with the sparse group lasso constraints. The reconstruction coefficients are then used to estimate the CDR of the testing image. The proposed method has been validated using 650 images with manually annotated CDRs. Experimental results show an average CDR error of 0.0616 and a correlation coefficient of 0.7, outperforming other methods. The areas under curve in the diagnostic test reach 0.843 and 0.837 when manual and automatically segmented discs are used respectively, better than other methods as well.
Ratliff, John K; Balise, Ray; Veeravagu, Anand; Cole, Tyler S; Cheng, Ivan; Olshen, Richard A; Tian, Lu
2016-05-18
Postoperative metrics are increasingly important in determining standards of quality for physicians and hospitals. Although complications following spinal surgery have been described, procedural and patient variables have yet to be incorporated into a predictive model of adverse-event occurrence. We sought to develop a predictive model of complication occurrence after spine surgery. We used longitudinal prospective data from a national claims database and developed a predictive model incorporating complication type and frequency of occurrence following spine surgery procedures. We structured our model to assess the impact of features such as preoperative diagnosis, patient comorbidities, location in the spine, anterior versus posterior approach, whether fusion had been performed, whether instrumentation had been used, number of levels, and use of bone morphogenetic protein (BMP). We assessed a variety of adverse events. Prediction models were built using logistic regression with additive main effects and logistic regression with main effects as well as all 2 and 3-factor interactions. Least absolute shrinkage and selection operator (LASSO) regularization was used to select features. Competing approaches included boosted additive trees and the classification and regression trees (CART) algorithm. The final prediction performance was evaluated by estimating the area under a receiver operating characteristic curve (AUC) as predictions were applied to independent validation data and compared with the Charlson comorbidity score. The model was developed from 279,135 records of patients with a minimum duration of follow-up of 30 days. Preliminary assessment showed an adverse-event rate of 13.95%, well within norms reported in the literature. We used the first 80% of the records for training (to predict adverse events) and the remaining 20% of the records for validation. There was remarkable similarity among methods, with an AUC of 0.70 for predicting the occurrence of adverse events. The AUC using the Charlson comorbidity score was 0.61. The described model was more accurate than Charlson scoring (p < 0.01). We present a modeling effort based on administrative claims data that predicts the occurrence of complications after spine surgery. We believe that the development of a predictive modeling tool illustrating the risk of complication occurrence after spine surgery will aid in patient counseling and improve the accuracy of risk modeling strategies. Copyright © 2016 by The Journal of Bone and Joint Surgery, Incorporated.
Sabourin, Jeremy; Nobel, Andrew B.; Valdar, William
2014-01-01
Genomewide association studies sometimes identify loci at which both the number and identities of the underlying causal variants are ambiguous. In such cases, statistical methods that model effects of multiple SNPs simultaneously can help disentangle the observed patterns of association and provide information about how those SNPs could be prioritized for follow-up studies. Current multi-SNP methods, however, tend to assume that SNP effects are well captured by additive genetics; yet when genetic dominance is present, this assumption translates to reduced power and faulty prioritizations. We describe a statistical procedure for prioritizing SNPs at GWAS loci that efficiently models both additive and dominance effects. Our method, LLARRMA-dawg, combines a group LASSO procedure for sparse modeling of multiple SNP effects with a resampling procedure based on fractional observation weights; it estimates for each SNP the robustness of association with the phenotype both to sampling variation and to competing explanations from other SNPs. In producing a SNP prioritization that best identifies underlying true signals, we show that: our method easily outperforms a single marker analysis; when additive-only signals are present, our joint model for additive and dominance is equivalent to or only slightly less powerful than modeling additive-only effects; and, when dominance signals are present, even in combination with substantial additive effects, our joint model is unequivocally more powerful than a model assuming additivity. We also describe how performance can be improved through calibrated randomized penalization, and discuss how dominance in ungenotyped SNPs can be incorporated through either heterozygote dosage or multiple imputation. PMID:25417853
Watanabe, Takanori; Kessler, Daniel; Scott, Clayton; Angstadt, Michael; Sripada, Chandra
2014-01-01
Substantial evidence indicates that major psychiatric disorders are associated with distributed neural dysconnectivity, leading to strong interest in using neuroimaging methods to accurately predict disorder status. In this work, we are specifically interested in a multivariate approach that uses features derived from whole-brain resting state functional connectomes. However, functional connectomes reside in a high dimensional space, which complicates model interpretation and introduces numerous statistical and computational challenges. Traditional feature selection techniques are used to reduce data dimensionality, but are blind to the spatial structure of the connectomes. We propose a regularization framework where the 6-D structure of the functional connectome (defined by pairs of points in 3-D space) is explicitly taken into account via the fused Lasso or the GraphNet regularizer. Our method only restricts the loss function to be convex and margin-based, allowing non-differentiable loss functions such as the hinge-loss to be used. Using the fused Lasso or GraphNet regularizer with the hinge-loss leads to a structured sparse support vector machine (SVM) with embedded feature selection. We introduce a novel efficient optimization algorithm based on the augmented Lagrangian and the classical alternating direction method, which can solve both fused Lasso and GraphNet regularized SVM with very little modification. We also demonstrate that the inner subproblems of the algorithm can be solved efficiently in analytic form by coupling the variable splitting strategy with a data augmentation scheme. Experiments on simulated data and resting state scans from a large schizophrenia dataset show that our proposed approach can identify predictive regions that are spatially contiguous in the 6-D “connectome space,” offering an additional layer of interpretability that could provide new insights about various disease processes. PMID:24704268
Detection of pesticide (Cyantraniliprole) residue on grapes using hyperspectral sensing
NASA Astrophysics Data System (ADS)
Mohite, Jayantrao; Karale, Yogita; Pappula, Srinivasu; Shabeer, Ahammed T. P.; Sawant, S. D.; Hingmire, Sandip
2017-05-01
Pesticide residues in the fruits, vegetables and agricultural commodities are harmful to humans and are becoming a health concern nowadays. Detection of pesticide residues on various commodities in an open environment is a challenging task. Hyperspectral sensing is one of the recent technologies used to detect the pesticide residues. This paper addresses the problem of detection of pesticide residues of Cyantraniliprole on grapes in open fields using multi temporal hyperspectral remote sensing data. The re ectance data of 686 samples of grapes with no, single and double dose application of Cyantraniliprole has been collected by handheld spectroradiometer (MS- 720) with a wavelength ranging from 350 nm to 1052 nm. The data collection was carried out over a large feature set of 213 spectral bands during the period of March to May 2015. This large feature set may cause model over-fitting problem as well as increase the computational time, so in order to get the most relevant features, various feature selection techniques viz Principle Component Analysis (PCA), LASSO and Elastic Net regularization have been used. Using this selected features, we evaluate the performance of various classifiers such as Artificial Neural Networks (ANN), Support Vector Machine (SVM), Random Forest (RF) and Extreme Gradient Boosting (XGBoost) to classify the grape sample with no, single or double application of Cyantraniliprole. The key finding of this paper is; most of the features selected by the LASSO varies between 350-373nm and 940-990nm consistently for all days. Experimental results also shows that, by using the relevant features selected by LASSO, SVM performs better with average prediction accuracy of 91.98 % among all classifiers, for all days.
M-estimation for robust sparse unmixing of hyperspectral images
NASA Astrophysics Data System (ADS)
Toomik, Maria; Lu, Shijian; Nelson, James D. B.
2016-10-01
Hyperspectral unmixing methods often use a conventional least squares based lasso which assumes that the data follows the Gaussian distribution. The normality assumption is an approximation which is generally invalid for real imagery data. We consider a robust (non-Gaussian) approach to sparse spectral unmixing of remotely sensed imagery which reduces the sensitivity of the estimator to outliers and relaxes the linearity assumption. The method consists of several appropriate penalties. We propose to use an lp norm with 0 < p < 1 in the sparse regression problem, which induces more sparsity in the results, but makes the problem non-convex. On the other hand, the problem, though non-convex, can be solved quite straightforwardly with an extensible algorithm based on iteratively reweighted least squares. To deal with the huge size of modern spectral libraries we introduce a library reduction step, similar to the multiple signal classification (MUSIC) array processing algorithm, which not only speeds up unmixing but also yields superior results. In the hyperspectral setting we extend the traditional least squares method to the robust heavy-tailed case and propose a generalised M-lasso solution. M-estimation replaces the Gaussian likelihood with a fixed function ρ(e) that restrains outliers. The M-estimate function reduces the effect of errors with large amplitudes or even assigns the outliers zero weights. Our experimental results on real hyperspectral data show that noise with large amplitudes (outliers) often exists in the data. This ability to mitigate the influence of such outliers can therefore offer greater robustness. Qualitative hyperspectral unmixing results on real hyperspectral image data corroborate the efficacy of the proposed method.
Melikian, George L; Rhee, Soo-Yon; Taylor, Jonathan; Fessel, W Jeffrey; Kaufman, David; Towner, William; Troia-Cancio, Paolo V; Zolopa, Andrew; Robbins, Gregory K; Kagan, Ron; Israelski, Dennis; Shafer, Robert W
2012-05-01
Determining the phenotypic impacts of reverse transcriptase (RT) mutations on individual nucleoside RT inhibitors (NRTIs) has remained a statistical challenge because clinical NRTI-resistant HIV-1 isolates usually contain multiple mutations, often in complex patterns, complicating the task of determining the relative contribution of each mutation to HIV drug resistance. Furthermore, the NRTIs have highly variable dynamic susceptibility ranges, making it difficult to determine the relative effect of an RT mutation on susceptibility to different NRTIs. In this study, we analyzed 1,273 genotyped HIV-1 isolates for which phenotypic results were obtained using the PhenoSense assay (Monogram, South San Francisco, CA). We used a parsimonious feature selection algorithm, LASSO, to assess the possible contributions of 177 mutations that occurred in 10 or more isolates in our data set. We then used least-squares regression to quantify the impact of each LASSO-selected mutation on each NRTI. Our study provides a comprehensive view of the most common NRTI resistance mutations. Because our results were standardized, the study provides the first analysis that quantifies the relative phenotypic effects of NRTI resistance mutations on each of the NRTIs. In addition, the study contains new findings on the relative impacts of thymidine analog mutations (TAMs) on susceptibility to abacavir and tenofovir; the impacts of several known but incompletely characterized mutations, including E40F, V75T, Y115F, and K219R; and a tentative role in reduced NRTI susceptibility for K64H, a novel NRTI resistance mutation.
NASA Astrophysics Data System (ADS)
Tang, Zhenchao; Liu, Zhenyu; Li, Ruili; Cui, Xinwei; Li, Hongjun; Dong, Enqing; Tian, Jie
2017-03-01
It's widely known that HIV infection would cause white matter integrity impairments. Nevertheless, it is still unclear that how the white matter anatomical structural connections are affected by HIV infection. In the current study, we employed a multivariate pattern analysis to explore the HIV-related white matter connections alterations. Forty antiretroviraltherapy- naïve HIV patients and thirty healthy controls were enrolled. Firstly, an Automatic Anatomical Label (AAL) atlas based white matter structural network, a 90 × 90 FA-weighted matrix, was constructed for each subject. Then, the white matter connections deprived from the structural network were entered into a lasso-logistic regression model to perform HIV-control group classification. Using leave one out cross validation, a classification accuracy (ACC) of 90% (P=0.002) and areas under the receiver operating characteristic curve (AUC) of 0.96 was obtained by the classification model. This result indicated that the white matter anatomical structural connections contributed greatly to HIV-control group classification, providing solid evidence that the white matter connections were affected by HIV infection. Specially, 11 white matter connections were selected in the classification model, mainly crossing the regions of frontal lobe, Cingulum, Hippocampus, and Thalamus, which were reported to be damaged in previous HIV studies. This might suggest that the white matter connections adjacent to the HIV-related impaired regions were prone to be damaged.
Machine learning derived risk prediction of anorexia nervosa.
Guo, Yiran; Wei, Zhi; Keating, Brendan J; Hakonarson, Hakon
2016-01-20
Anorexia nervosa (AN) is a complex psychiatric disease with a moderate to strong genetic contribution. In addition to conventional genome wide association (GWA) studies, researchers have been using machine learning methods in conjunction with genomic data to predict risk of diseases in which genetics play an important role. In this study, we collected whole genome genotyping data on 3940 AN cases and 9266 controls from the Genetic Consortium for Anorexia Nervosa (GCAN), the Wellcome Trust Case Control Consortium 3 (WTCCC3), Price Foundation Collaborative Group and the Children's Hospital of Philadelphia (CHOP), and applied machine learning methods for predicting AN disease risk. The prediction performance is measured by area under the receiver operating characteristic curve (AUC), indicating how well the model distinguishes cases from unaffected control subjects. Logistic regression model with the lasso penalty technique generated an AUC of 0.693, while Support Vector Machines and Gradient Boosted Trees reached AUC's of 0.691 and 0.623, respectively. Using different sample sizes, our results suggest that larger datasets are required to optimize the machine learning models and achieve higher AUC values. To our knowledge, this is the first attempt to assess AN risk based on genome wide genotype level data. Future integration of genomic, environmental and family-based information is likely to improve the AN risk evaluation process, eventually benefitting AN patients and families in the clinical setting.
Prediction of drug synergy in cancer using ensemble-based machine learning techniques
NASA Astrophysics Data System (ADS)
Singh, Harpreet; Rana, Prashant Singh; Singh, Urvinder
2018-04-01
Drug synergy prediction plays a significant role in the medical field for inhibiting specific cancer agents. It can be developed as a pre-processing tool for therapeutic successes. Examination of different drug-drug interaction can be done by drug synergy score. It needs efficient regression-based machine learning approaches to minimize the prediction errors. Numerous machine learning techniques such as neural networks, support vector machines, random forests, LASSO, Elastic Nets, etc., have been used in the past to realize requirement as mentioned above. However, these techniques individually do not provide significant accuracy in drug synergy score. Therefore, the primary objective of this paper is to design a neuro-fuzzy-based ensembling approach. To achieve this, nine well-known machine learning techniques have been implemented by considering the drug synergy data. Based on the accuracy of each model, four techniques with high accuracy are selected to develop ensemble-based machine learning model. These models are Random forest, Fuzzy Rules Using Genetic Cooperative-Competitive Learning method (GFS.GCCL), Adaptive-Network-Based Fuzzy Inference System (ANFIS) and Dynamic Evolving Neural-Fuzzy Inference System method (DENFIS). Ensembling is achieved by evaluating the biased weighted aggregation (i.e. adding more weights to the model with a higher prediction score) of predicted data by selected models. The proposed and existing machine learning techniques have been evaluated on drug synergy score data. The comparative analysis reveals that the proposed method outperforms others in terms of accuracy, root mean square error and coefficient of correlation.
Variable selection in a flexible parametric mixture cure model with interval-censored data.
Scolas, Sylvie; El Ghouch, Anouar; Legrand, Catherine; Oulhaj, Abderrahim
2016-03-30
In standard survival analysis, it is generally assumed that every individual will experience someday the event of interest. However, this is not always the case, as some individuals may not be susceptible to this event. Also, in medical studies, it is frequent that patients come to scheduled interviews and that the time to the event is only known to occur between two visits. That is, the data are interval-censored with a cure fraction. Variable selection in such a setting is of outstanding interest. Covariates impacting the survival are not necessarily the same as those impacting the probability to experience the event. The objective of this paper is to develop a parametric but flexible statistical model to analyze data that are interval-censored and include a fraction of cured individuals when the number of potential covariates may be large. We use the parametric mixture cure model with an accelerated failure time regression model for the survival, along with the extended generalized gamma for the error term. To overcome the issue of non-stable and non-continuous variable selection procedures, we extend the adaptive LASSO to our model. By means of simulation studies, we show good performance of our method and discuss the behavior of estimates with varying cure and censoring proportion. Lastly, our proposed method is illustrated with a real dataset studying the time until conversion to mild cognitive impairment, a possible precursor of Alzheimer's disease. © 2015 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd.
Zhou, Yu; Du, Juan; Hou, Hong-Yan; Lu, Yan-Fang; Yu, Jing; Mao, Li-Yan; Wang, Feng; Sun, Zi-Yong
2017-01-01
Tuberculosis (TB) is a leading global public health problem. To achieve the end TB strategy, non-invasive markers for diagnosis and treatment monitoring of TB disease are urgently needed, especially in high-endemic countries such as China. Interferon-gamma release assays (IGRAs) and tuberculin skin test (TST), frequently used immunological methods for TB detection, are intrinsically unable to discriminate active tuberculosis (ATB) from latent tuberculosis infection (LTBI). Thus, the specificity of these methods in the diagnosis of ATB is dependent upon the local prevalence of LTBI. The pathogen-detecting methods such as acid-fast staining and culture, all have limitations in clinical application. ImmunoScore (IS) is a new promising prognostic tool which was commonly used in tumor. However, the importance of host immunity has also been demonstrated in TB pathogenesis, which implies the possibility of using IS model for ATB diagnosis and therapy monitoring. In the present study, we focused on the performance of IS model in the differentiation between ATB and LTBI and in treatment monitoring of TB disease. We have totally screened five immunological markers (four non-specific markers and one TB-specific marker) and successfully established IS model by using Lasso logistic regression analysis. As expected, the IS model can effectively distinguish ATB from LTBI (with a sensitivity of 95.7% and a specificity of 92.1%) and also has potential value in the treatment monitoring of TB disease.
NASA Astrophysics Data System (ADS)
Lucas, D. D.; Labute, M.; Chowdhary, K.; Debusschere, B.; Cameron-Smith, P. J.
2014-12-01
Simulating the atmospheric cycles of ozone, methane, and other radiatively important trace gases in global climate models is computationally demanding and requires the use of 100's of photochemical parameters with uncertain values. Quantitative analysis of the effects of these uncertainties on tracer distributions, radiative forcing, and other model responses is hindered by the "curse of dimensionality." We describe efforts to overcome this curse using ensemble simulations and advanced statistical methods. Uncertainties from 95 photochemical parameters in the trop-MOZART scheme were sampled using a Monte Carlo method and propagated through 10,000 simulations of the single column version of the Community Atmosphere Model (CAM). The variance of the ensemble was represented as a network with nodes and edges, and the topology and connections in the network were analyzed using lasso regression, Bayesian compressive sensing, and centrality measures from the field of social network theory. Despite the limited sample size for this high dimensional problem, our methods determined the key sources of variation and co-variation in the ensemble and identified important clusters in the network topology. Our results can be used to better understand the flow of photochemical uncertainty in simulations using CAM and other climate models. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 and supported by the DOE Office of Science through the Scientific Discovery Through Advanced Computing (SciDAC).
NASA Astrophysics Data System (ADS)
Shan, X.; Zhang, K.; Zhuang, Y.; Fu, R.; Hong, Y.
2017-12-01
Seasonal prediction of rainfall during the dry-to-wet transition season in austral spring (September-November) over southern Amazonia is central for improving planting crops and fire mitigation in that region. Previous studies have identified the key large-scale atmospheric dynamic and thermodynamics pre-conditions during the dry season (June-August) that influence the rainfall anomalies during the dry to wet transition season over Southern Amazonia. Based on these key pre-conditions during dry season, we have evaluated several statistical models and developed a Neural Network based statistical prediction system to predict rainfall during the dry to wet transition for Southern Amazonia (5-15°S, 50-70°W). Multivariate Empirical Orthogonal Function (EOF) Analysis is applied to the following four fields during JJA from the ECMWF Reanalysis (ERA-Interim) spanning from year 1979 to 2015: geopotential height at 200 hPa, surface relative humidity, convective inhibition energy (CIN) index and convective available potential energy (CAPE), to filter out noise and highlight the most coherent spatial and temporal variations. The first 10 EOF modes are retained for inputs to the statistical models, accounting for at least 70% of the total variance in the predictor fields. We have tested several linear and non-linear statistical methods. While the regularized Ridge Regression and Lasso Regression can generally capture the spatial pattern and magnitude of rainfall anomalies, we found that that Neural Network performs best with an accuracy greater than 80%, as expected from the non-linear dependence of the rainfall on the large-scale atmospheric thermodynamic conditions and circulation. Further tests of various prediction skill metrics and hindcasts also suggest this Neural Network prediction approach can significantly improve seasonal prediction skill than the dynamic predictions and regression based statistical predictions. Thus, this statistical prediction system could have shown potential to improve real-time seasonal rainfall predictions in the future.
Wang, D; Salah El-Basyoni, I; Stephen Baenziger, P; Crossa, J; Eskridge, K M; Dweikat, I
2012-11-01
Though epistasis has long been postulated to have a critical role in genetic regulation of important pathways as well as provide a major source of variation in the process of speciation, the importance of epistasis for genomic selection in the context of plant breeding is still being debated. In this paper, we report the results on the prediction of genetic values with epistatic effects for 280 accessions in the Nebraska Wheat Breeding Program using adaptive mixed least absolute shrinkage and selection operator (LASSO). The development of adaptive mixed LASSO, originally designed for association mapping, for the context of genomic selection is reported. The results show that adaptive mixed LASSO can be successfully applied to the prediction of genetic values while incorporating both marker main effects and epistatic effects. Especially, the prediction accuracy is substantially improved by the inclusion of two-locus epistatic effects (more than onefold in some cases as measured by cross-validation correlation coefficient), which is observed for multiple traits and planting locations. This points to significant potential in using non-additive genetic effects for genomic selection in crop breeding practices.
Sungsanpin, a lasso peptide from a deep-sea streptomycete.
Um, Soohyun; Kim, Young-Joo; Kwon, Hyuknam; Wen, He; Kim, Seong-Hwan; Kwon, Hak Cheol; Park, Sunghyouk; Shin, Jongheon; Oh, Dong-Chan
2013-05-24
Sungsanpin (1), a new 15-amino-acid peptide, was discovered from a Streptomyces species isolated from deep-sea sediment collected off Jeju Island, Korea. The planar structure of 1 was determined by 1D and 2D NMR spectroscopy, mass spectrometry, and UV spectroscopy. The absolute configurations of the stereocenters in this compound were assigned by derivatizations of the hydrolysate of 1 with Marfey's reagents and 2,3,4,6-tetra-O-acetyl-β-d-glucopyranosyl isothiocyanate, followed by LC-MS analysis. Careful analysis of the ROESY NMR spectrum and three-dimensional structure calculations revealed that sungsanpin possesses the features of a lasso peptide: eight amino acids (-Gly(1)-Phe-Gly-Ser-Lys-Pro-Ile-Asp(8)-) that form a cyclic peptide and seven amino acids (-Ser(9)-Phe-Gly-Leu-Ser-Trp-Leu(15)) that form a tail that loops through the ring. Sungsanpin is thus the first example of a lasso peptide isolated from a marine-derived microorganism. Sungsanpin displayed inhibitory activity in a cell invasion assay with the human lung cancer cell line A549.
Liu, Rong; Li, Xi; Zhang, Wei; Zhou, Hong-Hao
2015-01-01
Objective Multiple linear regression (MLR) and machine learning techniques in pharmacogenetic algorithm-based warfarin dosing have been reported. However, performances of these algorithms in racially diverse group have never been objectively evaluated and compared. In this literature-based study, we compared the performances of eight machine learning techniques with those of MLR in a large, racially-diverse cohort. Methods MLR, artificial neural network (ANN), regression tree (RT), multivariate adaptive regression splines (MARS), boosted regression tree (BRT), support vector regression (SVR), random forest regression (RFR), lasso regression (LAR) and Bayesian additive regression trees (BART) were applied in warfarin dose algorithms in a cohort from the International Warfarin Pharmacogenetics Consortium database. Covariates obtained by stepwise regression from 80% of randomly selected patients were used to develop algorithms. To compare the performances of these algorithms, the mean percentage of patients whose predicted dose fell within 20% of the actual dose (mean percentage within 20%) and the mean absolute error (MAE) were calculated in the remaining 20% of patients. The performances of these techniques in different races, as well as the dose ranges of therapeutic warfarin were compared. Robust results were obtained after 100 rounds of resampling. Results BART, MARS and SVR were statistically indistinguishable and significantly out performed all the other approaches in the whole cohort (MAE: 8.84–8.96 mg/week, mean percentage within 20%: 45.88%–46.35%). In the White population, MARS and BART showed higher mean percentage within 20% and lower mean MAE than those of MLR (all p values < 0.05). In the Asian population, SVR, BART, MARS and LAR performed the same as MLR. MLR and LAR optimally performed among the Black population. When patients were grouped in terms of warfarin dose range, all machine learning techniques except ANN and LAR showed significantly higher mean percentage within 20%, and lower MAE (all p values < 0.05) than MLR in the low- and high- dose ranges. Conclusion Overall, machine learning-based techniques, BART, MARS and SVR performed superior than MLR in warfarin pharmacogenetic dosing. Differences of algorithms’ performances exist among the races. Moreover, machine learning-based algorithms tended to perform better in the low- and high- dose ranges than MLR. PMID:26305568
Olson, Dawn M; Prescott, Kristina K; Zeilinger, Adam R; Hou, Suqin; Coffin, Alisa W; Smith, Coby M; Ruberson, John R; Andow, David A
2018-06-06
Landscape factors can significantly influence arthropod populations. The economically important brown stink bug, Euschistus servus (Say) (Hemiptera: Pentatomidae), is a native mobile, polyphagous and multivoltine pest of many crops in southeastern United States and understanding the relative influence of local and landscape factors on their reproduction may facilitate population management. Finite rate of population increase (λ) was estimated in four major crop hosts-maize, peanut, cotton, and soybean-over 3 yr in 16 landscapes of southern Georgia. A geographic information system (GIS) was used to characterize the surrounding landscape structure. LASSO regression was used to identify the subset of local and landscape characteristics and predator densities that account for variation in λ. The percentage area of maize, peanut and woodland and pasture in the landscape and the connectivity of cropland had no influence on E. servus λ. The best model for explaining variation in λ included only four predictor variables: whether or not the sampled field was a soybean field, mean natural enemy density in the field, percentage area of cotton in the landscape and the percentage area of soybean in the landscape. Soybean was the single most important variable for determining E. servus λ, with much greater reproduction in soybean fields than in other crop species. Penalized regression and post-selection inference provide conservative estimates of the landscape-scale determinants of E. servus reproduction and indicate that a relatively simple set of in-field and landscape variables influences reproduction in this species.
Stochastic model search with binary outcomes for genome-wide association studies
Malovini, Alberto; Puca, Annibale A; Bellazzi, Riccardo
2012-01-01
Objective The spread of case–control genome-wide association studies (GWASs) has stimulated the development of new variable selection methods and predictive models. We introduce a novel Bayesian model search algorithm, Binary Outcome Stochastic Search (BOSS), which addresses the model selection problem when the number of predictors far exceeds the number of binary responses. Materials and methods Our method is based on a latent variable model that links the observed outcomes to the underlying genetic variables. A Markov Chain Monte Carlo approach is used for model search and to evaluate the posterior probability of each predictor. Results BOSS is compared with three established methods (stepwise regression, logistic lasso, and elastic net) in a simulated benchmark. Two real case studies are also investigated: a GWAS on the genetic bases of longevity, and the type 2 diabetes study from the Wellcome Trust Case Control Consortium. Simulations show that BOSS achieves higher precisions than the reference methods while preserving good recall rates. In both experimental studies, BOSS successfully detects genetic polymorphisms previously reported to be associated with the analyzed phenotypes. Discussion BOSS outperforms the other methods in terms of F-measure on simulated data. In the two real studies, BOSS successfully detects biologically relevant features, some of which are missed by univariate analysis and the three reference techniques. Conclusion The proposed algorithm is an advance in the methodology for model selection with a large number of features. Our simulated and experimental results showed that BOSS proves effective in detecting relevant markers while providing a parsimonious model. PMID:22534080
NASA Astrophysics Data System (ADS)
Wang, Quanchao; Yu, Yang; Li, Fuhua; Zhang, Xiaojun; Xiang, Jianhai
2017-09-01
Genomic selection (GS) can be used to accelerate genetic improvement by shortening the selection interval. The successful application of GS depends largely on the accuracy of the prediction of genomic estimated breeding value (GEBV). This study is a first attempt to understand the practicality of GS in Litopenaeus vannamei and aims to evaluate models for GS on growth traits. The performance of GS models in L. vannamei was evaluated in a population consisting of 205 individuals, which were genotyped for 6 359 single nucleotide polymorphism (SNP) markers by specific length amplified fragment sequencing (SLAF-seq) and phenotyped for body length and body weight. Three GS models (RR-BLUP, BayesA, and Bayesian LASSO) were used to obtain the GEBV, and their predictive ability was assessed by the reliability of the GEBV and the bias of the predicted phenotypes. The mean reliability of the GEBVs for body length and body weight predicted by the different models was 0.296 and 0.411, respectively. For each trait, the performances of the three models were very similar to each other with respect to predictability. The regression coefficients estimated by the three models were close to one, suggesting near to zero bias for the predictions. Therefore, when GS was applied in a L. vannamei population for the studied scenarios, all three models appeared practicable. Further analyses suggested that improved estimation of the genomic prediction could be realized by increasing the size of the training population as well as the density of SNPs.
Modelling Ecuador's rainfall distribution according to geographical characteristics.
NASA Astrophysics Data System (ADS)
Tobar, Vladimiro; Wyseure, Guido
2017-04-01
It is known that rainfall is affected by terrain characteristics and some studies had focussed on its distribution over complex terrain. Ecuador's temporal and spatial rainfall distribution is affected by its location on the ITCZ, the marine currents in the Pacific, the Amazon rainforest, and the Andes mountain range. Although all these factors are important, we think that the latter one may hold a key for modelling spatial and temporal distribution of rainfall. The study considered 30 years of monthly data from 319 rainfall stations having at least 10 years of data available. The relatively low density of stations and their location in accessible sites near to main roads or rivers, leave large and important areas ungauged, making it not appropriate to rely on traditional interpolation techniques to estimate regional rainfall for water balance. The aim of this research was to come up with a useful model for seasonal rainfall distribution in Ecuador based on geographical characteristics to allow its spatial generalization. The target for modelling was the seasonal rainfall, characterized by nine percentiles for each one of the 12 months of the year that results in 108 response variables, later on reduced to four principal components comprising 94% of the total variability. Predictor variables for the model were: geographic coordinates, elevation, main wind effects from the Amazon and Coast, Valley and Hill indexes, and average and maximum elevation above the selected rainfall station to the east and to the west, for each one of 18 directions (50-135°, by 5°) adding up to 79 predictors. A multiple linear regression model by the Elastic-net algorithm with cross-validation was applied for each one of the PC as response to select the most important ones from the 79 predictor variables. The Elastic-net algorithm deals well with collinearity problems, while allowing variable selection in a blended approach between the Ridge and Lasso regression. The model fitting produced explained variances of 59%, 81%, 49% and 17% for PC1, PC2, PC3 and PC4, respectively, backing up the hypothesis of good correlation between geographical characteristics and seasonal rainfall patterns (comprised in the four principal components). With the obtained coefficients from the regression, the 108 rainfall percentiles for each station were back estimated giving very good results when compared with the original ones, with an overall 60% explained variance.
Graph Lasso-Based Test for Evaluating Functional Brain Connectivity in Sickle Cell Disease.
Coloigner, Julie; Phlypo, Ronald; Coates, Thomas D; Lepore, Natasha; Wood, John C
2017-09-01
Sickle cell disease (SCD) is a vascular disorder that is often associated with recurrent ischemia-reperfusion injury, anemia, vasculopathy, and strokes. These cerebral injuries are associated with neurological dysfunction, limiting the full developing potential of the patient. However, recent large studies of SCD have demonstrated that cognitive impairment occurs even in the absence of brain abnormalities on conventional magnetic resonance imaging (MRI). These observations support an emerging consensus that brain injury in SCD is diffuse and that conventional neuroimaging often underestimates the extent of injury. In this article, we postulated that alterations in the cerebral connectivity may constitute a sensitive biomarker of SCD severity. Using functional MRI, a connectivity study analyzing the SCD patients individually was performed. First, a robust learning scheme based on graphical lasso model and Fréchet mean was used for estimating a consistent descriptor of healthy brain connectivity. Then, we tested a statistical method that provides an individual index of similarity between this healthy connectivity model and each SCD patient's connectivity matrix. Our results demonstrated that the reference connectivity model was not appropriate to model connectivity for only 4 out of 27 patients. After controlling for the gender, two separate predictors of this individual similarity index were the anemia (p = 0.02) and white matter hyperintensities (WMH) (silent stroke) (p = 0.03), so that patients with low hemoglobin level or with WMH have the least similarity to the reference connectivity model. Further studies are required to determine whether the resting-state connectivity changes reflect pathological changes or compensatory responses to chronic anemia.
NASA Astrophysics Data System (ADS)
González, D. L., II; Angus, M. P.; Tetteh, I. K.; Bello, G. A.; Padmanabhan, K.; Pendse, S. V.; Srinivas, S.; Yu, J.; Semazzi, F.; Kumar, V.; Samatova, N. F.
2015-01-01
Decades of hypothesis-driven and/or first-principles research have been applied towards the discovery and explanation of the mechanisms that drive climate phenomena, such as western African Sahel summer rainfall~variability. Although connections between various climate factors have been theorized, not all of the key relationships are fully understood. We propose a data-driven approach to identify candidate players in this climate system, which can help explain underlying mechanisms and/or even suggest new relationships, to facilitate building a more comprehensive and predictive model of the modulatory relationships influencing a climate phenomenon of interest. We applied coupled heterogeneous association rule mining (CHARM), Lasso multivariate regression, and dynamic Bayesian networks to find relationships within a complex system, and explored means with which to obtain a consensus result from the application of such varied methodologies. Using this fusion of approaches, we identified relationships among climate factors that modulate Sahel rainfall. These relationships fall into two categories: well-known associations from prior climate knowledge, such as the relationship with the El Niño-Southern Oscillation (ENSO) and putative links, such as North Atlantic Oscillation, that invite further research.
Gonzalez, II, D. L.; Angus, M. P.; Tetteh, I. K.; ...
2015-01-13
Decades of hypothesis-driven and/or first-principles research have been applied towards the discovery and explanation of the mechanisms that drive climate phenomena, such as western African Sahel summer rainfall~variability. Although connections between various climate factors have been theorized, not all of the key relationships are fully understood. We propose a data-driven approach to identify candidate players in this climate system, which can help explain underlying mechanisms and/or even suggest new relationships, to facilitate building a more comprehensive and predictive model of the modulatory relationships influencing a climate phenomenon of interest. We applied coupled heterogeneous association rule mining (CHARM), Lasso multivariate regression,more » and dynamic Bayesian networks to find relationships within a complex system, and explored means with which to obtain a consensus result from the application of such varied methodologies. Using this fusion of approaches, we identified relationships among climate factors that modulate Sahel rainfall. As a result, these relationships fall into two categories: well-known associations from prior climate knowledge, such as the relationship with the El Niño–Southern Oscillation (ENSO) and putative links, such as North Atlantic Oscillation, that invite further research.« less
Feature Grouping and Selection Over an Undirected Graph.
Yang, Sen; Yuan, Lei; Lai, Ying-Cheng; Shen, Xiaotong; Wonka, Peter; Ye, Jieping
2012-01-01
High-dimensional regression/classification continues to be an important and challenging problem, especially when features are highly correlated. Feature selection, combined with additional structure information on the features has been considered to be promising in promoting regression/classification performance. Graph-guided fused lasso (GFlasso) has recently been proposed to facilitate feature selection and graph structure exploitation, when features exhibit certain graph structures. However, the formulation in GFlasso relies on pairwise sample correlations to perform feature grouping, which could introduce additional estimation bias. In this paper, we propose three new feature grouping and selection methods to resolve this issue. The first method employs a convex function to penalize the pairwise l ∞ norm of connected regression/classification coefficients, achieving simultaneous feature grouping and selection. The second method improves the first one by utilizing a non-convex function to reduce the estimation bias. The third one is the extension of the second method using a truncated l 1 regularization to further reduce the estimation bias. The proposed methods combine feature grouping and feature selection to enhance estimation accuracy. We employ the alternating direction method of multipliers (ADMM) and difference of convex functions (DC) programming to solve the proposed formulations. Our experimental results on synthetic data and two real datasets demonstrate the effectiveness of the proposed methods.
Christensen, Julie A E; Zoetmulder, Marielle; Koch, Henriette; Frandsen, Rune; Arvastson, Lars; Christensen, Søren R; Jennum, Poul; Sorensen, Helge B D
2014-09-30
Manual scoring of sleep relies on identifying certain characteristics in polysomnograph (PSG) signals. However, these characteristics are disrupted in patients with neurodegenerative diseases. This study evaluates sleep using a topic modeling and unsupervised learning approach to identify sleep topics directly from electroencephalography (EEG) and electrooculography (EOG). PSG data from control subjects were used to develop an EOG and an EEG topic model. The models were applied to PSG data from 23 control subjects, 25 patients with periodic leg movements (PLMs), 31 patients with idiopathic REM sleep behavior disorder (iRBD) and 36 patients with Parkinson's disease (PD). The data were divided into training and validation datasets and features reflecting EEG and EOG characteristics based on topics were computed. The most discriminative feature subset for separating iRBD/PD and PLM/controls was estimated using a Lasso-regularized regression model. The features with highest discriminability were the number and stability of EEG topics linked to REM and N3, respectively. Validation of the model indicated a sensitivity of 91.4% and a specificity of 68.8% when classifying iRBD/PD patients. The topics showed visual accordance with the manually scored sleep stages, and the features revealed sleep characteristics containing information indicative of neurodegeneration. This study suggests that the amount of N3 and the ability to maintain NREM and REM sleep have potential as early PD biomarkers. Data-driven analysis of sleep may contribute to the evaluation of neurodegenerative patients. Copyright © 2014 Elsevier B.V. All rights reserved.
Recognition of predictors for mid-long term runoff prediction based on lasso
NASA Astrophysics Data System (ADS)
Xie, S.; Huang, Y.
2017-12-01
Reliable and accuracy mid-long term runoff prediction is of great importance in integrated management of reservoir. And many methods are proposed to model runoff time series. Almost all forecast lead times (LT) of these models are 1 month, and the predictors are previous runoff with different time lags. However, runoff prediction with increased LT, which is more beneficial, is not popular in current researches. It is because the connection between previous runoff and current runoff will be weakened with the increase of LT. So 74 atmospheric circulation factors (ACFs) together with pre-runoff are used as alternative predictors for mid-long term runoff prediction of Longyangxia reservoir in this study. Because pre-runoff and 74 ACFs with different time lags are so many and most of these factors are useless, lasso, which means `least absolutely shrinkage and selection operator', is used to recognize predictors. And the result demonstrates that 74 ACFs are beneficial for runoff prediction in both validation and test sets when LT is greater than 6. And there are 6 factors other than pre-runoff, most of which are with big time lag, are selected as predictors frequently. In order to verify the effect of 74 ACFs, 74 stochastic time series generated from normalized 74 ACFs are used as input of model. The result shows that these 74 stochastic time series are useless, which confirm the effect of 74 ACFs on mid-long term runoff prediction.
Peikert, Tobias; Duan, Fenghai; Rajagopalan, Srinivasan; Karwoski, Ronald A; Clay, Ryan; Robb, Richard A; Qin, Ziling; Sicks, JoRean; Bartholmai, Brian J; Maldonado, Fabien
2018-01-01
Optimization of the clinical management of screen-detected lung nodules is needed to avoid unnecessary diagnostic interventions. Herein we demonstrate the potential value of a novel radiomics-based approach for the classification of screen-detected indeterminate nodules. Independent quantitative variables assessing various radiologic nodule features such as sphericity, flatness, elongation, spiculation, lobulation and curvature were developed from the NLST dataset using 726 indeterminate nodules (all ≥ 7 mm, benign, n = 318 and malignant, n = 408). Multivariate analysis was performed using least absolute shrinkage and selection operator (LASSO) method for variable selection and regularization in order to enhance the prediction accuracy and interpretability of the multivariate model. The bootstrapping method was then applied for the internal validation and the optimism-corrected AUC was reported for the final model. Eight of the originally considered 57 quantitative radiologic features were selected by LASSO multivariate modeling. These 8 features include variables capturing Location: vertical location (Offset carina centroid z), Size: volume estimate (Minimum enclosing brick), Shape: flatness, Density: texture analysis (Score Indicative of Lesion/Lung Aggression/Abnormality (SILA) texture), and surface characteristics: surface complexity (Maximum shape index and Average shape index), and estimates of surface curvature (Average positive mean curvature and Minimum mean curvature), all with P<0.01. The optimism-corrected AUC for these 8 features is 0.939. Our novel radiomic LDCT-based approach for indeterminate screen-detected nodule characterization appears extremely promising however independent external validation is needed.
von Brachel, Ruth; Hötzel, Katrin; Hirschfeld, Gerrit; Rieger, Elizabeth; Schmidt, Ulrike; Kosfelder, Joachim; Hechler, Tanja; Schulte, Dietmar; Vocks, Silja
2014-03-31
One of the main problems of Internet-delivered interventions for a range of disorders is the high dropout rate, yet little is known about the factors associated with this. We recently developed and tested a Web-based 6-session program to enhance motivation to change for women with anorexia nervosa, bulimia nervosa, or related subthreshold eating pathology. The aim of the present study was to identify predictors of dropout from this Web program. A total of 179 women took part in the study. We used survival analyses (Cox regression) to investigate the predictive effect of eating disorder pathology (assessed by the Eating Disorders Examination-Questionnaire; EDE-Q), depressive mood (Hopkins Symptom Checklist), motivation to change (University of Rhode Island Change Assessment Scale; URICA), and participants' age at dropout. To identify predictors, we used the least absolute shrinkage and selection operator (LASSO) method. The dropout rate was 50.8% (91/179) and was equally distributed across the 6 treatment sessions. The LASSO analysis revealed that higher scores on the Shape Concerns subscale of the EDE-Q, a higher frequency of binge eating episodes and vomiting, as well as higher depression scores significantly increased the probability of dropout. However, we did not find any effect of the URICA or age on dropout. Women with more severe eating disorder pathology and depressive mood had a higher likelihood of dropping out from a Web-based motivational enhancement program. Interventions such as ours need to address the specific needs of women with more severe eating disorder pathology and depressive mood and offer them additional support to prevent them from prematurely discontinuing treatment.
Hirschfeld, Gerrit; Rieger, Elizabeth; Schmidt, Ulrike; Kosfelder, Joachim; Hechler, Tanja; Schulte, Dietmar; Vocks, Silja
2014-01-01
Background One of the main problems of Internet-delivered interventions for a range of disorders is the high dropout rate, yet little is known about the factors associated with this. We recently developed and tested a Web-based 6-session program to enhance motivation to change for women with anorexia nervosa, bulimia nervosa, or related subthreshold eating pathology. Objective The aim of the present study was to identify predictors of dropout from this Web program. Methods A total of 179 women took part in the study. We used survival analyses (Cox regression) to investigate the predictive effect of eating disorder pathology (assessed by the Eating Disorders Examination-Questionnaire; EDE-Q), depressive mood (Hopkins Symptom Checklist), motivation to change (University of Rhode Island Change Assessment Scale; URICA), and participants’ age at dropout. To identify predictors, we used the least absolute shrinkage and selection operator (LASSO) method. Results The dropout rate was 50.8% (91/179) and was equally distributed across the 6 treatment sessions. The LASSO analysis revealed that higher scores on the Shape Concerns subscale of the EDE-Q, a higher frequency of binge eating episodes and vomiting, as well as higher depression scores significantly increased the probability of dropout. However, we did not find any effect of the URICA or age on dropout. Conclusions Women with more severe eating disorder pathology and depressive mood had a higher likelihood of dropping out from a Web-based motivational enhancement program. Interventions such as ours need to address the specific needs of women with more severe eating disorder pathology and depressive mood and offer them additional support to prevent them from prematurely discontinuing treatment. PMID:24686856
NASA Astrophysics Data System (ADS)
Peña-Castro, A. F.; Dougherty, S. L.; Harrington, R. M.; Cochran, E. S.
2017-12-01
Oklahoma has recently experienced a large increase in seismicity that has been linked to injection of large volumes of wastewater into deep disposal wells, a by-product of oil and gas production. Recent studies have shown that areas with active fluid injection and induced seismicity, such as Oklahoma, may be susceptible to dynamic triggering during passage of seismic waves from large, remote earthquakes. In spring 2016, the 1833-station LArge-n Seismic Survey in Oklahoma (LASSO) array was deployed for 30 days to examine an area of active seismicity in Gran County, located in northern Oklahoma. Here we use the LASSO array to look for dynamic triggering caused by teleseismic earthquakes with magnitudes between Mw 6-8 that produce Peak-Ground-Velocities (PGVs) exceeding 10 μm/s at the LASSO array, consistent with PGV values seen to have triggered seismicity at other locations. We focus on examining seismicity around the shallow Mw7.8 event in Ecuador on 04/16/2016 which generated the largest PGV at LASSO (250 µm/s). To establish if earthquake rates change during or following the passage of the teleseismic surface waves, we develop a catalog of earthquakes around the time of each teleseismic event. We first create a preliminary catalogue using a Short-Term Average/Long-Term Average (STA/LTA) detection algorithm window spanning +/- 24 hours around each teleseism,requiring detection at a minimum of 110 LASSO stations to identify an event. Next, we enhance the STA/LTA catalog with manual detections for a period of +/- 1.5 hours around the time of the teleseismic P-wave arrival to explore if triggering occurs that is not detected by the automated procedure. All detected events are then located using standard location techniques. Any observed seismicity rate changes following the teleseismic arrivals will be examined compared to the short-term background rates to determine whether they are statistically significant. If triggering is observed, focal mechanisms will be determined to estimate fault plane orientations and resolve triggering stresses on receiver fault planes. Our preliminary results for the Mw 7.8 Ecuador event suggest there may be delayed triggering that starts roughly 4 hours after the teleseismic phase arrivals, with event rates increasing from 0-5 to 15-25 events per hour.
Schaake, Wouter; van der Schaaf, Arjen; van Dijk, Lisanne V; Bongaerts, Alfons H H; van den Bergh, Alfons C M; Langendijk, Johannes A
2016-06-01
Curative radiotherapy for prostate cancer may lead to anorectal side effects, including rectal bleeding, fecal incontinence, increased stool frequency and rectal pain. The main objective of this study was to develop multivariable NTCP models for these side effects. The study sample was composed of 262 patients with localized or locally advanced prostate cancer (stage T1-3). Anorectal toxicity was prospectively assessed using a standardized follow-up program. Different anatomical subregions within and around the anorectum were delineated. A LASSO logistic regression analysis was used to analyze dose volume effects on toxicity. In the univariable analysis, rectal bleeding, increase in stool frequency and fecal incontinence were significantly associated with a large number of dosimetric parameters. The collinearity between these predictors was high (VIF>5). In the multivariable model, rectal bleeding was associated with the anorectum (V70) and anticoagulant use, fecal incontinence was associated with the external sphincter (V15) and the iliococcygeal muscle (V55). Finally, increase in stool frequency was associated with the iliococcygeal muscle (V45) and the levator ani (V40). No significant associations were found for rectal pain. Different anorectal side effects are associated with different anatomical substructures within and around the anorectum. The dosimetric variables associated with these side effects can be used to optimize radiotherapy treatment planning aiming at prevention of specific side effects and to estimate the benefit of new radiation technologies. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Yoo, Jin Eun
2018-01-01
A substantial body of research has been conducted on variables relating to students' mathematics achievement with TIMSS. However, most studies have employed conventional statistical methods, and have focused on selected few indicators instead of utilizing hundreds of variables TIMSS provides. This study aimed to find a prediction model for students' mathematics achievement using as many TIMSS student and teacher variables as possible. Elastic net, the selected machine learning technique in this study, takes advantage of both LASSO and ridge in terms of variable selection and multicollinearity, respectively. A logistic regression model was also employed to predict TIMSS 2011 Korean 4th graders' mathematics achievement. Ten-fold cross-validation with mean squared error was employed to determine the elastic net regularization parameter. Among 162 TIMSS variables explored, 12 student and 5 teacher variables were selected in the elastic net model, and the prediction accuracy, sensitivity, and specificity were 76.06, 70.23, and 80.34%, respectively. This study showed that the elastic net method can be successfully applied to educational large-scale data by selecting a subset of variables with reasonable prediction accuracy and finding new variables to predict students' mathematics achievement. Newly found variables via machine learning can shed light on the existing theories from a totally different perspective, which in turn propagates creation of a new theory or complement of existing ones. This study also examined the current scale development convention from a machine learning perspective.
Yoo, Jin Eun
2018-01-01
A substantial body of research has been conducted on variables relating to students' mathematics achievement with TIMSS. However, most studies have employed conventional statistical methods, and have focused on selected few indicators instead of utilizing hundreds of variables TIMSS provides. This study aimed to find a prediction model for students' mathematics achievement using as many TIMSS student and teacher variables as possible. Elastic net, the selected machine learning technique in this study, takes advantage of both LASSO and ridge in terms of variable selection and multicollinearity, respectively. A logistic regression model was also employed to predict TIMSS 2011 Korean 4th graders' mathematics achievement. Ten-fold cross-validation with mean squared error was employed to determine the elastic net regularization parameter. Among 162 TIMSS variables explored, 12 student and 5 teacher variables were selected in the elastic net model, and the prediction accuracy, sensitivity, and specificity were 76.06, 70.23, and 80.34%, respectively. This study showed that the elastic net method can be successfully applied to educational large-scale data by selecting a subset of variables with reasonable prediction accuracy and finding new variables to predict students' mathematics achievement. Newly found variables via machine learning can shed light on the existing theories from a totally different perspective, which in turn propagates creation of a new theory or complement of existing ones. This study also examined the current scale development convention from a machine learning perspective. PMID:29599736
van Dijk, Lisanne V; Brouwer, Charlotte L; van der Schaaf, Arjen; Burgerhof, Johannes G M; Beukinga, Roelof J; Langendijk, Johannes A; Sijtsema, Nanna M; Steenbakkers, Roel J H M
2017-02-01
Current models for the prediction of late patient-rated moderate-to-severe xerostomia (XER 12m ) and sticky saliva (STIC 12m ) after radiotherapy are based on dose-volume parameters and baseline xerostomia (XER base ) or sticky saliva (STIC base ) scores. The purpose is to improve prediction of XER 12m and STIC 12m with patient-specific characteristics, based on CT image biomarkers (IBMs). Planning CT-scans and patient-rated outcome measures were prospectively collected for 249 head and neck cancer patients treated with definitive radiotherapy with or without systemic treatment. The potential IBMs represent geometric, CT intensity and textural characteristics of the parotid and submandibular glands. Lasso regularisation was used to create multivariable logistic regression models, which were internally validated by bootstrapping. The prediction of XER 12m could be improved significantly by adding the IBM "Short Run Emphasis" (SRE), which quantifies heterogeneity of parotid tissue, to a model with mean contra-lateral parotid gland dose and XER base . For STIC 12m , the IBM maximum CT intensity of the submandibular gland was selected in addition to STIC base and mean dose to submandibular glands. Prediction of XER 12m and STIC 12m was improved by including IBMs representing heterogeneity and density of the salivary glands, respectively. These IBMs could guide additional research to the patient-specific response of healthy tissue to radiation dose. Copyright © 2016 The Authors. Published by Elsevier B.V. All rights reserved.
Majumdar, Subhabrata; Basak, Subhash C
2018-04-26
Proper validation is an important aspect of QSAR modelling. External validation is one of the widely used validation methods in QSAR where the model is built on a subset of the data and validated on the rest of the samples. However, its effectiveness for datasets with a small number of samples but large number of predictors remains suspect. Calculating hundreds or thousands of molecular descriptors using currently available software has become the norm in QSAR research, owing to computational advances in the past few decades. Thus, for n chemical compounds and p descriptors calculated for each molecule, the typical chemometric dataset today has high value of p but small n (i.e. n < p). Motivated by the evidence of inadequacies of external validation in estimating the true predictive capability of a statistical model in recent literature, this paper performs an extensive and comparative study of this method with several other validation techniques. We compared four validation methods: leave-one-out, K-fold, external and multi-split validation, using statistical models built using the LASSO regression, which simultaneously performs variable selection and modelling. We used 300 simulated datasets and one real dataset of 95 congeneric amine mutagens for this evaluation. External validation metrics have high variation among different random splits of the data, hence are not recommended for predictive QSAR models. LOO has the overall best performance among all validation methods applied in our scenario. Results from external validation are too unstable for the datasets we analyzed. Based on our findings, we recommend using the LOO procedure for validating QSAR predictive models built on high-dimensional small-sample data. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.
Pacheco, Antonio G; Grinsztejn, Beatriz; da Fonseca, Maria de Jesus M; Moreira, Ronaldo I; Veloso, Valdiléa G; Friedman, Ruth K; Santini-Oliveira, Marilia; Cardoso, Sandra W; Falcão, Melissa; Mill, José G; Bensenor, Isabela; Lotufo, Paulo; Chor, Dóra
2015-01-01
Combination antiretroviral therapy (cART) had a dramatic impact on the mortality profile in human immunodeficiency virus (HIV) infected individuals and increased their life-expectancy. Conditions associated with the aging process have been diagnosed more frequently among HIV-infected patients, particularly, cardiovascular diseases. Patients followed in the Instituto de Pesquisa Clínica Evandro Chagas (IPEC) prospective cohort in Rio de Janeiro were submitted to the general procedures from the Brazilian Longitudinal Study of Adult Health, comprising several anthropometric, laboratory and imaging data. Carotid intima-media thickness (cIMT) was measured by ultrasonography, following the Mannheim protocol. Linear regression and proportional odds models were used to compare groups and covariables in respect to cIMT. The best model was chosen with the adaptive lasso procedure. A valid cIMT exam was available for 591 patients. Median cIMT was significantly larger for men than women (0.56mm vs. 0.53mm; p = 0.002; overall = 0.54mm). In univariable linear regression analysis, both traditional risk factors for cardiovascular diseases (CVD) and HIV-specific characteristics were significantly associated with cIMT values, but the best multivariable model chosen included only traditional characteristics. Hypertension presented the strongest association with higher cIMT terciles (OR = 2.51; 95%CI = 1.69-3.73), followed by current smoking (OR = 1,82; 95%CI = 1.19-2.79), family history of acute myocardial infarction or stroke (OR = 1.60; 95%CI = 1.10-2.32) and age (OR per year = 1.12; 95%CI = 1.10-1.14). Our results show that traditional cardiovascular disease (CVD) risk factors are the major players in determining increased cIMT among HIV infected patients in Brazil. This finding reinforces the need for thorough assessment of those risk factors in these patients to guarantee the incidence of CVD events remain under control.
Multiplex proteomics for prediction of major cardiovascular events in type 2 diabetes.
Nowak, Christoph; Carlsson, Axel C; Östgren, Carl Johan; Nyström, Fredrik H; Alam, Moudud; Feldreich, Tobias; Sundström, Johan; Carrero, Juan-Jesus; Leppert, Jerzy; Hedberg, Pär; Henriksen, Egil; Cordeiro, Antonio C; Giedraitis, Vilmantas; Lind, Lars; Ingelsson, Erik; Fall, Tove; Ärnlöv, Johan
2018-05-24
Multiplex proteomics could improve understanding and risk prediction of major adverse cardiovascular events (MACE) in type 2 diabetes. This study assessed 80 cardiovascular and inflammatory proteins for biomarker discovery and prediction of MACE in type 2 diabetes. We combined data from six prospective epidemiological studies of 30-77-year-old individuals with type 2 diabetes in whom 80 circulating proteins were measured by proximity extension assay. Multivariable-adjusted Cox regression was used in a discovery/replication design to identify biomarkers for incident MACE. We used gradient-boosted machine learning and lasso regularised Cox regression in a random 75% training subsample to assess whether adding proteins to risk factors included in the Swedish National Diabetes Register risk model would improve the prediction of MACE in the separate 25% test subsample. Of 1211 adults with type 2 diabetes (32% women), 211 experienced a MACE over a mean (±SD) of 6.4 ± 2.3 years. We replicated associations (<5% false discovery rate) between risk of MACE and eight proteins: matrix metalloproteinase (MMP)-12, IL-27 subunit α (IL-27a), kidney injury molecule (KIM)-1, fibroblast growth factor (FGF)-23, protein S100-A12, TNF receptor (TNFR)-1, TNFR-2 and TNF-related apoptosis-inducing ligand receptor (TRAIL-R)2. Addition of the 80-protein assay to established risk factors improved discrimination in the separate test sample from 0.686 (95% CI 0.682, 0.689) to 0.748 (95% CI 0.746, 0.751). A sparse model of 20 added proteins achieved a C statistic of 0.747 (95% CI 0.653, 0.842) in the test sample. We identified eight protein biomarkers, four of which are novel, for risk of MACE in community residents with type 2 diabetes, and found improved risk prediction by combining multiplex proteomics with an established risk model. Multiprotein arrays could be useful in identifying individuals with type 2 diabetes who are at highest risk of a cardiovascular event.
Noh, Hwayoung; Freisling, Heinz; Assi, Nada; Zamora-Ros, Raul; Achaintre, David; Affret, Aurélie; Mancini, Francesca; Boutron-Ruault, Marie-Christine; Flögel, Anna; Boeing, Heiner; Kühn, Tilman; Schübel, Ruth; Trichopoulou, Antonia; Naska, Androniki; Kritikou, Maria; Palli, Domenico; Pala, Valeria; Tumino, Rosario; Ricceri, Fulvio; Santucci de Magistris, Maria; Cross, Amanda; Slimani, Nadia; Scalbert, Augustin; Ferrari, Pietro
2017-07-25
We identified urinary polyphenol metabolite patterns by a novel algorithm that combines dimension reduction and variable selection methods to explain polyphenol-rich food intake, and compared their respective performance with that of single biomarkers in the European Prospective Investigation into Cancer and Nutrition (EPIC) study. The study included 475 adults from four European countries (Germany, France, Italy, and Greece). Dietary intakes were assessed with 24-h dietary recalls (24-HDR) and dietary questionnaires (DQ). Thirty-four polyphenols were measured by ultra-performance liquid chromatography-electrospray ionization-tandem mass spectrometry (UPLC-ESI-MS-MS) in 24-h urine. Reduced rank regression-based variable importance in projection (RRR-VIP) and least absolute shrinkage and selection operator (LASSO) methods were used to select polyphenol metabolites. Reduced rank regression (RRR) was then used to identify patterns in these metabolites, maximizing the explained variability in intake of pre-selected polyphenol-rich foods. The performance of RRR models was evaluated using internal cross-validation to control for over-optimistic findings from over-fitting. High performance was observed for explaining recent intake (24-HDR) of red wine ( r = 0.65; AUC = 89.1%), coffee ( r = 0.51; AUC = 89.1%), and olives ( r = 0.35; AUC = 82.2%). These metabolite patterns performed better or equally well compared to single polyphenol biomarkers. Neither metabolite patterns nor single biomarkers performed well in explaining habitual intake (as reported in the DQ) of polyphenol-rich foods. This proposed strategy of biomarker pattern identification has the potential of expanding the currently still limited list of available dietary intake biomarkers.
Protein Biomarkers for Insulin Resistance and Type 2 Diabetes Risk in Two Large Community Cohorts
Nowak, Christoph; Sundström, Johan; Gustafsson, Stefan; Giedraitis, Vilmantas; Lind, Lars; Ingelsson, Erik; Fall, Tove
2016-01-01
Insulin resistance (IR) is a precursor of type 2 diabetes (T2D), and improved risk prediction and understanding of the pathogenesis are needed. We used a novel high-throughput 92-protein assay to identify circulating biomarkers for HOMA of IR in two cohorts of community residents without diabetes (n = 1,367) (mean age 73 ± 3.6 years). Adjusted linear regression identified cathepsin D and confirmed six proteins (leptin, renin, interleukin-1 receptor antagonist [IL-1ra], hepatocyte growth factor, fatty acid–binding protein 4, and tissue plasminogen activator [t-PA]) as IR biomarkers. Mendelian randomization analysis indicated a positive causal effect of IR on t-PA concentrations. Two biomarkers, IL-1ra (hazard ratio [HR] 1.28, 95% CI 1.03–1.59) and t-PA (HR 1.30, 1.02–1.65) were associated with incident T2D, and t-PA predicted 5-year transition to hyperglycemia (odds ratio 1.30, 95% CI 1.02–1.65). Additional adjustment for fasting glucose rendered both coefficients insignificant and revealed an association between renin and T2D (HR 0.79, 0.62–0.99). LASSO regression suggested a risk model including IL-1ra, t-PA, and the Framingham Offspring Study T2D score, but prediction improvement was nonsignificant (difference in C-index 0.02, 95% CI −0.08 to 0.12) over the T2D score only. In conclusion, proteomic blood profiling indicated cathepsin D as a new IR biomarker and suggested a causal effect of IR on t-PA. PMID:26420861
T2L2 on JASON-2: First Evaluation of the Flying Model
2007-01-01
Para, J.-M. Torre R&D Metrology CNRS/GEMINI Observatoire de la Côte d’Azur Caussol, France E-mail: philippe.guillemot@cnes.fr Abstract...Laser Link” experiment T2L2 [1], under development at OCA (Observatoire de la Côte d’Azur) and CNES (Centre National d’Etudes Spatiales), France, will be...Experimental Astronomy, 7, 191-207. [2] P. Fridelance and C. Veillet, 1995, “Operation and data analysis in the LASSO experiment,” Metrologia
A resilient domain decomposition polynomial chaos solver for uncertain elliptic PDEs
NASA Astrophysics Data System (ADS)
Mycek, Paul; Contreras, Andres; Le Maître, Olivier; Sargsyan, Khachik; Rizzi, Francesco; Morris, Karla; Safta, Cosmin; Debusschere, Bert; Knio, Omar
2017-07-01
A resilient method is developed for the solution of uncertain elliptic PDEs on extreme scale platforms. The method is based on a hybrid domain decomposition, polynomial chaos (PC) framework that is designed to address soft faults. Specifically, parallel and independent solves of multiple deterministic local problems are used to define PC representations of local Dirichlet boundary-to-boundary maps that are used to reconstruct the global solution. A LAD-lasso type regression is developed for this purpose. The performance of the resulting algorithm is tested on an elliptic equation with an uncertain diffusivity field. Different test cases are considered in order to analyze the impacts of correlation structure of the uncertain diffusivity field, the stochastic resolution, as well as the probability of soft faults. In particular, the computations demonstrate that, provided sufficiently many samples are generated, the method effectively overcomes the occurrence of soft faults.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dyar, M. Darby; McCanta, Molly; Breves, Elly
2016-03-01
Pre-edge features in the K absorption edge of X-ray absorption spectra are commonly used to predict Fe3+ valence state in silicate glasses. However, this study shows that using the entire spectral region from the pre-edge into the extended X-ray absorption fine-structure region provides more accurate results when combined with multivariate analysis techniques. The least absolute shrinkage and selection operator (lasso) regression technique yields %Fe3+ values that are accurate to ±3.6% absolute when the full spectral region is employed. This method can be used across a broad range of glass compositions, is easily automated, and is demonstrated to yield accurate resultsmore » from different synchrotrons. It will enable future studies involving X-ray mapping of redox gradients on standard thin sections at 1 × 1 μm pixel sizes.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dyar, M. Darby; McCanta, Molly; Breves, Elly
2016-03-01
Pre-edge features in the K absorption edge of X-ray absorption spectra are commonly used to predict Fe 3+ valence state in silicate glasses. However, this study shows that using the entire spectral region from the pre-edge into the extended X-ray absorption fine-structure region provides more accurate results when combined with multivariate analysis techniques. The least absolute shrinkage and selection operator (lasso) regression technique yields %Fe 3+ values that are accurate to ±3.6% absolute when the full spectral region is employed. This method can be used across a broad range of glass compositions, is easily automated, and is demonstrated to yieldmore » accurate results from different synchrotrons. It will enable future studies involving X-ray mapping of redox gradients on standard thin sections at 1 × 1 μm pixel sizes.« less
Ching, Travers; Zhu, Xun; Garmire, Lana X
2018-04-01
Artificial neural networks (ANN) are computing architectures with many interconnections of simple neural-inspired computing elements, and have been applied to biomedical fields such as imaging analysis and diagnosis. We have developed a new ANN framework called Cox-nnet to predict patient prognosis from high throughput transcriptomics data. In 10 TCGA RNA-Seq data sets, Cox-nnet achieves the same or better predictive accuracy compared to other methods, including Cox-proportional hazards regression (with LASSO, ridge, and mimimax concave penalty), Random Forests Survival and CoxBoost. Cox-nnet also reveals richer biological information, at both the pathway and gene levels. The outputs from the hidden layer node provide an alternative approach for survival-sensitive dimension reduction. In summary, we have developed a new method for accurate and efficient prognosis prediction on high throughput data, with functional biological insights. The source code is freely available at https://github.com/lanagarmire/cox-nnet.
Andrinopoulou, Eleni-Rosalina; Rizopoulos, Dimitris
2016-11-20
The joint modeling of longitudinal and survival data has recently received much attention. Several extensions of the standard joint model that consists of one longitudinal and one survival outcome have been proposed including the use of different association structures between the longitudinal and the survival outcomes. However, in general, relatively little attention has been given to the selection of the most appropriate functional form to link the two outcomes. In common practice, it is assumed that the underlying value of the longitudinal outcome is associated with the survival outcome. However, it could be that different characteristics of the patients' longitudinal profiles influence the hazard. For example, not only the current value but also the slope or the area under the curve of the longitudinal outcome. The choice of which functional form to use is an important decision that needs to be investigated because it could influence the results. In this paper, we use a Bayesian shrinkage approach in order to determine the most appropriate functional forms. We propose a joint model that includes different association structures of different biomarkers and assume informative priors for the regression coefficients that correspond to the terms of the longitudinal process. Specifically, we assume Bayesian lasso, Bayesian ridge, Bayesian elastic net, and horseshoe. These methods are applied to a dataset consisting of patients with a chronic liver disease, where it is important to investigate which characteristics of the biomarkers have an influence on survival. Copyright © 2016 John Wiley & Sons, Ltd. Copyright © 2016 John Wiley & Sons, Ltd.
WE-E-BRE-05: Ensemble of Graphical Models for Predicting Radiation Pneumontis Risk
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lee, S; Ybarra, N; Jeyaseelan, K
Purpose: We propose a prior knowledge-based approach to construct an interaction graph of biological and dosimetric radiation pneumontis (RP) covariates for the purpose of developing a RP risk classifier. Methods: We recruited 59 NSCLC patients who received curative radiotherapy with minimum 6 month follow-up. 16 RP events was observed (CTCAE grade ≥2). Blood serum was collected from every patient before (pre-RT) and during RT (mid-RT). From each sample the concentration of the following five candidate biomarkers were taken as covariates: alpha-2-macroglobulin (α2M), angiotensin converting enzyme (ACE), transforming growth factor β (TGF-β), interleukin-6 (IL-6), and osteopontin (OPN). Dose-volumetric parameters were alsomore » included as covariates. The number of biological and dosimetric covariates was reduced by a variable selection scheme implemented by L1-regularized logistic regression (LASSO). Posterior probability distribution of interaction graphs between the selected variables was estimated from the data under the literature-based prior knowledge to weight more heavily the graphs that contain the expected associations. A graph ensemble was formed by averaging the most probable graphs weighted by their posterior, creating a Bayesian Network (BN)-based RP risk classifier. Results: The LASSO selected the following 7 RP covariates: (1) pre-RT concentration level of α2M, (2) α2M level mid- RT/pre-RT, (3) pre-RT IL6 level, (4) IL6 level mid-RT/pre-RT, (5) ACE mid-RT/pre-RT, (6) PTV volume, and (7) mean lung dose (MLD). The ensemble BN model achieved the maximum sensitivity/specificity of 81%/84% and outperformed univariate dosimetric predictors as shown by larger AUC values (0.78∼0.81) compared with MLD (0.61), V20 (0.65) and V30 (0.70). The ensembles obtained by incorporating the prior knowledge improved classification performance for the ensemble size 5∼50. Conclusion: We demonstrated a probabilistic ensemble method to detect robust associations between RP covariates and its potential to improve RP prediction accuracy. Our Bayesian approach to incorporate prior knowledge can enhance efficiency in searching of such associations from data. The authors acknowledge partial support by: 1) CREATE Medical Physics Research Training Network grant of the Natural Sciences and Engineering Research Council (Grant number: 432290) and 2) The Terry Fox Foundation Strategic Training Initiative for Excellence in Radiation Research for the 21st Century (EIRR21)« less
Mallick, Himel; Tiwari, Hemant K.
2016-01-01
Count data are increasingly ubiquitous in genetic association studies, where it is possible to observe excess zero counts as compared to what is expected based on standard assumptions. For instance, in rheumatology, data are usually collected in multiple joints within a person or multiple sub-regions of a joint, and it is not uncommon that the phenotypes contain enormous number of zeroes due to the presence of excessive zero counts in majority of patients. Most existing statistical methods assume that the count phenotypes follow one of these four distributions with appropriate dispersion-handling mechanisms: Poisson, Zero-inflated Poisson (ZIP), Negative Binomial, and Zero-inflated Negative Binomial (ZINB). However, little is known about their implications in genetic association studies. Also, there is a relative paucity of literature on their usefulness with respect to model misspecification and variable selection. In this article, we have investigated the performance of several state-of-the-art approaches for handling zero-inflated count data along with a novel penalized regression approach with an adaptive LASSO penalty, by simulating data under a variety of disease models and linkage disequilibrium patterns. By taking into account data-adaptive weights in the estimation procedure, the proposed method provides greater flexibility in multi-SNP modeling of zero-inflated count phenotypes. A fast coordinate descent algorithm nested within an EM (expectation-maximization) algorithm is implemented for estimating the model parameters and conducting variable selection simultaneously. Results show that the proposed method has optimal performance in the presence of multicollinearity, as measured by both prediction accuracy and empirical power, which is especially apparent as the sample size increases. Moreover, the Type I error rates become more or less uncontrollable for the competing methods when a model is misspecified, a phenomenon routinely encountered in practice. PMID:27066062
Gómez, Luz Marina; Marchioni, Dirce Maria Lobo; dos Anjos, Fernanda Silva Nogueira; Molina, Maria del Carmen Bisi; Lotufo, Paulo Andrade; Benseñor, Isabela Judith Martins; Titan, Silvia Maria de Oliveira
2018-01-01
Coronary artery calcification (CAC) is a widespread condition in chronic kidney disease (CKD). Diet may play an important role in CAC, but this role is not clear. This study evaluated the association between macro-and micronutrient intakes and CAC in non-dialysis CKD patients. We analyzed the baseline data from 454 participants of the PROGREDIR study. Dietary intake was evaluated by a food frequency questionnaire. CAC was measured by computed tomography. After exclusion of participants with a coronary stent, 373 people remained for the analyses. The highest tertile of CAC was directly associated with the intake of phosphorus, calcium and magnesium. There was a higher intake of pantothenic acid and potassium in the second tertile. After adjustments for confounding variables, the intake of pantothenic acid, phosphorus, calcium and potassium remained associated with CAC in the generalized linear mixed models. In order to handle the collinearity between these nutrients, we used the LASSO (least absolute shrinkage and selection operator) regression to evaluate the nutrients associated with CAC variability. In this approach, the nutrients that most explained the variance of CAC were phosphorus, calcium and potassium. Prospective studies are needed to confirm these findings and assess the role of interventions regarding these micronutrients on CAC prevention and progression. PMID:29562658
Machado, Alisson Diego; Gómez, Luz Marina; Marchioni, Dirce Maria Lobo; Dos Anjos, Fernanda Silva Nogueira; Molina, Maria Del Carmen Bisi; Lotufo, Paulo Andrade; Benseñor, Isabela Judith Martins; Titan, Silvia Maria de Oliveira
2018-03-19
Coronary artery calcification (CAC) is a widespread condition in chronic kidney disease (CKD). Diet may play an important role in CAC, but this role is not clear. This study evaluated the association between macro-and micronutrient intakes and CAC in non-dialysis CKD patients. We analyzed the baseline data from 454 participants of the PROGREDIR study. Dietary intake was evaluated by a food frequency questionnaire. CAC was measured by computed tomography. After exclusion of participants with a coronary stent, 373 people remained for the analyses. The highest tertile of CAC was directly associated with the intake of phosphorus, calcium and magnesium. There was a higher intake of pantothenic acid and potassium in the second tertile. After adjustments for confounding variables, the intake of pantothenic acid, phosphorus, calcium and potassium remained associated with CAC in the generalized linear mixed models. In order to handle the collinearity between these nutrients, we used the LASSO (least absolute shrinkage and selection operator) regression to evaluate the nutrients associated with CAC variability. In this approach, the nutrients that most explained the variance of CAC were phosphorus, calcium and potassium. Prospective studies are needed to confirm these findings and assess the role of interventions regarding these micronutrients on CAC prevention and progression.
Continuous-time discrete-space models for animal movement
Hanks, Ephraim M.; Hooten, Mevin B.; Alldredge, Mat W.
2015-01-01
The processes influencing animal movement and resource selection are complex and varied. Past efforts to model behavioral changes over time used Bayesian statistical models with variable parameter space, such as reversible-jump Markov chain Monte Carlo approaches, which are computationally demanding and inaccessible to many practitioners. We present a continuous-time discrete-space (CTDS) model of animal movement that can be fit using standard generalized linear modeling (GLM) methods. This CTDS approach allows for the joint modeling of location-based as well as directional drivers of movement. Changing behavior over time is modeled using a varying-coefficient framework which maintains the computational simplicity of a GLM approach, and variable selection is accomplished using a group lasso penalty. We apply our approach to a study of two mountain lions (Puma concolor) in Colorado, USA.
Sparse Additive Ordinary Differential Equations for Dynamic Gene Regulatory Network Modeling.
Wu, Hulin; Lu, Tao; Xue, Hongqi; Liang, Hua
2014-04-02
The gene regulation network (GRN) is a high-dimensional complex system, which can be represented by various mathematical or statistical models. The ordinary differential equation (ODE) model is one of the popular dynamic GRN models. High-dimensional linear ODE models have been proposed to identify GRNs, but with a limitation of the linear regulation effect assumption. In this article, we propose a sparse additive ODE (SA-ODE) model, coupled with ODE estimation methods and adaptive group LASSO techniques, to model dynamic GRNs that could flexibly deal with nonlinear regulation effects. The asymptotic properties of the proposed method are established and simulation studies are performed to validate the proposed approach. An application example for identifying the nonlinear dynamic GRN of T-cell activation is used to illustrate the usefulness of the proposed method.
A Brief Survey of Modern Optimization for Statisticians
Lange, Kenneth; Chi, Eric C.; Zhou, Hua
2014-01-01
Modern computational statistics is turning more and more to high-dimensional optimization to handle the deluge of big data. Once a model is formulated, its parameters can be estimated by optimization. Because model parsimony is important, models routinely include nondifferentiable penalty terms such as the lasso. This sober reality complicates minimization and maximization. Our broad survey stresses a few important principles in algorithm design. Rather than view these principles in isolation, it is more productive to mix and match them. A few well chosen examples illustrate this point. Algorithm derivation is also emphasized, and theory is downplayed, particularly the abstractions of the convex calculus. Thus, our survey should be useful and accessible to a broad audience. PMID:25242858
DOE Office of Scientific and Technical Information (OSTI.GOV)
Niedzielski, Joshua S., E-mail: jsniedzielski@mdanderson.org; University of Texas Houston Graduate School of Biomedical Science, Houston, Texas; Yang, Jinzhong
Purpose: We sought to investigate the ability of mid-treatment {sup 18}F-fluorodeoxyglucose positron emission tomography (PET) studies to objectively and spatially quantify esophageal injury in vivo from radiation therapy for non-small cell lung cancer. Methods and Materials: This retrospective study was approved by the local institutional review board, with written informed consent obtained before enrollment. We normalized {sup 18}F-fluorodeoxyglucose PET uptake to each patient's low-irradiated region (<5 Gy) of the esophagus, as a radiation response measure. Spatially localized metrics of normalized uptake (normalized standard uptake value [nSUV]) were derived for 79 patients undergoing concurrent chemoradiation therapy for non-small cell lung cancer. We usedmore » nSUV metrics to classify esophagitis grade at the time of the PET study, as well as maximum severity by treatment completion, according to National Cancer Institute Common Terminology Criteria for Adverse Events, using multivariate least absolute shrinkage and selection operator (LASSO) logistic regression and repeated 3-fold cross validation (training, validation, and test folds). This 3-fold cross-validation LASSO model procedure was used to predict toxicity progression from 43 asymptomatic patients during the PET study. Dose-volume metrics were also tested in both the multivariate classification and the symptom progression prediction analyses. Classification performance was quantified with the area under the curve (AUC) from receiver operating characteristic analysis on the test set from the 3-fold analyses. Results: Statistical analysis showed increasing nSUV is related to esophagitis severity. Axial-averaged maximum nSUV for 1 esophageal slice and esophageal length with at least 40% of axial-averaged nSUV both had AUCs of 0.85 for classifying grade 2 or higher esophagitis at the time of the PET study and AUCs of 0.91 and 0.92, respectively, for maximum grade 2 or higher by treatment completion. Symptom progression was predicted with an AUC of 0.75. Dose metrics performed poorly at classifying esophagitis (AUC of 0.52, grade 2 or higher mid treatment) or predicting symptom progression (AUC of 0.67). Conclusions: Normalized uptake can objectively, locally, and noninvasively quantify esophagitis during radiation therapy and predict eventual symptoms from asymptomatic patients. Normalized uptake may provide patient-specific dose-response information not discernible from dose.« less
Niedzielski, Joshua S; Yang, Jinzhong; Liao, Zhongxing; Gomez, Daniel R; Stingo, Francesco; Mohan, Radhe; Martel, Mary K; Briere, Tina M; Court, Laurence E
2016-11-01
We sought to investigate the ability of mid-treatment (18)F-fluorodeoxyglucose positron emission tomography (PET) studies to objectively and spatially quantify esophageal injury in vivo from radiation therapy for non-small cell lung cancer. This retrospective study was approved by the local institutional review board, with written informed consent obtained before enrollment. We normalized (18)F-fluorodeoxyglucose PET uptake to each patient's low-irradiated region (<5 Gy) of the esophagus, as a radiation response measure. Spatially localized metrics of normalized uptake (normalized standard uptake value [nSUV]) were derived for 79 patients undergoing concurrent chemoradiation therapy for non-small cell lung cancer. We used nSUV metrics to classify esophagitis grade at the time of the PET study, as well as maximum severity by treatment completion, according to National Cancer Institute Common Terminology Criteria for Adverse Events, using multivariate least absolute shrinkage and selection operator (LASSO) logistic regression and repeated 3-fold cross validation (training, validation, and test folds). This 3-fold cross-validation LASSO model procedure was used to predict toxicity progression from 43 asymptomatic patients during the PET study. Dose-volume metrics were also tested in both the multivariate classification and the symptom progression prediction analyses. Classification performance was quantified with the area under the curve (AUC) from receiver operating characteristic analysis on the test set from the 3-fold analyses. Statistical analysis showed increasing nSUV is related to esophagitis severity. Axial-averaged maximum nSUV for 1 esophageal slice and esophageal length with at least 40% of axial-averaged nSUV both had AUCs of 0.85 for classifying grade 2 or higher esophagitis at the time of the PET study and AUCs of 0.91 and 0.92, respectively, for maximum grade 2 or higher by treatment completion. Symptom progression was predicted with an AUC of 0.75. Dose metrics performed poorly at classifying esophagitis (AUC of 0.52, grade 2 or higher mid treatment) or predicting symptom progression (AUC of 0.67). Normalized uptake can objectively, locally, and noninvasively quantify esophagitis during radiation therapy and predict eventual symptoms from asymptomatic patients. Normalized uptake may provide patient-specific dose-response information not discernible from dose. Copyright © 2016 Elsevier Inc. All rights reserved.
Characterization and machine learning prediction of allele-specific DNA methylation.
He, Jianlin; Sun, Ming-an; Wang, Zhong; Wang, Qianfei; Li, Qing; Xie, Hehuang
2015-12-01
A large collection of Single Nucleotide Polymorphisms (SNPs) has been identified in the human genome. Currently, the epigenetic influences of SNPs on their neighboring CpG sites remain elusive. A growing body of evidence suggests that locus-specific information, including genomic features and local epigenetic state, may play important roles in the epigenetic readout of SNPs. In this study, we made use of mouse methylomes with known SNPs to develop statistical models for the prediction of SNP associated allele-specific DNA methylation (ASM). ASM has been classified into parent-of-origin dependent ASM (P-ASM) and sequence-dependent ASM (S-ASM), which comprises scattered-S-ASM (sS-ASM) and clustered-S-ASM (cS-ASM). We found that P-ASM and cS-ASM CpG sites are both enriched in CpG rich regions, promoters and exons, while sS-ASM CpG sites are enriched in simple repeat and regions with high frequent SNP occurrence. Using Lasso-grouped Logistic Regression (LGLR), we selected 21 out of 282 genomic and methylation related features that are powerful in distinguishing cS-ASM CpG sites and trained the classifiers with machine learning techniques. Based on 5-fold cross-validation, the logistic regression classifier was found to be the best for cS-ASM prediction with an ACC of 0.77, an AUC of 0.84 and an MCC of 0.54. Lastly, we applied the logistic regression classifier on human brain methylome and predicted 608 genes associated with cS-ASM. Gene ontology term enrichment analysis indicated that these cS-ASM associated genes are significantly enriched in the category coding for transcripts with alternative splicing forms. In summary, this study provided an analytical procedure for cS-ASM prediction and shed new light on the understanding of different types of ASM events. Published by Elsevier Inc.
Yu, Wenbao; Park, Taesung
2014-01-01
It is common to get an optimal combination of markers for disease classification and prediction when multiple markers are available. Many approaches based on the area under the receiver operating characteristic curve (AUC) have been proposed. Existing works based on AUC in a high-dimensional context depend mainly on a non-parametric, smooth approximation of AUC, with no work using a parametric AUC-based approach, for high-dimensional data. We propose an AUC-based approach using penalized regression (AucPR), which is a parametric method used for obtaining a linear combination for maximizing the AUC. To obtain the AUC maximizer in a high-dimensional context, we transform a classical parametric AUC maximizer, which is used in a low-dimensional context, into a regression framework and thus, apply the penalization regression approach directly. Two kinds of penalization, lasso and elastic net, are considered. The parametric approach can avoid some of the difficulties of a conventional non-parametric AUC-based approach, such as the lack of an appropriate concave objective function and a prudent choice of the smoothing parameter. We apply the proposed AucPR for gene selection and classification using four real microarray and synthetic data. Through numerical studies, AucPR is shown to perform better than the penalized logistic regression and the nonparametric AUC-based method, in the sense of AUC and sensitivity for a given specificity, particularly when there are many correlated genes. We propose a powerful parametric and easily-implementable linear classifier AucPR, for gene selection and disease prediction for high-dimensional data. AucPR is recommended for its good prediction performance. Beside gene expression microarray data, AucPR can be applied to other types of high-dimensional omics data, such as miRNA and protein data.
Intra- and inter-rater reliability of digital image analysis for skin color measurement
Sommers, Marilyn; Beacham, Barbara; Baker, Rachel; Fargo, Jamison
2013-01-01
Background We determined the intra- and inter-rater reliability of data from digital image color analysis between an expert and novice analyst. Methods Following training, the expert and novice independently analyzed 210 randomly ordered images. Both analysts used Adobe® Photoshop lasso or color sampler tools based on the type of image file. After color correction with Pictocolor® in camera software, they recorded L*a*b* (L*=light/dark; a*=red/green; b*=yellow/blue) color values for all skin sites. We computed intra-rater and inter-rater agreement within anatomical region, color value (L*, a*, b*), and technique (lasso, color sampler) using a series of one-way intra-class correlation coefficients (ICCs). Results Results of ICCs for intra-rater agreement showed high levels of internal consistency reliability within each rater for the lasso technique (ICC ≥ 0.99) and somewhat lower, yet acceptable, level of agreement for the color sampler technique (ICC = 0.91 for expert, ICC = 0.81 for novice). Skin L*, skin b*, and labia L* values reached the highest level of agreement (ICC ≥ 0.92) and skin a*, labia b*, and vaginal wall b* were the lowest (ICC ≥ 0.64). Conclusion Data from novice analysts can achieve high levels of agreement with data from expert analysts with training and the use of a detailed, standard protocol. PMID:23551208
Intra- and inter-rater reliability of digital image analysis for skin color measurement.
Sommers, Marilyn; Beacham, Barbara; Baker, Rachel; Fargo, Jamison
2013-11-01
We determined the intra- and inter-rater reliability of data from digital image color analysis between an expert and novice analyst. Following training, the expert and novice independently analyzed 210 randomly ordered images. Both analysts used Adobe(®) Photoshop lasso or color sampler tools based on the type of image file. After color correction with Pictocolor(®) in camera software, they recorded L*a*b* (L*=light/dark; a*=red/green; b*=yellow/blue) color values for all skin sites. We computed intra-rater and inter-rater agreement within anatomical region, color value (L*, a*, b*), and technique (lasso, color sampler) using a series of one-way intra-class correlation coefficients (ICCs). Results of ICCs for intra-rater agreement showed high levels of internal consistency reliability within each rater for the lasso technique (ICC ≥ 0.99) and somewhat lower, yet acceptable, level of agreement for the color sampler technique (ICC = 0.91 for expert, ICC = 0.81 for novice). Skin L*, skin b*, and labia L* values reached the highest level of agreement (ICC ≥ 0.92) and skin a*, labia b*, and vaginal wall b* were the lowest (ICC ≥ 0.64). Data from novice analysts can achieve high levels of agreement with data from expert analysts with training and the use of a detailed, standard protocol. © 2013 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.
Beyond Flint: National Trends in Drinking Water Quality Violations
NASA Astrophysics Data System (ADS)
Allaire, M.; Wu, H.; Lall, U.
2016-12-01
Ensuring safe water supply for communities across the U.S. represents an emerging challenge. Aging infrastructure, impaired source water, and strained community finances may increase vulnerability of water systems to quality violations. In the aftermath of Flint, there is a great need to assess the current state of U.S. drinking water quality. How widespread are violations? What are the spatial and temporal patterns in water quality? Which types of communities and systems are most vulnerable? This is the first national assessment of trends in drinking water quality violations across several decades. In 2015, 9% of community water systems violated health-related water quality standards. These non-compliant systems served nearly 23 million people. Thus, the challenge of providing safe drinking water extends beyond Flint and represents a nationwide concern. We use a panel dataset that includes every community water system in the United States from 1981 to 2010 to identify factors that lead to regulatory noncompliance. This study focuses on health-related violations of the Safe Drinking Water Act. Lasso regression informed selection of appropriate covariates, while logistic regressions modeled the probability of noncompliance. We find that compliance is positively associated with private ownership, purchased water supply, and greater household income. Yet, greater concentration of utility ownership and violations in prior years are associated with a higher likelihood of violation. The results suggest that purchased water contracts, which are growing among small utilities, could serve as a way to improve regulatory compliance in the future. However, persistence of violations and ownership concentration deserve attention from policymakers. Already, the EPA has begun to prioritize enforcement of persistent violators. Overall, as the revitalization of U.S. water infrastructure becomes a growing priority area, results of this study are intended to inform investment and policy.
Type 2 diabetes in Vietnam: a cross-sectional, prevalence-based cost-of-illness study.
Le, Nguyen Tu Dang; Dinh Pham, Luyen; Quang Vo, Trung
2017-01-01
According to the International Diabetes Federation, total global health care expenditures for diabetes tripled between 2003 and 2013 because of increases in the number of people with diabetes as well as in the average expenditures per patient. This study aims to provide accurate and timely information about the economic impacts of type 2 diabetes mellitus (T2DM) in Vietnam. The cost-of-illness estimates followed a prospective, prevalence-based approach from the societal perspective of T2DM with 392 selected diabetic patients who received treatment from a public hospital in Ho Chi Minh City, Vietnam, during the 2016 fiscal year. In this study, the annual cost per patient estimate was US $246.10 (95% CI 228.3, 267.2) for 392 patients, which accounted for about 12% (95% CI 11, 13) of the gross domestic product per capita in 2017. That includes US $127.30, US $34.40 and US $84.40 for direct medical costs, direct nonmedical expenditures, and indirect costs, respectively. The cost of pharmaceuticals accounted for the bulk of total expenditures in our study (27.5% of total costs and 53.2% of direct medical costs). A bootstrap analysis showed that female patients had a higher cost of treatment than men at US $48.90 (95% CI 3.1, 95.0); those who received insulin and oral antidiabetics (OAD) also had a statistically significant higher cost of treatment compared to those receiving OAD, US $445.90 (95% CI 181.2, 690.6). The Gradient Boosting Regression (Ensemble method) and Lasso Regression (Generalized Linear Models) were determined to be the best models to predict the cost of T2DM ( R 2 =65.3, mean square error [MSE]=0.94; and R 2 =64.75, MSE=0.96, respectively). The findings of this study serve as a reference for policy decision making in diabetes management as well as adjustment of costs for patients in order to reduce the economic impact of the disease.
Allele frequency changes due to hitch-hiking in genomic selection programs
2014-01-01
Background Genomic selection makes it possible to reduce pedigree-based inbreeding over best linear unbiased prediction (BLUP) by increasing emphasis on own rather than family information. However, pedigree inbreeding might not accurately reflect loss of genetic variation and the true level of inbreeding due to changes in allele frequencies and hitch-hiking. This study aimed at understanding the impact of using long-term genomic selection on changes in allele frequencies, genetic variation and level of inbreeding. Methods Selection was performed in simulated scenarios with a population of 400 animals for 25 consecutive generations. Six genetic models were considered with different heritabilities and numbers of QTL (quantitative trait loci) affecting the trait. Four selection criteria were used, including selection on own phenotype and on estimated breeding values (EBV) derived using phenotype-BLUP, genomic BLUP and Bayesian Lasso. Changes in allele frequencies at QTL, markers and linked neutral loci were investigated for the different selection criteria and different scenarios, along with the loss of favourable alleles and the rate of inbreeding measured by pedigree and runs of homozygosity. Results For each selection criterion, hitch-hiking in the vicinity of the QTL appeared more extensive when accuracy of selection was higher and the number of QTL was lower. When inbreeding was measured by pedigree information, selection on genomic BLUP EBV resulted in lower levels of inbreeding than selection on phenotype BLUP EBV, but this did not always apply when inbreeding was measured by runs of homozygosity. Compared to genomic BLUP, selection on EBV from Bayesian Lasso led to less genetic drift, reduced loss of favourable alleles and more effectively controlled the rate of both pedigree and genomic inbreeding in all simulated scenarios. In addition, selection on EBV from Bayesian Lasso showed a higher selection differential for mendelian sampling terms than selection on genomic BLUP EBV. Conclusions Neutral variation can be shaped to a great extent by the hitch-hiking effects associated with selection, rather than just by genetic drift. When implementing long-term genomic selection, strategies for genomic control of inbreeding are essential, due to a considerable hitch-hiking effect, regardless of the method that is used for prediction of EBV. PMID:24495634
NASA Astrophysics Data System (ADS)
Krishna, B.; Gustafson, W. I., Jr.; Vogelmann, A. M.; Toto, T.; Devarakonda, R.; Palanisamy, G.
2016-12-01
This paper presents a new way of providing ARM data discovery through data analysis and visualization services. ARM stands for Atmospheric Radiation Measurement. This Program was created to study cloud formation processes and their influence on radiative transfer and also include additional measurements of aerosol and precipitation at various highly instrumented ground and mobile stations. The total volume of ARM data is roughly 900TB. The current search for ARM data is performed by using its metadata, such as the site name, instrument name, date, etc. NoSQL technologies were explored to improve the capabilities of data searching, not only by their metadata, but also by using the measurement values. Two technologies that are currently being implemented for testing are Apache Cassandra (noSQL database) and Apache Spark (noSQL based analytics framework). Both of these technologies were developed to work in a distributed environment and hence can handle large data for storing and analytics. D3.js is a JavaScript library that can generate interactive data visualizations in web browsers by making use of commonly used SVG, HTML5, and CSS standards. To test the performance of NoSQL for ARM data, we will be using ARM's popular measurements to locate the data based on its value. Recently noSQL technology has been applied to a pilot project called LASSO, which stands for LES ARM Symbiotic Simulation and Observation Workflow. LASSO will be packaging LES output and observations in "data bundles" and analyses will require the ability for users to analyze both observations and LES model output either individually or together across multiple time periods. The LASSO implementation strategy suggests that enormous data storage is required to store the above mentioned quantities. Thus noSQL was used to provide a powerful means to store portions of the data that provided users with search capabilities on each simulation's traits through a web application. Based on the user selection, plots are created dynamically along with ancillary information that enables the user to locate and download data that fulfilled their required traits.
Yue, Yong; Osipov, Arsen; Fraass, Benedick; Sandler, Howard; Zhang, Xiao; Nissen, Nicholas; Hendifar, Andrew; Tuli, Richard
2017-02-01
To stratify risks of pancreatic adenocarcinoma (PA) patients using pre- and post-radiotherapy (RT) PET/CT images, and to assess the prognostic value of texture variations in predicting therapy response of patients. Twenty-six PA patients treated with RT from 2011-2013 with pre- and post-treatment 18F-FDG-PET/CT scans were identified. Tumor locoregional texture was calculated using 3D kernel-based approach, and texture variations were identified by fitting discrepancies of texture maps of pre- and post-treatment images. A total of 48 texture and clinical variables were identified and evaluated for association with overall survival (OS). The prognostic heterogeneity features were selected using lasso/elastic net regression, and further were evaluated by multivariate Cox analysis. Median age was 69 y (range, 46-86 y). The texture map and temporal variations between pre- and post-treatment were well characterized by histograms and statistical fitting. The lasso analysis identified seven predictors (age, node stage, post-RT SUVmax, variations of homogeneity, variance, sum mean, and cluster tendency). The multivariate Cox analysis identified five significant variables: age, node stage, variations of homogeneity, variance, and cluster tendency (with P=0.020, 0.040, 0.065, 0.078, and 0.081, respectively). The patients were stratified into two groups based on the risk score of multivariate analysis with log-rank P=0.001: a low risk group (n=11) with a longer mean OS (29.3 months) and higher texture variation (>30%), and a high risk group (n=15) with a shorter mean OS (17.7 months) and lower texture variation (<15%). Locoregional metabolic texture response provides a feasible approach for evaluating and predicting clinical outcomes following treatment of PA with RT. The proposed method can be used to stratify patient risk and help select appropriate treatment strategies for individual patients toward implementing response-driven adaptive RT.
Primon, Juliana F.
2017-01-01
Tapeworms of the genus Anindobothrium Marques, Brooks & Lasso, 2001 are found in both marine and Neotropical freshwater stingrays of the family Potamotrygonidae. The patterns of host association within the genus support the most recent hypothesis about the history of diversification of potamotrygonids, which suggests that the ancestor of freshwater lineages of the Potamotrygonidae colonized South American river systems through marine incursion events. Despite the relevance of the genus Anindobothrium to understand the history of colonization and diversification of potamotrygonids, no additional efforts were done to better investigate the phylogenetic relationship of this taxon with other lineages of cestodes since its erection. This study is a result of recent collecting efforts to sample members of the genus in marine and freshwater potamotrygonids that enabled the most extensive documentation of the fauna of Anindobothrium parasitizing species of Styracura de Carvalho, Loboda & da Silva, Potamotrygon schroederi Fernández-Yépez, P. orbignyi (Castelnau) and P. yepezi Castex & Castello from six different countries, representing the eastern Pacific Ocean, Caribbean Sea, and river basins in South America (Rio Negro, Orinoco, and Maracaibo). The newly collected material provided additional specimens for morphological studies and molecular samples for subsequent phylogenetic analyses that allowed us to address the phylogenetic position of Anindobothrium and provide molecular and morphological evidence to recognize two additional species for the genus. The taxonomic actions that followed our analyses included the proposition of a new family, Anindobothriidae fam. n., to accommodate the genus Anindobothrium in the order Rhinebothriidea Healy, Caira, Jensen, Webster & Littlewood, 2009 and the description of two new species—one from the eastern Pacific Ocean, A. carrioni sp. n., and the other from the Caribbean Sea, A. inexpectatum sp. n. In addition, we also present a redescription of the type species of the genus, A. anacolum (Brooks, 1977) Marques, Brooks & Lasso, 2001, and of A. lisae Marques, Brooks & Lasso, 2001. Finally, we discuss the paleogeographical events mostly linked with the diversification of the genus and the protocols adopted to uncover cryptic diversity in Anindobothrium. PMID:28953933
Zhai, Binxu; Chen, Jianguo
2018-04-18
A stacked ensemble model is developed for forecasting and analyzing the daily average concentrations of fine particulate matter (PM 2.5 ) in Beijing, China. Special feature extraction procedures, including those of simplification, polynomial, transformation and combination, are conducted before modeling to identify potentially significant features based on an exploratory data analysis. Stability feature selection and tree-based feature selection methods are applied to select important variables and evaluate the degrees of feature importance. Single models including LASSO, Adaboost, XGBoost and multi-layer perceptron optimized by the genetic algorithm (GA-MLP) are established in the level 0 space and are then integrated by support vector regression (SVR) in the level 1 space via stacked generalization. A feature importance analysis reveals that nitrogen dioxide (NO 2 ) and carbon monoxide (CO) concentrations measured from the city of Zhangjiakou are taken as the most important elements of pollution factors for forecasting PM 2.5 concentrations. Local extreme wind speeds and maximal wind speeds are considered to extend the most effects of meteorological factors to the cross-regional transportation of contaminants. Pollutants found in the cities of Zhangjiakou and Chengde have a stronger impact on air quality in Beijing than other surrounding factors. Our model evaluation shows that the ensemble model generally performs better than a single nonlinear forecasting model when applied to new data with a coefficient of determination (R 2 ) of 0.90 and a root mean squared error (RMSE) of 23.69μg/m 3 . For single pollutant grade recognition, the proposed model performs better when applied to days characterized by good air quality than when applied to days registering high levels of pollution. The overall classification accuracy level is 73.93%, with most misclassifications made among adjacent categories. The results demonstrate the interpretability and generalizability of the stacked ensemble model. Copyright © 2018 The Authors. Published by Elsevier B.V. All rights reserved.
Application of L1/2 regularization logistic method in heart disease diagnosis.
Zhang, Bowen; Chai, Hua; Yang, Ziyi; Liang, Yong; Chu, Gejin; Liu, Xiaoying
2014-01-01
Heart disease has become the number one killer of human health, and its diagnosis depends on many features, such as age, blood pressure, heart rate and other dozens of physiological indicators. Although there are so many risk factors, doctors usually diagnose the disease depending on their intuition and experience, which requires a lot of knowledge and experience for correct determination. To find the hidden medical information in the existing clinical data is a noticeable and powerful approach in the study of heart disease diagnosis. In this paper, sparse logistic regression method is introduced to detect the key risk factors using L(1/2) regularization on the real heart disease data. Experimental results show that the sparse logistic L(1/2) regularization method achieves fewer but informative key features than Lasso, SCAD, MCP and Elastic net regularization approaches. Simultaneously, the proposed method can cut down the computational complexity, save cost and time to undergo medical tests and checkups, reduce the number of attributes needed to be taken from patients.
Cox-nnet: An artificial neural network method for prognosis prediction of high-throughput omics data
Ching, Travers; Zhu, Xun
2018-01-01
Artificial neural networks (ANN) are computing architectures with many interconnections of simple neural-inspired computing elements, and have been applied to biomedical fields such as imaging analysis and diagnosis. We have developed a new ANN framework called Cox-nnet to predict patient prognosis from high throughput transcriptomics data. In 10 TCGA RNA-Seq data sets, Cox-nnet achieves the same or better predictive accuracy compared to other methods, including Cox-proportional hazards regression (with LASSO, ridge, and mimimax concave penalty), Random Forests Survival and CoxBoost. Cox-nnet also reveals richer biological information, at both the pathway and gene levels. The outputs from the hidden layer node provide an alternative approach for survival-sensitive dimension reduction. In summary, we have developed a new method for accurate and efficient prognosis prediction on high throughput data, with functional biological insights. The source code is freely available at https://github.com/lanagarmire/cox-nnet. PMID:29634719
Huo, Zhiguang; Tseng, George
2017-01-01
Cancer subtypes discovery is the first step to deliver personalized medicine to cancer patients. With the accumulation of massive multi-level omics datasets and established biological knowledge databases, omics data integration with incorporation of rich existing biological knowledge is essential for deciphering a biological mechanism behind the complex diseases. In this manuscript, we propose an integrative sparse K-means (is-K means) approach to discover disease subtypes with the guidance of prior biological knowledge via sparse overlapping group lasso. An algorithm using an alternating direction method of multiplier (ADMM) will be applied for fast optimization. Simulation and three real applications in breast cancer and leukemia will be used to compare is-K means with existing methods and demonstrate its superior clustering accuracy, feature selection, functional annotation of detected molecular features and computing efficiency. PMID:28959370
Huo, Zhiguang; Tseng, George
2017-06-01
Cancer subtypes discovery is the first step to deliver personalized medicine to cancer patients. With the accumulation of massive multi-level omics datasets and established biological knowledge databases, omics data integration with incorporation of rich existing biological knowledge is essential for deciphering a biological mechanism behind the complex diseases. In this manuscript, we propose an integrative sparse K -means (is- K means) approach to discover disease subtypes with the guidance of prior biological knowledge via sparse overlapping group lasso. An algorithm using an alternating direction method of multiplier (ADMM) will be applied for fast optimization. Simulation and three real applications in breast cancer and leukemia will be used to compare is- K means with existing methods and demonstrate its superior clustering accuracy, feature selection, functional annotation of detected molecular features and computing efficiency.
Predicting Solar Activity Using Machine-Learning Methods
NASA Astrophysics Data System (ADS)
Bobra, M.
2017-12-01
Of all the activity observed on the Sun, two of the most energetic events are flares and coronal mass ejections. However, we do not, as of yet, fully understand the physical mechanism that triggers solar eruptions. A machine-learning algorithm, which is favorable in cases where the amount of data is large, is one way to [1] empirically determine the signatures of this mechanism in solar image data and [2] use them to predict solar activity. In this talk, we discuss the application of various machine learning algorithms - specifically, a Support Vector Machine, a sparse linear regression (Lasso), and Convolutional Neural Network - to image data from the photosphere, chromosphere, transition region, and corona taken by instruments aboard the Solar Dynamics Observatory in order to predict solar activity on a variety of time scales. Such an approach may be useful since, at the present time, there are no physical models of flares available for real-time prediction. We discuss our results (Bobra and Couvidat, 2015; Bobra and Ilonidis, 2016; Jonas et al., 2017) as well as other attempts to predict flares using machine-learning (e.g. Ahmed et al., 2013; Nishizuka et al. 2017) and compare these results with the more traditional techniques used by the NOAA Space Weather Prediction Center (Crown, 2012). We also discuss some of the challenges in using machine-learning algorithms for space science applications.
Zhang, Lei; Masetti, Giulia; Colucci, Giuseppe; Salvi, Mario; Covelli, Danila; Eckstein, Anja; Kaiser, Ulrike; Draman, Mohd Shazli; Muller, Ilaria; Ludgate, Marian; Lucini, Luigi; Biscarini, Filippo
2018-05-30
Graves' Disease (GD) is an autoimmune condition in which thyroid-stimulating antibodies (TRAB) mimic thyroid-stimulating hormone function causing hyperthyroidism. 5% of GD patients develop inflammatory Graves' orbitopathy (GO) characterized by proptosis and attendant sight problems. A major challenge is to identify which GD patients are most likely to develop GO and has relied on TRAB measurement. We screened sera/plasma from 14 GD, 19 GO and 13 healthy controls using high-throughput proteomics and miRNA sequencing (Illumina's HiSeq2000 and Agilent-6550 Funnel quadrupole-time-of-flight mass spectrometry) to identify potential biomarkers for diagnosis or prognosis evaluation. Euclidean distances and differential expression (DE) based on miRNA and protein quantification were analysed by multidimensional scaling (MDS) and multinomial regression respectively. We detected 3025 miRNAs and 1886 proteins and MDS revealed good separation of the 3 groups. Biomarkers were identified by combined DE and Lasso-penalized predictive models; accuracy of predictions was 0.86 (±0:18), and 5 miRNA and 20 proteins were found including Zonulin, Alpha-2 macroglobulin, Beta-2 glycoprotein 1 and Fibronectin. Functional analysis identified relevant metabolic pathways, including hippo signaling, bacterial invasion of epithelial cells and mRNA surveillance. Proteomic and miRNA analyses, combined with robust bioinformatics, identified circulating biomarkers applicable to diagnose GD, predict GO disease status and optimize patient management.
SU-F-R-24: Identifying Prognostic Imaging Biomarkers in Early Stage Lung Cancer Using Radiomics
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zeng, X; Wu, J; Cui, Y
2016-06-15
Purpose: Patients diagnosed with early stage lung cancer have favorable outcomes when treated with surgery or stereotactic radiotherapy. However, a significant proportion (∼20%) of patients will develop metastatic disease and eventually die of the disease. The purpose of this work is to identify quantitative imaging biomarkers from CT for predicting overall survival in early stage lung cancer. Methods: In this institutional review board-approved HIPPA-compliant retrospective study, we retrospectively analyzed the diagnostic CT scans of 110 patients with early stage lung cancer. Data from 70 patients were used for training/discovery purposes, while those of remaining 40 patients were used for independentmore » validation. We extracted 191 radiomic features, including statistical, histogram, morphological, and texture features. Cox proportional hazard regression model, coupled with the least absolute shrinkage and selection operator (LASSO), was used to predict overall survival based on the radiomic features. Results: The optimal prognostic model included three image features from the Law’s feature and wavelet texture. In the discovery cohort, this model achieved a concordance index or CI=0.67, and it separated the low-risk from high-risk groups in predicting overall survival (hazard ratio=2.72, log-rank p=0.007). In the independent validation cohort, this radiomic signature achieved a CI=0.62, and significantly stratified the low-risk and high-risk groups in terms of overall survival (hazard ratio=2.20, log-rank p=0.042). Conclusion: We identified CT imaging characteristics associated with overall survival in early stage lung cancer. If prospectively validated, this could potentially help identify high-risk patients who might benefit from adjuvant systemic therapy.« less
Prediction of treatment outcomes to exercise in patients with nonremitted major depressive disorder.
Rethorst, Chad D; South, Charles C; Rush, A John; Greer, Tracy L; Trivedi, Madhukar H
2017-12-01
Only one-third of patients with major depressive disorder (MDD) achieve remission with initial treatment. Consequently, current clinical practice relies on a "trial-and-error" approach to identify an effective treatment for each patient. The purpose of this report was to determine whether we could identify a set of clinical and biological parameters with potential clinical utility for prescription of exercise for treatment of MDD in a secondary analysis of the Treatment with Exercise Augmentation in Depression (TREAD) trial. Participants with nonremitted MDD were randomized to one of two exercise doses for 12 weeks. Participants were categorized as "remitters" (≤12 on the IDS-C), nonresponders (<30% drop in IDS-C), or neither. The least absolute shrinkage and selection operator (LASSO) and random forests were used to evaluate 30 variables as predictors of both remission and nonresponse. Predictors were used to model treatment outcomes using logistic regression. Of the 122 participants, 36 were categorized as remitters (29.5%), 56 as nonresponders (45.9%), and 30 as neither (24.6%). Predictors of remission were higher levels of brain-derived neurotrophic factor (BDNF) and IL-1B, greater depressive symptom severity, and higher postexercise positive affect. Predictors of treatment nonresponse were low cardiorespiratory fitness, lower levels of IL-6 and BDNF, and lower postexercise positive affect. Models including these predictors resulted in predictive values greater than 70% (true predicted remitters/all predicted remitters) with specificities greater than 25% (true predicted remitters/all remitters). Results indicate feasibility in identifying patients who will either remit or not respond to exercise as a treatment for MDD utilizing a clinical decision model that incorporates multiple patient characteristics. © 2017 Wiley Periodicals, Inc.
Technical report on semiautomatic segmentation using the Adobe Photoshop.
Park, Jin Seo; Chung, Min Suk; Hwang, Sung Bae; Lee, Yong Sook; Har, Dong-Hwan
2005-12-01
The purpose of this research is to enable users to semiautomatically segment the anatomical structures in magnetic resonance images (MRIs), computerized tomographs (CTs), and other medical images on a personal computer. The segmented images are used for making 3D images, which are helpful to medical education and research. To achieve this purpose, the following trials were performed. The entire body of a volunteer was scanned to make 557 MRIs. On Adobe Photoshop, contours of 19 anatomical structures in the MRIs were semiautomatically drawn using MAGNETIC LASSO TOOL and manually corrected using either LASSO TOOL or DIRECT SELECTION TOOL to make 557 segmented images. In a similar manner, 13 anatomical structures in 8,590 anatomical images were segmented. Proper segmentation was verified by making 3D images from the segmented images. Semiautomatic segmentation using Adobe Photoshop is expected to be widely used for segmentation of anatomical structures in various medical images.
Janet, Jon Paul; Kulik, Heather J
2017-11-22
Machine learning (ML) of quantum mechanical properties shows promise for accelerating chemical discovery. For transition metal chemistry where accurate calculations are computationally costly and available training data sets are small, the molecular representation becomes a critical ingredient in ML model predictive accuracy. We introduce a series of revised autocorrelation functions (RACs) that encode relationships of the heuristic atomic properties (e.g., size, connectivity, and electronegativity) on a molecular graph. We alter the starting point, scope, and nature of the quantities evaluated in standard ACs to make these RACs amenable to inorganic chemistry. On an organic molecule set, we first demonstrate superior standard AC performance to other presently available topological descriptors for ML model training, with mean unsigned errors (MUEs) for atomization energies on set-aside test molecules as low as 6 kcal/mol. For inorganic chemistry, our RACs yield 1 kcal/mol ML MUEs on set-aside test molecules in spin-state splitting in comparison to 15-20× higher errors for feature sets that encode whole-molecule structural information. Systematic feature selection methods including univariate filtering, recursive feature elimination, and direct optimization (e.g., random forest and LASSO) are compared. Random-forest- or LASSO-selected subsets 4-5× smaller than the full RAC set produce sub- to 1 kcal/mol spin-splitting MUEs, with good transferability to metal-ligand bond length prediction (0.004-5 Å MUE) and redox potential on a smaller data set (0.2-0.3 eV MUE). Evaluation of feature selection results across property sets reveals the relative importance of local, electronic descriptors (e.g., electronegativity, atomic number) in spin-splitting and distal, steric effects in redox potential and bond lengths.
Terminal spacecraft rendezvous and capture with LASSO model predictive control
NASA Astrophysics Data System (ADS)
Hartley, Edward N.; Gallieri, Marco; Maciejowski, Jan M.
2013-11-01
The recently investigated ℓasso model predictive control (MPC) is applied to the terminal phase of a spacecraft rendezvous and capture mission. The interaction between the cost function and the treatment of minimum impulse bit is also investigated. The propellant consumption with ℓasso MPC for the considered scenario is noticeably less than with a conventional quadratic cost and control actions are sparser in time. Propellant consumption and sparsity are competitive with those achieved using a zone-based ℓ1 cost function, whilst requiring fewer decision variables in the optimisation problem than the latter. The ℓasso MPC is demonstrated to meet tighter specifications on control precision and also avoids the risk of undesirable behaviours often associated with pure ℓ1 stage costs.
Wu, Mon-Ju; Passos, Ives Cavalcante; Bauer, Isabelle E; Lavagnino, Luca; Cao, Bo; Zunta-Soares, Giovana B; Kapczinski, Flávio; Mwangi, Benson; Soares, Jair C
2016-03-01
Previous studies have reported that patients with bipolar disorder (BD) present with cognitive impairments during mood episodes as well as euthymic phase. However, it is still unknown whether reported neurocognitive abnormalities can objectively identify individual BD patients from healthy controls (HC). A total of 21 euthymic BD patients and 21 demographically matched HC were included in the current study. Participants performed the computerized Cambridge Neurocognitive Test Automated Battery (CANTAB) to assess cognitive performance. The least absolute shrinkage selection operator (LASSO) machine learning algorithm was implemented to identify neurocognitive signatures to distinguish individual BD patients from HC. The LASSO machine learning algorithm identified individual BD patients from HC with an accuracy of 71%, area under receiver operating characteristic curve of 0.7143 and significant at p=0.0053. The LASSO algorithm assigned individual subjects with a probability score (0-healthy, 1-patient). Patients with rapid cycling (RC) were assigned increased probability scores as compared to patients without RC. A multivariate pattern of neurocognitive abnormalities comprising of affective Go/No-go and the Cambridge gambling task was relevant in distinguishing individual patients from HC. Our study sample was small as we only considered euthymic BD patients and demographically matched HC. Neurocognitive abnormalities can distinguish individual euthymic BD patients from HC with relatively high accuracy. In addition, patients with RC had more cognitive impairments compared to patients without RC. The predictive neurocognitive signature identified in the current study can potentially be used to provide individualized clinical inferences on BD patients. Copyright © 2015 Elsevier B.V. All rights reserved.
Efficient differentially private learning improves drug sensitivity prediction.
Honkela, Antti; Das, Mrinal; Nieminen, Arttu; Dikmen, Onur; Kaski, Samuel
2018-02-06
Users of a personalised recommendation system face a dilemma: recommendations can be improved by learning from data, but only if other users are willing to share their private information. Good personalised predictions are vitally important in precision medicine, but genomic information on which the predictions are based is also particularly sensitive, as it directly identifies the patients and hence cannot easily be anonymised. Differential privacy has emerged as a potentially promising solution: privacy is considered sufficient if presence of individual patients cannot be distinguished. However, differentially private learning with current methods does not improve predictions with feasible data sizes and dimensionalities. We show that useful predictors can be learned under powerful differential privacy guarantees, and even from moderately-sized data sets, by demonstrating significant improvements in the accuracy of private drug sensitivity prediction with a new robust private regression method. Our method matches the predictive accuracy of the state-of-the-art non-private lasso regression using only 4x more samples under relatively strong differential privacy guarantees. Good performance with limited data is achieved by limiting the sharing of private information by decreasing the dimensionality and by projecting outliers to fit tighter bounds, therefore needing to add less noise for equal privacy. The proposed differentially private regression method combines theoretical appeal and asymptotic efficiency with good prediction accuracy even with moderate-sized data. As already the simple-to-implement method shows promise on the challenging genomic data, we anticipate rapid progress towards practical applications in many fields. This article was reviewed by Zoltan Gaspari and David Kreil.
Identifying individual sleep apnea/hypoapnea epochs using smartphone-based pulse oximetry.
Garde, Ainara; Dekhordi, Parastoo; Ansermino, J Mark; Dumont, Guy A
2016-08-01
Sleep apnea, characterized by frequent pauses in breathing during sleep, poses a serious threat to the healthy growth and development of children. Polysomnography (PSG), the gold standard for sleep apnea diagnosis, is resource intensive and confined to sleep laboratories, thus reducing its accessibility. Pulse oximetry alone, providing blood oxygen saturation (SpO2) and blood volume changes in tissue (PPG), has the potential to identify children with sleep apnea. Thus, we aim to develop a tool for at-home sleep apnea screening that provides a detailed and automated 30 sec epoch-by-epoch sleep apnea analysis. We propose to extract features characterizing pulse oximetry (SpO2 and pulse rate variability [PRV], a surrogate measure of heart rate variability) to create a multivariate logistic regression model that identifies epochs containing apnea/hypoapnea events. Overnight pulse oximetry was collected using a smartphone-based pulse oximeter, simultaneously with standard PSG from 160 children at the British Columbia Children's hospital. The sleep technician manually scored all apnea/hypoapnea events during the PSG study. Based on these scores we labeled each epoch as containing or not containing apnea/hypoapnea. We randomly divided the subjects into training data (40%), used to develop the model applying the LASSO method, and testing data (60%), used to validate the model. The developed model was assessed epoch-by-epoch for each subject. The test dataset had a median area under the receiver operating characteristic (ROC) curve of 81%; the model provided a median accuracy of 74% sensitivity of 75%, and specificity of 73% when using a risk threshold similar to the percentage of apnea/hypopnea epochs. Thus, providing a detailed epoch-by-epoch analysis with at-home pulse oximetry alone is feasible with accuracy, sensitivity and specificity values above 73% However, the performance might decrease when analyzing subjects with a low number of apnea/hypoapnea events.
Change in BMI accurately predicted by social exposure to acquaintances.
Oloritun, Rahman O; Ouarda, Taha B M J; Moturu, Sai; Madan, Anmol; Pentland, Alex Sandy; Khayal, Inas
2013-01-01
Research has mostly focused on obesity and not on processes of BMI change more generally, although these may be key factors that lead to obesity. Studies have suggested that obesity is affected by social ties. However these studies used survey based data collection techniques that may be biased toward select only close friends and relatives. In this study, mobile phone sensing techniques were used to routinely capture social interaction data in an undergraduate dorm. By automating the capture of social interaction data, the limitations of self-reported social exposure data are avoided. This study attempts to understand and develop a model that best describes the change in BMI using social interaction data. We evaluated a cohort of 42 college students in a co-located university dorm, automatically captured via mobile phones and survey based health-related information. We determined the most predictive variables for change in BMI using the least absolute shrinkage and selection operator (LASSO) method. The selected variables, with gender, healthy diet category, and ability to manage stress, were used to build multiple linear regression models that estimate the effect of exposure and individual factors on change in BMI. We identified the best model using Akaike Information Criterion (AIC) and R(2). This study found a model that explains 68% (p<0.0001) of the variation in change in BMI. The model combined social interaction data, especially from acquaintances, and personal health-related information to explain change in BMI. This is the first study taking into account both interactions with different levels of social interaction and personal health-related information. Social interactions with acquaintances accounted for more than half the variation in change in BMI. This suggests the importance of not only individual health information but also the significance of social interactions with people we are exposed to, even people we may not consider as close friends.
REMOVAL OF ALACHLOR FROM DRINKING WATER
Alachlor (Lasso) is a pre-emergent herbicide used in the production of corn and soybeans. U.S. EPA has studied control of alachlor in drinking water treatment processes to define treatability before setting maximum contaminant levels and to assist water utilities in selecting con...
Detection of Alzheimer's disease using group lasso SVM-based region selection
NASA Astrophysics Data System (ADS)
Sun, Zhuo; Fan, Yong; Lelieveldt, Boudewijn P. F.; van de Giessen, Martijn
2015-03-01
Alzheimer's disease (AD) is one of the most frequent forms of dementia and an increasing challenging public health problem. In the last two decades, structural magnetic resonance imaging (MRI) has shown potential in distinguishing patients with Alzheimer's disease and elderly controls (CN). To obtain AD-specific biomarkers, previous research used either statistical testing to find statistically significant different regions between the two clinical groups, or l1 sparse learning to select isolated features in the image domain. In this paper, we propose a new framework that uses structural MRI to simultaneously distinguish the two clinical groups and find the bio-markers of AD, using a group lasso support vector machine (SVM). The group lasso term (mixed l1- l2 norm) introduces anatomical information from the image domain into the feature domain, such that the resulting set of selected voxels are more meaningful than the l1 sparse SVM. Because of large inter-structure size variation, we introduce a group specific normalization factor to deal with the structure size bias. Experiments have been performed on a well-designed AD vs. CN dataset1 to validate our method. Comparing to the l1 sparse SVM approach, our method achieved better classification performance and a more meaningful biomarker selection. When we vary the training set, the selected regions by our method were more stable than the l1 sparse SVM. Classification experiments showed that our group normalization lead to higher classification accuracy with fewer selected regions than the non-normalized method. Comparing to the state-of-art AD vs. CN classification methods, our approach not only obtains a high accuracy with the same dataset, but more importantly, we simultaneously find the brain anatomies that are closely related to the disease.
Problematic internet use as an age-related multifaceted problem: Evidence from a two-site survey.
Ioannidis, Konstantinos; Treder, Matthias S; Chamberlain, Samuel R; Kiraly, Franz; Redden, Sarah A; Stein, Dan J; Lochner, Christine; Grant, Jon E
2018-06-01
Problematic internet use (PIU; otherwise known as Internet Addiction) is a growing problem in modern societies. There is scarce knowledge of the demographic variables and specific internet activities associated with PIU and a limited understanding of how PIU should be conceptualized. Our aim was to identify specific internet activities associated with PIU and explore the moderating role of age and gender in those associations. We recruited 1749 participants aged 18 and above via media advertisements in an Internet-based survey at two sites, one in the US, and one in South Africa; we utilized Lasso regression for the analysis. Specific internet activities were associated with higher problematic internet use scores, including general surfing (lasso β: 2.1), internet gaming (β: 0.6), online shopping (β: 1.4), use of online auction websites (β: 0.027), social networking (β: 0.46) and use of online pornography (β: 1.0). Age moderated the relationship between PIU and role-playing-games (β: 0.33), online gambling (β: 0.15), use of auction websites (β: 0.35) and streaming media (β: 0.35), with older age associated with higher levels of PIU. There was inconclusive evidence for gender and gender × internet activities being associated with problematic internet use scores. Attention-deficit hyperactivity disorder (ADHD) and social anxiety disorder were associated with high PIU scores in young participants (age ≤ 25, β: 0.35 and 0.65 respectively), whereas generalized anxiety disorder (GAD) and obsessive-compulsive disorder (OCD) were associated with high PIU scores in the older participants (age > 55, β: 6.4 and 4.3 respectively). Many types of online behavior (e.g. shopping, pornography, general surfing) bear a stronger relationship with maladaptive use of the internet than gaming supporting the diagnostic classification of problematic internet use as a multifaceted disorder. Furthermore, internet activities and psychiatric diagnoses associated with problematic internet use vary with age, with public health implications. Crown Copyright © 2018. Published by Elsevier Ltd. All rights reserved.
Mooney, Stephen J; Joshi, Spruha; Cerdá, Magdalena; Kennedy, Gary J; Beard, John R; Rundle, Andrew G
2017-04-01
Background: Few older adults achieve recommended physical activity levels. We conducted a "neighborhood environment-wide association study (NE-WAS)" of neighborhood influences on physical activity among older adults, analogous, in a genetic context, to a genome-wide association study. Methods: Physical Activity Scale for the Elderly (PASE) and sociodemographic data were collected via telephone survey of 3,497 residents of New York City aged 65 to 75 years. Using Geographic Information Systems, we created 337 variables describing each participant's residential neighborhood's built, social, and economic context. We used survey-weighted regression models adjusting for individual-level covariates to test for associations between each neighborhood variable and (i) total PASE score, (ii) gardening activity, (iii) walking, and (iv) housework (as a negative control). We also applied two "Big Data" analytic techniques, LASSO regression, and Random Forests, to algorithmically select neighborhood variables predictive of these four physical activity measures. Results: Of all 337 measures, proportion of residents living in extreme poverty was most strongly associated with total physical activity [-0.85; (95% confidence interval, -1.14 to -0.56) PASE units per 1% increase in proportion of residents living with household incomes less than half the federal poverty line]. Only neighborhood socioeconomic status and disorder measures were associated with total activity and gardening, whereas a broader range of measures was associated with walking. As expected, no neighborhood meaZsures were associated with housework after accounting for multiple comparisons. Conclusions: This systematic approach revealed patterns in the domains of neighborhood measures associated with physical activity. Impact: The NE-WAS approach appears to be a promising exploratory technique. Cancer Epidemiol Biomarkers Prev; 26(4); 495-504. ©2017 AACR See all the articles in this CEBP Focus section, "Geospatial Approaches to Cancer Control and Population Sciences." ©2017 American Association for Cancer Research.
Superhero science: from fiction to fact
NASA Astrophysics Data System (ADS)
Follows, Michael
2017-11-01
At the 2016 Manchester Science Festival, a team of like-minded scientists came together to try to suss out the real-world science behind everything from Wonder Woman's lasso to the Hulk's gigantic transformation. The result is The Secret Science of Superheroes- an eclectic collection of essays.
Molecular structure of human KATP in complex with ATP and ADP
Lee, Kenneth Pak Kin
2017-01-01
In many excitable cells, KATP channels respond to intracellular adenosine nucleotides: ATP inhibits while ADP activates. We present two structures of the human pancreatic KATP channel, containing the ABC transporter SUR1 and the inward-rectifier K+ channel Kir6.2, in the presence of Mg2+ and nucleotides. These structures, referred to as quatrefoil and propeller forms, were determined by single-particle cryo-EM at 3.9 Å and 5.6 Å, respectively. In both forms, ATP occupies the inhibitory site in Kir6.2. The nucleotide-binding domains of SUR1 are dimerized with Mg2+-ATP in the degenerate site and Mg2+-ADP in the consensus site. A lasso extension forms an interface between SUR1 and Kir6.2 adjacent to the ATP site in the propeller form and is disrupted in the quatrefoil form. These structures support the role of SUR1 as an ADP sensor and highlight the lasso extension as a key regulatory element in ADP’s ability to override ATP inhibition. PMID:29286281
Assessing mood symptoms through heartbeat dynamics: An HRV study on cardiosurgical patients.
Gentili, Claudio; Messerotti Benvenuti, Simone; Palomba, Daniela; Greco, Alberto; Scilingo, Enzo Pasquale; Valenza, Gaetano
2017-12-01
Heart Rate Variability (HRV) is reduced both in depression and in coronary heart disease (CHD) suggesting common pathophysiological mechanisms for the two disorders. Within CHD, cardiac surgery patients (CSP) with postoperative depression are at greater risk of adverse cardiac events. Therefore, CSP would especially benefit from depression early diagnosis. Here we tested whether HRV-multi-feature analysis discriminates CSP with or without depression and provides an effective estimation of symptoms severity. Thirty-one patients admitted to cardiac rehabilitation after first-time cardiac surgery were recruited. Depressive symptoms were assessed with the Center for Epidemiologic Studies Depression Scale (CES-D). HRV features in time, frequency, and nonlinear domains were extracted from 5-min-ECG recordings at rest and used as predictors of "least absolute shrinkage and selection" (LASSO) operator regression model to estimate patients' CES-D score and to predict depressive state. The model significantly predicted the CES-D score in all subjects (the total explained variance of CES-D score was 89.93%). Also it discriminated depressed and non-depressed CSP with 86.75% accuracy. Seven of the ten most informative metrics belonged to non-linear-domain. A higher number of patients evaluated also with a structured clinical interview would help to generalize the present findings. To our knowledge this is the first study using a multi-feature approach to evaluate depression in CSP. The high informative power of HRV-nonlinear metrics suggests their possible pathophysiological role both in depression and in CHD. The high-accuracy of the algorithm at single-subject level opens to its translational use as screening tool in clinical practice. Copyright © 2017 Elsevier Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Song, Jiangdian; Zang, Yali; Li, Weimin; Zhong, Wenzhao; Shi, Jingyun; Dong, Di; Fang, Mengjie; Liu, Zaiyi; Tian, Jie
2017-03-01
Accurately predict the risk of disease progression and benefit of tyrosine kinase inhibitors (TKIs) therapy for stage IV non-small cell lung cancer (NSCLC) patients with activing epidermal growth factor receptor (EGFR) mutations by current staging methods are challenge. We postulated that integrating a classifier consisted of multiple computed tomography (CT) phenotypic features, and other clinicopathological risk factors into a single model could improve risk stratification and prediction of progression-free survival (PFS) of EGFR TKIs for these patients. Patients confirmed as stage IV EGFR-mutant NSCLC received EGFR TKIs with no resection; pretreatment contrast enhanced CT performed at approximately 2 weeks before the treatment was enrolled. A six-CT-phenotypic-feature-based classifier constructed by the LASSO Cox regression model, and three clinicopathological factors: pathologic N category, performance status (PS) score, and intrapulmonary metastasis status were used to construct a nomogram in a training set of 115 patients. The prognostic and predictive accuracy of this nomogram was then subjected to an external independent validation of 107 patients. PFS between the training and independent validation set is no statistical difference by Mann-Whitney U test (P = 0.2670). PFS of the patients could be predicted with good consistency compared with the actual survival. C-index of the proposed individualized nomogram in the training set (0·707, 95%CI: 0·643, 0·771) and the independent validation set (0·715, 95%CI: 0·650, 0·780) showed the potential of clinical prognosis to predict PFS of stage IV EGFR-mutant NSCLC from EGFR TKIs. The individualized nomogram might facilitate patient counselling and individualise management of patients with this disease.
NASA Astrophysics Data System (ADS)
Tang, Zhenchao; Liu, Zhenyu; Zhang, Xiaoyan; Shi, Yanjie; Wang, Shou; Fang, Mengjie; Sun, Yingshi; Dong, Enqing; Tian, Jie
2018-02-01
The Locally advanced rectal cancer (LARC) patients were routinely treated with neoadjuvant chemoradiotherapy (CRT) firstly and received total excision afterwards. While, the LARC patients might relieve to T1N0M0/T0N0M0 stage after the CRT, which would enable the patients be qualified for local excision. However, accurate pathological TNM stage could only be obtained by the pathological examination after surgery. We aimed to conduct a Radiomics analysis of Diffusion weighted Imaging (DWI) data to identify the patients in T1N0M0/T0N0M0 stages before surgery, in hope of providing clinical surgery decision support. 223 routinely treated LARC patients in Beijing Cancer Hospital were enrolled in current study. DWI data and clinical characteristics were collected after CRT. According to the pathological TNM stage, the patients of T1N0M0 and T0N0M0 stages were labelled as 1 and the other patients were labelled as 0. The first 123 patients in chronological order were used as training set, and the rest patients as validation set. 563 image features extracted from the DWI data and clinical characteristics were used as features. Two-sample T test was conducted to pre-select the top 50% discriminating features. Least absolute shrinkage and selection operator (Lasso)-Logistic regression model was conducted to further select features and construct the classification model. Based on the 14 selected image features, the area under the Receiver Operating Characteristic (ROC) curve (AUC) of 0.8781, classification Accuracy (ACC) of 0.8432 were achieved in the training set. In the validation set, AUC of 0.8707, ACC (ACC) of 0.84 were observed.
Correlates of sleep quality in midlife and beyond: a machine learning analysis.
Kaplan, Katherine A; Hardas, Prajesh P; Redline, Susan; Zeitzer, Jamie M
2017-06-01
In older adults, traditional metrics derived from polysomnography (PSG) are not well correlated with subjective sleep quality. Little is known about whether the association between PSG and subjective sleep quality changes with age, or whether quantitative electroencephalography (qEEG) is associated with sleep quality. Therefore, we examined the relationship between subjective sleep quality and objective sleep characteristics (standard PSG and qEEG) across middle to older adulthood. Using cross-sectional analyses of 3173 community-dwelling men and women aged between 39 and 90 participating in the Sleep Heart Health Study, we examined the relationship between a morning rating of the prior night's sleep quality (sleep depth and restfulness) and polysomnographic, and qEEG descriptors of that single night of sleep, along with clinical and demographic measures. Multivariable models were constructed using two machine learning methods, namely lasso penalized regressions and random forests. Little variance was explained across models. Greater objective sleep efficiency, reduced wake after sleep onset, and fewer sleep-to-wake stage transitions were each associated with higher sleep quality; qEEG variables contributed little explanatory power. The oldest adults reported the highest sleep quality even as objective sleep deteriorated such that they would rate their sleep better, given the same level of sleep efficiency. Despite this, there were no major differences in the predictors of subjective sleep across the age span. Standard metrics derived from PSG, including qEEG, contribute little to explaining subjective sleep quality in middle-aged to older adults. The objective correlates of subjective sleep quality do not appear to systematically change with age despite a change in the relationship between subjective sleep quality and objective sleep efficiency. Published by Elsevier B.V.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mbah, Chamberlain, E-mail: chamberlain.mbah@ugent.be; Department of Mathematical Modeling, Statistics, and Bioinformatics, Faculty of Bioscience Engineering, Ghent University, Ghent; Thierens, Hubert
Purpose: To identify the main causes underlying the failure of prediction models for radiation therapy toxicity to replicate. Methods and Materials: Data were used from two German cohorts, Individual Radiation Sensitivity (ISE) (n=418) and Mammary Carcinoma Risk Factor Investigation (MARIE) (n=409), of breast cancer patients with similar characteristics and radiation therapy treatments. The toxicity endpoint chosen was telangiectasia. The LASSO (least absolute shrinkage and selection operator) logistic regression method was used to build a predictive model for a dichotomized endpoint (Radiation Therapy Oncology Group/European Organization for the Research and Treatment of Cancer score 0, 1, or ≥2). Internal areas undermore » the receiver operating characteristic curve (inAUCs) were calculated by a naïve approach whereby the training data (ISE) were also used for calculating the AUC. Cross-validation was also applied to calculate the AUC within the same cohort, a second type of inAUC. Internal AUCs from cross-validation were calculated within ISE and MARIE separately. Models trained on one dataset (ISE) were applied to a test dataset (MARIE) and AUCs calculated (exAUCs). Results: Internal AUCs from the naïve approach were generally larger than inAUCs from cross-validation owing to overfitting the training data. Internal AUCs from cross-validation were also generally larger than the exAUCs, reflecting heterogeneity in the predictors between cohorts. The best models with largest inAUCs from cross-validation within both cohorts had a number of common predictors: hypertension, normalized total boost, and presence of estrogen receptors. Surprisingly, the effect (coefficient in the prediction model) of hypertension on telangiectasia incidence was positive in ISE and negative in MARIE. Other predictors were also not common between the 2 cohorts, illustrating that overcoming overfitting does not solve the problem of replication failure of prediction models completely. Conclusions: Overfitting and cohort heterogeneity are the 2 main causes of replication failure of prediction models across cohorts. Cross-validation and similar techniques (eg, bootstrapping) cope with overfitting, but the development of validated predictive models for radiation therapy toxicity requires strategies that deal with cohort heterogeneity.« less
Liu, Shelley H; Bobb, Jennifer F; Lee, Kyu Ha; Gennings, Chris; Claus Henn, Birgit; Bellinger, David; Austin, Christine; Schnaas, Lourdes; Tellez-Rojo, Martha M; Hu, Howard; Wright, Robert O; Arora, Manish; Coull, Brent A
2018-07-01
The impact of neurotoxic chemical mixtures on children's health is a critical public health concern. It is well known that during early life, toxic exposures may impact cognitive function during critical time intervals of increased vulnerability, known as windows of susceptibility. Knowledge on time windows of susceptibility can help inform treatment and prevention strategies, as chemical mixtures may affect a developmental process that is operating at a specific life phase. There are several statistical challenges in estimating the health effects of time-varying exposures to multi-pollutant mixtures, such as: multi-collinearity among the exposures both within time points and across time points, and complex exposure-response relationships. To address these concerns, we develop a flexible statistical method, called lagged kernel machine regression (LKMR). LKMR identifies critical exposure windows of chemical mixtures, and accounts for complex non-linear and non-additive effects of the mixture at any given exposure window. Specifically, LKMR estimates how the effects of a mixture of exposures change with the exposure time window using a Bayesian formulation of a grouped, fused lasso penalty within a kernel machine regression (KMR) framework. A simulation study demonstrates the performance of LKMR under realistic exposure-response scenarios, and demonstrates large gains over approaches that consider each time window separately, particularly when serial correlation among the time-varying exposures is high. Furthermore, LKMR demonstrates gains over another approach that inputs all time-specific chemical concentrations together into a single KMR. We apply LKMR to estimate associations between neurodevelopment and metal mixtures in Early Life Exposures in Mexico and Neurotoxicology, a prospective cohort study of child health in Mexico City.
Zhang, Xinyu; Cao, Jiguo; Carroll, Raymond J
2015-03-01
We consider model selection and estimation in a context where there are competing ordinary differential equation (ODE) models, and all the models are special cases of a "full" model. We propose a computationally inexpensive approach that employs statistical estimation of the full model, followed by a combination of a least squares approximation (LSA) and the adaptive Lasso. We show the resulting method, here called the LSA method, to be an (asymptotically) oracle model selection method. The finite sample performance of the proposed LSA method is investigated with Monte Carlo simulations, in which we examine the percentage of selecting true ODE models, the efficiency of the parameter estimation compared to simply using the full and true models, and coverage probabilities of the estimated confidence intervals for ODE parameters, all of which have satisfactory performances. Our method is also demonstrated by selecting the best predator-prey ODE to model a lynx and hare population dynamical system among some well-known and biologically interpretable ODE models. © 2014, The International Biometric Society.
fastBMA: scalable network inference and transitive reduction.
Hung, Ling-Hong; Shi, Kaiyuan; Wu, Migao; Young, William Chad; Raftery, Adrian E; Yeung, Ka Yee
2017-10-01
Inferring genetic networks from genome-wide expression data is extremely demanding computationally. We have developed fastBMA, a distributed, parallel, and scalable implementation of Bayesian model averaging (BMA) for this purpose. fastBMA also includes a computationally efficient module for eliminating redundant indirect edges in the network by mapping the transitive reduction to an easily solved shortest-path problem. We evaluated the performance of fastBMA on synthetic data and experimental genome-wide time series yeast and human datasets. When using a single CPU core, fastBMA is up to 100 times faster than the next fastest method, LASSO, with increased accuracy. It is a memory-efficient, parallel, and distributed application that scales to human genome-wide expression data. A 10 000-gene regulation network can be obtained in a matter of hours using a 32-core cloud cluster (2 nodes of 16 cores). fastBMA is a significant improvement over its predecessor ScanBMA. It is more accurate and orders of magnitude faster than other fast network inference methods such as the 1 based on LASSO. The improved scalability allows it to calculate networks from genome scale data in a reasonable time frame. The transitive reduction method can improve accuracy in denser networks. fastBMA is available as code (M.I.T. license) from GitHub (https://github.com/lhhunghimself/fastBMA), as part of the updated networkBMA Bioconductor package (https://www.bioconductor.org/packages/release/bioc/html/networkBMA.html) and as ready-to-deploy Docker images (https://hub.docker.com/r/biodepot/fastbma/). © The Authors 2017. Published by Oxford University Press.
NASA Astrophysics Data System (ADS)
Ytsma, Cai R.; Dyar, M. Darby
2018-01-01
Hydrogen (H) is a critical element to measure on the surface of Mars because its presence in mineral structures is indicative of past hydrous conditions. The Curiosity rover uses the laser-induced breakdown spectrometer (LIBS) on the ChemCam instrument to analyze rocks for their H emission signal at 656.6 nm, from which H can be quantified. Previous LIBS calibrations for H used small data sets measured on standards and/or manufactured mixtures of hydrous minerals and rocks and applied univariate regression to spectra normalized in a variety of ways. However, matrix effects common to LIBS make these calibrations of limited usefulness when applied to the broad range of compositions on the Martian surface. In this study, 198 naturally-occurring hydrous geological samples covering a broad range of bulk compositions with directly-measured H content are used to create more robust prediction models for measuring H in LIBS data acquired under Mars conditions. Both univariate and multivariate prediction models, including partial least square (PLS) and the least absolute shrinkage and selection operator (Lasso), are compared using several different methods for normalization of H peak intensities. Data from the ChemLIBS Mars-analog spectrometer at Mount Holyoke College are compared against spectra from the same samples acquired using a ChemCam-like instrument at Los Alamos National Laboratory and the ChemCam instrument on Mars. Results show that all current normalization and data preprocessing variations for quantifying H result in models with statistically indistinguishable prediction errors (accuracies) ca. ± 1.5 weight percent (wt%) H2O, limiting the applications of LIBS in these implementations for geological studies. This error is too large to allow distinctions among the most common hydrous phases (basalts, amphiboles, micas) to be made, though some clays (e.g., chlorites with ≈ 12 wt% H2O, smectites with 15-20 wt% H2O) and hydrated phases (e.g., gypsum with ≈ 20 wt% H2O) may be differentiated from lower-H phases within the known errors. Analyses of the H emission peak in Curiosity calibration targets and rock and soil targets on the Martian surface suggest that shot-to-shot variations of the ChemCam laser on Mars lead to variations in intensity that are comparable to those represented by the breadth of H standards tested in this study.
Sanchez-Santos, Maria T.; Davey, Trish; Leyland, Kirsten M.; Allsopp, Adrian J.; Lanham-New, Susan A.; Judge, Andrew; Arden, Nigel K.; Fallowfield, Joanne L.
2017-01-01
Background: Stress fractures (SFs) are one of the more severe overuse injuries in military training, and therefore, knowledge of potential risk factors is needed to assist in developing mitigating strategies. Purpose: To develop a prediction model for risk of SF in Royal Marines (RM) recruits during an arduous military training program. Study Design: Case-control study; Level of evidence, 3. Methods: RM recruits (N = 1082; age range, 16-33 years) who enrolled between September 2009 and July 2010 were prospectively followed through the 32-week RM training program. SF diagnosis was confirmed from a positive radiograph or magnetic resonance imaging scan. Potential risk factors assessed at week 1 included recruit characteristics, anthropometric assessment, dietary supplement use, lifestyle habits, fitness assessment, blood samples, 25(OH)D, bone strength as measured by heel broadband ultrasound attention, history of physical activity, and previous and current food intake. A logistic least absolute shrinkage and selection operator (LASSO) regression with 10-fold cross-validation was used to select potential predictors among 47 candidate variables. Model performance was assessed using measures of discrimination (c-index) and calibration. Bootstrapping was used for internal validation of the developed model and to quantify optimism. Results: A total of 86 (8%) volunteer recruits presented at least 1 SF during training. Twelve variables were identified as the most important risk factors of SF. Variables strongly associated with SF were age, body weight, pretraining weightbearing exercise, pretraining cycling, and childhood intake of milk and milk products. The c-index for the prediction model, which represents the model performance in future volunteers, was 0.73 (optimism-corrected c-index, 0.68). Although 25(OH)D and VO2max had only a borderline statistically significant association with SF, the inclusion of these factors improved the performance of the model. Conclusion: These findings will assist in identifying recruits at greater risk of SF during training and will support interventions to mitigate this injury risk. However, external validation of the model is still required. PMID:28804727
Predictors of High Profit and High Deficit Outliers under SwissDRG of a Tertiary Care Center
Mehra, Tarun; Müller, Christian Thomas Benedikt; Volbracht, Jörk; Seifert, Burkhardt; Moos, Rudolf
2015-01-01
Principles Case weights of Diagnosis Related Groups (DRGs) are determined by the average cost of cases from a previous billing period. However, a significant amount of cases are largely over- or underfunded. We therefore decided to analyze earning outliers of our hospital as to search for predictors enabling a better grouping under SwissDRG. Methods 28,893 inpatient cases without additional private insurance discharged from our hospital in 2012 were included in our analysis. Outliers were defined by the interquartile range method. Predictors for deficit and profit outliers were determined with logistic regressions. Predictors were shortlisted with the LASSO regularized logistic regression method and compared to results of Random forest analysis. 10 of these parameters were selected for quantile regression analysis as to quantify their impact on earnings. Results Psychiatric diagnosis and admission as an emergency case were significant predictors for higher deficit with negative regression coefficients for all analyzed quantiles (p<0.001). Admission from an external health care provider was a significant predictor for a higher deficit in all but the 90% quantile (p<0.001 for Q10, Q20, Q50, Q80 and p = 0.0017 for Q90). Burns predicted higher earnings for cases which were favorably remunerated (p<0.001 for the 90% quantile). Osteoporosis predicted a higher deficit in the most underfunded cases, but did not predict differences in earnings for balanced or profitable cases (Q10 and Q20: p<0.00, Q50: p = 0.10, Q80: p = 0.88 and Q90: p = 0.52). ICU stay, mechanical and patient clinical complexity level score (PCCL) predicted higher losses at the 10% quantile but also higher profits at the 90% quantile (p<0.001). Conclusion We suggest considering psychiatric diagnosis, admission as an emergencay case and admission from an external health care provider as DRG split criteria as they predict large, consistent and significant losses. PMID:26517545
Predictors of High Profit and High Deficit Outliers under SwissDRG of a Tertiary Care Center.
Mehra, Tarun; Müller, Christian Thomas Benedikt; Volbracht, Jörk; Seifert, Burkhardt; Moos, Rudolf
2015-01-01
Case weights of Diagnosis Related Groups (DRGs) are determined by the average cost of cases from a previous billing period. However, a significant amount of cases are largely over- or underfunded. We therefore decided to analyze earning outliers of our hospital as to search for predictors enabling a better grouping under SwissDRG. 28,893 inpatient cases without additional private insurance discharged from our hospital in 2012 were included in our analysis. Outliers were defined by the interquartile range method. Predictors for deficit and profit outliers were determined with logistic regressions. Predictors were shortlisted with the LASSO regularized logistic regression method and compared to results of Random forest analysis. 10 of these parameters were selected for quantile regression analysis as to quantify their impact on earnings. Psychiatric diagnosis and admission as an emergency case were significant predictors for higher deficit with negative regression coefficients for all analyzed quantiles (p<0.001). Admission from an external health care provider was a significant predictor for a higher deficit in all but the 90% quantile (p<0.001 for Q10, Q20, Q50, Q80 and p = 0.0017 for Q90). Burns predicted higher earnings for cases which were favorably remunerated (p<0.001 for the 90% quantile). Osteoporosis predicted a higher deficit in the most underfunded cases, but did not predict differences in earnings for balanced or profitable cases (Q10 and Q20: p<0.00, Q50: p = 0.10, Q80: p = 0.88 and Q90: p = 0.52). ICU stay, mechanical and patient clinical complexity level score (PCCL) predicted higher losses at the 10% quantile but also higher profits at the 90% quantile (p<0.001). We suggest considering psychiatric diagnosis, admission as an emergency case and admission from an external health care provider as DRG split criteria as they predict large, consistent and significant losses.
Leaf beetles lasso tamarisk without hurting the relatives in Texas and the Southwest USA
USDA-ARS?s Scientific Manuscript database
This online trade journal article summarizes recent progress in biological control of tamarisks (Tamarix spp., Tamaricaceae) in North America. Tamarisks are a group of five exotic Eurasian shrub/tree species plus hybrids (also known collectively as saltcedar) that have invaded over 1 million hectar...
Aranda, A; Bonizzi, P; Karel, J; Peeters, R
2015-08-01
This study performs a comparison between Dower's inverse transform and Frank lead system for Myocardial Infarction (MI) identification. We have selected a set of relevant features for MI detection from the vectorcardiogram and used the lasso method after that to build a model for the Dower's inverse transform and one for the Frank leads system. Then we analyzed the performance between both models on MI detection. The proposed methods have been tested using PhysioNet PTB database that contains 550 records from which 368 are MIs. Two main conclusions are coming from this study. The first one is that Dower's inverse transform performs equally well than Frank leads in identification of MI patients. The second one is that lead positions have a large influence on the accuracy of MI patient identification.
An experimental validation of genomic selection in octoploid strawberry
Gezan, Salvador A; Osorio, Luis F; Verma, Sujeet; Whitaker, Vance M
2017-01-01
The primary goal of genomic selection is to increase genetic gains for complex traits by predicting performance of individuals for which phenotypic data are not available. The objective of this study was to experimentally evaluate the potential of genomic selection in strawberry breeding and to define a strategy for its implementation. Four clonally replicated field trials, two in each of 2 years comprised of a total of 1628 individuals, were established in 2013–2014 and 2014–2015. Five complex yield and fruit quality traits with moderate to low heritability were assessed in each trial. High-density genotyping was performed with the Affymetrix Axiom IStraw90 single-nucleotide polymorphism array, and 17 479 polymorphic markers were chosen for analysis. Several methods were compared, including Genomic BLUP, Bayes B, Bayes C, Bayesian LASSO Regression, Bayesian Ridge Regression and Reproducing Kernel Hilbert Spaces. Cross-validation within training populations resulted in higher values than for true validations across trials. For true validations, Bayes B gave the highest predictive abilities on average and also the highest selection efficiencies, particularly for yield traits that were the lowest heritability traits. Selection efficiencies using Bayes B for parent selection ranged from 74% for average fruit weight to 34% for early marketable yield. A breeding strategy is proposed in which advanced selection trials are utilized as training populations and in which genomic selection can reduce the breeding cycle from 3 to 2 years for a subset of untested parents based on their predicted genomic breeding values. PMID:28090334
Johnson, Brent A
2009-10-01
We consider estimation and variable selection in the partial linear model for censored data. The partial linear model for censored data is a direct extension of the accelerated failure time model, the latter of which is a very important alternative model to the proportional hazards model. We extend rank-based lasso-type estimators to a model that may contain nonlinear effects. Variable selection in such partial linear model has direct application to high-dimensional survival analyses that attempt to adjust for clinical predictors. In the microarray setting, previous methods can adjust for other clinical predictors by assuming that clinical and gene expression data enter the model linearly in the same fashion. Here, we select important variables after adjusting for prognostic clinical variables but the clinical effects are assumed nonlinear. Our estimator is based on stratification and can be extended naturally to account for multiple nonlinear effects. We illustrate the utility of our method through simulation studies and application to the Wisconsin prognostic breast cancer data set.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Song, J; Pollom, E; Durkee, B
2015-06-15
Purpose: To predict response to radiation treatment using computational FDG-PET and CT images in locally advanced head and neck cancer (HNC). Methods: 68 patients with State III-IVB HNC treated with chemoradiation were included in this retrospective study. For each patient, we analyzed primary tumor and lymph nodes on PET and CT scans acquired both prior to and during radiation treatment, which led to 8 combinations of image datasets. From each image set, we extracted high-throughput, radiomic features of the following types: statistical, morphological, textural, histogram, and wavelet, resulting in a total of 437 features. We then performed unsupervised redundancy removalmore » and stability test on these features. To avoid over-fitting, we trained a logistic regression model with simultaneous feature selection based on least absolute shrinkage and selection operator (LASSO). To objectively evaluate the prediction ability, we performed 5-fold cross validation (CV) with 50 random repeats of stratified bootstrapping. Feature selection and model training was solely conducted on the training set and independently validated on the holdout test set. Receiver operating characteristic (ROC) curve of the pooled Result and the area under the ROC curve (AUC) was calculated as figure of merit. Results: For predicting local-regional recurrence, our model built on pre-treatment PET of lymph nodes achieved the best performance (AUC=0.762) on 5-fold CV, which compared favorably with node volume and SUVmax (AUC=0.704 and 0.449, p<0.001). Wavelet coefficients turned out to be the most predictive features. Prediction of distant recurrence showed a similar trend, in which pre-treatment PET features of lymph nodes had the highest AUC of 0.705. Conclusion: The radiomics approach identified novel imaging features that are predictive to radiation treatment response. If prospectively validated in larger cohorts, they could aid in risk-adaptive treatment of HNC.« less
Kaplan, Katherine A; Hirshman, Jason; Hernandez, Beatriz; Stefanick, Marcia L; Hoffman, Andrew R; Redline, Susan; Ancoli-Israel, Sonia; Stone, Katie; Friedman, Leah; Zeitzer, Jamie M
2017-02-01
Reports of subjective sleep quality are frequently collected in research and clinical practice. It is unclear, however, how well polysomnographic measures of sleep correlate with subjective reports of prior-night sleep quality in elderly men and women. Furthermore, the relative importance of various polysomnographic, demographic and clinical characteristics in predicting subjective sleep quality is not known. We sought to determine the correlates of subjective sleep quality in older adults using more recently developed machine learning algorithms that are suitable for selecting and ranking important variables. Community-dwelling older men (n=1024) and women (n=459), a subset of those participating in the Osteoporotic Fractures in Men study and the Study of Osteoporotic Fractures study, respectively, completed a single night of at-home polysomnographic recording of sleep followed by a set of morning questions concerning the prior night's sleep quality. Questionnaires concerning demographics and psychological characteristics were also collected prior to the overnight recording and entered into multivariable models. Two machine learning algorithms, lasso penalized regression and random forests, determined variable selection and the ordering of variable importance separately for men and women. Thirty-eight sleep, demographic and clinical correlates of sleep quality were considered. Together, these multivariable models explained only 11-17% of the variance in predicting subjective sleep quality. Objective sleep efficiency emerged as the strongest correlate of subjective sleep quality across all models, and across both sexes. Greater total sleep time and sleep stage transitions were also significant objective correlates of subjective sleep quality. The amount of slow wave sleep obtained was not determined to be important. Overall, the commonly obtained measures of polysomnographically-defined sleep contributed little to subjective ratings of prior-night sleep quality. Though they explained relatively little of the variance, sleep efficiency, total sleep time and sleep stage transitions were among the most important objective correlates. Published by Elsevier B.V.
Kaplan, Katherine A.; Hirshman, Jason; Hernandez, Beatriz; Stefanick, Marcia L.; Hoffman, Andrew R.; Redline, Susan; Ancoli-Israel, Sonia; Stone, Katie; Friedman, Leah; Zeitzer, Jamie M.
2016-01-01
Background Reports of subjective sleep quality are frequently collected in research and clinical practice. It is unclear, however, how well polysomnographic measures of sleep correlate with subjective reports of prior-night sleep quality in elderly men and women. Furthermore, the relative importance of various polysomnographic, demographic and clinical characteristics in predicting subjective sleep quality is not known. We sought to determine the correlates of subjective sleep quality in in older adults using more recently developed machine learning algorithms that are suitable for selecting and ranking important variables. Methods Community-dwelling older men (n=1024) and women (n=459), a subset of those participating in the Osteoporotic Fractures in Men study and the Study of Osteoporotic Fractures study, respectively, completed a single night of at-home polysomnographic recording of sleep followed by a set of morning questions concerning the prior night's sleep quality. Questionnaires concerning demographics and psychological characteristics were also collected prior to the overnight recording and entered into multivariable models. Two machine learning algorithms, lasso penalized regression and random forests, determined variable selection and the ordering of variable importance separately for men and women. Results Thirty-eight sleep, demographic and clinical correlates of sleep quality were considered. Together, these multivariable models explained only 11-17% of the variance in predicting subjective sleep quality. Objective sleep efficiency emerged as the strongest correlate of subjective sleep quality across all models, and across both sexes. Greater total sleep time and sleep stage transitions were also significant objective correlates of subjective sleep quality. The amount of slow wave sleep obtained was not determined to be important. Conclusions Overall, the commonly obtained measures of polysomnographically-defined sleep contributed little to subjective ratings of prior-night sleep quality. Though they explained relatively little of the variance, sleep efficiency, total sleep time and sleep stage transitions were among the most important objective correlates. PMID:27889439
Woo, Hyekyung; Cho, Youngtae; Shim, Eunyoung; Lee, Jong-Koo; Lee, Chang-Gun; Kim, Seong Hwan
2016-07-04
As suggested as early as in 2006, logs of queries submitted to search engines seeking information could be a source for detection of emerging influenza epidemics if changes in the volume of search queries are monitored (infodemiology). However, selecting queries that are most likely to be associated with influenza epidemics is a particular challenge when it comes to generating better predictions. In this study, we describe a methodological extension for detecting influenza outbreaks using search query data; we provide a new approach for query selection through the exploration of contextual information gleaned from social media data. Additionally, we evaluate whether it is possible to use these queries for monitoring and predicting influenza epidemics in South Korea. Our study was based on freely available weekly influenza incidence data and query data originating from the search engine on the Korean website Daum between April 3, 2011 and April 5, 2014. To select queries related to influenza epidemics, several approaches were applied: (1) exploring influenza-related words in social media data, (2) identifying the chief concerns related to influenza, and (3) using Web query recommendations. Optimal feature selection by least absolute shrinkage and selection operator (Lasso) and support vector machine for regression (SVR) were used to construct a model predicting influenza epidemics. In total, 146 queries related to influenza were generated through our initial query selection approach. A considerable proportion of optimal features for final models were derived from queries with reference to the social media data. The SVR model performed well: the prediction values were highly correlated with the recent observed influenza-like illness (r=.956; P<.001) and virological incidence rate (r=.963; P<.001). These results demonstrate the feasibility of using search queries to enhance influenza surveillance in South Korea. In addition, an approach for query selection using social media data seems ideal for supporting influenza surveillance based on search query data.
Pena, Michelle J; Heinzel, Andreas; Rossing, Peter; Parving, Hans-Henrik; Dallmann, Guido; Rossing, Kasper; Andersen, Steen; Mayer, Bernd; Heerspink, Hiddo J L
2016-07-05
Individual patients show a large variability in albuminuria response to angiotensin receptor blockers (ARB). Identifying novel biomarkers that predict ARB response may help tailor therapy. We aimed to discover and validate a serum metabolite classifier that predicts albuminuria response to ARBs in patients with diabetes mellitus and micro- or macroalbuminuria. Liquid chromatography-tandem mass spectrometry metabolomics was performed on serum samples. Data from patients with type 2 diabetes and microalbuminuria (n = 49) treated with irbesartan 300 mg/day were used for discovery. LASSO and ridge regression were performed to develop the classifier. Improvement in albuminuria response prediction was assessed by calculating differences in R(2) between a reference model of clinical parameters and a model with clinical parameters and the classifier. The classifier was externally validated in patients with type 1 diabetes and macroalbuminuria (n = 50) treated with losartan 100 mg/day. Molecular process analysis was performed to link metabolites to molecular mechanisms contributing to ARB response. In discovery, median change in urinary albumin excretion (UAE) was -42 % [Q1-Q3: -69 to -8]. The classifier, consisting of 21 metabolites, was significantly associated with UAE response to irbesartan (p < 0.001) and improved prediction of UAE response on top of the clinical reference model (R(2) increase from 0.10 to 0.70; p < 0.001). In external validation, median change in UAE was -43 % [Q1-Q35: -63 to -23]. The classifier improved prediction of UAE response to losartan (R(2) increase from 0.20 to 0.59; p < 0.001). Specifically ADMA impacting eNOS activity appears to be a relevant factor in ARB response. A serum metabolite classifier was discovered and externally validated to significantly improve prediction of albuminuria response to ARBs in diabetes mellitus.
Woo, Hyekyung; Shim, Eunyoung; Lee, Jong-Koo; Lee, Chang-Gun; Kim, Seong Hwan
2016-01-01
Background As suggested as early as in 2006, logs of queries submitted to search engines seeking information could be a source for detection of emerging influenza epidemics if changes in the volume of search queries are monitored (infodemiology). However, selecting queries that are most likely to be associated with influenza epidemics is a particular challenge when it comes to generating better predictions. Objective In this study, we describe a methodological extension for detecting influenza outbreaks using search query data; we provide a new approach for query selection through the exploration of contextual information gleaned from social media data. Additionally, we evaluate whether it is possible to use these queries for monitoring and predicting influenza epidemics in South Korea. Methods Our study was based on freely available weekly influenza incidence data and query data originating from the search engine on the Korean website Daum between April 3, 2011 and April 5, 2014. To select queries related to influenza epidemics, several approaches were applied: (1) exploring influenza-related words in social media data, (2) identifying the chief concerns related to influenza, and (3) using Web query recommendations. Optimal feature selection by least absolute shrinkage and selection operator (Lasso) and support vector machine for regression (SVR) were used to construct a model predicting influenza epidemics. Results In total, 146 queries related to influenza were generated through our initial query selection approach. A considerable proportion of optimal features for final models were derived from queries with reference to the social media data. The SVR model performed well: the prediction values were highly correlated with the recent observed influenza-like illness (r=.956; P<.001) and virological incidence rate (r=.963; P<.001). Conclusions These results demonstrate the feasibility of using search queries to enhance influenza surveillance in South Korea. In addition, an approach for query selection using social media data seems ideal for supporting influenza surveillance based on search query data. PMID:27377323
Monasta, Lorenzo; Pierobon, Chiara; Princivalle, Andrea; Martelossi, Stefano; Marcuzzi, Annalisa; Pasini, Francesco; Perbellini, Luigi
2017-01-01
Inflammatory bowel diseases (IBD) profoundly affect quality of life and have been gradually increasing in incidence, prevalence and severity in many areas of the world, and in children in particular. Patients with suspected IBD require careful history and clinical examination, while definitive diagnosis relies on endoscopic and histological findings. The aim of the present study was to investigate whether the alveolar air of pediatric patients with IBD presents a specific volatile organic compounds' (VOCs) pattern when compared to controls. Patients 10-17 years of age, were divided into four groups: Crohn's disease (CD), ulcerative colitis (UC), controls with gastrointestinal symptomatology, and surgical controls with no evidence of gastrointestinal problems. Alveolar breath was analyzed by ion molecule reaction mass spectrometry. Four models were built starting from 81 molecules plus the age of subjects as independent variables, adopting a penalizing LASSO logistic regression approach: 1) IBDs vs. controls, finally based on 18 VOCs plus age (sensitivity = 95%, specificity = 69%, AUC = 0.925); 2) CD vs. UC, finally based on 13 VOCs plus age (sensitivity = 94%, specificity = 76%, AUC = 0.934); 3) IBDs vs. gastroenterological controls, finally based on 15 VOCs plus age (sensitivity = 94%, specificity = 65%, AUC = 0.918); 4) IBDs vs. controls, built starting from the 21 directly or indirectly calibrated molecules only, and finally based on 12 VOCs plus age (sensitivity = 94%, specificity = 71%, AUC = 0.888). The molecules identified by the models were carefully studied in relation to the concerned outcomes. This study, with the creation of models based on VOCs profiles, precise instrumentation and advanced statistical methods, can contribute to the development of new non-invasive, fast and relatively inexpensive diagnostic tools, with high sensitivity and specificity. It also represents a crucial step towards gaining further insights on the etiology of IBD through the analysis of specific molecules which are the expression of the particular metabolism that characterizes these patients.
Ortega Hinojosa, Alberto M; Davies, Molly M; Jarjour, Sarah; Burnett, Richard T; Mann, Jennifer K; Hughes, Edward; Balmes, John R; Turner, Michelle C; Jerrett, Michael
2014-10-01
Globally and in the United States, smoking and obesity are leading causes of death and disability. Reliable estimates of prevalence for these risk factors are often missing variables in public health surveillance programs. This may limit the capacity of public health surveillance to target interventions or to assess associations between other environmental risk factors (e.g., air pollution) and health because smoking and obesity are often important confounders. To generate prevalence estimates of smoking and obesity rates over small areas for the United States (i.e., at the ZIP code and census tract levels). We predicted smoking and obesity prevalence using a combined approach first using a lasso-based variable selection procedure followed by a two-level random effects regression with a Poisson link clustered on state and county. We used data from the Behavioral Risk Factor Surveillance System (BRFSS) from 1991 to 2010 to estimate the model. We used 10-fold cross-validated mean squared errors and the variance of the residuals to test our model. To downscale the estimates we combined the prediction equations with 1990 and 2000 U.S. Census data for each of the four five-year time periods in this time range at the ZIP code and census tract levels. Several sensitivity analyses were conducted using models that included only basic terms, that accounted for spatial autocorrelation, and used Generalized Linear Models that did not include random effects. The two-level random effects model produced improved estimates compared to the fixed effects-only models. Estimates were particularly improved for the two-thirds of the conterminous U.S. where BRFSS data were available to estimate the county level random effects. We downscaled the smoking and obesity rate predictions to derive ZIP code and census tract estimates. To our knowledge these smoking and obesity predictions are the first to be developed for the entire conterminous U.S. for census tracts and ZIP codes. Our estimates could have significant utility for public health surveillance. Copyright © 2014. Published by Elsevier Inc.
Saphir, A
1999-02-08
In Texas, they do things differently, and they do things big. Hospitals in the Lone Star State have been banding together more often and more effectively than elsewhere. Swinging their lassos, they are riding herd on HMOs, enjoying record profits and making ever-larger deals.
California Lassos a Lone Star as Its Savior
ERIC Educational Resources Information Center
Fain, Paul
2008-01-01
For at least four decades, the University of California has been the international gold standard in public higher education. The system's 10 campuses are magnets for top-notch faculty members and students. With an annual budget of $18-billion, the university includes five medical centers and three national laboratories. And one of every 10 members…
Welcome to Fermilab Butterflies!!
, fascinating insects, and there's a lot to learn about them! Join our expert, Tom Peterson, and explore the Meet Tom Peterson, Fermilab's Butterfly Expert Go to our Butterfly Links Have fun! Graphics and Page Design: Rory Parilac, Content: Tom Peterson and Rory Parilac Database and Lasso Code: Liz Quigg Web
Sparsest representations and approximations of an underdetermined linear system
NASA Astrophysics Data System (ADS)
Tardivel, Patrick J. C.; Servien, Rémi; Concordet, Didier
2018-05-01
In an underdetermined linear system of equations, constrained l 1 minimization methods such as the basis pursuit or the lasso are often used to recover one of the sparsest representations or approximations of the system. The null space property is a sufficient and ‘almost’ necessary condition to recover a sparsest representation with the basis pursuit. Unfortunately, this property cannot be easily checked. On the other hand, the mutual coherence is an easily checkable sufficient condition insuring the basis pursuit to recover one of the sparsest representations. Because the mutual coherence condition is too strong, it is hardly met in practice. Even if one of these conditions holds, to our knowledge, there is no theoretical result insuring that the lasso solution is one of the sparsest approximations. In this article, we study a novel constrained problem that gives, without any condition, one of the sparsest representations or approximations. To solve this problem, we provide a numerical method and we prove its convergence. Numerical experiments show that this approach gives better results than both the basis pursuit problem and the reweighted l 1 minimization problem.
Comparison of Penalty Functions for Sparse Canonical Correlation Analysis
Chalise, Prabhakar; Fridley, Brooke L.
2011-01-01
Canonical correlation analysis (CCA) is a widely used multivariate method for assessing the association between two sets of variables. However, when the number of variables far exceeds the number of subjects, such in the case of large-scale genomic studies, the traditional CCA method is not appropriate. In addition, when the variables are highly correlated the sample covariance matrices become unstable or undefined. To overcome these two issues, sparse canonical correlation analysis (SCCA) for multiple data sets has been proposed using a Lasso type of penalty. However, these methods do not have direct control over sparsity of solution. An additional step that uses Bayesian Information Criterion (BIC) has also been suggested to further filter out unimportant features. In this paper, a comparison of four penalty functions (Lasso, Elastic-net, SCAD and Hard-threshold) for SCCA with and without the BIC filtering step have been carried out using both real and simulated genotypic and mRNA expression data. This study indicates that the SCAD penalty with BIC filter would be a preferable penalty function for application of SCCA to genomic data. PMID:21984855
El Alaoui, Adil; Sbiyaa, Mouhcine; Bah, Aliou; Rabhi, Ilyas; mezzani, Amine; Marzouki, Amine; Boutayeb, Fawzi
2015-01-01
La lèpre est une maladie infectieuse due à une mycobactérie (M. Leprae, Bacille de Hansen, ou BH) dont le tropisme nerveux est destructeur pour les cellules de Schwann. La localisation préférentielle des neuropathies tronculaire secondaire à la lèpre restent dominé par les zones ou les troncs nerveux traversent les défilés ostéo-ligamentaires inextensibles comme le défilé rétro-épitrochléen ou passe le nerf ulnaire. De nombreux travaux ont été consacrés à la souffrance nerveuse secondaire à la lèpre et surtout l'atteinte du nerf ulnaire qui se manifeste par une griffe des doigts. Le traitement dans ce cas est palliatif et fait appel à plusieurs techniques décrites dans la littérature. Nous rapportons dans ce travail un cas de griffe cubitale chez un patient lépreux traité par transfert tendineux de Lasso Zancolli. PMID:26985277
Wager, Tor D.; Atlas, Lauren Y.; Leotti, Lauren A.; Rilling, James K.
2012-01-01
Recent studies have identified brain correlates of placebo analgesia, but none have assessed how accurately patterns of brain activity can predict individual differences in placebo responses. We reanalyzed data from two fMRI studies of placebo analgesia (N = 47), using patterns of fMRI activity during the anticipation and experience of pain to predict new subjects’ scores on placebo analgesia and placebo-induced changes in pain processing. We used a cross-validated regression procedure, LASSO-PCR, which provided both unbiased estimates of predictive accuracy and interpretable maps of which regions are most important for prediction. Increased anticipatory activity in a frontoparietal network and decreases in a posterior insular/temporal network predicted placebo analgesia. Patterns of anticipatory activity across the cortex predicted a moderate amount of variance in the placebo response (~12% overall, ~40% for study 2 alone), which is substantial considering the multiple likely contributing factors. The most predictive regions were those associated with emotional appraisal, rather than cognitive control or pain processing. During pain, decreases in limbic and paralimbic regions most strongly predicted placebo analgesia. Responses within canonical pain-processing regions explained significant variance in placebo analgesia, but the pattern of effects was inconsistent with widespread decreases in nociceptive processing. Together, the findings suggest that engagement of emotional appraisal circuits drives individual variation in placebo analgesia, rather than early suppression of nociceptive processing. This approach provides a framework that will allow prediction accuracy to increase as new studies provide more precise information for future predictive models. PMID:21228154
Effect of Spatial Resolution for Characterizing Soil Properties from Imaging Spectrometer Data
NASA Astrophysics Data System (ADS)
Dutta, D.; Kumar, P.; Greenberg, J. A.
2015-12-01
The feasibility of quantifying soil constituents over large areas using airborne hyperspectral data [0.35 - 2.5 μm] in an ensemble bootstrapping lasso algorithmic framework has been demonstrated previously [1]. However the effects of coarsening the spatial resolution of hyperspectral data on the quantification of soil constituents are unknown. We use Airborne Visible Infrared Imaging Spectrometer (AVIRIS) data collected at 7.6m resolution over Birds Point New Madrid (BPNM) floodway for up-scaling and generating multiple coarser resolution datasets including the 60m Hyperspectral Infrared Imager (HyspIRI) like data. HyspIRI is a proposed visible shortwave/thermal infrared mission, which will provide global data over a spectral range of 0.35 - 2.5μm at a spatial resolution of 60m. Our results show that the lasso method, which is based on point scale observational data, is scalable. We found consistent good model performance (R2) values (0.79 < R2 < 0.82) and correct classifications as per USDA soil texture classes at multiple spatial resolutions. The results further demonstrate that the attributes of the pdf for different soil constituents across the landscape and the within-pixel variance are well preserved across scales. Our analysis provides a methodological framework with a sufficient set of metrics for assessing the performance of scaling up analysis from point scale observational data and may be relevant for other similar remote sensing studies. [1] Dutta, D.; Goodwell, A.E.; Kumar, P.; Garvey, J.E.; Darmody, R.G.; Berretta, D.P.; Greenberg, J.A., "On the Feasibility of Characterizing Soil Properties From AVIRIS Data," Geoscience and Remote Sensing, IEEE Transactions on, vol.53, no.9, pp.5133,5147, Sept. 2015. doi: 10.1109/TGRS.2015.2417547.
Logsdon, Benjamin A.; Mezey, Jason
2010-01-01
Cellular gene expression measurements contain regulatory information that can be used to discover novel network relationships. Here, we present a new algorithm for network reconstruction powered by the adaptive lasso, a theoretically and empirically well-behaved method for selecting the regulatory features of a network. Any algorithms designed for network discovery that make use of directed probabilistic graphs require perturbations, produced by either experiments or naturally occurring genetic variation, to successfully infer unique regulatory relationships from gene expression data. Our approach makes use of appropriately selected cis-expression Quantitative Trait Loci (cis-eQTL), which provide a sufficient set of independent perturbations for maximum network resolution. We compare the performance of our network reconstruction algorithm to four other approaches: the PC-algorithm, QTLnet, the QDG algorithm, and the NEO algorithm, all of which have been used to reconstruct directed networks among phenotypes leveraging QTL. We show that the adaptive lasso can outperform these algorithms for networks of ten genes and ten cis-eQTL, and is competitive with the QDG algorithm for networks with thirty genes and thirty cis-eQTL, with rich topologies and hundreds of samples. Using this novel approach, we identify unique sets of directed relationships in Saccharomyces cerevisiae when analyzing genome-wide gene expression data for an intercross between a wild strain and a lab strain. We recover novel putative network relationships between a tyrosine biosynthesis gene (TYR1), and genes involved in endocytosis (RCY1), the spindle checkpoint (BUB2), sulfonate catabolism (JLP1), and cell-cell communication (PRM7). Our algorithm provides a synthesis of feature selection methods and graphical model theory that has the potential to reveal new directed regulatory relationships from the analysis of population level genetic and gene expression data. PMID:21152011
Linking metabolic network features to phenotypes using sparse group lasso.
Samal, Satya Swarup; Radulescu, Ovidiu; Weber, Andreas; Fröhlich, Holger
2017-11-01
Integration of metabolic networks with '-omics' data has been a subject of recent research in order to better understand the behaviour of such networks with respect to differences between biological and clinical phenotypes. Under the conditions of steady state of the reaction network and the non-negativity of fluxes, metabolic networks can be algebraically decomposed into a set of sub-pathways often referred to as extreme currents (ECs). Our objective is to find the statistical association of such sub-pathways with given clinical outcomes, resulting in a particular instance of a self-contained gene set analysis method. In this direction, we propose a method based on sparse group lasso (SGL) to identify phenotype associated ECs based on gene expression data. SGL selects a sparse set of feature groups and also introduces sparsity within each group. Features in our model are clusters of ECs, and feature groups are defined based on correlations among these features. We apply our method to metabolic networks from KEGG database and study the association of network features to prostate cancer (where the outcome is tumor and normal, respectively) as well as glioblastoma multiforme (where the outcome is survival time). In addition, simulations show the superior performance of our method compared to global test, which is an existing self-contained gene set analysis method. R code (compatible with version 3.2.5) is available from http://www.abi.bit.uni-bonn.de/index.php?id=17. samal@combine.rwth-aachen.de or frohlich@bit.uni-bonn.de. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jang, Dae -Heung; Anderson-Cook, Christine Michaela
When there are constraints on resources, an unreplicated factorial or fractional factorial design can allow efficient exploration of numerous factor and interaction effects. A half-normal plot is a common graphical tool used to compare the relative magnitude of effects and to identify important effects from these experiments when no estimate of error from the experiment is available. An alternative is to use a least absolute shrinkage and selection operation plot to examine the pattern of model selection terms from an experiment. We examine how both the half-normal and least absolute shrinkage and selection operation plots are impacted by the absencemore » of individual observations or an outlier, and the robustness of conclusions obtained from these 2 techniques for identifying important effects from factorial experiments. As a result, the methods are illustrated with 2 examples from the literature.« less
Jang, Dae -Heung; Anderson-Cook, Christine Michaela
2017-04-12
When there are constraints on resources, an unreplicated factorial or fractional factorial design can allow efficient exploration of numerous factor and interaction effects. A half-normal plot is a common graphical tool used to compare the relative magnitude of effects and to identify important effects from these experiments when no estimate of error from the experiment is available. An alternative is to use a least absolute shrinkage and selection operation plot to examine the pattern of model selection terms from an experiment. We examine how both the half-normal and least absolute shrinkage and selection operation plots are impacted by the absencemore » of individual observations or an outlier, and the robustness of conclusions obtained from these 2 techniques for identifying important effects from factorial experiments. As a result, the methods are illustrated with 2 examples from the literature.« less
Linear and nonlinear variable selection in competing risks data.
Ren, Xiaowei; Li, Shanshan; Shen, Changyu; Yu, Zhangsheng
2018-06-15
Subdistribution hazard model for competing risks data has been applied extensively in clinical researches. Variable selection methods of linear effects for competing risks data have been studied in the past decade. There is no existing work on selection of potential nonlinear effects for subdistribution hazard model. We propose a two-stage procedure to select the linear and nonlinear covariate(s) simultaneously and estimate the selected covariate effect(s). We use spectral decomposition approach to distinguish the linear and nonlinear parts of each covariate and adaptive LASSO to select each of the 2 components. Extensive numerical studies are conducted to demonstrate that the proposed procedure can achieve good selection accuracy in the first stage and small estimation biases in the second stage. The proposed method is applied to analyze a cardiovascular disease data set with competing death causes. Copyright © 2018 John Wiley & Sons, Ltd.
Predicting human age using regional morphometry and inter-regional morphological similarity
NASA Astrophysics Data System (ADS)
Wang, Xun-Heng; Li, Lihua
2016-03-01
The goal of this study is predicting human age using neuro-metrics derived from structural MRI, as well as investigating the relationships between age and predictive neuro-metrics. To this end, a cohort of healthy subjects were recruited from 1000 Functional Connectomes Project. The ages of the participations were ranging from 7 to 83 (36.17+/-20.46). The structural MRI for each subject was preprocessed using FreeSurfer, resulting in regional cortical thickness, mean curvature, regional volume and regional surface area for 148 anatomical parcellations. The individual age was predicted from the combination of regional and inter-regional neuro-metrics. The prediction accuracy is r = 0.835, p < 0.00001, evaluated by Pearson correlation coefficient between predicted ages and actual ages. Moreover, the LASSO linear regression also found certain predictive features, most of which were inter-regional features. The turning-point of the developmental trajectories in human brain was around 40 years old based on regional cortical thickness. In conclusion, structural MRI could be potential biomarkers for the aging in human brain. The human age could be successfully predicted from the combination of regional morphometry and inter-regional morphological similarity. The inter-regional measures could be beneficial to investigating human brain connectome.
Penalized unsupervised learning with outliers
Witten, Daniela M.
2013-01-01
We consider the problem of performing unsupervised learning in the presence of outliers – that is, observations that do not come from the same distribution as the rest of the data. It is known that in this setting, standard approaches for unsupervised learning can yield unsatisfactory results. For instance, in the presence of severe outliers, K-means clustering will often assign each outlier to its own cluster, or alternatively may yield distorted clusters in order to accommodate the outliers. In this paper, we take a new approach to extending existing unsupervised learning techniques to accommodate outliers. Our approach is an extension of a recent proposal for outlier detection in the regression setting. We allow each observation to take on an “error” term, and we penalize the errors using a group lasso penalty in order to encourage most of the observations’ errors to exactly equal zero. We show that this approach can be used in order to develop extensions of K-means clustering and principal components analysis that result in accurate outlier detection, as well as improved performance in the presence of outliers. These methods are illustrated in a simulation study and on two gene expression data sets, and connections with M-estimation are explored. PMID:23875057
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mizukami, Wataru, E-mail: wataru.mizukami@bristol.ac.uk; Tew, David P., E-mail: david.tew@bristol.ac.uk; Habershon, Scott, E-mail: S.Habershon@warwick.ac.uk
2014-10-14
We present a new approach to semi-global potential energy surface fitting that uses the least absolute shrinkage and selection operator (LASSO) constrained least squares procedure to exploit an extremely flexible form for the potential function, while at the same time controlling the risk of overfitting and avoiding the introduction of unphysical features such as divergences or high-frequency oscillations. Drawing from a massively redundant set of overlapping distributed multi-dimensional Gaussian functions of inter-atomic separations we build a compact full-dimensional surface for malonaldehyde, fit to explicitly correlated coupled cluster CCSD(T)(F12*) energies with a root mean square deviations accuracy of 0.3%–0.5% up tomore » 25 000 cm{sup −1} above equilibrium. Importance-sampled diffusion Monte Carlo calculations predict zero point energies for malonaldehyde and its deuterated isotopologue of 14 715.4(2) and 13 997.9(2) cm{sup −1} and hydrogen transfer tunnelling splittings of 21.0(4) and 3.2(4) cm{sup −1}, respectively, which are in excellent agreement with the experimental values of 21.583 and 2.915(4) cm{sup −1}.« less
Kwong, Qi Bin; Ong, Ai Ling; Teh, Chee Keng; Chew, Fook Tim; Tammi, Martti; Mayes, Sean; Kulaveerasingam, Harikrishna; Yeoh, Suat Hui; Harikrishna, Jennifer Ann; Appleton, David Ross
2017-06-06
Genomic selection (GS) uses genome-wide markers to select individuals with the desired overall combination of breeding traits. A total of 1,218 individuals from a commercial population of Ulu Remis x AVROS (UR x AVROS) were genotyped using the OP200K array. The traits of interest included: shell-to-fruit ratio (S/F, %), mesocarp-to-fruit ratio (M/F, %), kernel-to-fruit ratio (K/F, %), fruit per bunch (F/B, %), oil per bunch (O/B, %) and oil per palm (O/P, kg/palm/year). Genomic heritabilities of these traits were estimated to be in the range of 0.40 to 0.80. GS methods assessed were RR-BLUP, Bayes A (BA), Cπ (BC), Lasso (BL) and Ridge Regression (BRR). All methods resulted in almost equal prediction accuracy. The accuracy achieved ranged from 0.40 to 0.70, correlating with the heritability of traits. By selecting the most important markers, RR-BLUP B has the potential to outperform other methods. The marker density for certain traits can be further reduced based on the linkage disequilibrium (LD). Together with in silico breeding, GS is now being used in oil palm breeding programs to hasten parental palm selection.
NASA Astrophysics Data System (ADS)
Anderson, R. B.; Finch, N.; Clegg, S. M.; Graff, T. G.; Morris, R. V.; Laura, J.; Gaddis, L. R.
2017-12-01
Machine learning is a powerful but underutilized approach that can enable planetary scientists to derive meaningful results from the rapidly-growing quantity of available spectral data. For example, regression methods such as Partial Least Squares (PLS) and Least Absolute Shrinkage and Selection Operator (LASSO), can be used to determine chemical concentrations from ChemCam and SuperCam Laser-Induced Breakdown Spectroscopy (LIBS) data [1]. Many scientists are interested in testing different spectral data processing and machine learning methods, but few have the time or expertise to write their own software to do so. We are therefore developing a free open-source library of software called the Python Spectral Analysis Tool (PySAT) along with a flexible, user-friendly graphical interface to enable scientists to process and analyze point spectral data without requiring significant programming or machine-learning expertise. A related but separately-funded effort is working to develop a graphical interface for orbital data [2]. The PySAT point-spectra tool includes common preprocessing steps (e.g. interpolation, normalization, masking, continuum removal, dimensionality reduction), plotting capabilities, and capabilities to prepare data for machine learning such as creating stratified folds for cross validation, defining training and test sets, and applying calibration transfer so that data collected on different instruments or under different conditions can be used together. The tool leverages the scikit-learn library [3] to enable users to train and compare the results from a variety of multivariate regression methods. It also includes the ability to combine multiple "sub-models" into an overall model, a method that has been shown to improve results and is currently used for ChemCam data [4]. Although development of the PySAT point-spectra tool has focused primarily on the analysis of LIBS spectra, the relevant steps and methods are applicable to any spectral data. The tool is available at https://github.com/USGS-Astrogeology/PySAT_Point_Spectra_GUI. [1] Clegg, S.M., et al. (2017) Spectrochim Acta B. 129, 64-85. [2] Gaddis, L. et al. (2017) 3rd Planetary Data Workshop, #1986. [3] http://scikit-learn.org/ [4] Anderson, R.B., et al. (2017) Spectrochim. Acta B. 129, 49-57.
A Hybrid Supervised/Unsupervised Machine Learning Approach to Solar Flare Prediction
NASA Astrophysics Data System (ADS)
Benvenuto, Federico; Piana, Michele; Campi, Cristina; Massone, Anna Maria
2018-01-01
This paper introduces a novel method for flare forecasting, combining prediction accuracy with the ability to identify the most relevant predictive variables. This result is obtained by means of a two-step approach: first, a supervised regularization method for regression, namely, LASSO is applied, where a sparsity-enhancing penalty term allows the identification of the significance with which each data feature contributes to the prediction; then, an unsupervised fuzzy clustering technique for classification, namely, Fuzzy C-Means, is applied, where the regression outcome is partitioned through the minimization of a cost function and without focusing on the optimization of a specific skill score. This approach is therefore hybrid, since it combines supervised and unsupervised learning; realizes classification in an automatic, skill-score-independent way; and provides effective prediction performances even in the case of imbalanced data sets. Its prediction power is verified against NOAA Space Weather Prediction Center data, using as a test set, data in the range between 1996 August and 2010 December and as training set, data in the range between 1988 December and 1996 June. To validate the method, we computed several skill scores typically utilized in flare prediction and compared the values provided by the hybrid approach with the ones provided by several standard (non-hybrid) machine learning methods. The results showed that the hybrid approach performs classification better than all other supervised methods and with an effectiveness comparable to the one of clustering methods; but, in addition, it provides a reliable ranking of the weights with which the data properties contribute to the forecast.
L2-Boosting algorithm applied to high-dimensional problems in genomic selection.
González-Recio, Oscar; Weigel, Kent A; Gianola, Daniel; Naya, Hugo; Rosa, Guilherme J M
2010-06-01
The L(2)-Boosting algorithm is one of the most promising machine-learning techniques that has appeared in recent decades. It may be applied to high-dimensional problems such as whole-genome studies, and it is relatively simple from a computational point of view. In this study, we used this algorithm in a genomic selection context to make predictions of yet to be observed outcomes. Two data sets were used: (1) productive lifetime predicted transmitting abilities from 4702 Holstein sires genotyped for 32 611 single nucleotide polymorphisms (SNPs) derived from the Illumina BovineSNP50 BeadChip, and (2) progeny averages of food conversion rate, pre-corrected by environmental and mate effects, in 394 broilers genotyped for 3481 SNPs. Each of these data sets was split into training and testing sets, the latter comprising dairy or broiler sires whose ancestors were in the training set. Two weak learners, ordinary least squares (OLS) and non-parametric (NP) regression were used for the L2-Boosting algorithm, to provide a stringent evaluation of the procedure. This algorithm was compared with BL [Bayesian LASSO (least absolute shrinkage and selection operator)] and BayesA regression. Learning tasks were carried out in the training set, whereas validation of the models was performed in the testing set. Pearson correlations between predicted and observed responses in the dairy cattle (broiler) data set were 0.65 (0.33), 0.53 (0.37), 0.66 (0.26) and 0.63 (0.27) for OLS-Boosting, NP-Boosting, BL and BayesA, respectively. The smallest bias and mean-squared errors (MSEs) were obtained with OLS-Boosting in both the dairy cattle (0.08 and 1.08, respectively) and broiler (-0.011 and 0.006) data sets, respectively. In the dairy cattle data set, the BL was more accurate (bias=0.10 and MSE=1.10) than BayesA (bias=1.26 and MSE=2.81), whereas no differences between these two methods were found in the broiler data set. L2-Boosting with a suitable learner was found to be a competitive alternative for genomic selection applications, providing high accuracy and low bias in genomic-assisted evaluations with a relatively short computational time.
The Ins & Outs of Developing a Field-Based Science Project: Learning by Lassoing Lizards
ERIC Educational Resources Information Center
Matthews, Catherine E.; Huffling, Lacey D.; Benavides, Aerin
2014-01-01
We describe a field-based lizard project we did with high school students as a part of our summer Herpetological Research Experiences. We describe data collection on lizards captured, identified, and marked as a part of our mark-recapture study. We also describe other lizard projects that are ongoing in the United States and provide resources for…
Processing ARM VAP data on an AWS cluster
NASA Astrophysics Data System (ADS)
Martin, T.; Macduff, M.; Shippert, T.
2017-12-01
The Atmospheric Radiation Measurement (ARM) Data Management Facility (DMF) manages over 18,000 processes and 1.3 TB of data each day. This includes many Value Added Products (VAPs) that make use of multiple instruments to produce the derived products that are scientifically relevant. A thermodynamic and cloud profile VAP is being developed to provide input to the ARM Large-eddy simulation (LES) ARM Symbiotic Simulation and Observation (LASSO) project (https://www.arm.gov/capabilities/vaps/lasso-122) . This algorithm is CPU intensive and the processing requirements exceeded the available DMF computing capacity. Amazon Web Service (AWS) along with CfnCluster was investigated to see how it would perform. This cluster environment is cost effective and scales dynamically based on demand. We were able to take advantage of autoscaling which allowed the cluster to grow and shrink based on the size of the processing queue. We also were able to take advantage of the Amazon Web Services spot market to further reduce the cost. Our test was very successful and found that cloud resources can be used to efficiently and effectively process time series data. This poster will present the resources and methodology used to successfully run the algorithm.
Prodinger, Birgit; Cieza, Alarcos; Oberhauser, Cornelia; Bickenbach, Jerome; Üstün, Tevfik Bedirhan; Chatterji, Somnath; Stucki, Gerold
2016-06-01
To develop a comprehensive set of the International Classification of Functioning, Disability and Health (ICF) categories as a minimal standard for reporting and assessing functioning and disability in clinical populations along the continuum of care. The specific aims were to specify the domains of functioning recommended for an ICF Rehabilitation Set and to identify a minimal set of environmental factors (EFs) to be used alongside the ICF Rehabilitation Set when describing disability across individuals and populations with various health conditions. Secondary analysis of existing data sets using regression methods (Random Forests and Group Lasso regression) and expert consultations. Along the continuum of care, including acute, early postacute, and long-term and community rehabilitation settings. Persons (N=9863) with various health conditions participated in primary studies. The number of respondents for whom the dependent variable data were available and used in this analysis was 9264. Not applicable. For regression analyses, self-reported general health was used as a dependent variable. The ICF categories from the functioning component and the EF component were used as independent variables for the development of the ICF Rehabilitation Set and the minimal set of EFs, respectively. Thirty ICF categories to be complemented with 12 EFs were identified as relevant to the identified ICF sets. The ICF Rehabilitation Set constitutes of 9 ICF categories from the component body functions and 21 from the component activities and participation. The minimal set of EFs contains 12 categories spanning all chapters of the EF component of the ICF. The identified sets proposed serve as minimal generic sets of aspects of functioning in clinical populations for reporting data within and across heath conditions, time, clinical settings including rehabilitation, and countries. These sets present a reference framework for harmonizing existing information on disability across general and clinical populations. Copyright © 2016 American Congress of Rehabilitation Medicine. Published by Elsevier Inc. All rights reserved.
Efficient SRAM yield optimization with mixture surrogate modeling
NASA Astrophysics Data System (ADS)
Zhongjian, Jiang; Zuochang, Ye; Yan, Wang
2016-12-01
Largely repeated cells such as SRAM cells usually require extremely low failure-rate to ensure a moderate chi yield. Though fast Monte Carlo methods such as importance sampling and its variants can be used for yield estimation, they are still very expensive if one needs to perform optimization based on such estimations. Typically the process of yield calculation requires a lot of SPICE simulation. The circuit SPICE simulation analysis accounted for the largest proportion of time in the process yield calculation. In the paper, a new method is proposed to address this issue. The key idea is to establish an efficient mixture surrogate model. The surrogate model is based on the design variables and process variables. This model construction method is based on the SPICE simulation to get a certain amount of sample points, these points are trained for mixture surrogate model by the lasso algorithm. Experimental results show that the proposed model is able to calculate accurate yield successfully and it brings significant speed ups to the calculation of failure rate. Based on the model, we made a further accelerated algorithm to further enhance the speed of the yield calculation. It is suitable for high-dimensional process variables and multi-performance applications.
SLOPE—ADAPTIVE VARIABLE SELECTION VIA CONVEX OPTIMIZATION
Bogdan, Małgorzata; van den Berg, Ewout; Sabatti, Chiara; Su, Weijie; Candès, Emmanuel J.
2015-01-01
We introduce a new estimator for the vector of coefficients β in the linear model y = Xβ + z, where X has dimensions n × p with p possibly larger than n. SLOPE, short for Sorted L-One Penalized Estimation, is the solution to minb∈ℝp12‖y−Xb‖ℓ22+λ1|b|(1)+λ2|b|(2)+⋯+λp|b|(p),where λ1 ≥ λ2 ≥ … ≥ λp ≥ 0 and |b|(1)≥|b|(2)≥⋯≥|b|(p) are the decreasing absolute values of the entries of b. This is a convex program and we demonstrate a solution algorithm whose computational complexity is roughly comparable to that of classical ℓ1 procedures such as the Lasso. Here, the regularizer is a sorted ℓ1 norm, which penalizes the regression coefficients according to their rank: the higher the rank—that is, stronger the signal—the larger the penalty. This is similar to the Benjamini and Hochberg [J. Roy. Statist. Soc. Ser. B 57 (1995) 289–300] procedure (BH) which compares more significant p-values with more stringent thresholds. One notable choice of the sequence {λi} is given by the BH critical values λBH(i)=z(1−i⋅q/2p), where q ∈ (0, 1) and z(α) is the quantile of a standard normal distribution. SLOPE aims to provide finite sample guarantees on the selected model; of special interest is the false discovery rate (FDR), defined as the expected proportion of irrelevant regressors among all selected predictors. Under orthogonal designs, SLOPE with λBH provably controls FDR at level q. Moreover, it also appears to have appreciable inferential properties under more general designs X while having substantial power, as demonstrated in a series of experiments running on both simulated and real data. PMID:26709357
Targeting TWIST1 through loss of function inhibits tumorigenicity of human glioblastoma.
Mikheev, Andrei M; Mikheeva, Svetlana A; Severs, Liza J; Funk, Cory C; Huang, Lei; McFaline-Figueroa, José L; Schwensen, Jeanette; Trapnell, Cole; Price, Nathan D; Wong, Stephen; Rostomily, Robert C
2018-05-13
Twist1 (TW) is a bHLH transcription factor (TF) and master regulator of the epithelial to mesenchymal transition (EMT). In vitro, TW promotes mesenchymal change, invasion and self-renewal in glioblastoma (GBM) cells. However the potential therapeutic relevance of TW has not been established through loss of function studies in human GBM cell xenograft models. The effects of TW loss of function (gene editing and knock down) on inhibition of tumorigenicity of U87MG and GBM4 glioma stem cells were tested in orthotopic xenograft models and conditional knockdown in established flank xenograft tumors. RNAseq and the analysis of tumors investigated putative TW associated mechanisms. Multiple bioinformatics tools revealed significant alteration of ECM, membrane receptors, signaling transduction kinases and cytoskeleton dynamics leading to identification of PI3K/AKT signaling. We experimentally show alteration of AKT activity and periostin (POSTN) expression in vivo and/or in vitro. For the first time we show that effect of TW knockout inhibits AKT activity in U87MG cells in vivo independent of PTEN mutation. The clinical relevance of TW and candidate mechanisms was established by analysis of the TCGA and ENCODE databases. TW expression was associated with decreased patient survival and LASSO regression analysis identified POSTN as one of top targets of TW in human GBM. While we previously demonstrated the role of TW in promoting EMT and invasion of glioma cells, these studies provide direct experimental evidence supporting pro-tumorigenic role of TW independent of invasion in vivo and the therapeutic relevance of targeting TW in human GBM. Further, the role of TW driving POSTN expression and AKT signaling suggests actionable targets, which could be leveraged to mitigate the oncogenic effects of TW in GBM. Molecular Oncology (2018) © 2018 The Authors. Published by FEBS Press and John Wiley & Sons Ltd.
Muenchhoff, Maximilian; Healy, Michael; Singh, Ravesh; Roider, Julia; Groll, Andreas; Kindra, Chirjeev; Sibaya, Thobekile; Moonsamy, Angeline; McGregor, Callum; Phan, Michelle Q; Palma, Alejandro; Kloverpris, Henrik; Leslie, Alasdair; Bobat, Raziya; LaRussa, Philip; Ndung'u, Thumbi; Goulder, Philip; Sobieszczyk, Magdalena E; Archary, Mohendran
2018-01-01
This observational study aimed to describe immunopathogenesis and treatment outcomes in children with and without severe acute malnutrition (SAM) and HIV-infection. We studied markers of microbial translocation (16sDNA), intestinal damage (iFABP), monocyte activation (sCD14), T-cell activation (CD38, HLA-DR) and immune exhaustion (PD1) in 32 HIV-infected children with and 41 HIV-infected children without SAM prior to initiation of antiretroviral therapy (ART) and cross-sectionally compared these children to 15 HIV-uninfected children with and 19 HIV-uninfected children without SAM. We then prospectively measured these markers and correlated them to treatment outcomes in the HIV-infected children at 48 weeks following initiation of ART. Plasma levels of 16sDNA, iFABP and sCD14 were measured by quantitative real time PCR, ELISA and Luminex, respectively. T cell phenotype markers were measured by flow cytometry. Multiple regression analysis was performed using generalized linear models (GLMs) and the least absolute shrinkage and selection operator (LASSO) approach for variable selection. Microbial translocation, T cell activation and exhaustion were increased in HIV-uninfected children with SAM compared to HIV-uninfected children without SAM. In HIV-infected children microbial translocation, immune activation, and exhaustion was strongly increased but did not differ by SAM-status. SAM was associated with increased mortality rates early after ART initiation. Malnutrition, age, microbial translocation, monocyte, and CD8 T cell activation were independently associated with decreased rates of CD4% immune recovery after 48 weeks of ART. SAM is associated with increased microbial translocation, immune activation, and immune exhaustion in HIV-uninfected children and with worse prognosis and impaired immune recovery in HIV-infected children on ART.
Blood Lead Levels and Associated Factors among Children in Guiyu of China: A Population-Based Study
Guo, Pi; Xu, Xijin; Huang, Binliang; Sun, Di; Zhang, Jian; Chen, Xiaojuan; Zhang, Qin
2014-01-01
Objectives Children's health problems caused by the electronic waste (e-waste) lead exposure in China remains. To assess children's blood lead levels (BLLs) in Guiyu of China and investigate risk factors of children's elevated BLLs in Guiyu. Material and Methods 842 children under 11 years of age from Guiyu and Haojiang were enrolled in this population-based study during 2011–2013. Participants completed a lifestyle and residential environment questionnaire and their physical growth indices were measured, and blood samples taken. Blood samples were tested to assess BLLs. Children's BLLs between the two groups were compared and factors associated with elevated BLLs among Guiyu children were analyzed by group Lasso logistic regression model. Results Children living in Guiyu had significant higher BLLs (7.06 µg/dL) than the quantity (5.89 µg/dL) of Haojiang children (P<0.05). Subgroup analyses of BLLs exceeding 10 µg/dL showed the proportion (24.80%) of high-level BLLs for Guiyu children was greater than that (12.84%) in Haojiang (P<0.05). Boys had greater BLLs than girls, irrespectively of areas (P<0.05). The number of e-waste piles or recycling workshops around the house (odds ratio, 2.28; 95% confidence interval [CI], 1.37 to 3.87) significantly contributed to the elevated BLLs of children in Guiyu, and girls had less risk (odds ratio, 0.51; 95% CI, 0.31 to 0.83) of e-waste lead exposure than boys. Conclusions This analysis reinforces the importance of shifting e-waste recycling piles or workshops to non-populated areas as part of a comprehensive response to e-waste lead exposure control in Guiyu. To correct the problem of lead poisoning in children in Guiyu should be a long-term mission. PMID:25136795
Rozanova, Julia; Morozova, Olga; Azbel, Lyuba; Bachireddy, Chethan; Izenberg, Jacob M; Kiriazova, Tetiana; Dvoryak, Sergiy; Altice, Frederick L
2018-05-04
Facing competing demands with limited resources following release from prison, people who inject drugs (PWID) may neglect health needs, with grave implications including relapse, overdose, and non-continuous care. We examined the relative importance of health-related tasks after release compared to tasks of everyday life among a total sample of 577 drug users incarcerated in Ukraine, Azerbaijan, and Kyrgyzstan. A proxy measure of whether participants identified a task as applicable (easy or hard) versus not applicable was used to determine the importance of each task. Correlates of the importance of health-related reentry tasks were analyzed using logistic regression, with a parsimonious model being derived using Bayesian lasso method. Despite all participants having substance use disorders and high prevalence of comorbidities, participants in all three countries prioritized finding a source of income, reconnecting with family, and staying out of prison over receiving treatment for substance use disorders, general health conditions, and initiating methadone treatment. Participants with poorer general health were more likely to prioritize treatment for substance use disorders. While prior drug injection and opioid agonist treatment (OAT) correlated with any interest in methadone in all countries, only in Ukraine did a small number of participants prioritize getting methadone as the most important post-release task. While community-based OAT is available in all three countries and prison-based OAT only in Kyrgyzstan, Kyrgyz prisoners were less likely to choose help staying off drugs and getting methadone. Overall, prisoners consider methadone treatment inapplicable to their pre-release planning. Future studies that involve patient decision-making and scale-up of OAT within prison settings are needed to better improve individual and public health.
Chen, Hongda; Zucknick, Manuela; Werner, Simone; Knebel, Phillip; Brenner, Hermann
2015-07-15
Novel noninvasive blood-based screening tests are strongly desirable for early detection of colorectal cancer. We aimed to conduct a head-to-head comparison of the diagnostic performance of 92 plasma-based tumor-associated protein biomarkers for early detection of colorectal cancer in a true screening setting. Among all available 35 carriers of colorectal cancer and a representative sample of 54 men and women free of colorectal neoplasms recruited in a cohort of screening colonoscopy participants in 2005-2012 (N = 5,516), the plasma levels of 92 protein biomarkers were measured. ROC analyses were conducted to evaluate the diagnostic performance. A multimarker algorithm was developed through the Lasso logistic regression model and validated in an independent validation set. The .632+ bootstrap method was used to adjust for the potential overestimation of diagnostic performance. Seventeen protein markers were identified to show statistically significant differences in plasma levels between colorectal cancer cases and controls. The adjusted area under the ROC curves (AUC) of these 17 individual markers ranged from 0.55 to 0.70. An eight-marker classifier was constructed that increased the adjusted AUC to 0.77 [95% confidence interval (CI), 0.59-0.91]. When validating this algorithm in an independent validation set, the AUC was 0.76 (95% CI, 0.65-0.85), and sensitivities at cutoff levels yielding 80% and 90% specificities were 65% (95% CI, 41-80%) and 44% (95% CI, 24-72%), respectively. The identified profile of protein biomarkers could contribute to the development of a powerful multimarker blood-based test for early detection of colorectal cancer. ©2015 American Association for Cancer Research.
Sihong Chen; Jing Qin; Xing Ji; Baiying Lei; Tianfu Wang; Dong Ni; Jie-Zhi Cheng
2017-03-01
The gap between the computational and semantic features is the one of major factors that bottlenecks the computer-aided diagnosis (CAD) performance from clinical usage. To bridge this gap, we exploit three multi-task learning (MTL) schemes to leverage heterogeneous computational features derived from deep learning models of stacked denoising autoencoder (SDAE) and convolutional neural network (CNN), as well as hand-crafted Haar-like and HoG features, for the description of 9 semantic features for lung nodules in CT images. We regard that there may exist relations among the semantic features of "spiculation", "texture", "margin", etc., that can be explored with the MTL. The Lung Image Database Consortium (LIDC) data is adopted in this study for the rich annotation resources. The LIDC nodules were quantitatively scored w.r.t. 9 semantic features from 12 radiologists of several institutes in U.S.A. By treating each semantic feature as an individual task, the MTL schemes select and map the heterogeneous computational features toward the radiologists' ratings with cross validation evaluation schemes on the randomly selected 2400 nodules from the LIDC dataset. The experimental results suggest that the predicted semantic scores from the three MTL schemes are closer to the radiologists' ratings than the scores from single-task LASSO and elastic net regression methods. The proposed semantic attribute scoring scheme may provide richer quantitative assessments of nodules for better support of diagnostic decision and management. Meanwhile, the capability of the automatic association of medical image contents with the clinical semantic terms by our method may also assist the development of medical search engine.
Healy, Michael; Singh, Ravesh; Roider, Julia; Groll, Andreas; Kindra, Chirjeev; Sibaya, Thobekile; Moonsamy, Angeline; McGregor, Callum; Phan, Michelle Q.; Palma, Alejandro; Kloverpris, Henrik; Leslie, Alasdair; Bobat, Raziya; LaRussa, Philip; Ndung'u, Thumbi; Goulder, Philip; Sobieszczyk, Magdalena E.; Archary, Mohendran
2018-01-01
Abstract This observational study aimed to describe immunopathogenesis and treatment outcomes in children with and without severe acute malnutrition (SAM) and HIV-infection. We studied markers of microbial translocation (16sDNA), intestinal damage (iFABP), monocyte activation (sCD14), T-cell activation (CD38, HLA-DR) and immune exhaustion (PD1) in 32 HIV-infected children with and 41 HIV-infected children without SAM prior to initiation of antiretroviral therapy (ART) and cross-sectionally compared these children to 15 HIV-uninfected children with and 19 HIV-uninfected children without SAM. We then prospectively measured these markers and correlated them to treatment outcomes in the HIV-infected children at 48 weeks following initiation of ART. Plasma levels of 16sDNA, iFABP and sCD14 were measured by quantitative real time PCR, ELISA and Luminex, respectively. T cell phenotype markers were measured by flow cytometry. Multiple regression analysis was performed using generalized linear models (GLMs) and the least absolute shrinkage and selection operator (LASSO) approach for variable selection. Microbial translocation, T cell activation and exhaustion were increased in HIV-uninfected children with SAM compared to HIV-uninfected children without SAM. In HIV-infected children microbial translocation, immune activation, and exhaustion was strongly increased but did not differ by SAM-status. SAM was associated with increased mortality rates early after ART initiation. Malnutrition, age, microbial translocation, monocyte, and CD8 T cell activation were independently associated with decreased rates of CD4% immune recovery after 48 weeks of ART. SAM is associated with increased microbial translocation, immune activation, and immune exhaustion in HIV-uninfected children and with worse prognosis and impaired immune recovery in HIV-infected children on ART. PMID:28670966
Lv, Yufeng; Wei, Wenhao; Huang, Zhong; Chen, Zhichao; Fang, Yuan; Pan, Lili; Han, Xueqiong; Xu, Zihai
2018-06-20
The aim of this study was to develop a novel long non-coding RNA (lncRNA) expression signature to accurately predict early recurrence for patients with hepatocellular carcinoma (HCC) after curative resection. Using expression profiles downloaded from The Cancer Genome Atlas database, we identified multiple lncRNAs with differential expression between early recurrence (ER) group and non-early recurrence (non-ER) group of HCC. Least absolute shrinkage and selection operator (LASSO) for logistic regression models were used to develop a lncRNA-based classifier for predicting ER in the training set. An independent test set was used to validated the predictive value of this classifier. Futhermore, a co-expression network based on these lncRNAs and its highly related genes was constructed and Gene Ontology and Kyoto Encyclopedia of Genes and Genomes pathway enrichment analyses of genes in the network were performed. We identified 10 differentially expressed lncRNAs, including 3 that were upregulated and 7 that were downregulated in ER group. The lncRNA-based classifier was constructed based on 7 lncRNAs (AL035661.1, PART1, AC011632.1, AC109588.1, AL365361.1, LINC00861 and LINC02084), and its accuracy was 0.83 in training set, 0.87 in test set and 0.84 in total set. And ROC curve analysis showed the AUROC was 0.741 in training set, 0.824 in the test set and 0.765 in total set. A functional enrichment analysis suggested that the genes of which is highly related to 4 lncRNAs were involved in immune system. This 7-lncRNA expression profile can effectively predict the early recurrence after surgical resection for HCC. This article is protected by copyright. All rights reserved.
Elastic SCAD as a novel penalization method for SVM classification tasks in high-dimensional data.
Becker, Natalia; Toedt, Grischa; Lichter, Peter; Benner, Axel
2011-05-09
Classification and variable selection play an important role in knowledge discovery in high-dimensional data. Although Support Vector Machine (SVM) algorithms are among the most powerful classification and prediction methods with a wide range of scientific applications, the SVM does not include automatic feature selection and therefore a number of feature selection procedures have been developed. Regularisation approaches extend SVM to a feature selection method in a flexible way using penalty functions like LASSO, SCAD and Elastic Net.We propose a novel penalty function for SVM classification tasks, Elastic SCAD, a combination of SCAD and ridge penalties which overcomes the limitations of each penalty alone.Since SVM models are extremely sensitive to the choice of tuning parameters, we adopted an interval search algorithm, which in comparison to a fixed grid search finds rapidly and more precisely a global optimal solution. Feature selection methods with combined penalties (Elastic Net and Elastic SCAD SVMs) are more robust to a change of the model complexity than methods using single penalties. Our simulation study showed that Elastic SCAD SVM outperformed LASSO (L1) and SCAD SVMs. Moreover, Elastic SCAD SVM provided sparser classifiers in terms of median number of features selected than Elastic Net SVM and often better predicted than Elastic Net in terms of misclassification error.Finally, we applied the penalization methods described above on four publicly available breast cancer data sets. Elastic SCAD SVM was the only method providing robust classifiers in sparse and non-sparse situations. The proposed Elastic SCAD SVM algorithm provides the advantages of the SCAD penalty and at the same time avoids sparsity limitations for non-sparse data. We were first to demonstrate that the integration of the interval search algorithm and penalized SVM classification techniques provides fast solutions on the optimization of tuning parameters.The penalized SVM classification algorithms as well as fixed grid and interval search for finding appropriate tuning parameters were implemented in our freely available R package 'penalizedSVM'.We conclude that the Elastic SCAD SVM is a flexible and robust tool for classification and feature selection tasks for high-dimensional data such as microarray data sets.
Elastic SCAD as a novel penalization method for SVM classification tasks in high-dimensional data
2011-01-01
Background Classification and variable selection play an important role in knowledge discovery in high-dimensional data. Although Support Vector Machine (SVM) algorithms are among the most powerful classification and prediction methods with a wide range of scientific applications, the SVM does not include automatic feature selection and therefore a number of feature selection procedures have been developed. Regularisation approaches extend SVM to a feature selection method in a flexible way using penalty functions like LASSO, SCAD and Elastic Net. We propose a novel penalty function for SVM classification tasks, Elastic SCAD, a combination of SCAD and ridge penalties which overcomes the limitations of each penalty alone. Since SVM models are extremely sensitive to the choice of tuning parameters, we adopted an interval search algorithm, which in comparison to a fixed grid search finds rapidly and more precisely a global optimal solution. Results Feature selection methods with combined penalties (Elastic Net and Elastic SCAD SVMs) are more robust to a change of the model complexity than methods using single penalties. Our simulation study showed that Elastic SCAD SVM outperformed LASSO (L1) and SCAD SVMs. Moreover, Elastic SCAD SVM provided sparser classifiers in terms of median number of features selected than Elastic Net SVM and often better predicted than Elastic Net in terms of misclassification error. Finally, we applied the penalization methods described above on four publicly available breast cancer data sets. Elastic SCAD SVM was the only method providing robust classifiers in sparse and non-sparse situations. Conclusions The proposed Elastic SCAD SVM algorithm provides the advantages of the SCAD penalty and at the same time avoids sparsity limitations for non-sparse data. We were first to demonstrate that the integration of the interval search algorithm and penalized SVM classification techniques provides fast solutions on the optimization of tuning parameters. The penalized SVM classification algorithms as well as fixed grid and interval search for finding appropriate tuning parameters were implemented in our freely available R package 'penalizedSVM'. We conclude that the Elastic SCAD SVM is a flexible and robust tool for classification and feature selection tasks for high-dimensional data such as microarray data sets. PMID:21554689
Acute toxicity of selected herbicides and surfactants to larvae of the midge Chironomus riparius
Buhl, Kevin J.; Faerber, Neil L.
1989-01-01
The acute toxicities of eight commercial herbicides and two surfactants to early fourth instar larvae of the midgeChironomus riparius were determined under static conditions. The formulated herbicides tested were Eradicane® (EPTC), Fargo® (triallate), Lasso® (alachlor), ME4 Brominal® (bromoxynil), Ramrod® (propachlor), Rodeo® (glyphosate), Sencor®(metribuzin), and Sutan (+)® (butylate); the two surfactants were Activator N.F.® and Ortho X-77®. In addition, technical grade alachlor, metribuzin, propachlor, and triallate were tested for comparison with the formulated herbicides. The relative toxicity of the commercial formulations, based on percent active ingredient, varied considerably. The EC50 values ranged from 1.23 mg/L for Fargo® to 5,600 mg/L for Rodeo®. Fargo®, ME4 Brominal®, and Ramrod®were moderately toxic to midge larvae; Lasso®, Sutan (+)®, and Eradicane® were slightly toxic; and Sencor® and Rodeo® were practically non-toxic. The 48-hr EC50 values of the two surfactants were nearly identical and were considered moderately toxic to midges. For two of the herbicides in which the technical grade material was tested, the inert ingredients in the formulations had a significant effect on the toxicity of the active ingredients. Fargo® was twice as toxic as technical grade triallate, whereas Sencor® was considerably less toxic than technical grade metribuzin. A comparison of the slope function values indicated that the toxic action of all the compounds occurred within a relatively narrow range. Published acute toxicity data on these compounds for other freshwater biota were tabulated and compared with our results. In general, the relative order of toxicity toC. riparius was similar to those for other freshwater invertebrates and fish. Maximum concentrations of each herbicide in bulk runoff during a projected “critical” runoff event were calculated as a percentage of the application rate lost in a given volume of runoff. A comparison between estimated maximum herbicide concentrations in runoff and results of acute tests indicated that Ramrod®, ME4 Brominal®, and Lasso® pose the greatest direct risk to midge larvae during a storm event.
Sherer, Eric A; Sale, Mark E; Pollock, Bruce G; Belani, Chandra P; Egorin, Merrill J; Ivy, Percy S; Lieberman, Jeffrey A; Manuck, Stephen B; Marder, Stephen R; Muldoon, Matthew F; Scher, Howard I; Solit, David B; Bies, Robert R
2012-08-01
A limitation in traditional stepwise population pharmacokinetic model building is the difficulty in handling interactions between model components. To address this issue, a method was previously introduced which couples NONMEM parameter estimation and model fitness evaluation to a single-objective, hybrid genetic algorithm for global optimization of the model structure. In this study, the generalizability of this approach for pharmacokinetic model building is evaluated by comparing (1) correct and spurious covariate relationships in a simulated dataset resulting from automated stepwise covariate modeling, Lasso methods, and single-objective hybrid genetic algorithm approaches to covariate identification and (2) information criteria values, model structures, convergence, and model parameter values resulting from manual stepwise versus single-objective, hybrid genetic algorithm approaches to model building for seven compounds. Both manual stepwise and single-objective, hybrid genetic algorithm approaches to model building were applied, blinded to the results of the other approach, for selection of the compartment structure as well as inclusion and model form of inter-individual and inter-occasion variability, residual error, and covariates from a common set of model options. For the simulated dataset, stepwise covariate modeling identified three of four true covariates and two spurious covariates; Lasso identified two of four true and 0 spurious covariates; and the single-objective, hybrid genetic algorithm identified three of four true covariates and one spurious covariate. For the clinical datasets, the Akaike information criterion was a median of 22.3 points lower (range of 470.5 point decrease to 0.1 point decrease) for the best single-objective hybrid genetic-algorithm candidate model versus the final manual stepwise model: the Akaike information criterion was lower by greater than 10 points for four compounds and differed by less than 10 points for three compounds. The root mean squared error and absolute mean prediction error of the best single-objective hybrid genetic algorithm candidates were a median of 0.2 points higher (range of 38.9 point decrease to 27.3 point increase) and 0.02 points lower (range of 0.98 point decrease to 0.74 point increase), respectively, than that of the final stepwise models. In addition, the best single-objective, hybrid genetic algorithm candidate models had successful convergence and covariance steps for each compound, used the same compartment structure as the manual stepwise approach for 6 of 7 (86 %) compounds, and identified 54 % (7 of 13) of covariates included by the manual stepwise approach and 16 covariate relationships not included by manual stepwise models. The model parameter values between the final manual stepwise and best single-objective, hybrid genetic algorithm models differed by a median of 26.7 % (q₁ = 4.9 % and q₃ = 57.1 %). Finally, the single-objective, hybrid genetic algorithm approach was able to identify models capable of estimating absorption rate parameters for four compounds that the manual stepwise approach did not identify. The single-objective, hybrid genetic algorithm represents a general pharmacokinetic model building methodology whose ability to rapidly search the feasible solution space leads to nearly equivalent or superior model fits to pharmacokinetic data.
Integrative Analysis of Cancer Diagnosis Studies with Composite Penalization
Liu, Jin; Huang, Jian; Ma, Shuangge
2013-01-01
Summary In cancer diagnosis studies, high-throughput gene profiling has been extensively conducted, searching for genes whose expressions may serve as markers. Data generated from such studies have the “large d, small n” feature, with the number of genes profiled much larger than the sample size. Penalization has been extensively adopted for simultaneous estimation and marker selection. Because of small sample sizes, markers identified from the analysis of single datasets can be unsatisfactory. A cost-effective remedy is to conduct integrative analysis of multiple heterogeneous datasets. In this article, we investigate composite penalization methods for estimation and marker selection in integrative analysis. The proposed methods use the minimax concave penalty (MCP) as the outer penalty. Under the homogeneity model, the ridge penalty is adopted as the inner penalty. Under the heterogeneity model, the Lasso penalty and MCP are adopted as the inner penalty. Effective computational algorithms based on coordinate descent are developed. Numerical studies, including simulation and analysis of practical cancer datasets, show satisfactory performance of the proposed methods. PMID:24578589
A network perspective on comorbid depression in adolescents with obsessive-compulsive disorder.
Jones, Payton J; Mair, Patrick; Riemann, Bradley C; Mugno, Beth L; McNally, Richard J
2018-01-01
People with obsessive-compulsive disorder [OCD] frequently suffer from depression, a comorbidity associated with greater symptom severity and suicide risk. We examined the associations between OCD and depression symptoms in 87 adolescents with primary OCD. We computed an association network, a graphical LASSO, and a directed acyclic graph (DAG) to model symptom interactions. Models showed OCD and depression as separate syndromes linked by bridge symptoms. Bridges between the two disorders emerged between obsessional problems in the OCD syndrome, and guilt, concentration problems, and sadness in the depression syndrome. A directed network indicated that OCD symptoms directionally precede depression symptoms. Concentration impairment emerged as a highly central node that may be distinctive to adolescents. We conclude that the network approach to mental disorders provides a new way to understand the etiology and maintenance of comorbid OCD-depression. Network analysis can improve research and treatment of mental disorder comorbidities by generating hypotheses concerning potential causal symptom structures and by identifying symptoms that may bridge disorders. Copyright © 2017 Elsevier Ltd. All rights reserved.
Reward-related neural activity and structure predict future substance use in dysregulated youth.
Bertocci, M A; Bebko, G; Versace, A; Iyengar, S; Bonar, L; Forbes, E E; Almeida, J R C; Perlman, S B; Schirda, C; Travis, M J; Gill, M K; Diwadkar, V A; Sunshine, J L; Holland, S K; Kowatch, R A; Birmaher, B; Axelson, D A; Frazier, T W; Arnold, L E; Fristad, M A; Youngstrom, E A; Horwitz, S M; Findling, R L; Phillips, M L
2017-06-01
Identifying youth who may engage in future substance use could facilitate early identification of substance use disorder vulnerability. We aimed to identify biomarkers that predicted future substance use in psychiatrically un-well youth. LASSO regression for variable selection was used to predict substance use 24.3 months after neuroimaging assessment in 73 behaviorally and emotionally dysregulated youth aged 13.9 (s.d. = 2.0) years, 30 female, from three clinical sites in the Longitudinal Assessment of Manic Symptoms (LAMS) study. Predictor variables included neural activity during a reward task, cortical thickness, and clinical and demographic variables. Future substance use was associated with higher left middle prefrontal cortex activity, lower left ventral anterior insula activity, thicker caudal anterior cingulate cortex, higher depression and lower mania scores, not using antipsychotic medication, more parental stress, older age. This combination of variables explained 60.4% of the variance in future substance use, and accurately classified 83.6%. These variables explained a large proportion of the variance, were useful classifiers of future substance use, and showed the value of combining multiple domains to provide a comprehensive understanding of substance use development. This may be a step toward identifying neural measures that can identify future substance use disorder risk, and act as targets for therapeutic interventions.
Clinical application of ICF key codes to evaluate patients with dysphagia following stroke
Dong, Yi; Zhang, Chang-Jie; Shi, Jie; Deng, Jinggui; Lan, Chun-Na
2016-01-01
Abstract This study was aimed to identify and evaluate the International Classification of Functioning (ICF) key codes for dysphagia in stroke patients. Thirty patients with dysphagia after stroke were enrolled in our study. To evaluate the ICF dysphagia scale, 6 scales were used as comparisons, namely the Barthel Index (BI), Repetitive Saliva Swallowing Test (RSST), Kubota Water Swallowing Test (KWST), Frenchay Dysarthria Assessment, Mini-Mental State Examination (MMSE), and the Montreal Cognitive Assessment (MoCA). Multiple regression analysis was performed to quantitate the relationship between the ICF scale and the other 7 scales. In addition, 60 ICF scales were analyzed by the least absolute shrinkage and selection operator (LASSO) method. A total of 21 ICF codes were identified, which were closely related with the other scales. These included 13 codes from Body Function, 1 from Body Structure, 3 from Activities and Participation, and 4 from Environmental Factors. A topographic network map with 30 ICF key codes was also generated to visualize their relationships. The number of ICF codes identified is in line with other well-established evaluation methods. The network topographic map generated here could be used as an instruction tool in future evaluations. We also found that attention functions and biting were critical codes of these scales, and could be used as treatment targets. PMID:27661012
Metabolomics biomarkers to predict acamprosate treatment response in alcohol-dependent subjects.
Hinton, David J; Vázquez, Marely Santiago; Geske, Jennifer R; Hitschfeld, Mario J; Ho, Ada M C; Karpyak, Victor M; Biernacka, Joanna M; Choi, Doo-Sup
2017-05-31
Precision medicine for alcohol use disorder (AUD) allows optimal treatment of the right patient with the right drug at the right time. Here, we generated multivariable models incorporating clinical information and serum metabolite levels to predict acamprosate treatment response. The sample of 120 patients was randomly split into a training set (n = 80) and test set (n = 40) five independent times. Treatment response was defined as complete abstinence (no alcohol consumption during 3 months of acamprosate treatment) while nonresponse was defined as any alcohol consumption during this period. In each of the five training sets, we built a predictive model using a least absolute shrinkage and section operator (LASSO) penalized selection method and then evaluated the predictive performance of each model in the corresponding test set. The models predicted acamprosate treatment response with a mean sensitivity and specificity in the test sets of 0.83 and 0.31, respectively, suggesting our model performed well at predicting responders, but not non-responders (i.e. many non-responders were predicted to respond). Studies with larger sample sizes and additional biomarkers will expand the clinical utility of predictive algorithms for pharmaceutical response in AUD.
Wardenaar, K J; van Loo, H M; Cai, T; Fava, M; Gruber, M J; Li, J; de Jonge, P; Nierenberg, A A; Petukhova, M V; Rose, S; Sampson, N A; Schoevers, R A; Wilcox, M A; Alonso, J; Bromet, E J; Bunting, B; Florescu, S E; Fukao, A; Gureje, O; Hu, C; Huang, Y Q; Karam, A N; Levinson, D; Medina Mora, M E; Posada-Villa, J; Scott, K M; Taib, N I; Viana, M C; Xavier, M; Zarkov, Z; Kessler, R C
2014-11-01
Although variation in the long-term course of major depressive disorder (MDD) is not strongly predicted by existing symptom subtype distinctions, recent research suggests that prediction can be improved by using machine learning methods. However, it is not known whether these distinctions can be refined by added information about co-morbid conditions. The current report presents results on this question. Data came from 8261 respondents with lifetime DSM-IV MDD in the World Health Organization (WHO) World Mental Health (WMH) Surveys. Outcomes included four retrospectively reported measures of persistence/severity of course (years in episode; years in chronic episodes; hospitalization for MDD; disability due to MDD). Machine learning methods (regression tree analysis; lasso, ridge and elastic net penalized regression) followed by k-means cluster analysis were used to augment previously detected subtypes with information about prior co-morbidity to predict these outcomes. Predicted values were strongly correlated across outcomes. Cluster analysis of predicted values found three clusters with consistently high, intermediate or low values. The high-risk cluster (32.4% of cases) accounted for 56.6-72.9% of high persistence, high chronicity, hospitalization and disability. This high-risk cluster had both higher sensitivity and likelihood ratio positive (LR+; relative proportions of cases in the high-risk cluster versus other clusters having the adverse outcomes) than in a parallel analysis that excluded measures of co-morbidity as predictors. Although the results using the retrospective data reported here suggest that useful MDD subtyping distinctions can be made with machine learning and clustering across multiple indicators of illness persistence/severity, replication with prospective data is needed to confirm this preliminary conclusion.
Model Selection in the Analysis of Photoproduction Data
NASA Astrophysics Data System (ADS)
Landay, Justin
2017-01-01
Scattering experiments provide one of the most powerful and useful tools for probing matter to better understand its fundamental properties governed by the strong interaction. As the spectroscopy of the excited states of nucleons enters a new era of precision ushered in by improved experiments at Jefferson Lab and other facilities around the world, traditional partial-wave analysis methods must be adjusted accordingly. In this poster, we present a rigorous set of statistical tools and techniques that we implemented; most notably, the LASSO method, which serves for the selection of the simplest model, allowing us to avoid over fitting. In the case of establishing the spectrum of exited baryons, it avoids overpopulation of the spectrum and thus the occurrence of false-positives. This is a prerequisite to reliably compare theories like lattice QCD or quark models to experiments. Here, we demonstrate the principle by simultaneously fitting three observables in neutral pion photo-production, such as the differential cross section, beam asymmetry and target polarization across thousands of data points. Other authors include Michael Doring, Bin Hu, and Raquel Molina.
DL-ADR: a novel deep learning model for classifying genomic variants into adverse drug reactions.
Liang, Zhaohui; Huang, Jimmy Xiangji; Zeng, Xing; Zhang, Gang
2016-08-10
Genomic variations are associated with the metabolism and the occurrence of adverse reactions of many therapeutic agents. The polymorphisms on over 2000 locations of cytochrome P450 enzymes (CYP) due to many factors such as ethnicity, mutations, and inheritance attribute to the diversity of response and side effects of various drugs. The associations of the single nucleotide polymorphisms (SNPs), the internal pharmacokinetic patterns and the vulnerability of specific adverse reactions become one of the research interests of pharmacogenomics. The conventional genomewide association studies (GWAS) mainly focuses on the relation of single or multiple SNPs to a specific risk factors which are a one-to-many relation. However, there are no robust methods to establish a many-to-many network which can combine the direct and indirect associations between multiple SNPs and a serial of events (e.g. adverse reactions, metabolic patterns, prognostic factors etc.). In this paper, we present a novel deep learning model based on generative stochastic networks and hidden Markov chain to classify the observed samples with SNPs on five loci of two genes (CYP2D6 and CYP1A2) respectively to the vulnerable population of 14 types of adverse reactions. A supervised deep learning model is proposed in this study. The revised generative stochastic networks (GSN) model with transited by the hidden Markov chain is used. The data of the training set are collected from clinical observation. The training set is composed of 83 observations of blood samples with the genotypes respectively on CYP2D6*2, *10, *14 and CYP1A2*1C, *1 F. The samples are genotyped by the polymerase chain reaction (PCR) method. A hidden Markov chain is used as the transition operator to simulate the probabilistic distribution. The model can perform learning at lower cost compared to the conventional maximal likelihood method because the transition distribution is conditional on the previous state of the hidden Markov chain. A least square loss (LASSO) algorithm and a k-Nearest Neighbors (kNN) algorithm are used as the baselines for comparison and to evaluate the performance of our proposed deep learning model. There are 53 adverse reactions reported during the observation. They are assigned to 14 categories. In the comparison of classification accuracy, the deep learning model shows superiority over the LASSO and kNN model with a rate over 80 %. In the comparison of reliability, the deep learning model shows the best stability among the three models. Machine learning provides a new method to explore the complex associations among genomic variations and multiple events in pharmacogenomics studies. The new deep learning algorithm is capable of classifying various SNPs to the corresponding adverse reactions. We expect that as more genomic variations are added as features and more observations are made, the deep learning model can improve its performance and can act as a black-box but reliable verifier for other GWAS studies.
Techniques on semiautomatic segmentation using the Adobe Photoshop
NASA Astrophysics Data System (ADS)
Park, Jin Seo; Chung, Min Suk; Hwang, Sung Bae
2005-04-01
The purpose of this research is to enable anybody to semiautomatically segment the anatomical structures in the MRIs, CTs, and other medical images on the personal computer. The segmented images are used for making three-dimensional images, which are helpful in medical education and research. To achieve this purpose, the following trials were performed. The entire body of a volunteer was MR scanned to make 557 MRIs, which were transferred to a personal computer. On Adobe Photoshop, contours of 19 anatomical structures in the MRIs were semiautomatically drawn using MAGNETIC LASSO TOOL; successively, manually corrected using either LASSO TOOL or DIRECT SELECTION TOOL to make 557 segmented images. In a likewise manner, 11 anatomical structures in the 8,500 anatomcial images were segmented. Also, 12 brain and 10 heart anatomical structures in anatomical images were segmented. Proper segmentation was verified by making and examining the coronal, sagittal, and three-dimensional images from the segmented images. During semiautomatic segmentation on Adobe Photoshop, suitable algorithm could be used, the extent of automatization could be regulated, convenient user interface could be used, and software bugs rarely occurred. The techniques of semiautomatic segmentation using Adobe Photoshop are expected to be widely used for segmentation of the anatomical structures in various medical images.
(T2L2) Time Transfer by Laser Link
NASA Technical Reports Server (NTRS)
Veillet, Christian; Fridelance, Patricia
1995-01-01
T2L2 (Time Transfer by Laser Link) is a new generation time transfer experiment based on the principles of LASSO (Laser Synchronization from Synchronous Orbit) and used with an operational procedure developed at OCA (Observatoire de la Cote d'Azur) during the active intercontinental phase of LASSO. The hardware improvements could lead to a precision better than 10 ps for time transfer (flying clock monitoring or ground based clock comparison). Such a package could fly on any spacecraft with a stable clock. It has been developed in France in the frame of the PHARAO project (cooled atom clock in orbit) involving CNES and different laboratories. But T2L2 could fly on any spacecraft carrying a stable oscillator. A GPS satellite would be a good candidate, as T2L2 could allow to link the flying clock directly to ground clocks using light, aiming to important accuracy checks, both for time and for geodesy. Radioastron (a flying VLBI antenna with a H-maser) is also envisioned, waiting for a PHARAO flight. The ultimate goal of T2L2 is to be part of more ambitious missions, as SORT (Solar Orbit Relativity Test), aiming to examine aspects of the gravitation in the vicinity of the Sun.
LOAD-ENHANCED MOVEMENT QUALITY SCREENING AND TACTICAL ATHLETICISM: AN EXTENSION OF EVIDENCE
Schmitz, Randy J.; Rhea, Christopher K.; Ross, Scott E.
2017-01-01
Background Military organizations use movement quality screening for prediction of injury risk and performance potential. Currently, evidence of an association between movement quality and performance is limited. Recent work has demonstrated that external loading strengthens the relationship between movement screens and performance outcomes. Such loading may therefore steer us toward robust implementations of movement quality screens while maintaining their appeal as cost effective, field-expedient tools. Purpose The purpose of the current study was to quantify the effect of external load-bearing on the relationship between clinically rated movement quality and tactical performance outcomes while addressing the noted limitations. Study Design Crossover Trial. Methods Fifty young adults (25 male, 25 female, 22.98 ± 3.09 years, 171.95 ± 11.46 cm, 71.77 ± 14.03 kg) completed the Functional Movement Screen™ with (FMS™W) and without (FMS™C) a weight vest in randomized order. Following FMS™ testing, criterion measures of tactical performance were administered, including agility T-Tests, sprints, a 400-meter run, the Mobility for Battle (MOB) course, and a simulated casualty rescue. For each performance outcome, regression models were selected via group lasso with smoothed FMS™ item scores as candidate predictor variables. Results For all outcomes, proportion of variance accounted for was greater in FMS™W (R2 = ;0.22 [T-Test], 0.29 [Sprint], 0.17 [400 meter], 0.29 [MOB], and 0.11 [casualty rescue]) than in FMS™C (R2 = ;0.00 [T-Test], 0.11 [Sprint], 0.00 [400 meter], 0.19 [MOB], and 0.00 [casualty rescue]). From the FMS™W condition, beneficial performance effects (p<0.05) were observed for Deep Squat (sprint, casualty rescue), Hurdle Step (T-Agility, 400 meter run), Inline Lunge (sprint, MOB), and Trunk Stability Push Up (all models). Similar effects for FMS™C item scores were limited to Trunk Stability Push Up (p<0.05, all models). Conclusions The present study extends evidence supporting the validity of load-enhanced movement quality screening as a predictor of tactical performance ability. Future designs should seek to identify mechanisms explaining this effect. Level of Evidence 3 PMID:28593095
LOAD-ENHANCED MOVEMENT QUALITY SCREENING AND TACTICAL ATHLETICISM: AN EXTENSION OF EVIDENCE.
Glass, Stephen M; Schmitz, Randy J; Rhea, Christopher K; Ross, Scott E
2017-06-01
Military organizations use movement quality screening for prediction of injury risk and performance potential. Currently, evidence of an association between movement quality and performance is limited. Recent work has demonstrated that external loading strengthens the relationship between movement screens and performance outcomes. Such loading may therefore steer us toward robust implementations of movement quality screens while maintaining their appeal as cost effective, field-expedient tools. The purpose of the current study was to quantify the effect of external load-bearing on the relationship between clinically rated movement quality and tactical performance outcomes while addressing the noted limitations. Crossover Trial. Fifty young adults (25 male, 25 female, 22.98 ± 3.09 years, 171.95 ± 11.46 cm, 71.77 ± 14.03 kg) completed the Functional Movement Screen™ with (FMS™W) and without (FMS™C) a weight vest in randomized order. Following FMS™ testing, criterion measures of tactical performance were administered, including agility T-Tests, sprints, a 400-meter run, the Mobility for Battle (MOB) course, and a simulated casualty rescue. For each performance outcome, regression models were selected via group lasso with smoothed FMS™ item scores as candidate predictor variables. For all outcomes, proportion of variance accounted for was greater in FMS™W (R 2 = ;0.22 [T-Test], 0.29 [Sprint], 0.17 [400 meter], 0.29 [MOB], and 0.11 [casualty rescue]) than in FMS™C (R 2 = ;0.00 [T-Test], 0.11 [Sprint], 0.00 [400 meter], 0.19 [MOB], and 0.00 [casualty rescue]). From the FMS™W condition, beneficial performance effects (p<0.05) were observed for Deep Squat (sprint, casualty rescue), Hurdle Step (T-Agility, 400 meter run), Inline Lunge (sprint, MOB), and Trunk Stability Push Up (all models). Similar effects for FMS™C item scores were limited to Trunk Stability Push Up (p<0.05, all models). The present study extends evidence supporting the validity of load-enhanced movement quality screening as a predictor of tactical performance ability. Future designs should seek to identify mechanisms explaining this effect. 3.
Analysis of Genome-Wide Association Studies with Multiple Outcomes Using Penalization
Liu, Jin; Huang, Jian; Ma, Shuangge
2012-01-01
Genome-wide association studies have been extensively conducted, searching for markers for biologically meaningful outcomes and phenotypes. Penalization methods have been adopted in the analysis of the joint effects of a large number of SNPs (single nucleotide polymorphisms) and marker identification. This study is partly motivated by the analysis of heterogeneous stock mice dataset, in which multiple correlated phenotypes and a large number of SNPs are available. Existing penalization methods designed to analyze a single response variable cannot accommodate the correlation among multiple response variables. With multiple response variables sharing the same set of markers, joint modeling is first employed to accommodate the correlation. The group Lasso approach is adopted to select markers associated with all the outcome variables. An efficient computational algorithm is developed. Simulation study and analysis of the heterogeneous stock mice dataset show that the proposed method can outperform existing penalization methods. PMID:23272092
The Mexican Military Approaches the 21st Century: Coping with a New World Order
1994-02-21
known today by the initials PRI (Partido Revolucionario Institucional ), began the process of institutionalizing civilian political power.1 The...Herrera Lasso M. and Gonzalez, Balance y Perspectivas, pp. 397-399. 8. Roberto Vizcaino, "La Seguridad del Pafs, Fin Primordial del Estado ," Proceso...pp. 120-121. 10. Raud Benitez Manaut, "Las Fuerzas Armadas Mexicaras y su Relaci6n con el Estado , el Sistema Politico y la Sociedad," paper presented
Model selection for pion photoproduction
Landay, J.; Doring, M.; Fernandez-Ramirez, C.; ...
2017-01-12
Partial-wave analysis of meson and photon-induced reactions is needed to enable the comparison of many theoretical approaches to data. In both energy-dependent and independent parametrizations of partial waves, the selection of the model amplitude is crucial. Principles of the S matrix are implemented to a different degree in different approaches; but a many times overlooked aspect concerns the selection of undetermined coefficients and functional forms for fitting, leading to a minimal yet sufficient parametrization. We present an analysis of low-energy neutral pion photoproduction using the least absolute shrinkage and selection operator (LASSO) in combination with criteria from information theory andmore » K-fold cross validation. These methods are not yet widely known in the analysis of excited hadrons but will become relevant in the era of precision spectroscopy. As a result, the principle is first illustrated with synthetic data; then, its feasibility for real data is demonstrated by analyzing the latest available measurements of differential cross sections (dσ/dΩ), photon-beam asymmetries (Σ), and target asymmetry differential cross sections (dσ T/d≡Tdσ/dΩ) in the low-energy regime.« less
Natural Language-based Machine Learning Models for the Annotation of Clinical Radiology Reports.
Zech, John; Pain, Margaret; Titano, Joseph; Badgeley, Marcus; Schefflein, Javin; Su, Andres; Costa, Anthony; Bederson, Joshua; Lehar, Joseph; Oermann, Eric Karl
2018-05-01
Purpose To compare different methods for generating features from radiology reports and to develop a method to automatically identify findings in these reports. Materials and Methods In this study, 96 303 head computed tomography (CT) reports were obtained. The linguistic complexity of these reports was compared with that of alternative corpora. Head CT reports were preprocessed, and machine-analyzable features were constructed by using bag-of-words (BOW), word embedding, and Latent Dirichlet allocation-based approaches. Ultimately, 1004 head CT reports were manually labeled for findings of interest by physicians, and a subset of these were deemed critical findings. Lasso logistic regression was used to train models for physician-assigned labels on 602 of 1004 head CT reports (60%) using the constructed features, and the performance of these models was validated on a held-out 402 of 1004 reports (40%). Models were scored by area under the receiver operating characteristic curve (AUC), and aggregate AUC statistics were reported for (a) all labels, (b) critical labels, and (c) the presence of any critical finding in a report. Sensitivity, specificity, accuracy, and F1 score were reported for the best performing model's (a) predictions of all labels and (b) identification of reports containing critical findings. Results The best-performing model (BOW with unigrams, bigrams, and trigrams plus average word embeddings vector) had a held-out AUC of 0.966 for identifying the presence of any critical head CT finding and an average 0.957 AUC across all head CT findings. Sensitivity and specificity for identifying the presence of any critical finding were 92.59% (175 of 189) and 89.67% (191 of 213), respectively. Average sensitivity and specificity across all findings were 90.25% (1898 of 2103) and 91.72% (18 351 of 20 007), respectively. Simpler BOW methods achieved results competitive with those of more sophisticated approaches, with an average AUC for presence of any critical finding of 0.951 for unigram BOW versus 0.966 for the best-performing model. The Yule I of the head CT corpus was 34, markedly lower than that of the Reuters corpus (at 103) or I2B2 discharge summaries (at 271), indicating lower linguistic complexity. Conclusion Automated methods can be used to identify findings in radiology reports. The success of this approach benefits from the standardized language of these reports. With this method, a large labeled corpus can be generated for applications such as deep learning. © RSNA, 2018 Online supplemental material is available for this article.
Fast Adaptive Least Trimmed Squares for Robust Evaluation of Quality of Experience
2014-07-01
fact that not every Internet user is trustworthy . In other words, due to the lack of supervision when subjects perform experiments in crowdsourcing, they...21], [22], etc. However, a major challenge of crowdsourcing QoE evaluation is that not every Internet user is trustworthy . That is, some raters try...regularization paths of the LASSO problem could provide us an order on samples tending to be outliers. Such an approach is inspired by Huber’s celebrated work on
NASA Astrophysics Data System (ADS)
Tao, Quan
Because of their extraordinary characteristics such as quantum confinement and large surface-tovolume ratio, semiconducting nanostructures such as nanowires or nanotubes hold great potential in sensing chemical vapors. Nanowire or nanotube based gas sensors usually possess appealing advantages such as high sensitivity, high stability, fast recovery time, and electrically controllable properties. To better predict the composition and concentration of target gas, nanostructures made from heterogeneous materials are employed to provide more predictors. In recent years, nanowires and nanotubes can be synthesized routinely through different methods. The techniques of fabricating nanowire or nanotube based sensor arrays, however, encounter obstacles and deserve further investigations. Dielectrophoresis (DEP), which refers to the motion of submicron particles inside a non-uniform electric field, has long been recognized as a nondestructive, easily implementable, and efficient approach to manipulate nanostructures onto electronic circuitries. However, due to our limited understandings, devices fabricated through DEP often end up with unpredictable number of arbitrarily aligned nanostructures. In this study, we first optimize the classical DEP formulas such that it can be applied to a more general case that a nanostructure is subjected to a non-uniform electric field with arbitrary orientation. A comprehensive model is then constructed to investigate the trajectory and alignment of DEP assembled nanostructures, which can be verified by experimental observations. The simulation results assist us to fabricate a gas sensor array with zinc oxide (ZnO) nanowires and carbon nanotubes (CNTs). It is then demonstrated that the device can well sense ammonia (NH3) at room temperature, which circumvents the usually required high temperature condition for nanowire based gas sensor application. An effective approach to recover the device using DC biases to locally heat up the nanostructures is then proposed and implemented to accelerate the recovery process of the device without the requirement of heating up the whole device. As the sensors are characterized under different NH3 concentrations, the outputs are analyzed using regression methods to estimate the concentration of NH3. The quadratic model with the lasso is demonstrated to provide best performance for the collected data.
Model-Based Control of Observer Bias for the Analysis of Presence-Only Data in Ecology
Warton, David I.; Renner, Ian W.; Ramp, Daniel
2013-01-01
Presence-only data, where information is available concerning species presence but not species absence, are subject to bias due to observers being more likely to visit and record sightings at some locations than others (hereafter “observer bias”). In this paper, we describe and evaluate a model-based approach to accounting for observer bias directly – by modelling presence locations as a function of known observer bias variables (such as accessibility variables) in addition to environmental variables, then conditioning on a common level of bias to make predictions of species occurrence free of such observer bias. We implement this idea using point process models with a LASSO penalty, a new presence-only method related to maximum entropy modelling, that implicitly addresses the “pseudo-absence problem” of where to locate pseudo-absences (and how many). The proposed method of bias-correction is evaluated using systematically collected presence/absence data for 62 plant species endemic to the Blue Mountains near Sydney, Australia. It is shown that modelling and controlling for observer bias significantly improves the accuracy of predictions made using presence-only data, and usually improves predictions as compared to pseudo-absence or “inventory” methods of bias correction based on absences from non-target species. Future research will consider the potential for improving the proposed bias-correction approach by estimating the observer bias simultaneously across multiple species. PMID:24260167
Pecetti, Luciano; Brummer, E. Charles; Palmonari, Alberto; Tava, Aldo
2017-01-01
Genetic progress for forage quality has been poor in alfalfa (Medicago sativa L.), the most-grown forage legume worldwide. This study aimed at exploring opportunities for marker-assisted selection (MAS) and genomic selection of forage quality traits based on breeding values of parent plants. Some 154 genotypes from a broadly-based reference population were genotyped by genotyping-by-sequencing (GBS), and phenotyped for leaf-to-stem ratio, leaf and stem contents of protein, neutral detergent fiber (NDF) and acid detergent lignin (ADL), and leaf and stem NDF digestibility after 24 hours (NDFD), of their dense-planted half-sib progenies in three growing conditions (summer harvest, full irrigation; summer harvest, suspended irrigation; autumn harvest). Trait-marker analyses were performed on progeny values averaged over conditions, owing to modest germplasm × condition interaction. Genomic selection exploited 11,450 polymorphic SNP markers, whereas a subset of 8,494 M. truncatula-aligned markers were used for a genome-wide association study (GWAS). GWAS confirmed the polygenic control of quality traits and, in agreement with phenotypic correlations, indicated substantially different genetic control of a given trait in stems and leaves. It detected several SNPs in different annotated genes that were highly linked to stem protein content. Also, it identified a small genomic region on chromosome 8 with high concentration of annotated genes associated with leaf ADL, including one gene probably involved in the lignin pathway. Three genomic selection models, i.e., Ridge-regression BLUP, Bayes B and Bayesian Lasso, displayed similar prediction accuracy, whereas SVR-lin was less accurate. Accuracy values were moderate (0.3–0.4) for stem NDFD and leaf protein content, modest for leaf ADL and NDFD, and low to very low for the other traits. Along with previous results for the same germplasm set, this study indicates that GBS data can be exploited to improve both quality traits (by genomic selection or MAS) and forage yield. PMID:28068350
Biazzi, Elisa; Nazzicari, Nelson; Pecetti, Luciano; Brummer, E Charles; Palmonari, Alberto; Tava, Aldo; Annicchiarico, Paolo
2017-01-01
Genetic progress for forage quality has been poor in alfalfa (Medicago sativa L.), the most-grown forage legume worldwide. This study aimed at exploring opportunities for marker-assisted selection (MAS) and genomic selection of forage quality traits based on breeding values of parent plants. Some 154 genotypes from a broadly-based reference population were genotyped by genotyping-by-sequencing (GBS), and phenotyped for leaf-to-stem ratio, leaf and stem contents of protein, neutral detergent fiber (NDF) and acid detergent lignin (ADL), and leaf and stem NDF digestibility after 24 hours (NDFD), of their dense-planted half-sib progenies in three growing conditions (summer harvest, full irrigation; summer harvest, suspended irrigation; autumn harvest). Trait-marker analyses were performed on progeny values averaged over conditions, owing to modest germplasm × condition interaction. Genomic selection exploited 11,450 polymorphic SNP markers, whereas a subset of 8,494 M. truncatula-aligned markers were used for a genome-wide association study (GWAS). GWAS confirmed the polygenic control of quality traits and, in agreement with phenotypic correlations, indicated substantially different genetic control of a given trait in stems and leaves. It detected several SNPs in different annotated genes that were highly linked to stem protein content. Also, it identified a small genomic region on chromosome 8 with high concentration of annotated genes associated with leaf ADL, including one gene probably involved in the lignin pathway. Three genomic selection models, i.e., Ridge-regression BLUP, Bayes B and Bayesian Lasso, displayed similar prediction accuracy, whereas SVR-lin was less accurate. Accuracy values were moderate (0.3-0.4) for stem NDFD and leaf protein content, modest for leaf ADL and NDFD, and low to very low for the other traits. Along with previous results for the same germplasm set, this study indicates that GBS data can be exploited to improve both quality traits (by genomic selection or MAS) and forage yield.
Chen, Ligong; Schiffer, Jarad M; Dalton, Shannon; Sabourin, Carol L; Niemuth, Nancy A; Plikaytis, Brian D; Quinn, Conrad P
2014-11-01
Humoral and cell-mediated immune correlates of protection (COP) for inhalation anthrax in a rhesus macaque (Macaca mulatta) model were determined. The immunological and survival data were from 114 vaccinated and 23 control animals exposed to Bacillus anthracis spores at 12, 30, or 52 months after the first vaccination. The vaccinated animals received a 3-dose intramuscular priming series (3-i.m.) of anthrax vaccine adsorbed (AVA) (BioThrax) at 0, 1, and 6 months. The immune responses were modulated by administering a range of vaccine dilutions. Together with the vaccine dilution dose and interval between the first vaccination and challenge, each of 80 immune response variables to anthrax toxin protective antigen (PA) at every available study time point was analyzed as a potential COP by logistic regression penalized by least absolute shrinkage and selection operator (LASSO) or elastic net. The anti-PA IgG level at the last available time point before challenge (last) and lymphocyte stimulation index (SI) at months 2 and 6 were identified consistently as a COP. Anti-PA IgG levels and lethal toxin neutralization activity (TNA) at months 6 and 7 (peak) and the frequency of gamma interferon (IFN-γ)-secreting cells at month 6 also had statistically significant positive correlations with survival. The ratio of interleukin 4 (IL-4) mRNA to IFN-γ mRNA at month 6 also had a statistically significant negative correlation with survival. TNA had lower accuracy as a COP than did anti-PA IgG response. Following the 3-i.m. priming with AVA, the anti-PA IgG responses at the time of exposure or at month 7 were practicable and accurate metrics for correlating vaccine-induced immunity with protection against inhalation anthrax. Copyright © 2014, American Society for Microbiology. All Rights Reserved.
Benchmarking of surgical complications in gynaecological oncology: prospective multicentre study.
Burnell, M; Iyer, R; Gentry-Maharaj, A; Nordin, A; Liston, R; Manchanda, R; Das, N; Gornall, R; Beardmore-Gray, A; Hillaby, K; Leeson, S; Linder, A; Lopes, A; Meechan, D; Mould, T; Nevin, J; Olaitan, A; Rufford, B; Shanbhag, S; Thackeray, A; Wood, N; Reynolds, K; Ryan, A; Menon, U
2016-12-01
To explore the impact of risk-adjustment on surgical complication rates (CRs) for benchmarking gynaecological oncology centres. Prospective cohort study. Ten UK accredited gynaecological oncology centres. Women undergoing major surgery on a gynaecological oncology operating list. Patient co-morbidity, surgical procedures and intra-operative (IntraOp) complications were recorded contemporaneously by surgeons for 2948 major surgical procedures. Postoperative (PostOp) complications were collected from hospitals and patients. Risk-prediction models for IntraOp and PostOp complications were created using penalised (lasso) logistic regression using over 30 potential patient/surgical risk factors. Observed and risk-adjusted IntraOp and PostOp CRs for individual hospitals were calculated. Benchmarking using colour-coded funnel plots and observed-to-expected ratios was undertaken. Overall, IntraOp CR was 4.7% (95% CI 4.0-5.6) and PostOp CR was 25.7% (95% CI 23.7-28.2). The observed CRs for all hospitals were under the upper 95% control limit for both IntraOp and PostOp funnel plots. Risk-adjustment and use of observed-to-expected ratio resulted in one hospital moving to the >95-98% CI (red) band for IntraOp CRs. Use of only hospital-reported data for PostOp CRs would have resulted in one hospital being unfairly allocated to the red band. There was little concordance between IntraOp and PostOp CRs. The funnel plots and overall IntraOp (≈5%) and PostOp (≈26%) CRs could be used for benchmarking gynaecological oncology centres. Hospital benchmarking using risk-adjusted CRs allows fairer institutional comparison. IntraOp and PostOp CRs are best assessed separately. As hospital under-reporting is common for postoperative complications, use of patient-reported outcomes is important. Risk-adjusted benchmarking of surgical complications for ten UK gynaecological oncology centres allows fairer comparison. © 2016 Royal College of Obstetricians and Gynaecologists.
Lin, Nan; Jiang, Junhai; Guo, Shicheng; Xiong, Momiao
2015-01-01
Due to the advancement in sensor technology, the growing large medical image data have the ability to visualize the anatomical changes in biological tissues. As a consequence, the medical images have the potential to enhance the diagnosis of disease, the prediction of clinical outcomes and the characterization of disease progression. But in the meantime, the growing data dimensions pose great methodological and computational challenges for the representation and selection of features in image cluster analysis. To address these challenges, we first extend the functional principal component analysis (FPCA) from one dimension to two dimensions to fully capture the space variation of image the signals. The image signals contain a large number of redundant features which provide no additional information for clustering analysis. The widely used methods for removing the irrelevant features are sparse clustering algorithms using a lasso-type penalty to select the features. However, the accuracy of clustering using a lasso-type penalty depends on the selection of the penalty parameters and the threshold value. In practice, they are difficult to determine. Recently, randomized algorithms have received a great deal of attentions in big data analysis. This paper presents a randomized algorithm for accurate feature selection in image clustering analysis. The proposed method is applied to both the liver and kidney cancer histology image data from the TCGA database. The results demonstrate that the randomized feature selection method coupled with functional principal component analysis substantially outperforms the current sparse clustering algorithms in image cluster analysis. PMID:26196383
Ramezani, Alireza; Ahmadieh, Hamid; Azarmina, Mohsen; Soheilian, Masoud; Dehghan, Mohammad H; Mohebbi, Mohammad R
2009-12-01
To evaluate the validity of a new method for the quantitative analysis of fundus or angiographic images using Photoshop 7.0 (Adobe, USA) software by comparing with clinical evaluation. Four hundred and eighteen fundus and angiographic images of diabetic patients were evaluated by three retina specialists and then by computing using Photoshop 7.0 software. Four variables were selected for comparison: amount of hard exudates (HE) on color pictures, amount of HE on red-free pictures, severity of leakage, and the size of the foveal avascular zone (FAZ). The coefficient of agreement (Kappa) between the two methods in the amount of HE on color and red-free photographs were 85% (0.69) and 79% (0.59), respectively. The agreement for severity of leakage was 72% (0.46). In the two methods for the evaluation of the FAZ size using the magic and lasso software tools, the agreement was 54% (0.09) and 89% (0.77), respectively. Agreement in the estimation of the FAZ size by the lasso magnetic tool was excellent and was almost as good in the quantification of HE on color and on red-free images. Considering the agreement of this new technique for the measurement of variables in fundus images using Photoshop software with the clinical evaluation, this method seems to have sufficient validity to be used for the quantitative analysis of HE, leakage, and FAZ size on the angiograms of diabetic patients.
Metabolomic prediction of yield in hybrid rice.
Xu, Shizhong; Xu, Yang; Gong, Liang; Zhang, Qifa
2016-10-01
Rice (Oryza sativa) provides a staple food source for more than 50% of the world's population. An increase in yield can significantly contribute to global food security. Hybrid breeding can potentially help to meet this goal because hybrid rice often shows a considerable increase in yield when compared with pure-bred cultivars. We recently developed a marker-guided prediction method for hybrid yield and showed a substantial increase in yield through genomic hybrid breeding. We now have transcriptomic and metabolomic data as potential resources for prediction. Using six prediction methods, including least absolute shrinkage and selection operator (LASSO), best linear unbiased prediction (BLUP), stochastic search variable selection, partial least squares, and support vector machines using the radial basis function and polynomial kernel function, we found that the predictability of hybrid yield can be further increased using these omic data. LASSO and BLUP are the most efficient methods for yield prediction. For high heritability traits, genomic data remain the most efficient predictors. When metabolomic data are used, the predictability of hybrid yield is almost doubled compared with genomic prediction. Of the 21 945 potential hybrids derived from 210 recombinant inbred lines, selection of the top 10 hybrids predicted from metabolites would lead to a ~30% increase in yield. We hypothesize that each metabolite represents a biologically built-in genetic network for yield; thus, using metabolites for prediction is equivalent to using information integrated from these hidden genetic networks for yield prediction. © 2016 The Authors The Plant Journal © 2016 John Wiley & Sons Ltd.
Contact isolation is a risk factor for venous thromboembolism in trauma patients.
Reed, Christopher R; Ferguson, Robert A; Peng, Yiming; Collier, Bryan R; Bradburn, Eric H; Toms, Alice R; Fogel, Sandy L; Baker, Christopher C; Hamill, Mark E
2015-11-01
Contact isolation (CI) is a series of precautions used to prevent the transmission of medically significant infectious pathogens in the health care setting. Our institution's implementation of CI includes limiting patient movement to the assigned room. Our objective was to define the association between CI and venous thromboembolism (VTE) at our Level I trauma center. Our institution's prospective trauma database was retrospectively queried for all patients admitted to the trauma service between January 1, 2011, and December 31, 2012. Data including demographics, Injury Severity Score (ISS), preexisting medical conditions, injury type, and VTE development were collected. CI status data were obtained from our institution's infection control database. χ2 was used to examine the unadjusted relationship between CI status and VTE. As the groups were not equivalent, logistic regression was then used to examine the relationship between CI and VTE while adjusting for relevant covariates including sex, age, ISS, and comorbidities. Of the 4,423 trauma patients admitted during the study period, 4,318 (97.6%) had complete records and were included in subsequent analyses. A total of 249 (5.8%) of the patients were on CI. VTE occurred in 44 patients (17.7%) on CI versus 141 patients (3.5%) who were not isolated (p < 0.0001; odds ratio, 6.0; 95% confidence interval, 4.1-8.6). With the use of lasso [least absolute shrinkage and selection operator] regression to adjust for patient risk factors, this relationship remained highly significant (p < 0.0001; odds ratio, 2.61; 95% confidence interval, 1.7-4.0). CI, ISS, hospital length of stay, and cardiac comorbidity were associated with VTE. After adjustment for other risk factors, CI remained most strongly associated with VTE. Although any medical intervention may come with unintended consequences, the risks and benefits of CI in this population need to be reevaluated. Further study is planned to identify opportunities to mitigate this increased VTE risk. Prognostic/epidemiologic study, level III; therapeutic study, level IV.
Predicting fatty acid profiles in blood based on food intake and the FADS1 rs174546 SNP.
Hallmann, Jacqueline; Kolossa, Silvia; Gedrich, Kurt; Celis-Morales, Carlos; Forster, Hannah; O'Donovan, Clare B; Woolhead, Clara; Macready, Anna L; Fallaize, Rosalind; Marsaux, Cyril F M; Lambrinou, Christina-Paulina; Mavrogianni, Christina; Moschonis, George; Navas-Carretero, Santiago; San-Cristobal, Rodrigo; Godlewska, Magdalena; Surwiłło, Agnieszka; Mathers, John C; Gibney, Eileen R; Brennan, Lorraine; Walsh, Marianne C; Lovegrove, Julie A; Saris, Wim H M; Manios, Yannis; Martinez, Jose Alfredo; Traczyk, Iwona; Gibney, Michael J; Daniel, Hannelore
2015-12-01
A high intake of n-3 PUFA provides health benefits via changes in the n-6/n-3 ratio in blood. In addition to such dietary PUFAs, variants in the fatty acid desaturase 1 (FADS1) gene are also associated with altered PUFA profiles. We used mathematical modeling to predict levels of PUFA in whole blood, based on multiple hypothesis testing and bootstrapped LASSO selected food items, anthropometric and lifestyle factors, and the rs174546 genotypes in FADS1 from 1607 participants (Food4Me Study). The models were developed using data from the first reported time point (training set) and their predictive power was evaluated using data from the last reported time point (test set). Among other food items, fish, pizza, chicken, and cereals were identified as being associated with the PUFA profiles. Using these food items and the rs174546 genotypes as predictors, models explained 26-43% of the variability in PUFA concentrations in the training set and 22-33% in the test set. Selecting food items using multiple hypothesis testing is a valuable contribution to determine predictors, as our models' predictive power is higher compared to analogue studies. As unique feature, we additionally confirmed our models' power based on a test set. © 2015 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Huo, Zhiguang; Ding, Ying; Liu, Silvia; Oesterreich, Steffi; Tseng, George
2016-01-01
Disease phenotyping by omics data has become a popular approach that potentially can lead to better personalized treatment. Identifying disease subtypes via unsupervised machine learning is the first step towards this goal. In this paper, we extend a sparse K-means method towards a meta-analytic framework to identify novel disease subtypes when expression profiles of multiple cohorts are available. The lasso regularization and meta-analysis identify a unique set of gene features for subtype characterization. An additional pattern matching reward function guarantees consistent subtype signatures across studies. The method was evaluated by simulations and leukemia and breast cancer data sets. The identified disease subtypes from meta-analysis were characterized with improved accuracy and stability compared to single study analysis. The breast cancer model was applied to an independent METABRIC dataset and generated improved survival difference between subtypes. These results provide a basis for diagnosis and development of targeted treatments for disease subgroups. PMID:27330233
Huo, Zhiguang; Ding, Ying; Liu, Silvia; Oesterreich, Steffi; Tseng, George
Disease phenotyping by omics data has become a popular approach that potentially can lead to better personalized treatment. Identifying disease subtypes via unsupervised machine learning is the first step towards this goal. In this paper, we extend a sparse K -means method towards a meta-analytic framework to identify novel disease subtypes when expression profiles of multiple cohorts are available. The lasso regularization and meta-analysis identify a unique set of gene features for subtype characterization. An additional pattern matching reward function guarantees consistent subtype signatures across studies. The method was evaluated by simulations and leukemia and breast cancer data sets. The identified disease subtypes from meta-analysis were characterized with improved accuracy and stability compared to single study analysis. The breast cancer model was applied to an independent METABRIC dataset and generated improved survival difference between subtypes. These results provide a basis for diagnosis and development of targeted treatments for disease subgroups.
Relativistic theory for picosecond time transfer in the vicinity of Earth
NASA Technical Reports Server (NTRS)
Petit, G.; Wolf, P.
1994-01-01
The problem of light propagation is treated in a geocentric reference system with the goal of ensuring picosecond accuracy for time transfer techniques using electromagnetic signals in the vicinity of the Earth. We give an explicit formula for a one way time transfer, to be applied when the spatial coordinates of the time transfer stations are known in a geocentric reference system rotating with the Earth. This expression is extended, at the same accuracy level of one picosecond, to the special cases of two way and LASSO time transfers via geostationary satellites.
cDREM: inferring dynamic combinatorial gene regulation.
Wise, Aaron; Bar-Joseph, Ziv
2015-04-01
Genes are often combinatorially regulated by multiple transcription factors (TFs). Such combinatorial regulation plays an important role in development and facilitates the ability of cells to respond to different stresses. While a number of approaches have utilized sequence and ChIP-based datasets to study combinational regulation, these have often ignored the combinational logic and the dynamics associated with such regulation. Here we present cDREM, a new method for reconstructing dynamic models of combinatorial regulation. cDREM integrates time series gene expression data with (static) protein interaction data. The method is based on a hidden Markov model and utilizes the sparse group Lasso to identify small subsets of combinatorially active TFs, their time of activation, and the logical function they implement. We tested cDREM on yeast and human data sets. Using yeast we show that the predicted combinatorial sets agree with other high throughput genomic datasets and improve upon prior methods developed to infer combinatorial regulation. Applying cDREM to study human response to flu, we were able to identify several combinatorial TF sets, some of which were known to regulate immune response while others represent novel combinations of important TFs.
A robust sparse-modeling framework for estimating schizophrenia biomarkers from fMRI.
Dillon, Keith; Calhoun, Vince; Wang, Yu-Ping
2017-01-30
Our goal is to identify the brain regions most relevant to mental illness using neuroimaging. State of the art machine learning methods commonly suffer from repeatability difficulties in this application, particularly when using large and heterogeneous populations for samples. We revisit both dimensionality reduction and sparse modeling, and recast them in a common optimization-based framework. This allows us to combine the benefits of both types of methods in an approach which we call unambiguous components. We use this to estimate the image component with a constrained variability, which is best correlated with the unknown disease mechanism. We apply the method to the estimation of neuroimaging biomarkers for schizophrenia, using task fMRI data from a large multi-site study. The proposed approach yields an improvement in both robustness of the estimate and classification accuracy. We find that unambiguous components incorporate roughly two thirds of the same brain regions as sparsity-based methods LASSO and elastic net, while roughly one third of the selected regions differ. Further, unambiguous components achieve superior classification accuracy in differentiating cases from controls. Unambiguous components provide a robust way to estimate important regions of imaging data. Copyright © 2016 Elsevier B.V. All rights reserved.
Upweighting rare favourable alleles increases long-term genetic gain in genomic selection programs.
Liu, Huiming; Meuwissen, Theo H E; Sørensen, Anders C; Berg, Peer
2015-03-21
The short-term impact of using different genomic prediction (GP) models in genomic selection has been intensively studied, but their long-term impact is poorly understood. Furthermore, long-term genetic gain of genomic selection is expected to improve by using Jannink's weighting (JW) method, in which rare favourable marker alleles are upweighted in the selection criterion. In this paper, we extend the JW method by including an additional parameter to decrease the emphasis on rare favourable alleles over the time horizon, with the purpose of further improving the long-term genetic gain. We call this new method dynamic weighting (DW). The paper explores the long-term impact of different GP models with or without weighting methods. Different selection criteria were tested by simulating a population of 500 animals with truncation selection of five males and 50 females. Selection criteria included unweighted and weighted genomic estimated breeding values using the JW or DW methods, for which ridge regression (RR) and Bayesian lasso (BL) were used to estimate marker effects. The impacts of these selection criteria were compared under three genetic architectures, i.e. varying numbers of QTL for the trait and for two time horizons of 15 (TH15) or 40 (TH40) generations. For unweighted GP, BL resulted in up to 21.4% higher long-term genetic gain and 23.5% lower rate of inbreeding under TH40 than RR. For weighted GP, DW resulted in 1.3 to 5.5% higher long-term gain compared to unweighted GP. JW, however, showed a 6.8% lower long-term genetic gain relative to unweighted GP when BL was used to estimate the marker effects. Under TH40, both DW and JW obtained significantly higher genetic gain than unweighted GP. With DW, the long-term genetic gain was increased by up to 30.8% relative to unweighted GP, and also increased by 8% relative to JW, although at the expense of a lower short-term gain. Irrespective of the number of QTL simulated, BL is superior to RR in maintaining genetic variance and therefore results in higher long-term genetic gain. Moreover, DW is a promising method with which high long-term genetic gain can be expected within a fixed time frame.
Wu, Lin; Wang, Yang; Pan, Shirui
2017-12-01
It is now well established that sparse representation models are working effectively for many visual recognition tasks, and have pushed forward the success of dictionary learning therein. Recent studies over dictionary learning focus on learning discriminative atoms instead of purely reconstructive ones. However, the existence of intraclass diversities (i.e., data objects within the same category but exhibit large visual dissimilarities), and interclass similarities (i.e., data objects from distinct classes but share much visual similarities), makes it challenging to learn effective recognition models. To this end, a large number of labeled data objects are required to learn models which can effectively characterize these subtle differences. However, labeled data objects are always limited to access, committing it difficult to learn a monolithic dictionary that can be discriminative enough. To address the above limitations, in this paper, we propose a weakly-supervised dictionary learning method to automatically learn a discriminative dictionary by fully exploiting visual attribute correlations rather than label priors. In particular, the intrinsic attribute correlations are deployed as a critical cue to guide the process of object categorization, and then a set of subdictionaries are jointly learned with respect to each category. The resulting dictionary is highly discriminative and leads to intraclass diversity aware sparse representations. Extensive experiments on image classification and object recognition are conducted to show the effectiveness of our approach.
Cloud Type Classification (cldtype) Value-Added Product
DOE Office of Scientific and Technical Information (OSTI.GOV)
Flynn, Donna; Shi, Yan; Lim, K-S
The Cloud Type (cldtype) value-added product (VAP) provides an automated cloud type classification based on macrophysical quantities derived from vertically pointing lidar and radar. Up to 10 layers of clouds are classified into seven cloud types based on predetermined and site-specific thresholds of cloud top, base and thickness. Examples of thresholds for selected U.S. Department of Energy (DOE) Atmospheric Radiation Measurement (ARM) Climate Research Facility sites are provided in Tables 1 and 2. Inputs for the cldtype VAP include lidar and radar cloud boundaries obtained from the Active Remotely Sensed Cloud Location (ARSCL) and Surface Meteorological Systems (MET) data. Rainmore » rates from MET are used to determine when radar signal attenuation precludes accurate cloud detection. Temporal resolution and vertical resolution for cldtype are 1 minute and 30 m respectively and match the resolution of ARSCL. The cldtype classification is an initial step for further categorization of clouds. It was developed for use by the Shallow Cumulus VAP to identify potential periods of interest to the LASSO model and is intended to find clouds of interest for a variety of users.« less
Brown, Jeremiah R; MacKenzie, Todd A; Maddox, Thomas M; Fly, James; Tsai, Thomas T; Plomondon, Mary E; Nielson, Christopher D; Siew, Edward D; Resnic, Frederic S; Baker, Clifton R; Rumsfeld, John S; Matheny, Michael E
2015-12-11
Acute kidney injury (AKI) occurs frequently after cardiac catheterization and percutaneous coronary intervention. Although a clinical risk model exists for percutaneous coronary intervention, no models exist for both procedures, nor do existing models account for risk factors prior to the index admission. We aimed to develop such a model for use in prospective automated surveillance programs in the Veterans Health Administration. We collected data on all patients undergoing cardiac catheterization or percutaneous coronary intervention in the Veterans Health Administration from January 01, 2009 to September 30, 2013, excluding patients with chronic dialysis, end-stage renal disease, renal transplant, and missing pre- and postprocedural creatinine measurement. We used 4 AKI definitions in model development and included risk factors from up to 1 year prior to the procedure and at presentation. We developed our prediction models for postprocedural AKI using the least absolute shrinkage and selection operator (LASSO) and internally validated using bootstrapping. We developed models using 115 633 angiogram procedures and externally validated using 27 905 procedures from a New England cohort. Models had cross-validated C-statistics of 0.74 (95% CI: 0.74-0.75) for AKI, 0.83 (95% CI: 0.82-0.84) for AKIN2, 0.74 (95% CI: 0.74-0.75) for contrast-induced nephropathy, and 0.89 (95% CI: 0.87-0.90) for dialysis. We developed a robust, externally validated clinical prediction model for AKI following cardiac catheterization or percutaneous coronary intervention to automatically identify high-risk patients before and immediately after a procedure in the Veterans Health Administration. Work is ongoing to incorporate these models into routine clinical practice. © 2015 The Authors. Published on behalf of the American Heart Association, Inc., by Wiley Blackwell.
Management of benign biliary strictures with a novel retrievable self-expandable metal stent.
Hu, Bing; Leung, Joseph W; Gao, Dao Jian; Wang, Tian Tian; Wu, Jun
2014-03-01
Endoscopic placement of covered self-expandable metal stent (SEMS) has gained popularity in the management of benign biliary strictures (BBS). The existing SEMS has been designed primarily to palliate malignant biliary obstruction and has a high frequency of stent migration, difficulty in retrieval and stricture recurrence after stent removal. This study aimed to design a novel retrievable SEMS dedicated to the treatment of extrahepatic BBS and evaluate its clinical efficacy and safety. A short fully covered SEMS (FCSEMS) with a retrieval lasso was designed for the specific treatment of BBS. A total of 45 patients with segmental extrahepatic BBS were included in this study. The stent was placed entirely inside the bile duct with only the retrieval lasso extending from the papilla. The stents were recommended to be in situ for 6 to 12 months before removal. The FCSEMS was successfully placed in all 45 patients. In all, 33 patients had their FCSEMS successfully removed after a mean period of 8.6 ± 3.7 (range 2-15.5) months. Stent migration occurred in 9.1% of the patients. During a mean follow-up of 18.9 months after stent removal, recurrent stricture was found in 2 (6.1%) patients and was successfully treated with a second FCSEMS. Overall, the strictures resolved in 30/33 (90.9%) patients. Intraductal placement of a short FCSEMS is suitable for the treatment of segmental extrahepatic BBS. This new removable design offered prolonged stenting and drainage for BBS for up to one year with minimal complications. © 2013 Chinese Medical Association Shanghai Branch, Chinese Society of Gastroenterology, Renji Hospital Affiliated to Shanghai Jiaotong University School of Medicine and Wiley Publishing Asia Pty Ltd.
Elbes, Delphine; Magat, Julie; Govari, Assaf; Ephrath, Yaron; Vieillot, Delphine; Beeckler, Christopher; Weerasooriya, Rukshen; Jais, Pierre; Quesson, Bruno
2017-03-01
Interventional cardiac catheter mapping is routinely guided by X-ray fluoroscopy, although radiation exposure remains a significant concern. Feasibility of catheter ablation for common flutter has recently been demonstrated under magnetic resonance imaging (MRI) guidance. The benefit of catheter ablation under MRI could be significant for complex arrhythmias such as atrial fibrillation (AF), but MRI-compatible multi-electrode catheters such as Lasso have not yet been developed. This study aimed at demonstrating the feasibility and safety of using a multi-electrode catheter [magnetic resonance (MR)-compatible Lasso] during MRI for cardiac mapping. We also aimed at measuring the level of interference between MR and electrophysiological (EP) systems. Experiments were performed in vivo in sheep (N = 5) using a multi-electrode, circular, steerable, MR-compatible diagnostic catheter. The most common MRI sequences (1.5T) relevant for cardiac examination were run with the catheter positioned in the right atrium. High-quality electrograms were recorded while imaging with a maximal signal-to-noise ratio (peak-to-peak signal amplitude/peak-to-peak noise amplitude) ranging from 5.8 to 165. Importantly, MRI image quality was unchanged. Artefacts induced by MRI sequences during mapping were demonstrated to be compatible with clinical use. Phantom data demonstrated that this 10-pole circular catheter can be used safely with a maximum of 4°C increase in temperature. This new MR-compatible 10-pole catheter appears to be safe and effective. Combining MR and multipolar EP in a single session offers the possibility to correlate substrate information (scar, fibrosis) and EP mapping as well as online monitoring of lesion formation and electrical endpoint. Published on behalf of the European Society of Cardiology. All rights reserved. © The Author 2016. For permissions please email: journals.permissions@oup.com.
Characterizing spatial structure of sediment E. coli populations to inform sampling design.
Piorkowski, Gregory S; Jamieson, Rob C; Hansen, Lisbeth Truelstrup; Bezanson, Greg S; Yost, Chris K
2014-01-01
Escherichia coli can persist in streambed sediments and influence water quality monitoring programs through their resuspension into overlying waters. This study examined the spatial patterns in E. coli concentration and population structure within streambed morphological features during baseflow and following stormflow to inform sampling strategies for representative characterization of E. coli populations within a stream reach. E. coli concentrations in bed sediments were significantly different (p = 0.002) among monitoring sites during baseflow, and significant interactive effects (p = 0.002) occurred among monitoring sites and morphological features following stormflow. Least absolute shrinkage and selection operator (LASSO) regression revealed that water velocity and effective particle size (D 10) explained E. coli concentration during baseflow, whereas sediment organic carbon, water velocity and median particle diameter (D 50) were important explanatory variables following stormflow. Principle Coordinate Analysis illustrated the site-scale differences in sediment E. coli populations between disconnected stream segments. Also, E. coli populations were similar among depositional features within a reach, but differed in relation to high velocity features (e.g., riffles). Canonical correspondence analysis resolved that E. coli population structure was primarily explained by spatial (26.9–31.7 %) over environmental variables (9.2–13.1 %). Spatial autocorrelation existed among monitoring sites and morphological features for both sampling events, and gradients in mean particle diameter and water velocity influenced E. coli population structure for the baseflow and stormflow sampling events, respectively. Representative characterization of streambed E. coli requires sampling of depositional and high velocity environments to accommodate strain selectivity among these features owing to sediment and water velocity heterogeneity.
Generalized t-statistic for two-group classification.
Komori, Osamu; Eguchi, Shinto; Copas, John B
2015-06-01
In the classic discriminant model of two multivariate normal distributions with equal variance matrices, the linear discriminant function is optimal both in terms of the log likelihood ratio and in terms of maximizing the standardized difference (the t-statistic) between the means of the two distributions. In a typical case-control study, normality may be sensible for the control sample but heterogeneity and uncertainty in diagnosis may suggest that a more flexible model is needed for the cases. We generalize the t-statistic approach by finding the linear function which maximizes a standardized difference but with data from one of the groups (the cases) filtered by a possibly nonlinear function U. We study conditions for consistency of the method and find the function U which is optimal in the sense of asymptotic efficiency. Optimality may also extend to other measures of discriminatory efficiency such as the area under the receiver operating characteristic curve. The optimal function U depends on a scalar probability density function which can be estimated non-parametrically using a standard numerical algorithm. A lasso-like version for variable selection is implemented by adding L1-regularization to the generalized t-statistic. Two microarray data sets in the study of asthma and various cancers are used as motivating examples. © 2014, The International Biometric Society.
Yeh, James S; Austad, Kirsten E; Franklin, Jessica M; Chimonas, Susan; Campbell, Eric G; Avorn, Jerry; Kesselheim, Aaron S
2014-10-01
Professional societies use metrics to evaluate medical schools' policies regarding interactions of students and faculty with the pharmaceutical and medical device industries. We compared these metrics and determined which US medical schools' industry interaction policies were associated with student behaviors. Using survey responses from a national sample of 1,610 US medical students, we compared their reported industry interactions with their schools' American Medical Student Association (AMSA) PharmFree Scorecard and average Institute on Medicine as a Profession (IMAP) Conflicts of Interest Policy Database score. We used hierarchical logistic regression models to determine the association between policies and students' gift acceptance, interactions with marketing representatives, and perceived adequacy of faculty-industry separation. We adjusted for year in training, medical school size, and level of US National Institutes of Health (NIH) funding. We used LASSO regression models to identify specific policies associated with the outcomes. We found that IMAP and AMSA scores had similar median values (1.75 [interquartile range 1.50-2.00] versus 1.77 [1.50-2.18], adjusted to compare scores on the same scale). Scores on AMSA and IMAP shared policy dimensions were not closely correlated (gift policies, r = 0.28, 95% CI 0.11-0.44; marketing representative access policies, r = 0.51, 95% CI 0.36-0.63). Students from schools with the most stringent industry interaction policies were less likely to report receiving gifts (AMSA score, odds ratio [OR]: 0.37, 95% CI 0.19-0.72; IMAP score, OR 0.45, 95% CI 0.19-1.04) and less likely to interact with marketing representatives (AMSA score, OR 0.33, 95% CI 0.15-0.69; IMAP score, OR 0.37, 95% CI 0.14-0.95) than students from schools with the lowest ranked policy scores. The association became nonsignificant when fully adjusted for NIH funding level, whereas adjusting for year of education, size of school, and publicly versus privately funded school did not alter the association. Policies limiting gifts, meals, and speaking bureaus were associated with students reporting having not received gifts and having not interacted with marketing representatives. Policy dimensions reflecting the regulation of industry involvement in educational activities (e.g., continuing medical education, travel compensation, and scholarships) were associated with perceived separation between faculty and industry. The study is limited by potential for recall bias and the cross-sectional nature of the survey, as school curricula and industry interaction policies may have changed since the time of the survey administration and study analysis. As medical schools review policies regulating medical students' industry interactions, limitations on receipt of gifts and meals and participation of faculty in speaking bureaus should be emphasized, and policy makers should pay greater attention to less research-intensive institutions. Please see later in the article for the Editors' Summary.
Forecasting influenza in Hong Kong with Google search queries and statistical model fusion.
Xu, Qinneng; Gel, Yulia R; Ramirez Ramirez, L Leticia; Nezafati, Kusha; Zhang, Qingpeng; Tsui, Kwok-Leung
2017-01-01
The objective of this study is to investigate predictive utility of online social media and web search queries, particularly, Google search data, to forecast new cases of influenza-like-illness (ILI) in general outpatient clinics (GOPC) in Hong Kong. To mitigate the impact of sensitivity to self-excitement (i.e., fickle media interest) and other artifacts of online social media data, in our approach we fuse multiple offline and online data sources. Four individual models: generalized linear model (GLM), least absolute shrinkage and selection operator (LASSO), autoregressive integrated moving average (ARIMA), and deep learning (DL) with Feedforward Neural Networks (FNN) are employed to forecast ILI-GOPC both one week and two weeks in advance. The covariates include Google search queries, meteorological data, and previously recorded offline ILI. To our knowledge, this is the first study that introduces deep learning methodology into surveillance of infectious diseases and investigates its predictive utility. Furthermore, to exploit the strength from each individual forecasting models, we use statistical model fusion, using Bayesian model averaging (BMA), which allows a systematic integration of multiple forecast scenarios. For each model, an adaptive approach is used to capture the recent relationship between ILI and covariates. DL with FNN appears to deliver the most competitive predictive performance among the four considered individual models. Combing all four models in a comprehensive BMA framework allows to further improve such predictive evaluation metrics as root mean squared error (RMSE) and mean absolute predictive error (MAPE). Nevertheless, DL with FNN remains the preferred method for predicting locations of influenza peaks. The proposed approach can be viewed a feasible alternative to forecast ILI in Hong Kong or other countries where ILI has no constant seasonal trend and influenza data resources are limited. The proposed methodology is easily tractable and computationally efficient.
Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data
2016-01-01
Well-powered genomic studies require genome-wide marker coverage across many individuals. For non-model species with few genomic resources, high-throughput sequencing (HTS) methods, such as Genotyping-By-Sequencing (GBS), offer an inexpensive alternative to array-based genotyping. Although affordable, datasets derived from HTS methods suffer from sequencing error, alignment errors, and missing data, all of which introduce noise and uncertainty to variant discovery and genotype calling. Under such circumstances, meaningful analysis of the data is difficult. Our primary interest lies in the issue of how one can accurately infer or impute missing genotypes in HTS-derived datasets. Many of the existing genotype imputation algorithms and software packages were primarily developed by and optimized for the human genetics community, a field where a complete and accurate reference genome has been constructed and SNP arrays have, in large part, been the common genotyping platform. We set out to answer two questions: 1) can we use existing imputation methods developed by the human genetics community to impute missing genotypes in datasets derived from non-human species and 2) are these methods, which were developed and optimized to impute ascertained variants, amenable for imputation of missing genotypes at HTS-derived variants? We selected Beagle v.4, a widely used algorithm within the human genetics community with reportedly high accuracy, to serve as our imputation contender. We performed a series of cross-validation experiments, using GBS data collected from the species Manihot esculenta by the Next Generation (NEXTGEN) Cassava Breeding Project. NEXTGEN currently imputes missing genotypes in their datasets using a LASSO-penalized, linear regression method (denoted ‘glmnet’). We selected glmnet to serve as a benchmark imputation method for this reason. We obtained estimates of imputation accuracy by masking a subset of observed genotypes, imputing, and calculating the sample Pearson correlation between observed and imputed genotype dosages at the site and individual level; computation time served as a second metric for comparison. We then set out to examine factors affecting imputation accuracy, such as levels of missing data, read depth, minor allele frequency (MAF), and reference panel composition. PMID:27537694
Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data.
Chan, Ariel W; Hamblin, Martha T; Jannink, Jean-Luc
2016-01-01
Well-powered genomic studies require genome-wide marker coverage across many individuals. For non-model species with few genomic resources, high-throughput sequencing (HTS) methods, such as Genotyping-By-Sequencing (GBS), offer an inexpensive alternative to array-based genotyping. Although affordable, datasets derived from HTS methods suffer from sequencing error, alignment errors, and missing data, all of which introduce noise and uncertainty to variant discovery and genotype calling. Under such circumstances, meaningful analysis of the data is difficult. Our primary interest lies in the issue of how one can accurately infer or impute missing genotypes in HTS-derived datasets. Many of the existing genotype imputation algorithms and software packages were primarily developed by and optimized for the human genetics community, a field where a complete and accurate reference genome has been constructed and SNP arrays have, in large part, been the common genotyping platform. We set out to answer two questions: 1) can we use existing imputation methods developed by the human genetics community to impute missing genotypes in datasets derived from non-human species and 2) are these methods, which were developed and optimized to impute ascertained variants, amenable for imputation of missing genotypes at HTS-derived variants? We selected Beagle v.4, a widely used algorithm within the human genetics community with reportedly high accuracy, to serve as our imputation contender. We performed a series of cross-validation experiments, using GBS data collected from the species Manihot esculenta by the Next Generation (NEXTGEN) Cassava Breeding Project. NEXTGEN currently imputes missing genotypes in their datasets using a LASSO-penalized, linear regression method (denoted 'glmnet'). We selected glmnet to serve as a benchmark imputation method for this reason. We obtained estimates of imputation accuracy by masking a subset of observed genotypes, imputing, and calculating the sample Pearson correlation between observed and imputed genotype dosages at the site and individual level; computation time served as a second metric for comparison. We then set out to examine factors affecting imputation accuracy, such as levels of missing data, read depth, minor allele frequency (MAF), and reference panel composition.
MicroRNA Expression-Based Model Indicates Event-Free Survival in Pediatric Acute Myeloid Leukemia
Lim, Emilia L.; Trinh, Diane L.; Ries, Rhonda E.; Wang, Jim; Gerbing, Robert B.; Ma, Yussanne; Topham, James; Hughes, Maya; Pleasance, Erin; Mungall, Andrew J.; Moore, Richard; Zhao, Yongjun; Aplenc, Richard; Sung, Lillian; Kolb, E. Anders; Gamis, Alan; Smith, Malcolm; Gerhard, Daniela S.; Alonzo, Todd A.; Meshinchi, Soheil; Marra, Marco A.
2017-01-01
Purpose Children with acute myeloid leukemia (AML) whose disease is refractory to standard induction chemotherapy therapy or who experience relapse after initial response have dismal outcomes. We sought to comprehensively profile pediatric AML microRNA (miRNA) samples to identify dysregulated genes and assess the utility of miRNAs for improved outcome prediction. Patients and Methods To identify miRNA biomarkers that are associated with treatment failure, we performed a comprehensive sequence-based characterization of the pediatric AML miRNA landscape. miRNA sequencing was performed on 1,362 samples—1,303 primary, 22 refractory, and 37 relapse samples. One hundred sixty-four matched samples—127 primary and 37 relapse samples—were analyzed by using RNA sequencing. Results By using penalized lasso Cox proportional hazards regression, we identified 36 miRNAs the expression levels at diagnosis of which were highly associated with event-free survival. Combined expression of the 36 miRNAs was used to create a novel miRNA-based risk classification scheme (AMLmiR36). This new miRNA-based risk classifier identifies those patients who are at high risk (hazard ratio, 2.830; P ≤ .001) or low risk (hazard ratio, 0.323; P ≤ .001) of experiencing treatment failure, independent of conventional karyotype or mutation status. The performance of AMLmiR36 was independently assessed by using 878 patients from two different clinical trials (AAML0531 and AAML1031). Our analysis also revealed that miR-106a-363 was abundantly expressed in relapse and refractory samples, and several candidate targets of miR-106a-5p were involved in oxidative phosphorylation, a process that is suppressed in treatment-resistant leukemic cells. Conclusion To assess the utility of miRNAs for outcome prediction in patients with pediatric AML, we designed and validated a miRNA-based risk classification scheme. We also hypothesized that the abundant expression of miR-106a could increase treatment resistance via modulation of genes that are involved in oxidative phosphorylation. PMID:29068783
Predictive Modeling of Risk Factors and Complications of Cataract Surgery
Gaskin, Gregory L; Pershing, Suzann; Cole, Tyler S; Shah, Nigam H
2016-01-01
Purpose To quantify the relationship between aggregated preoperative risk factors and cataract surgery complications, as well as to build a model predicting outcomes on an individual-level—given a constellation of demographic, baseline, preoperative, and intraoperative patient characteristics. Setting Stanford Hospital and Clinics between 1994 and 2013. Design Retrospective cohort study Methods Patients age 40 or older who received cataract surgery between 1994 and 2013. Risk factors, complications, and demographic information were extracted from the Electronic Health Record (EHR), based on International Classification of Diseases, 9th edition (ICD-9) codes, Current Procedural Terminology (CPT) codes, drug prescription information, and text data mining using natural language processing. We used a bootstrapped least absolute shrinkage and selection operator (LASSO) model to identify highly-predictive variables. We built random forest classifiers for each complication to create predictive models. Results Our data corroborated existing literature on postoperative complications—including the association of intraoperative complications, complex cataract surgery, black race, and/or prior eye surgery with an increased risk of any postoperative complications. We also found a number of other, less well-described risk factors, including systemic diabetes mellitus, young age (<60 years old), and hyperopia as risk factors for complex cataract surgery and intra- and post-operative complications. Our predictive models based on aggregated outperformed existing published models. Conclusions The constellations of risk factors and complications described here can guide new avenues of research and provide specific, personalized risk assessment for a patient considering cataract surgery. The predictive capacity of our models can enable risk stratification of patients, which has utility as a teaching tool as well as informing quality/value-based reimbursements. PMID:26692059
Poisson Mixture Regression Models for Heart Disease Prediction.
Mufudza, Chipo; Erol, Hamza
2016-01-01
Early heart disease control can be achieved by high disease prediction and diagnosis efficiency. This paper focuses on the use of model based clustering techniques to predict and diagnose heart disease via Poisson mixture regression models. Analysis and application of Poisson mixture regression models is here addressed under two different classes: standard and concomitant variable mixture regression models. Results show that a two-component concomitant variable Poisson mixture regression model predicts heart disease better than both the standard Poisson mixture regression model and the ordinary general linear Poisson regression model due to its low Bayesian Information Criteria value. Furthermore, a Zero Inflated Poisson Mixture Regression model turned out to be the best model for heart prediction over all models as it both clusters individuals into high or low risk category and predicts rate to heart disease componentwise given clusters available. It is deduced that heart disease prediction can be effectively done by identifying the major risks componentwise using Poisson mixture regression model.
Poisson Mixture Regression Models for Heart Disease Prediction
Erol, Hamza
2016-01-01
Early heart disease control can be achieved by high disease prediction and diagnosis efficiency. This paper focuses on the use of model based clustering techniques to predict and diagnose heart disease via Poisson mixture regression models. Analysis and application of Poisson mixture regression models is here addressed under two different classes: standard and concomitant variable mixture regression models. Results show that a two-component concomitant variable Poisson mixture regression model predicts heart disease better than both the standard Poisson mixture regression model and the ordinary general linear Poisson regression model due to its low Bayesian Information Criteria value. Furthermore, a Zero Inflated Poisson Mixture Regression model turned out to be the best model for heart prediction over all models as it both clusters individuals into high or low risk category and predicts rate to heart disease componentwise given clusters available. It is deduced that heart disease prediction can be effectively done by identifying the major risks componentwise using Poisson mixture regression model. PMID:27999611
Parametric regression model for survival data: Weibull regression model as an example
2016-01-01
Weibull regression model is one of the most popular forms of parametric regression model that it provides estimate of baseline hazard function, as well as coefficients for covariates. Because of technical difficulties, Weibull regression model is seldom used in medical literature as compared to the semi-parametric proportional hazard model. To make clinical investigators familiar with Weibull regression model, this article introduces some basic knowledge on Weibull regression model and then illustrates how to fit the model with R software. The SurvRegCensCov package is useful in converting estimated coefficients to clinical relevant statistics such as hazard ratio (HR) and event time ratio (ETR). Model adequacy can be assessed by inspecting Kaplan-Meier curves stratified by categorical variable. The eha package provides an alternative method to model Weibull regression model. The check.dist() function helps to assess goodness-of-fit of the model. Variable selection is based on the importance of a covariate, which can be tested using anova() function. Alternatively, backward elimination starting from a full model is an efficient way for model development. Visualization of Weibull regression model after model development is interesting that it provides another way to report your findings. PMID:28149846
Introduction to the use of regression models in epidemiology.
Bender, Ralf
2009-01-01
Regression modeling is one of the most important statistical techniques used in analytical epidemiology. By means of regression models the effect of one or several explanatory variables (e.g., exposures, subject characteristics, risk factors) on a response variable such as mortality or cancer can be investigated. From multiple regression models, adjusted effect estimates can be obtained that take the effect of potential confounders into account. Regression methods can be applied in all epidemiologic study designs so that they represent a universal tool for data analysis in epidemiology. Different kinds of regression models have been developed in dependence on the measurement scale of the response variable and the study design. The most important methods are linear regression for continuous outcomes, logistic regression for binary outcomes, Cox regression for time-to-event data, and Poisson regression for frequencies and rates. This chapter provides a nontechnical introduction to these regression models with illustrating examples from cancer research.
Wang, Xun-Heng; Jiao, Yun; Li, Lihua
2017-10-24
Attention deficit hyperactivity disorder (ADHD) is a common brain disorder with high prevalence in school-age children. Previously developed machine learning-based methods have discriminated patients with ADHD from normal controls by providing label information of the disease for individuals. Inattention and impulsivity are the two most significant clinical symptoms of ADHD. However, predicting clinical symptoms (i.e., inattention and impulsivity) is a challenging task based on neuroimaging data. The goal of this study is twofold: to build predictive models for clinical symptoms of ADHD based on resting-state fMRI and to mine brain networks for predictive patterns of inattention and impulsivity. To achieve this goal, a cohort of 74 boys with ADHD and a cohort of 69 age-matched normal controls were recruited from the ADHD-200 Consortium. Both structural and resting-state fMRI images were obtained for each participant. Temporal patterns between and within intrinsic connectivity networks (ICNs) were applied as raw features in the predictive models. Specifically, sample entropy was taken asan intra-ICN feature, and phase synchronization (PS) was used asan inter-ICN feature. The predictive models were based on the least absolute shrinkage and selectionator operator (LASSO) algorithm. The performance of the predictive model for inattention is r=0.79 (p<10 -8 ), and the performance of the predictive model for impulsivity is r=0.48 (p<10 -8 ). The ICN-related predictive patterns may provide valuable information for investigating the brain network mechanisms of ADHD. In summary, the predictive models for clinical symptoms could be beneficial for personalizing ADHD medications. Copyright © 2017 IBRO. Published by Elsevier Ltd. All rights reserved.
Interpretation of commonly used statistical regression models.
Kasza, Jessica; Wolfe, Rory
2014-01-01
A review of some regression models commonly used in respiratory health applications is provided in this article. Simple linear regression, multiple linear regression, logistic regression and ordinal logistic regression are considered. The focus of this article is on the interpretation of the regression coefficients of each model, which are illustrated through the application of these models to a respiratory health research study. © 2013 The Authors. Respirology © 2013 Asian Pacific Society of Respirology.
NASA Astrophysics Data System (ADS)
Zhao, Jianhua; Zeng, Haishan; Kalia, Sunil; Lui, Harvey
2017-02-01
Background: Raman spectroscopy is a non-invasive optical technique which can measure molecular vibrational modes within tissue. A large-scale clinical study (n = 518) has demonstrated that real-time Raman spectroscopy could distinguish malignant from benign skin lesions with good diagnostic accuracy; this was validated by a follow-up independent study (n = 127). Objective: Most of the previous diagnostic algorithms have typically been based on analyzing the full band of the Raman spectra, either in the fingerprint or high wavenumber regions. Our objective in this presentation is to explore wavenumber selection based analysis in Raman spectroscopy for skin cancer diagnosis. Methods: A wavenumber selection algorithm was implemented using variably-sized wavenumber windows, which were determined by the correlation coefficient between wavenumbers. Wavenumber windows were chosen based on accumulated frequency from leave-one-out cross-validated stepwise regression or least and shrinkage selection operator (LASSO). The diagnostic algorithms were then generated from the selected wavenumber windows using multivariate statistical analyses, including principal component and general discriminant analysis (PC-GDA) and partial least squares (PLS). A total cohort of 645 confirmed lesions from 573 patients encompassing skin cancers, precancers and benign skin lesions were included. Lesion measurements were divided into training cohort (n = 518) and testing cohort (n = 127) according to the measurement time. Result: The area under the receiver operating characteristic curve (ROC) improved from 0.861-0.891 to 0.891-0.911 and the diagnostic specificity for sensitivity levels of 0.99-0.90 increased respectively from 0.17-0.65 to 0.20-0.75 by selecting specific wavenumber windows for analysis. Conclusion: Wavenumber selection based analysis in Raman spectroscopy improves skin cancer diagnostic specificity at high sensitivity levels.
Köhler, M; Ziegler, A G; Beyerlein, A
2016-06-01
Women with gestational diabetes mellitus (GDM) have an increased risk of diabetes postpartum. We developed a score to predict the long-term risk of postpartum diabetes using clinical and anamnestic variables recorded during or shortly after delivery. Data from 257 GDM women who were prospectively followed for diabetes outcome over 20 years of follow-up were used to develop and validate the risk score. Participants were divided into training and test sets. The risk score was calculated using Lasso Cox regression and divided into four risk categories, and its prediction performance was assessed in the test set. Postpartum diabetes developed in 110 women. The computed training set risk score of 5 × body mass index in early pregnancy (per kg/m(2)) + 132 if GDM was treated with insulin (otherwise 0) + 44 if the woman had a family history of diabetes (otherwise 0) - 35 if the woman lactated (otherwise 0) had R (2) values of 0.23, 0.25, and 0.33 at 5, 10, and 15 years postpartum, respectively, and a C-Index of 0.75. Application of the risk score in the test set resulted in observed risk of postpartum diabetes at 5 years of 11 % for low risk scores ≤140, 29 % for scores 141-220, 64 % for scores 221-300, and 80 % for scores >300. The derived risk score is easy to calculate, allows accurate prediction of GDM-related postpartum diabetes, and may thus be a useful prediction tool for clinicians and general practitioners.
Bahrami, Naeim; Hartman, Stephen J; Chang, Yu-Hsuan; Delfanti, Rachel; White, Nathan S; Karunamuni, Roshan; Seibert, Tyler M; Dale, Anders M; Hattangadi-Gluth, Jona A; Piccioni, David; Farid, Nikdokht; McDonald, Carrie R
2018-06-02
Molecular markers of WHO grade II/III glioma are known to have important prognostic and predictive implications and may be associated with unique imaging phenotypes. The purpose of this study is to determine whether three clinically relevant molecular markers identified in gliomas-IDH, 1p/19q, and MGMT status-show distinct quantitative MRI characteristics on FLAIR imaging. Sixty-one patients with grade II/III gliomas who had molecular data and MRI available prior to radiation were included. Quantitative MRI features were extracted that measured tissue heterogeneity (homogeneity and pixel correlation) and FLAIR border distinctiveness (edge contrast; EC). T-tests were conducted to determine whether patients with different genotypes differ across the features. Logistic regression with LASSO regularization was used to determine the optimal combination of MRI and clinical features for predicting molecular subtypes. Patients with IDH wildtype tumors showed greater signal heterogeneity (p = 0.001) and lower EC (p = 0.008) within the FLAIR region compared to IDH mutant tumors. Among patients with IDH mutant tumors, 1p/19q co-deleted tumors had greater signal heterogeneity (p = 0.002) and lower EC (p = 0.005) compared to 1p/19q intact tumors. MGMT methylated tumors showed lower EC (p = 0.03) compared to the unmethylated group. The combination of FLAIR border distinctness, heterogeneity, and pixel correlation optimally classified tumors by IDH status. Quantitative imaging characteristics of FLAIR heterogeneity and border pattern in grade II/III gliomas may provide unique information for determining molecular status at time of initial diagnostic imaging, which may then guide subsequent surgical and medical management.
Barratt, Daniel T.; Klepstad, Pål; Dale, Ola; Kaasa, Stein; Somogyi, Andrew A.
2015-01-01
Common adverse symptoms of cancer and chemotherapy are a major health burden; chief among these is pain, with opioids including transdermal fentanyl the mainstay of treatment. Innate immune activation has been implicated generally in pain, opioid analgesia, cognitive dysfunction, and sickness type symptoms reported by cancer patients. We aimed to determine if genetic polymorphisms in neuroimmune activation pathways alter the serum fentanyl concentration-response relationships for pain control, cognitive dysfunction, and other adverse symptoms, in cancer pain patients. Cancer pain patients (468) receiving transdermal fentanyl were genotyped for 31 single nucleotide polymorphisms in 19 genes: CASP1, BDNF, CRP, LY96, IL6, IL1B, TGFB1, TNF, IL10, IL2, TLR2, TLR4, MYD88, IL6R, OPRM1, ARRB2, COMT, STAT6 and ABCB1. Lasso and backward stepwise generalised linear regression were used to identify non-genetic and genetic predictors, respectively, of pain control (average Brief Pain Inventory < 4), cognitive dysfunction (Mini-Mental State Examination ≤ 23), sickness response and opioid adverse event complaint. Serum fentanyl concentrations did not predict between-patient variability in these outcomes, nor did genetic factors predict pain control, sickness response or opioid adverse event complaint. Carriers of the MYD88 rs6853 variant were half as likely to have cognitive dysfunction (11/111) than wild-type patients (69/325), with a relative risk of 0.45 (95% CI: 0.27 to 0.76) when accounting for major non-genetic predictors (age, Karnofsky functional score). This supports the involvement of innate immune signalling in cognitive dysfunction, and identifies MyD88 signalling pathways as a potential focus for predicting and reducing the burden of cognitive dysfunction in cancer pain patients. PMID:26332828
Genomic selection for fruit quality traits in apple (Malus×domestica Borkh.).
Kumar, Satish; Chagné, David; Bink, Marco C A M; Volz, Richard K; Whitworth, Claire; Carlisle, Charmaine
2012-01-01
The genome sequence of apple (Malus×domestica Borkh.) was published more than a year ago, which helped develop an 8K SNP chip to assist in implementing genomic selection (GS). In apple breeding programmes, GS can be used to obtain genomic breeding values (GEBV) for choosing next-generation parents or selections for further testing as potential commercial cultivars at a very early stage. Thus GS has the potential to accelerate breeding efficiency significantly because of decreased generation interval or increased selection intensity. We evaluated the accuracy of GS in a population of 1120 seedlings generated from a factorial mating design of four females and two male parents. All seedlings were genotyped using an Illumina Infinium chip comprising 8,000 single nucleotide polymorphisms (SNPs), and were phenotyped for various fruit quality traits. Random-regression best liner unbiased prediction (RR-BLUP) and the Bayesian LASSO method were used to obtain GEBV, and compared using a cross-validation approach for their accuracy to predict unobserved BLUP-BV. Accuracies were very similar for both methods, varying from 0.70 to 0.90 for various fruit quality traits. The selection response per unit time using GS compared with the traditional BLUP-based selection were very high (>100%) especially for low-heritability traits. Genome-wide average estimated linkage disequilibrium (LD) between adjacent SNPs was 0.32, with a relatively slow decay of LD in the long range (r(2) = 0.33 and 0.19 at 100 kb and 1,000 kb respectively), contributing to the higher accuracy of GS. Distribution of estimated SNP effects revealed involvement of large effect genes with likely pleiotropic effects. These results demonstrated that genomic selection is a credible alternative to conventional selection for fruit quality traits.
Watanabe, Hiroyuki; Miyazaki, Hiroyasu
2006-01-01
Over- and/or under-correction of QT intervals for changes in heart rate may lead to misleading conclusions and/or masking the potential of a drug to prolong the QT interval. This study examines a nonparametric regression model (Loess Smoother) to adjust the QT interval for differences in heart rate, with an improved fitness over a wide range of heart rates. 240 sets of (QT, RR) observations collected from each of 8 conscious and non-treated beagle dogs were used as the materials for investigation. The fitness of the nonparametric regression model to the QT-RR relationship was compared with four models (individual linear regression, common linear regression, and Bazett's and Fridericia's correlation models) with reference to Akaike's Information Criterion (AIC). Residuals were visually assessed. The bias-corrected AIC of the nonparametric regression model was the best of the models examined in this study. Although the parametric models did not fit, the nonparametric regression model improved the fitting at both fast and slow heart rates. The nonparametric regression model is the more flexible method compared with the parametric method. The mathematical fit for linear regression models was unsatisfactory at both fast and slow heart rates, while the nonparametric regression model showed significant improvement at all heart rates in beagle dogs.
Genome Mining for Ribosomally Synthesized Natural Products
Velásquez, Juan E.; van der Donk, Wilfred
2011-01-01
In recent years, the number of known peptide natural products that are synthesized via the ribosomal pathway has rapidly grown. Taking advantage of sequence homology among genes encoding precursor peptides or biosynthetic proteins, in silico mining of genomes combined with molecular biology approaches has guided the discovery of a large number of new ribosomal natural products, including lantipeptides, cyanobactins, linear thiazole/oxazole-containing peptides, microviridins, lasso peptides, amatoxins, cyclotides, and conopeptides. In this review, we describe the strategies used for the identification of these ribosomally-synthesized and posttranslationally modified peptides (RiPPs) and the structures of newly identified compounds. The increasing number of chemical entities and their remarkable structural and functional diversity may lead to novel pharmaceutical applications. PMID:21095156
Birse, Kenzie; Arnold, Kelly B; Novak, Richard M; McCorrister, Stuart; Shaw, Souradet; Westmacott, Garrett R; Ball, Terry B; Lauffenburger, Douglas A; Burgener, Adam
2015-09-01
The variable infectivity and transmissibility of HIV/SHIV has been recently associated with the menstrual cycle, with particular susceptibility observed during the luteal phase in nonhuman primate models and ex vivo human explant cultures, but the mechanism is poorly understood. Here, we performed an unbiased, mass spectrometry-based proteomic analysis to better understand the mucosal immunological processes underpinning this observed susceptibility to HIV infection. Cervicovaginal lavage samples (n = 19) were collected, characterized as follicular or luteal phase using days since last menstrual period, and analyzed by tandem mass spectrometry. Biological insights from these data were gained using a spectrum of computational methods, including hierarchical clustering, pathway analysis, gene set enrichment analysis, and partial least-squares discriminant analysis with LASSO feature selection. Of the 384 proteins identified, 43 were differentially abundant between phases (P < 0.05, ≥2-fold change). Cell-cell adhesion proteins and antiproteases were reduced, and leukocyte recruitment (interleukin-8 pathway, P = 1.41E-5) and extravasation proteins (P = 5.62E-4) were elevated during the luteal phase. LASSO/PLSDA identified a minimal profile of 18 proteins that best distinguished the luteal phase. This profile included cytoskeletal elements and proteases known to be involved in cellular movement. Gene set enrichment analysis associated CD4(+) T cell and neutrophil gene set signatures with the luteal phase (P < 0.05). Taken together, our findings indicate a strong association between proteins involved in tissue remodeling and leukocyte infiltration with the luteal phase, which may represent potential hormone-associated mechanisms of increased susceptibility to HIV. Recent studies have discovered an enhanced susceptibility to HIV infection during the progesterone-dominant luteal phase of the menstrual cycle. However, the mechanism responsible for this enhanced susceptibility has not yet been determined. Understanding the source of this vulnerability will be important for designing efficacious HIV prevention technologies for women. Furthermore, these findings may also be extrapolated to better understand the impact of exogenous hormone application, such as the use of hormonal contraceptives, on HIV acquisition risk. Hormonal contraceptives are the most widely used contraceptive method in sub-Saharan Africa, the most HIV-burdened area of the world. For this reason, research conducted to better understand how hormones impact host immunity and susceptibility factors important for HIV infection is a global health priority. Copyright © 2015, American Society for Microbiology. All Rights Reserved.
Stargate GTM: Bridging Descriptor and Activity Spaces.
Gaspar, Héléna A; Baskin, Igor I; Marcou, Gilles; Horvath, Dragos; Varnek, Alexandre
2015-11-23
Predicting the activity profile of a molecule or discovering structures possessing a specific activity profile are two important goals in chemoinformatics, which could be achieved by bridging activity and molecular descriptor spaces. In this paper, we introduce the "Stargate" version of the Generative Topographic Mapping approach (S-GTM) in which two different multidimensional spaces (e.g., structural descriptor space and activity space) are linked through a common 2D latent space. In the S-GTM algorithm, the manifolds are trained simultaneously in two initial spaces using the probabilities in the 2D latent space calculated as a weighted geometric mean of probability distributions in both spaces. S-GTM has the following interesting features: (1) activities are involved during the training procedure; therefore, the method is supervised, unlike conventional GTM; (2) using molecular descriptors of a given compound as input, the model predicts a whole activity profile, and (3) using an activity profile as input, areas populated by relevant chemical structures can be detected. To assess the performance of S-GTM prediction models, a descriptor space (ISIDA descriptors) of a set of 1325 GPCR ligands was related to a B-dimensional (B = 1 or 8) activity space corresponding to pKi values for eight different targets. S-GTM outperforms conventional GTM for individual activities and performs similarly to the Lasso multitask learning algorithm, although it is still slightly less accurate than the Random Forest method.
Data-driven confounder selection via Markov and Bayesian networks.
Häggström, Jenny
2018-06-01
To unbiasedly estimate a causal effect on an outcome unconfoundedness is often assumed. If there is sufficient knowledge on the underlying causal structure then existing confounder selection criteria can be used to select subsets of the observed pretreatment covariates, X, sufficient for unconfoundedness, if such subsets exist. Here, estimation of these target subsets is considered when the underlying causal structure is unknown. The proposed method is to model the causal structure by a probabilistic graphical model, for example, a Markov or Bayesian network, estimate this graph from observed data and select the target subsets given the estimated graph. The approach is evaluated by simulation both in a high-dimensional setting where unconfoundedness holds given X and in a setting where unconfoundedness only holds given subsets of X. Several common target subsets are investigated and the selected subsets are compared with respect to accuracy in estimating the average causal effect. The proposed method is implemented with existing software that can easily handle high-dimensional data, in terms of large samples and large number of covariates. The results from the simulation study show that, if unconfoundedness holds given X, this approach is very successful in selecting the target subsets, outperforming alternative approaches based on random forests and LASSO, and that the subset estimating the target subset containing all causes of outcome yields smallest MSE in the average causal effect estimation. © 2017, The International Biometric Society.
Modified Regression Correlation Coefficient for Poisson Regression Model
NASA Astrophysics Data System (ADS)
Kaengthong, Nattacha; Domthong, Uthumporn
2017-09-01
This study gives attention to indicators in predictive power of the Generalized Linear Model (GLM) which are widely used; however, often having some restrictions. We are interested in regression correlation coefficient for a Poisson regression model. This is a measure of predictive power, and defined by the relationship between the dependent variable (Y) and the expected value of the dependent variable given the independent variables [E(Y|X)] for the Poisson regression model. The dependent variable is distributed as Poisson. The purpose of this research was modifying regression correlation coefficient for Poisson regression model. We also compare the proposed modified regression correlation coefficient with the traditional regression correlation coefficient in the case of two or more independent variables, and having multicollinearity in independent variables. The result shows that the proposed regression correlation coefficient is better than the traditional regression correlation coefficient based on Bias and the Root Mean Square Error (RMSE).
Richetin, Juliette; Preti, Emanuele; Costantini, Giulio; De Panfilis, Chiara
2017-01-01
We argue that the series of traits characterizing Borderline Personality Disorder samples do not weigh equally. In this regard, we believe that network approaches employed recently in Personality and Psychopathology research to provide information about the differential relationships among symptoms would be useful to test our claim. To our knowledge, this approach has never been applied to personality disorders. We applied network analysis to the nine Borderline Personality Disorder traits to explore their relationships in two samples drawn from university students and clinical populations (N = 1317 and N = 96, respectively). We used the Fused Graphical Lasso, a technique that allows estimating networks from different populations separately while considering their similarities and differences. Moreover, we examined centrality indices to determine the relative importance of each symptom in each network. The general structure of the two networks was very similar in the two samples, although some differences were detected. Results indicate the centrality of mainly affective instability, identity, and effort to avoid abandonment aspects in Borderline Personality Disorder. Results are consistent with the new DSM Alternative Model for Personality Disorders. We discuss them in terms of implications for therapy.