Does rational selection of training and test sets improve the outcome of QSAR modeling?
Martin, Todd M; Harten, Paul; Young, Douglas M; Muratov, Eugene N; Golbraikh, Alexander; Zhu, Hao; Tropsha, Alexander
2012-10-22
Prior to using a quantitative structure activity relationship (QSAR) model for external predictions, its predictive power should be established and validated. In the absence of a true external data set, the best way to validate the predictive ability of a model is to perform its statistical external validation. In statistical external validation, the overall data set is divided into training and test sets. Commonly, this splitting is performed using random division. Rational splitting methods can divide data sets into training and test sets in an intelligent fashion. The purpose of this study was to determine whether rational division methods lead to more predictive models compared to random division. A special data splitting procedure was used to facilitate the comparison between random and rational division methods. For each toxicity end point, the overall data set was divided into a modeling set (80% of the overall set) and an external evaluation set (20% of the overall set) using random division. The modeling set was then subdivided into a training set (80% of the modeling set) and a test set (20% of the modeling set) using rational division methods and by using random division. The Kennard-Stone, minimal test set dissimilarity, and sphere exclusion algorithms were used as the rational division methods. The hierarchical clustering, random forest, and k-nearest neighbor (kNN) methods were used to develop QSAR models based on the training sets. For kNN QSAR, multiple training and test sets were generated, and multiple QSAR models were built. The results of this study indicate that models based on rational division methods generate better statistical results for the test sets than models based on random division, but the predictive power of both types of models are comparable.
Boligon, A A; Baldi, F; Mercadante, M E Z; Lobo, R B; Pereira, R J; Albuquerque, L G
2011-06-28
We quantified the potential increase in accuracy of expected breeding value for weights of Nelore cattle, from birth to mature age, using multi-trait and random regression models on Legendre polynomials and B-spline functions. A total of 87,712 weight records from 8144 females were used, recorded every three months from birth to mature age from the Nelore Brazil Program. For random regression analyses, all female weight records from birth to eight years of age (data set I) were considered. From this general data set, a subset was created (data set II), which included only nine weight records: at birth, weaning, 365 and 550 days of age, and 2, 3, 4, 5, and 6 years of age. Data set II was analyzed using random regression and multi-trait models. The model of analysis included the contemporary group as fixed effects and age of dam as a linear and quadratic covariable. In the random regression analyses, average growth trends were modeled using a cubic regression on orthogonal polynomials of age. Residual variances were modeled by a step function with five classes. Legendre polynomials of fourth and sixth order were utilized to model the direct genetic and animal permanent environmental effects, respectively, while third-order Legendre polynomials were considered for maternal genetic and maternal permanent environmental effects. Quadratic polynomials were applied to model all random effects in random regression models on B-spline functions. Direct genetic and animal permanent environmental effects were modeled using three segments or five coefficients, and genetic maternal and maternal permanent environmental effects were modeled with one segment or three coefficients in the random regression models on B-spline functions. For both data sets (I and II), animals ranked differently according to expected breeding value obtained by random regression or multi-trait models. With random regression models, the highest gains in accuracy were obtained at ages with a low number of weight records. The results indicate that random regression models provide more accurate expected breeding values than the traditionally finite multi-trait models. Thus, higher genetic responses are expected for beef cattle growth traits by replacing a multi-trait model with random regression models for genetic evaluation. B-spline functions could be applied as an alternative to Legendre polynomials to model covariance functions for weights from birth to mature age.
NASA Astrophysics Data System (ADS)
Bennett, D. L.; Brene, N.; Nielsen, H. B.
1987-01-01
The goal of random dynamics is the derivation of the laws of Nature as we know them (standard model) from inessential assumptions. The inessential assumptions made here are expressed as sets of general models at extremely high energies: gauge glass and spacetime foam. Both sets of models lead tentatively to the standard model.
Li, Baoyue; Lingsma, Hester F; Steyerberg, Ewout W; Lesaffre, Emmanuel
2011-05-23
Logistic random effects models are a popular tool to analyze multilevel also called hierarchical data with a binary or ordinal outcome. Here, we aim to compare different statistical software implementations of these models. We used individual patient data from 8509 patients in 231 centers with moderate and severe Traumatic Brain Injury (TBI) enrolled in eight Randomized Controlled Trials (RCTs) and three observational studies. We fitted logistic random effects regression models with the 5-point Glasgow Outcome Scale (GOS) as outcome, both dichotomized as well as ordinal, with center and/or trial as random effects, and as covariates age, motor score, pupil reactivity or trial. We then compared the implementations of frequentist and Bayesian methods to estimate the fixed and random effects. Frequentist approaches included R (lme4), Stata (GLLAMM), SAS (GLIMMIX and NLMIXED), MLwiN ([R]IGLS) and MIXOR, Bayesian approaches included WinBUGS, MLwiN (MCMC), R package MCMCglmm and SAS experimental procedure MCMC.Three data sets (the full data set and two sub-datasets) were analysed using basically two logistic random effects models with either one random effect for the center or two random effects for center and trial. For the ordinal outcome in the full data set also a proportional odds model with a random center effect was fitted. The packages gave similar parameter estimates for both the fixed and random effects and for the binary (and ordinal) models for the main study and when based on a relatively large number of level-1 (patient level) data compared to the number of level-2 (hospital level) data. However, when based on relatively sparse data set, i.e. when the numbers of level-1 and level-2 data units were about the same, the frequentist and Bayesian approaches showed somewhat different results. The software implementations differ considerably in flexibility, computation time, and usability. There are also differences in the availability of additional tools for model evaluation, such as diagnostic plots. The experimental SAS (version 9.2) procedure MCMC appeared to be inefficient. On relatively large data sets, the different software implementations of logistic random effects regression models produced similar results. Thus, for a large data set there seems to be no explicit preference (of course if there is no preference from a philosophical point of view) for either a frequentist or Bayesian approach (if based on vague priors). The choice for a particular implementation may largely depend on the desired flexibility, and the usability of the package. For small data sets the random effects variances are difficult to estimate. In the frequentist approaches the MLE of this variance was often estimated zero with a standard error that is either zero or could not be determined, while for Bayesian methods the estimates could depend on the chosen "non-informative" prior of the variance parameter. The starting value for the variance parameter may be also critical for the convergence of the Markov chain.
Mixed models approaches for joint modeling of different types of responses.
Ivanova, Anna; Molenberghs, Geert; Verbeke, Geert
2016-01-01
In many biomedical studies, one jointly collects longitudinal continuous, binary, and survival outcomes, possibly with some observations missing. Random-effects models, sometimes called shared-parameter models or frailty models, received a lot of attention. In such models, the corresponding variance components can be employed to capture the association between the various sequences. In some cases, random effects are considered common to various sequences, perhaps up to a scaling factor; in others, there are different but correlated random effects. Even though a variety of data types has been considered in the literature, less attention has been devoted to ordinal data. For univariate longitudinal or hierarchical data, the proportional odds mixed model (POMM) is an instance of the generalized linear mixed model (GLMM; Breslow and Clayton, 1993). Ordinal data are conveniently replaced by a parsimonious set of dummies, which in the longitudinal setting leads to a repeated set of dummies. When ordinal longitudinal data are part of a joint model, the complexity increases further. This is the setting considered in this paper. We formulate a random-effects based model that, in addition, allows for overdispersion. Using two case studies, it is shown that the combination of random effects to capture association with further correction for overdispersion can improve the model's fit considerably and that the resulting models allow to answer research questions that could not be addressed otherwise. Parameters can be estimated in a fairly straightforward way, using the SAS procedure NLMIXED.
2011-01-01
Background Logistic random effects models are a popular tool to analyze multilevel also called hierarchical data with a binary or ordinal outcome. Here, we aim to compare different statistical software implementations of these models. Methods We used individual patient data from 8509 patients in 231 centers with moderate and severe Traumatic Brain Injury (TBI) enrolled in eight Randomized Controlled Trials (RCTs) and three observational studies. We fitted logistic random effects regression models with the 5-point Glasgow Outcome Scale (GOS) as outcome, both dichotomized as well as ordinal, with center and/or trial as random effects, and as covariates age, motor score, pupil reactivity or trial. We then compared the implementations of frequentist and Bayesian methods to estimate the fixed and random effects. Frequentist approaches included R (lme4), Stata (GLLAMM), SAS (GLIMMIX and NLMIXED), MLwiN ([R]IGLS) and MIXOR, Bayesian approaches included WinBUGS, MLwiN (MCMC), R package MCMCglmm and SAS experimental procedure MCMC. Three data sets (the full data set and two sub-datasets) were analysed using basically two logistic random effects models with either one random effect for the center or two random effects for center and trial. For the ordinal outcome in the full data set also a proportional odds model with a random center effect was fitted. Results The packages gave similar parameter estimates for both the fixed and random effects and for the binary (and ordinal) models for the main study and when based on a relatively large number of level-1 (patient level) data compared to the number of level-2 (hospital level) data. However, when based on relatively sparse data set, i.e. when the numbers of level-1 and level-2 data units were about the same, the frequentist and Bayesian approaches showed somewhat different results. The software implementations differ considerably in flexibility, computation time, and usability. There are also differences in the availability of additional tools for model evaluation, such as diagnostic plots. The experimental SAS (version 9.2) procedure MCMC appeared to be inefficient. Conclusions On relatively large data sets, the different software implementations of logistic random effects regression models produced similar results. Thus, for a large data set there seems to be no explicit preference (of course if there is no preference from a philosophical point of view) for either a frequentist or Bayesian approach (if based on vague priors). The choice for a particular implementation may largely depend on the desired flexibility, and the usability of the package. For small data sets the random effects variances are difficult to estimate. In the frequentist approaches the MLE of this variance was often estimated zero with a standard error that is either zero or could not be determined, while for Bayesian methods the estimates could depend on the chosen "non-informative" prior of the variance parameter. The starting value for the variance parameter may be also critical for the convergence of the Markov chain. PMID:21605357
An instrumental variable random-coefficients model for binary outcomes
Chesher, Andrew; Rosen, Adam M
2014-01-01
In this paper, we study a random-coefficients model for a binary outcome. We allow for the possibility that some or even all of the explanatory variables are arbitrarily correlated with the random coefficients, thus permitting endogeneity. We assume the existence of observed instrumental variables Z that are jointly independent with the random coefficients, although we place no structure on the joint determination of the endogenous variable X and instruments Z, as would be required for a control function approach. The model fits within the spectrum of generalized instrumental variable models, and we thus apply identification results from our previous studies of such models to the present context, demonstrating their use. Specifically, we characterize the identified set for the distribution of random coefficients in the binary response model with endogeneity via a collection of conditional moment inequalities, and we investigate the structure of these sets by way of numerical illustration. PMID:25798048
Good, Andrew C; Hermsmeier, Mark A
2007-01-01
Research into the advancement of computer-aided molecular design (CAMD) has a tendency to focus on the discipline of algorithm development. Such efforts are often wrought to the detriment of the data set selection and analysis used in said algorithm validation. Here we highlight the potential problems this can cause in the context of druglikeness classification. More rigorous efforts are applied to the selection of decoy (nondruglike) molecules from the ACD. Comparisons are made between model performance using the standard technique of random test set creation with test sets derived from explicit ontological separation by drug class. The dangers of viewing druglike space as sufficiently coherent to permit simple classification are highlighted. In addition the issues inherent in applying unfiltered data and random test set selection to (Q)SAR models utilizing large and supposedly heterogeneous databases are discussed.
Genomic-Enabled Prediction Kernel Models with Random Intercepts for Multi-environment Trials.
Cuevas, Jaime; Granato, Italo; Fritsche-Neto, Roberto; Montesinos-Lopez, Osval A; Burgueño, Juan; Bandeira E Sousa, Massaine; Crossa, José
2018-03-28
In this study, we compared the prediction accuracy of the main genotypic effect model (MM) without G×E interactions, the multi-environment single variance G×E deviation model (MDs), and the multi-environment environment-specific variance G×E deviation model (MDe) where the random genetic effects of the lines are modeled with the markers (or pedigree). With the objective of further modeling the genetic residual of the lines, we incorporated the random intercepts of the lines ([Formula: see text]) and generated another three models. Each of these 6 models were fitted with a linear kernel method (Genomic Best Linear Unbiased Predictor, GB) and a Gaussian Kernel (GK) method. We compared these 12 model-method combinations with another two multi-environment G×E interactions models with unstructured variance-covariances (MUC) using GB and GK kernels (4 model-method). Thus, we compared the genomic-enabled prediction accuracy of a total of 16 model-method combinations on two maize data sets with positive phenotypic correlations among environments, and on two wheat data sets with complex G×E that includes some negative and close to zero phenotypic correlations among environments. The two models (MDs and MDE with the random intercept of the lines and the GK method) were computationally efficient and gave high prediction accuracy in the two maize data sets. Regarding the more complex G×E wheat data sets, the prediction accuracy of the model-method combination with G×E, MDs and MDe, including the random intercepts of the lines with GK method had important savings in computing time as compared with the G×E interaction multi-environment models with unstructured variance-covariances but with lower genomic prediction accuracy. Copyright © 2018 Cuevas et al.
Genomic-Enabled Prediction Kernel Models with Random Intercepts for Multi-environment Trials
Cuevas, Jaime; Granato, Italo; Fritsche-Neto, Roberto; Montesinos-Lopez, Osval A.; Burgueño, Juan; Bandeira e Sousa, Massaine; Crossa, José
2018-01-01
In this study, we compared the prediction accuracy of the main genotypic effect model (MM) without G×E interactions, the multi-environment single variance G×E deviation model (MDs), and the multi-environment environment-specific variance G×E deviation model (MDe) where the random genetic effects of the lines are modeled with the markers (or pedigree). With the objective of further modeling the genetic residual of the lines, we incorporated the random intercepts of the lines (l) and generated another three models. Each of these 6 models were fitted with a linear kernel method (Genomic Best Linear Unbiased Predictor, GB) and a Gaussian Kernel (GK) method. We compared these 12 model-method combinations with another two multi-environment G×E interactions models with unstructured variance-covariances (MUC) using GB and GK kernels (4 model-method). Thus, we compared the genomic-enabled prediction accuracy of a total of 16 model-method combinations on two maize data sets with positive phenotypic correlations among environments, and on two wheat data sets with complex G×E that includes some negative and close to zero phenotypic correlations among environments. The two models (MDs and MDE with the random intercept of the lines and the GK method) were computationally efficient and gave high prediction accuracy in the two maize data sets. Regarding the more complex G×E wheat data sets, the prediction accuracy of the model-method combination with G×E, MDs and MDe, including the random intercepts of the lines with GK method had important savings in computing time as compared with the G×E interaction multi-environment models with unstructured variance-covariances but with lower genomic prediction accuracy. PMID:29476023
Prediction of aquatic toxicity mode of action using linear discriminant and random forest models.
Martin, Todd M; Grulke, Christopher M; Young, Douglas M; Russom, Christine L; Wang, Nina Y; Jackson, Crystal R; Barron, Mace G
2013-09-23
The ability to determine the mode of action (MOA) for a diverse group of chemicals is a critical part of ecological risk assessment and chemical regulation. However, existing MOA assignment approaches in ecotoxicology have been limited to a relatively few MOAs, have high uncertainty, or rely on professional judgment. In this study, machine based learning algorithms (linear discriminant analysis and random forest) were used to develop models for assigning aquatic toxicity MOA. These methods were selected since they have been shown to be able to correlate diverse data sets and provide an indication of the most important descriptors. A data set of MOA assignments for 924 chemicals was developed using a combination of high confidence assignments, international consensus classifications, ASTER (ASessment Tools for the Evaluation of Risk) predictions, and weight of evidence professional judgment based an assessment of structure and literature information. The overall data set was randomly divided into a training set (75%) and a validation set (25%) and then used to develop linear discriminant analysis (LDA) and random forest (RF) MOA assignment models. The LDA and RF models had high internal concordance and specificity and were able to produce overall prediction accuracies ranging from 84.5 to 87.7% for the validation set. These results demonstrate that computational chemistry approaches can be used to determine the acute toxicity MOAs across a large range of structures and mechanisms.
ERIC Educational Resources Information Center
Shire, Stephanie Y.; Chang, Ya-Chih; Shih, Wendy; Bracaglia, Suzanne; Kodjoe, Maria; Kasari, Connie
2017-01-01
Background: Interventions found to be effective in research settings are often not as effective when implemented in community settings. Considering children with autism, studies have rarely examined the efficacy of laboratory-tested interventions on child outcomes in community settings using randomized controlled designs. Methods: One hundred and…
Rotolo, Federico; Paoletti, Xavier; Burzykowski, Tomasz; Buyse, Marc; Michiels, Stefan
2017-01-01
Surrogate endpoints are often used in clinical trials instead of well-established hard endpoints for practical convenience. The meta-analytic approach relies on two measures of surrogacy: one at the individual level and one at the trial level. In the survival data setting, a two-step model based on copulas is commonly used. We present a new approach which employs a bivariate survival model with an individual random effect shared between the two endpoints and correlated treatment-by-trial interactions. We fit this model using auxiliary mixed Poisson models. We study via simulations the operating characteristics of this mixed Poisson approach as compared to the two-step copula approach. We illustrate the application of the methods on two individual patient data meta-analyses in gastric cancer, in the advanced setting (4069 patients from 20 randomized trials) and in the adjuvant setting (3288 patients from 14 randomized trials).
Modelling wildland fire propagation by tracking random fronts
NASA Astrophysics Data System (ADS)
Pagnini, G.; Mentrelli, A.
2013-11-01
Wildland fire propagation is studied in literature by two alternative approaches, namely the reaction-diffusion equation and the level-set method. These two approaches are considered alternative each other because the solution of the reaction-diffusion equation is generally a continuous smooth function that has an exponential decay and an infinite support, while the level-set method, which is a front tracking technique, generates a sharp function with a finite support. However, these two approaches can indeed be considered complementary and reconciled. Turbulent hot-air transport and fire spotting are phenomena with a random character that are extremely important in wildland fire propagation. As a consequence the fire front gets a random character, too. Hence a tracking method for random fronts is needed. In particular, the level-set contourn is here randomized accordingly to the probability density function of the interface particle displacement. Actually, when the level-set method is developed for tracking a front interface with a random motion, the resulting averaged process emerges to be governed by an evolution equation of the reaction-diffusion type. In this reconciled approach, the rate of spread of the fire keeps the same key and characterizing role proper to the level-set approach. The resulting model emerges to be suitable to simulate effects due to turbulent convection as fire flank and backing fire, the faster fire spread because of the actions by hot air pre-heating and by ember landing, and also the fire overcoming a firebreak zone that is a case not resolved by models based on the level-set method. Moreover, from the proposed formulation it follows a correction for the rate of spread formula due to the mean jump-length of firebrands in the downwind direction for the leeward sector of the fireline contour.
Set statistics in conductive bridge random access memory device with Cu/HfO{sub 2}/Pt structure
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zhang, Meiyun; Long, Shibing, E-mail: longshibing@ime.ac.cn; Wang, Guoming
2014-11-10
The switching parameter variation of resistive switching memory is one of the most important challenges in its application. In this letter, we have studied the set statistics of conductive bridge random access memory with a Cu/HfO{sub 2}/Pt structure. The experimental distributions of the set parameters in several off resistance ranges are shown to nicely fit a Weibull model. The Weibull slopes of the set voltage and current increase and decrease logarithmically with off resistance, respectively. This experimental behavior is perfectly captured by a Monte Carlo simulator based on the cell-based set voltage statistics model and the Quantum Point Contact electronmore » transport model. Our work provides indications for the improvement of the switching uniformity.« less
Using partial site aggregation to reduce bias in random utility travel cost models
NASA Astrophysics Data System (ADS)
Lupi, Frank; Feather, Peter M.
1998-12-01
We propose a "partial aggregation" strategy for defining the recreation sites that enter choice sets in random utility models. Under the proposal, the most popular sites and sites that will be the subject of policy analysis enter choice sets as individual sites while remaining sites are aggregated into groups of similar sites. The scheme balances the desire to include all potential substitute sites in the choice sets with practical data and modeling constraints. Unlike fully aggregate models, our analysis and empirical applications suggest that the partial aggregation approach reasonably approximates the results of a disaggregate model. The partial aggregation approach offers all of the data and computational advantages of models with aggregate sites but does not suffer from the same degree of bias as fully aggregate models.
ERIC Educational Resources Information Center
Yu, Bing; Hong, Guanglei
2012-01-01
This study uses simulation examples representing three types of treatment assignment mechanisms in data generation (the random intercept and slopes setting, the random intercept setting, and a third setting with a cluster-level treatment and an individual-level outcome) in order to determine optimal procedures for reducing bias and improving…
Random forest models to predict aqueous solubility.
Palmer, David S; O'Boyle, Noel M; Glen, Robert C; Mitchell, John B O
2007-01-01
Random Forest regression (RF), Partial-Least-Squares (PLS) regression, Support Vector Machines (SVM), and Artificial Neural Networks (ANN) were used to develop QSPR models for the prediction of aqueous solubility, based on experimental data for 988 organic molecules. The Random Forest regression model predicted aqueous solubility more accurately than those created by PLS, SVM, and ANN and offered methods for automatic descriptor selection, an assessment of descriptor importance, and an in-parallel measure of predictive ability, all of which serve to recommend its use. The prediction of log molar solubility for an external test set of 330 molecules that are solid at 25 degrees C gave an r2 = 0.89 and RMSE = 0.69 log S units. For a standard data set selected from the literature, the model performed well with respect to other documented methods. Finally, the diversity of the training and test sets are compared to the chemical space occupied by molecules in the MDL drug data report, on the basis of molecular descriptors selected by the regression analysis.
NASA Astrophysics Data System (ADS)
Berger, Noam; Mukherjee, Chiranjib; Okamura, Kazuki
2018-03-01
We prove a quenched large deviation principle (LDP) for a simple random walk on a supercritical percolation cluster (SRWPC) on {Z^d} ({d ≥ 2}). The models under interest include classical Bernoulli bond and site percolation as well as models that exhibit long range correlations, like the random cluster model, the random interlacement and the vacant set of random interlacements (for {d ≥ 3}) and the level sets of the Gaussian free field ({d≥ 3}). Inspired by the methods developed by Kosygina et al. (Commun Pure Appl Math 59:1489-1521, 2006) for proving quenched LDP for elliptic diffusions with a random drift, and by Yilmaz (Commun Pure Appl Math 62(8):1033-1075, 2009) and Rosenbluth (Quenched large deviations for multidimensional random walks in a random environment: a variational formula. Ph.D. thesis, NYU, arXiv:0804.1444v1) for similar results regarding elliptic random walks in random environment, we take the point of view of the moving particle and prove a large deviation principle for the quenched distribution of the pair empirical measures of the environment Markov chain in the non-elliptic case of SRWPC. Via a contraction principle, this reduces easily to a quenched LDP for the distribution of the mean velocity of the random walk and both rate functions admit explicit variational formulas. The main difficulty in our set up lies in the inherent non-ellipticity as well as the lack of translation-invariance stemming from conditioning on the fact that the origin belongs to the infinite cluster. We develop a unifying approach for proving quenched large deviations for SRWPC based on exploiting coercivity properties of the relative entropies in the context of convex variational analysis, combined with input from ergodic theory and invoking geometric properties of the supercritical percolation cluster.
NASA Astrophysics Data System (ADS)
Berger, Noam; Mukherjee, Chiranjib; Okamura, Kazuki
2017-12-01
We prove a quenched large deviation principle (LDP) for a simple random walk on a supercritical percolation cluster (SRWPC) on {Z^d} ({d ≥ 2} ). The models under interest include classical Bernoulli bond and site percolation as well as models that exhibit long range correlations, like the random cluster model, the random interlacement and the vacant set of random interlacements (for {d ≥ 3} ) and the level sets of the Gaussian free field ({d≥ 3} ). Inspired by the methods developed by Kosygina et al. (Commun Pure Appl Math 59:1489-1521, 2006) for proving quenched LDP for elliptic diffusions with a random drift, and by Yilmaz (Commun Pure Appl Math 62(8):1033-1075, 2009) and Rosenbluth (Quenched large deviations for multidimensional random walks in a random environment: a variational formula. Ph.D. thesis, NYU, arXiv:0804.1444v1) for similar results regarding elliptic random walks in random environment, we take the point of view of the moving particle and prove a large deviation principle for the quenched distribution of the pair empirical measures of the environment Markov chain in the non-elliptic case of SRWPC. Via a contraction principle, this reduces easily to a quenched LDP for the distribution of the mean velocity of the random walk and both rate functions admit explicit variational formulas. The main difficulty in our set up lies in the inherent non-ellipticity as well as the lack of translation-invariance stemming from conditioning on the fact that the origin belongs to the infinite cluster. We develop a unifying approach for proving quenched large deviations for SRWPC based on exploiting coercivity properties of the relative entropies in the context of convex variational analysis, combined with input from ergodic theory and invoking geometric properties of the supercritical percolation cluster.
NASA Astrophysics Data System (ADS)
Gong, Yue-Feng; Song, Zhi-Tang; Ling, Yun; Liu, Yan; Li, Yi-Jin
2010-06-01
A three-dimensional finite element model for phase change random access memory is established to simulate electric, thermal and phase state distribution during (SET) operation. The model is applied to simulate the SET behaviors of the heater addition structure (HS) and the ring-type contact in the bottom electrode (RIB) structure. The simulation results indicate that the small bottom electrode contactor (BEC) is beneficial for heat efficiency and reliability in the HS cell, and the bottom electrode contactor with size Fx = 80 nm is a good choice for the RIB cell. Also shown is that the appropriate SET pulse time is 100 ns for the low power consumption and fast operation.
Improvement of Predictive Ability by Uniform Coverage of the Target Genetic Space
Bustos-Korts, Daniela; Malosetti, Marcos; Chapman, Scott; Biddulph, Ben; van Eeuwijk, Fred
2016-01-01
Genome-enabled prediction provides breeders with the means to increase the number of genotypes that can be evaluated for selection. One of the major challenges in genome-enabled prediction is how to construct a training set of genotypes from a calibration set that represents the target population of genotypes, where the calibration set is composed of a training and validation set. A random sampling protocol of genotypes from the calibration set will lead to low quality coverage of the total genetic space by the training set when the calibration set contains population structure. As a consequence, predictive ability will be affected negatively, because some parts of the genotypic diversity in the target population will be under-represented in the training set, whereas other parts will be over-represented. Therefore, we propose a training set construction method that uniformly samples the genetic space spanned by the target population of genotypes, thereby increasing predictive ability. To evaluate our method, we constructed training sets alongside with the identification of corresponding genomic prediction models for four genotype panels that differed in the amount of population structure they contained (maize Flint, maize Dent, wheat, and rice). Training sets were constructed using uniform sampling, stratified-uniform sampling, stratified sampling and random sampling. We compared these methods with a method that maximizes the generalized coefficient of determination (CD). Several training set sizes were considered. We investigated four genomic prediction models: multi-locus QTL models, GBLUP models, combinations of QTL and GBLUPs, and Reproducing Kernel Hilbert Space (RKHS) models. For the maize and wheat panels, construction of the training set under uniform sampling led to a larger predictive ability than under stratified and random sampling. The results of our methods were similar to those of the CD method. For the rice panel, all training set construction methods led to similar predictive ability, a reflection of the very strong population structure in this panel. PMID:27672112
ERIC Educational Resources Information Center
Fific, Mario; Little, Daniel R.; Nosofsky, Robert M.
2010-01-01
We formalize and provide tests of a set of logical-rule models for predicting perceptual classification response times (RTs) and choice probabilities. The models are developed by synthesizing mental-architecture, random-walk, and decision-bound approaches. According to the models, people make independent decisions about the locations of stimuli…
Zou, Zhengxia; Shi, Zhenwei
2018-03-01
We propose a new paradigm for target detection in high resolution aerial remote sensing images under small target priors. Previous remote sensing target detection methods frame the detection as learning of detection model + inference of class-label and bounding-box coordinates. Instead, we formulate it from a Bayesian view that at inference stage, the detection model is adaptively updated to maximize its posterior that is determined by both training and observation. We call this paradigm "random access memories (RAM)." In this paradigm, "Memories" can be interpreted as any model distribution learned from training data and "random access" means accessing memories and randomly adjusting the model at detection phase to obtain better adaptivity to any unseen distribution of test data. By leveraging some latest detection techniques e.g., deep Convolutional Neural Networks and multi-scale anchors, experimental results on a public remote sensing target detection data set show our method outperforms several other state of the art methods. We also introduce a new data set "LEarning, VIsion and Remote sensing laboratory (LEVIR)", which is one order of magnitude larger than other data sets of this field. LEVIR consists of a large set of Google Earth images, with over 22 k images and 10 k independently labeled targets. RAM gives noticeable upgrade of accuracy (an mean average precision improvement of 1% ~ 4%) of our baseline detectors with acceptable computational overhead.
Tangen, C M; Koch, G G
1999-03-01
In the randomized clinical trial setting, controlling for covariates is expected to produce variance reduction for the treatment parameter estimate and to adjust for random imbalances of covariates between the treatment groups. However, for the logistic regression model, variance reduction is not obviously obtained. This can lead to concerns about the assumptions of the logistic model. We introduce a complementary nonparametric method for covariate adjustment. It provides results that are usually compatible with expectations for analysis of covariance. The only assumptions required are based on randomization and sampling arguments. The resulting treatment parameter is a (unconditional) population average log-odds ratio that has been adjusted for random imbalance of covariates. Data from a randomized clinical trial are used to compare results from the traditional maximum likelihood logistic method with those from the nonparametric logistic method. We examine treatment parameter estimates, corresponding standard errors, and significance levels in models with and without covariate adjustment. In addition, we discuss differences between unconditional population average treatment parameters and conditional subpopulation average treatment parameters. Additional features of the nonparametric method, including stratified (multicenter) and multivariate (multivisit) analyses, are illustrated. Extensions of this methodology to the proportional odds model are also made.
NASA Astrophysics Data System (ADS)
Norajitra, Tobias; Meinzer, Hans-Peter; Maier-Hein, Klaus H.
2015-03-01
During image segmentation, 3D Statistical Shape Models (SSM) usually conduct a limited search for target landmarks within one-dimensional search profiles perpendicular to the model surface. In addition, landmark appearance is modeled only locally based on linear profiles and weak learners, altogether leading to segmentation errors from landmark ambiguities and limited search coverage. We present a new method for 3D SSM segmentation based on 3D Random Forest Regression Voting. For each surface landmark, a Random Regression Forest is trained that learns a 3D spatial displacement function between the according reference landmark and a set of surrounding sample points, based on an infinite set of non-local randomized 3D Haar-like features. Landmark search is then conducted omni-directionally within 3D search spaces, where voxelwise forest predictions on landmark position contribute to a common voting map which reflects the overall position estimate. Segmentation experiments were conducted on a set of 45 CT volumes of the human liver, of which 40 images were randomly chosen for training and 5 for testing. Without parameter optimization, using a simple candidate selection and a single resolution approach, excellent results were achieved, while faster convergence and better concavity segmentation were observed, altogether underlining the potential of our approach in terms of increased robustness from distinct landmark detection and from better search coverage.
Cure fraction model with random effects for regional variation in cancer survival.
Seppä, Karri; Hakulinen, Timo; Kim, Hyon-Jung; Läärä, Esa
2010-11-30
Assessing regional differences in the survival of cancer patients is important but difficult when separate regions are small or sparsely populated. In this paper, we apply a mixture cure fraction model with random effects to cause-specific survival data of female breast cancer patients collected by the population-based Finnish Cancer Registry. Two sets of random effects were used to capture the regional variation in the cure fraction and in the survival of the non-cured patients, respectively. This hierarchical model was implemented in a Bayesian framework using a Metropolis-within-Gibbs algorithm. To avoid poor mixing of the Markov chain, when the variance of either set of random effects was close to zero, posterior simulations were based on a parameter-expanded model with tailor-made proposal distributions in Metropolis steps. The random effects allowed the fitting of the cure fraction model to the sparse regional data and the estimation of the regional variation in 10-year cause-specific breast cancer survival with a parsimonious number of parameters. Before 1986, the capital of Finland clearly stood out from the rest, but since then all the 21 hospital districts have achieved approximately the same level of survival. Copyright © 2010 John Wiley & Sons, Ltd.
Fific, Mario; Little, Daniel R; Nosofsky, Robert M
2010-04-01
We formalize and provide tests of a set of logical-rule models for predicting perceptual classification response times (RTs) and choice probabilities. The models are developed by synthesizing mental-architecture, random-walk, and decision-bound approaches. According to the models, people make independent decisions about the locations of stimuli along a set of component dimensions. Those independent decisions are then combined via logical rules to determine the overall categorization response. The time course of the independent decisions is modeled via random-walk processes operating along individual dimensions. Alternative mental architectures are used as mechanisms for combining the independent decisions to implement the logical rules. We derive fundamental qualitative contrasts for distinguishing among the predictions of the rule models and major alternative models of classification RT. We also use the models to predict detailed RT-distribution data associated with individual stimuli in tasks of speeded perceptual classification. PsycINFO Database Record (c) 2010 APA, all rights reserved.
Models for the hotspot distribution
NASA Technical Reports Server (NTRS)
Jurdy, Donna M.; Stefanick, Michael
1990-01-01
Published hotspot catalogs all show a hemispheric concentration beyond what can be expected by chance. Cumulative distributions about the center of concentration are described by a power law with a fractal dimension closer to 1 than 2. Random sets of the corresponding sizes do not show this effect. A simple shift of the random sets away from a point would produce distributions similar to those of hotspot sets. The possible relation of the hotspots to the locations of ridges and subduction zones is tested using large sets of randomly-generated points to estimate areas within given distances of the plate boundaries. The probability of finding the observed number of hotspots within 10 deg of the ridges is about what is expected.
A unifying framework for marginalized random intercept models of correlated binary outcomes
Swihart, Bruce J.; Caffo, Brian S.; Crainiceanu, Ciprian M.
2013-01-01
We demonstrate that many current approaches for marginal modeling of correlated binary outcomes produce likelihoods that are equivalent to the copula-based models herein. These general copula models of underlying latent threshold random variables yield likelihood-based models for marginal fixed effects estimation and interpretation in the analysis of correlated binary data with exchangeable correlation structures. Moreover, we propose a nomenclature and set of model relationships that substantially elucidates the complex area of marginalized random intercept models for binary data. A diverse collection of didactic mathematical and numerical examples are given to illustrate concepts. PMID:25342871
NASA Astrophysics Data System (ADS)
Goudarzi, Nasser
2016-04-01
In this work, two new and powerful chemometrics methods are applied for the modeling and prediction of the 19F chemical shift values of some fluorinated organic compounds. The radial basis function-partial least square (RBF-PLS) and random forest (RF) are employed to construct the models to predict the 19F chemical shifts. In this study, we didn't used from any variable selection method and RF method can be used as variable selection and modeling technique. Effects of the important parameters affecting the ability of the RF prediction power such as the number of trees (nt) and the number of randomly selected variables to split each node (m) were investigated. The root-mean-square errors of prediction (RMSEP) for the training set and the prediction set for the RBF-PLS and RF models were 44.70, 23.86, 29.77, and 23.69, respectively. Also, the correlation coefficients of the prediction set for the RBF-PLS and RF models were 0.8684 and 0.9313, respectively. The results obtained reveal that the RF model can be used as a powerful chemometrics tool for the quantitative structure-property relationship (QSPR) studies.
The Ising model coupled to 2d orders
NASA Astrophysics Data System (ADS)
Glaser, Lisa
2018-04-01
In this article we make first steps in coupling matter to causal set theory in the path integral. We explore the case of the Ising model coupled to the 2d discrete Einstein Hilbert action, restricted to the 2d orders. We probe the phase diagram in terms of the Wick rotation parameter β and the Ising coupling j and find that the matter and the causal sets together give rise to an interesting phase structure. The couplings give rise to five different phases. The causal sets take on random or crystalline characteristics as described in Surya (2012 Class. Quantum Grav. 29 132001) and the Ising model can be correlated or uncorrelated on the random orders and correlated, uncorrelated or anti-correlated on the crystalline orders. We find that at least one new phase transition arises, in which the Ising spins push the causal set into the crystalline phase.
Marchese Robinson, Richard L; Palczewska, Anna; Palczewski, Jan; Kidley, Nathan
2017-08-28
The ability to interpret the predictions made by quantitative structure-activity relationships (QSARs) offers a number of advantages. While QSARs built using nonlinear modeling approaches, such as the popular Random Forest algorithm, might sometimes be more predictive than those built using linear modeling approaches, their predictions have been perceived as difficult to interpret. However, a growing number of approaches have been proposed for interpreting nonlinear QSAR models in general and Random Forest in particular. In the current work, we compare the performance of Random Forest to those of two widely used linear modeling approaches: linear Support Vector Machines (SVMs) (or Support Vector Regression (SVR)) and partial least-squares (PLS). We compare their performance in terms of their predictivity as well as the chemical interpretability of the predictions using novel scoring schemes for assessing heat map images of substructural contributions. We critically assess different approaches for interpreting Random Forest models as well as for obtaining predictions from the forest. We assess the models on a large number of widely employed public-domain benchmark data sets corresponding to regression and binary classification problems of relevance to hit identification and toxicology. We conclude that Random Forest typically yields comparable or possibly better predictive performance than the linear modeling approaches and that its predictions may also be interpreted in a chemically and biologically meaningful way. In contrast to earlier work looking at interpretation of nonlinear QSAR models, we directly compare two methodologically distinct approaches for interpreting Random Forest models. The approaches for interpreting Random Forest assessed in our article were implemented using open-source programs that we have made available to the community. These programs are the rfFC package ( https://r-forge.r-project.org/R/?group_id=1725 ) for the R statistical programming language and the Python program HeatMapWrapper [ https://doi.org/10.5281/zenodo.495163 ] for heat map generation.
Calibration of Predictor Models Using Multiple Validation Experiments
NASA Technical Reports Server (NTRS)
Crespo, Luis G.; Kenny, Sean P.; Giesy, Daniel P.
2015-01-01
This paper presents a framework for calibrating computational models using data from several and possibly dissimilar validation experiments. The offset between model predictions and observations, which might be caused by measurement noise, model-form uncertainty, and numerical error, drives the process by which uncertainty in the models parameters is characterized. The resulting description of uncertainty along with the computational model constitute a predictor model. Two types of predictor models are studied: Interval Predictor Models (IPMs) and Random Predictor Models (RPMs). IPMs use sets to characterize uncertainty, whereas RPMs use random vectors. The propagation of a set through a model makes the response an interval valued function of the state, whereas the propagation of a random vector yields a random process. Optimization-based strategies for calculating both types of predictor models are proposed. Whereas the formulations used to calculate IPMs target solutions leading to the interval value function of minimal spread containing all observations, those for RPMs seek to maximize the models' ability to reproduce the distribution of observations. Regarding RPMs, we choose a structure for the random vector (i.e., the assignment of probability to points in the parameter space) solely dependent on the prediction error. As such, the probabilistic description of uncertainty is not a subjective assignment of belief, nor is it expected to asymptotically converge to a fixed value, but instead it casts the model's ability to reproduce the experimental data. This framework enables evaluating the spread and distribution of the predicted response of target applications depending on the same parameters beyond the validation domain.
Modelling wildland fire propagation by tracking random fronts
NASA Astrophysics Data System (ADS)
Pagnini, G.; Mentrelli, A.
2014-08-01
Wildland fire propagation is studied in the literature by two alternative approaches, namely the reaction-diffusion equation and the level-set method. These two approaches are considered alternatives to each other because the solution of the reaction-diffusion equation is generally a continuous smooth function that has an exponential decay, and it is not zero in an infinite domain, while the level-set method, which is a front tracking technique, generates a sharp function that is not zero inside a compact domain. However, these two approaches can indeed be considered complementary and reconciled. Turbulent hot-air transport and fire spotting are phenomena with a random nature and they are extremely important in wildland fire propagation. Consequently, the fire front gets a random character, too; hence, a tracking method for random fronts is needed. In particular, the level-set contour is randomised here according to the probability density function of the interface particle displacement. Actually, when the level-set method is developed for tracking a front interface with a random motion, the resulting averaged process emerges to be governed by an evolution equation of the reaction-diffusion type. In this reconciled approach, the rate of spread of the fire keeps the same key and characterising role that is typical of the level-set approach. The resulting model emerges to be suitable for simulating effects due to turbulent convection, such as fire flank and backing fire, the faster fire spread being because of the actions by hot-air pre-heating and by ember landing, and also due to the fire overcoming a fire-break zone, which is a case not resolved by models based on the level-set method. Moreover, from the proposed formulation, a correction follows for the formula of the rate of spread which is due to the mean jump length of firebrands in the downwind direction for the leeward sector of the fireline contour. The presented study constitutes a proof of concept, and it needs to be subjected to a future validation.
ERIC Educational Resources Information Center
Chorpita, Bruce F.; Daleiden, Eric L.
2009-01-01
This study applied the distillation and matching model to 322 randomized clinical trials for child mental health treatments. The model involved initial data reduction of 615 treatment protocol descriptions by means of a set of codes describing discrete clinical strategies, referred to as practice elements. Practice elements were then summarized in…
Fuzzy Stochastic Petri Nets for Modeling Biological Systems with Uncertain Kinetic Parameters
Liu, Fei; Heiner, Monika; Yang, Ming
2016-01-01
Stochastic Petri nets (SPNs) have been widely used to model randomness which is an inherent feature of biological systems. However, for many biological systems, some kinetic parameters may be uncertain due to incomplete, vague or missing kinetic data (often called fuzzy uncertainty), or naturally vary, e.g., between different individuals, experimental conditions, etc. (often called variability), which has prevented a wider application of SPNs that require accurate parameters. Considering the strength of fuzzy sets to deal with uncertain information, we apply a specific type of stochastic Petri nets, fuzzy stochastic Petri nets (FSPNs), to model and analyze biological systems with uncertain kinetic parameters. FSPNs combine SPNs and fuzzy sets, thereby taking into account both randomness and fuzziness of biological systems. For a biological system, SPNs model the randomness, while fuzzy sets model kinetic parameters with fuzzy uncertainty or variability by associating each parameter with a fuzzy number instead of a crisp real value. We introduce a simulation-based analysis method for FSPNs to explore the uncertainties of outputs resulting from the uncertainties associated with input parameters, which works equally well for bounded and unbounded models. We illustrate our approach using a yeast polarization model having an infinite state space, which shows the appropriateness of FSPNs in combination with simulation-based analysis for modeling and analyzing biological systems with uncertain information. PMID:26910830
Regularization of the big bang singularity with random perturbations
NASA Astrophysics Data System (ADS)
Belbruno, Edward; Xue, BingKan
2018-03-01
We show how to regularize the big bang singularity in the presence of random perturbations modeled by Brownian motion using stochastic methods. We prove that the physical variables in a contracting universe dominated by a scalar field can be continuously and uniquely extended through the big bang as a function of time to an expanding universe only for a discrete set of values of the equation of state satisfying special co-prime number conditions. This result significantly generalizes a previous result (Xue and Belbruno 2014 Class. Quantum Grav. 31 165002) that did not model random perturbations. This result implies that the extension from a contracting to an expanding universe for the discrete set of co-prime equation of state is robust, which is a surprising result. Implications for a purely expanding universe are discussed, such as a non-smooth, randomly varying scale factor near the big bang.
NASA Astrophysics Data System (ADS)
Gong, Yue-Feng; Song, Zhi-Tang; Ling, Yun; Liu, Yan; Feng, Song-Lin
2009-11-01
A three-dimensional finite element model for phase change random access memory (PCRAM) is established for comprehensive electrical and thermal analysis during SET operation. The SET behaviours of the heater addition structure (HS) and the ring-type contact in bottom electrode (RIB) structure are compared with each other. There are two ways to reduce the RESET current, applying a high resistivity interfacial layer and building a new device structure. The simulation results indicate that the variation of SET current with different power reduction ways is little. This study takes the RESET and SET operation current into consideration, showing that the RIB structure PCRAM cell is suitable for future devices with high heat efficiency and high-density, due to its high heat efficiency in RESET operation.
On chemical distances and shape theorems in percolation models with long-range correlations
NASA Astrophysics Data System (ADS)
Drewitz, Alexander; Ráth, Balázs; Sapozhnikov, Artëm
2014-08-01
In this paper, we provide general conditions on a one parameter family of random infinite subsets of {{Z}}^d to contain a unique infinite connected component for which the chemical distances are comparable to the Euclidean distance. In addition, we show that these conditions also imply a shape theorem for the corresponding infinite connected component. By verifying these conditions for specific models, we obtain novel results about the structure of the infinite connected component of the vacant set of random interlacements and the level sets of the Gaussian free field. As a byproduct, we obtain alternative proofs to the corresponding results for random interlacements in the work of Černý and Popov ["On the internal distance in the interlacement set," Electron. J. Probab. 17(29), 1-25 (2012)], and while our main interest is in percolation models with long-range correlations, we also recover results in the spirit of the work of Antal and Pisztora ["On the chemical distance for supercritical Bernoulli percolation," Ann Probab. 24(2), 1036-1048 (1996)] for Bernoulli percolation. Finally, as a corollary, we derive new results about the (chemical) diameter of the largest connected component in the complement of the trace of the random walk on the torus.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gong, Y; Yu, J; Yeung, V
Purpose: Artificial neural networks (ANN) can be used to discover complex relations within datasets to help with medical decision making. This study aimed to develop an ANN method to predict two-year overall survival of patients with peri-ampullary cancer (PAC) following resection. Methods: Data were collected from 334 patients with PAC following resection treated in our institutional pancreatic tumor registry between 2006 and 2012. The dataset contains 14 variables including age, gender, T-stage, tumor differentiation, positive-lymph-node ratio, positive resection margins, chemotherapy, radiation therapy, and tumor histology.After censoring for two-year survival analysis, 309 patients were left, of which 44 patients (∼15%) weremore » randomly selected to form testing set. The remaining 265 cases were randomly divided into training set (211 cases, ∼80% of 265) and validation set (54 cases, ∼20% of 265) for 20 times to build 20 ANN models. Each ANN has one hidden layer with 5 units. The 20 ANN models were ranked according to their concordance index (c-index) of prediction on validation sets. To further improve prediction, the top 10% of ANN models were selected, and their outputs averaged for prediction on testing set. Results: By random division, 44 cases in testing set and the remaining 265 cases have approximately equal two-year survival rates, 36.4% and 35.5% respectively. The 20 ANN models, which were trained and validated on the 265 cases, yielded mean c-indexes as 0.59 and 0.63 on validation sets and the testing set, respectively. C-index was 0.72 when the two best ANN models (top 10%) were used in prediction on testing set. The c-index of Cox regression analysis was 0.63. Conclusion: ANN improved survival prediction for patients with PAC. More patient data and further analysis of additional factors may be needed for a more robust model, which will help guide physicians in providing optimal post-operative care. This project was supported by PA CURE Grant.« less
Maze, Mervyn
2016-02-01
The purpose of this report is to facilitate an understanding of the possible application of xenon for neuroprotection in critical care settings. This narrative review appraises the literature assessing the efficacy and safety of xenon in preclinical models of acute ongoing neurologic injury. Databases of the published literature (MEDLINE® and EMBASE™) were appraised for peer-reviewed manuscripts addressing the use of xenon in both preclinical models and disease states of acute ongoing neurologic injury. For randomized clinical trials not yet reported, the investigators' declarations in the National Institutes of Health clinical trials website were considered. While not a primary focus of this review, to date, xenon cannot be distinguished as superior for surgical anesthesia over existing alternatives in adults. Nevertheless, studies in a variety of preclinical disease models from multiple laboratories have consistently shown xenon's neuroprotective properties. These properties are enhanced in settings where xenon is combined with hypothermia. Small randomized clinical trials are underway to explore xenon's efficacy and safety in clinical settings of acute neurologic injury where hypothermia is the current standard of care. According to the evidence to date, the neuroprotective efficacy of xenon in preclinical models and its safety in clinical anesthesia set the stage for the launch of randomized clinical trials to determine whether these encouraging neuroprotective findings can be translated into clinical utility.
Structure of random discrete spacetime
NASA Technical Reports Server (NTRS)
Brightwell, Graham; Gregory, Ruth
1991-01-01
The usual picture of spacetime consists of a continuous manifold, together with a metric of Lorentzian signature which imposes a causal structure on the spacetime. A model, first suggested by Bombelli et al., is considered in which spacetime consists of a discrete set of points taken at random from a manifold, with only the causal structure on this set remaining. This structure constitutes a partially ordered set (or poset). Working from the poset alone, it is shown how to construct a metric on the space which closely approximates the metric on the original spacetime manifold, how to define the effective dimension of the spacetime, and how such quantities may depend on the scale of measurement. Possible desirable features of the model are discussed.
The structure of random discrete spacetime
NASA Technical Reports Server (NTRS)
Brightwell, Graham; Gregory, Ruth
1990-01-01
The usual picture of spacetime consists of a continuous manifold, together with a metric of Lorentzian signature which imposes a causal structure on the spacetime. A model, first suggested by Bombelli et al., is considered in which spacetime consists of a discrete set of points taken at random from a manifold, with only the causal structure on this set remaining. This structure constitutes a partially ordered set (or poset). Working from the poset alone, it is shown how to construct a metric on the space which closely approximates the metric on the original spacetime manifold, how to define the effective dimension of the spacetime, and how such quantities may depend on the scale of measurement. Possible desirable features of the model are discussed.
Preference heterogeneity in a count data model of demand for off-highway vehicle recreation
Thomas P Holmes; Jeffrey E Englin
2010-01-01
This paper examines heterogeneity in the preferences for OHV recreation by applying the random parameters Poisson model to a data set of off-highway vehicle (OHV) users at four National Forest sites in North Carolina. The analysis develops estimates of individual consumer surplus and finds that estimates are systematically affected by the random parameter specification...
Testing homogeneity in Weibull-regression models.
Bolfarine, Heleno; Valença, Dione M
2005-10-01
In survival studies with families or geographical units it may be of interest testing whether such groups are homogeneous for given explanatory variables. In this paper we consider score type tests for group homogeneity based on a mixing model in which the group effect is modelled as a random variable. As opposed to hazard-based frailty models, this model presents survival times that conditioned on the random effect, has an accelerated failure time representation. The test statistics requires only estimation of the conventional regression model without the random effect and does not require specifying the distribution of the random effect. The tests are derived for a Weibull regression model and in the uncensored situation, a closed form is obtained for the test statistic. A simulation study is used for comparing the power of the tests. The proposed tests are applied to real data sets with censored data.
A random spatial network model based on elementary postulates
Karlinger, Michael R.; Troutman, Brent M.
1989-01-01
A model for generating random spatial networks that is based on elementary postulates comparable to those of the random topology model is proposed. In contrast to the random topology model, this model ascribes a unique spatial specification to generated drainage networks, a distinguishing property of some network growth models. The simplicity of the postulates creates an opportunity for potential analytic investigations of the probabilistic structure of the drainage networks, while the spatial specification enables analyses of spatially dependent network properties. In the random topology model all drainage networks, conditioned on magnitude (number of first-order streams), are equally likely, whereas in this model all spanning trees of a grid, conditioned on area and drainage density, are equally likely. As a result, link lengths in the generated networks are not independent, as usually assumed in the random topology model. For a preliminary model evaluation, scale-dependent network characteristics, such as geometric diameter and link length properties, and topologic characteristics, such as bifurcation ratio, are computed for sets of drainage networks generated on square and rectangular grids. Statistics of the bifurcation and length ratios fall within the range of values reported for natural drainage networks, but geometric diameters tend to be relatively longer than those for natural networks.
Common Randomness Principles of Secrecy
ERIC Educational Resources Information Center
Tyagi, Himanshu
2013-01-01
This dissertation concerns the secure processing of distributed data by multiple terminals, using interactive public communication among themselves, in order to accomplish a given computational task. In the setting of a probabilistic multiterminal source model in which several terminals observe correlated random signals, we analyze secure…
Fox, Eric W; Hill, Ryan A; Leibowitz, Scott G; Olsen, Anthony R; Thornbrugh, Darren J; Weber, Marc H
2017-07-01
Random forest (RF) modeling has emerged as an important statistical learning method in ecology due to its exceptional predictive performance. However, for large and complex ecological data sets, there is limited guidance on variable selection methods for RF modeling. Typically, either a preselected set of predictor variables are used or stepwise procedures are employed which iteratively remove variables according to their importance measures. This paper investigates the application of variable selection methods to RF models for predicting probable biological stream condition. Our motivating data set consists of the good/poor condition of n = 1365 stream survey sites from the 2008/2009 National Rivers and Stream Assessment, and a large set (p = 212) of landscape features from the StreamCat data set as potential predictors. We compare two types of RF models: a full variable set model with all 212 predictors and a reduced variable set model selected using a backward elimination approach. We assess model accuracy using RF's internal out-of-bag estimate, and a cross-validation procedure with validation folds external to the variable selection process. We also assess the stability of the spatial predictions generated by the RF models to changes in the number of predictors and argue that model selection needs to consider both accuracy and stability. The results suggest that RF modeling is robust to the inclusion of many variables of moderate to low importance. We found no substantial improvement in cross-validated accuracy as a result of variable reduction. Moreover, the backward elimination procedure tended to select too few variables and exhibited numerous issues such as upwardly biased out-of-bag accuracy estimates and instabilities in the spatial predictions. We use simulations to further support and generalize results from the analysis of real data. A main purpose of this work is to elucidate issues of model selection bias and instability to ecologists interested in using RF to develop predictive models with large environmental data sets.
A social preference valuations set for EQ-5D health states in Flanders, Belgium.
Cleemput, Irina
2010-04-01
This study aimed at deriving a preference valuation set for EQ-5D health states from the general Flemish public in Belgium. A EuroQol valuation instrument with 16 health states to be valued on a visual analogue scale was sent to a random sample of 2,754 adults. The initial response rate was 35%. Eventually, 548 (20%) respondents provided useable valuations for modeling. Valuations for 245 health states were modeled using a random effects model. The selection of the model was based on two criteria: health state valuations must be consistent, and the difference with the directly observed valuations must be small. A model including a value decrement if any health dimension of the EQ-5D is on the worst level was selected to construct the social health state valuation set. A comparison with health state valuations from other countries showed similarities, especially with those from New Zealand. The use of a single preference valuation set across different health economic evaluations within a country is highly preferable to increase their usability for policy makers. This study contributes to the standardization of outcome measurement in economic evaluations in Belgium.
Nowosad, Jakub; Stach, Alfred; Kasprzyk, Idalia; Weryszko-Chmielewska, Elżbieta; Piotrowska-Weryszko, Krystyna; Puc, Małgorzata; Grewling, Łukasz; Pędziszewska, Anna; Uruska, Agnieszka; Myszkowska, Dorota; Chłopek, Kazimiera; Majkowska-Wojciechowska, Barbara
The aim of the study was to create and evaluate models for predicting high levels of daily pollen concentration of Corylus , Alnus , and Betula using a spatiotemporal correlation of pollen count. For each taxon, a high pollen count level was established according to the first allergy symptoms during exposure. The dataset was divided into a training set and a test set, using a stratified random split. For each taxon and city, the model was built using a random forest method. Corylus models performed poorly. However, the study revealed the possibility of predicting with substantial accuracy the occurrence of days with high pollen concentrations of Alnus and Betula using past pollen count data from monitoring sites. These results can be used for building (1) simpler models, which require data only from aerobiological monitoring sites, and (2) combined meteorological and aerobiological models for predicting high levels of pollen concentration.
On the generation of log-Lévy distributions and extreme randomness
NASA Astrophysics Data System (ADS)
Eliazar, Iddo; Klafter, Joseph
2011-10-01
The log-normal distribution is prevalent across the sciences, as it emerges from the combination of multiplicative processes and the central limit theorem (CLT). The CLT, beyond yielding the normal distribution, also yields the class of Lévy distributions. The log-Lévy distributions are the Lévy counterparts of the log-normal distribution, they appear in the context of ultraslow diffusion processes, and they are categorized by Mandelbrot as belonging to the class of extreme randomness. In this paper, we present a natural stochastic growth model from which both the log-normal distribution and the log-Lévy distributions emerge universally—the former in the case of deterministic underlying setting, and the latter in the case of stochastic underlying setting. In particular, we establish a stochastic growth model which universally generates Mandelbrot’s extreme randomness.
USDA-ARS?s Scientific Manuscript database
(Co)variance components for calving ease and stillbirth in US Holsteins were estimated using a single-trait threshold animal model and two different sets of data edits. Six sets of approximately 250,000 records each were created by randomly selecting herd codes without replacement from the data used...
Multisource passive acoustic tracking: an application of random finite set data fusion
NASA Astrophysics Data System (ADS)
Ali, Andreas M.; Hudson, Ralph E.; Lorenzelli, Flavio; Yao, Kung
2010-04-01
Multisource passive acoustic tracking is useful in animal bio-behavioral study by replacing or enhancing human involvement during and after field data collection. Multiple simultaneous vocalizations are a common occurrence in a forest or a jungle, where many species are encountered. Given a set of nodes that are capable of producing multiple direction-of-arrivals (DOAs), such data needs to be combined into meaningful estimates. Random Finite Set provides the mathematical probabilistic model, which is suitable for analysis and optimal estimation algorithm synthesis. Then the proposed algorithm has been verified using a simulation and a controlled test experiment.
Hsieh, Chung-Ho; Lu, Ruey-Hwa; Lee, Nai-Hsin; Chiu, Wen-Ta; Hsu, Min-Huei; Li, Yu-Chuan Jack
2011-01-01
Diagnosing acute appendicitis clinically is still difficult. We developed random forests, support vector machines, and artificial neural network models to diagnose acute appendicitis. Between January 2006 and December 2008, patients who had a consultation session with surgeons for suspected acute appendicitis were enrolled. Seventy-five percent of the data set was used to construct models including random forest, support vector machines, artificial neural networks, and logistic regression. Twenty-five percent of the data set was withheld to evaluate model performance. The area under the receiver operating characteristic curve (AUC) was used to evaluate performance, which was compared with that of the Alvarado score. Data from a total of 180 patients were collected, 135 used for training and 45 for testing. The mean age of patients was 39.4 years (range, 16-85). Final diagnosis revealed 115 patients with and 65 without appendicitis. The AUC of random forest, support vector machines, artificial neural networks, logistic regression, and Alvarado was 0.98, 0.96, 0.91, 0.87, and 0.77, respectively. The sensitivity, specificity, positive, and negative predictive values of random forest were 94%, 100%, 100%, and 87%, respectively. Random forest performed better than artificial neural networks, logistic regression, and Alvarado. We demonstrated that random forest can predict acute appendicitis with good accuracy and, deployed appropriately, can be an effective tool in clinical decision making. Copyright © 2011 Mosby, Inc. All rights reserved.
Bayesian Genomic Prediction with Genotype × Environment Interaction Kernel Models
Cuevas, Jaime; Crossa, José; Montesinos-López, Osval A.; Burgueño, Juan; Pérez-Rodríguez, Paulino; de los Campos, Gustavo
2016-01-01
The phenomenon of genotype × environment (G × E) interaction in plant breeding decreases selection accuracy, thereby negatively affecting genetic gains. Several genomic prediction models incorporating G × E have been recently developed and used in genomic selection of plant breeding programs. Genomic prediction models for assessing multi-environment G × E interaction are extensions of a single-environment model, and have advantages and limitations. In this study, we propose two multi-environment Bayesian genomic models: the first model considers genetic effects (u) that can be assessed by the Kronecker product of variance–covariance matrices of genetic correlations between environments and genomic kernels through markers under two linear kernel methods, linear (genomic best linear unbiased predictors, GBLUP) and Gaussian (Gaussian kernel, GK). The other model has the same genetic component as the first model (u) plus an extra component, f, that captures random effects between environments that were not captured by the random effects u. We used five CIMMYT data sets (one maize and four wheat) that were previously used in different studies. Results show that models with G × E always have superior prediction ability than single-environment models, and the higher prediction ability of multi-environment models with u and f over the multi-environment model with only u occurred 85% of the time with GBLUP and 45% of the time with GK across the five data sets. The latter result indicated that including the random effect f is still beneficial for increasing prediction ability after adjusting by the random effect u. PMID:27793970
Bayesian Genomic Prediction with Genotype × Environment Interaction Kernel Models.
Cuevas, Jaime; Crossa, José; Montesinos-López, Osval A; Burgueño, Juan; Pérez-Rodríguez, Paulino; de Los Campos, Gustavo
2017-01-05
The phenomenon of genotype × environment (G × E) interaction in plant breeding decreases selection accuracy, thereby negatively affecting genetic gains. Several genomic prediction models incorporating G × E have been recently developed and used in genomic selection of plant breeding programs. Genomic prediction models for assessing multi-environment G × E interaction are extensions of a single-environment model, and have advantages and limitations. In this study, we propose two multi-environment Bayesian genomic models: the first model considers genetic effects [Formula: see text] that can be assessed by the Kronecker product of variance-covariance matrices of genetic correlations between environments and genomic kernels through markers under two linear kernel methods, linear (genomic best linear unbiased predictors, GBLUP) and Gaussian (Gaussian kernel, GK). The other model has the same genetic component as the first model [Formula: see text] plus an extra component, F: , that captures random effects between environments that were not captured by the random effects [Formula: see text] We used five CIMMYT data sets (one maize and four wheat) that were previously used in different studies. Results show that models with G × E always have superior prediction ability than single-environment models, and the higher prediction ability of multi-environment models with [Formula: see text] over the multi-environment model with only u occurred 85% of the time with GBLUP and 45% of the time with GK across the five data sets. The latter result indicated that including the random effect f is still beneficial for increasing prediction ability after adjusting by the random effect [Formula: see text]. Copyright © 2017 Cuevas et al.
DOT National Transportation Integrated Search
2016-09-01
We consider the problem of solving mixed random linear equations with k components. This is the noiseless setting of mixed linear regression. The goal is to estimate multiple linear models from mixed samples in the case where the labels (which sample...
Kück, Patrick; Meusemann, Karen; Dambach, Johannes; Thormann, Birthe; von Reumont, Björn M; Wägele, Johann W; Misof, Bernhard
2010-03-31
Methods of alignment masking, which refers to the technique of excluding alignment blocks prior to tree reconstructions, have been successful in improving the signal-to-noise ratio in sequence alignments. However, the lack of formally well defined methods to identify randomness in sequence alignments has prevented a routine application of alignment masking. In this study, we compared the effects on tree reconstructions of the most commonly used profiling method (GBLOCKS) which uses a predefined set of rules in combination with alignment masking, with a new profiling approach (ALISCORE) based on Monte Carlo resampling within a sliding window, using different data sets and alignment methods. While the GBLOCKS approach excludes variable sections above a certain threshold which choice is left arbitrary, the ALISCORE algorithm is free of a priori rating of parameter space and therefore more objective. ALISCORE was successfully extended to amino acids using a proportional model and empirical substitution matrices to score randomness in multiple sequence alignments. A complex bootstrap resampling leads to an even distribution of scores of randomly similar sequences to assess randomness of the observed sequence similarity. Testing performance on real data, both masking methods, GBLOCKS and ALISCORE, helped to improve tree resolution. The sliding window approach was less sensitive to different alignments of identical data sets and performed equally well on all data sets. Concurrently, ALISCORE is capable of dealing with different substitution patterns and heterogeneous base composition. ALISCORE and the most relaxed GBLOCKS gap parameter setting performed best on all data sets. Correspondingly, Neighbor-Net analyses showed the most decrease in conflict. Alignment masking improves signal-to-noise ratio in multiple sequence alignments prior to phylogenetic reconstruction. Given the robust performance of alignment profiling, alignment masking should routinely be used to improve tree reconstructions. Parametric methods of alignment profiling can be easily extended to more complex likelihood based models of sequence evolution which opens the possibility of further improvements.
Artificial neural network study on organ-targeting peptides
NASA Astrophysics Data System (ADS)
Jung, Eunkyoung; Kim, Junhyoung; Choi, Seung-Hoon; Kim, Minkyoung; Rhee, Hokyoung; Shin, Jae-Min; Choi, Kihang; Kang, Sang-Kee; Lee, Nam Kyung; Choi, Yun-Jaie; Jung, Dong Hyun
2010-01-01
We report a new approach to studying organ targeting of peptides on the basis of peptide sequence information. The positive control data sets consist of organ-targeting peptide sequences identified by the peroral phage-display technique for four organs, and the negative control data are prepared from random sequences. The capacity of our models to make appropriate predictions is validated by statistical indicators including sensitivity, specificity, enrichment curve, and the area under the receiver operating characteristic (ROC) curve (the ROC score). VHSE descriptor produces statistically significant training models and the models with simple neural network architectures show slightly greater predictive power than those with complex ones. The training and test set statistics indicate that our models could discriminate between organ-targeting and random sequences. We anticipate that our models will be applicable to the selection of organ-targeting peptides for generating peptide drugs or peptidomimetics.
1989-08-01
Random variables for the conditional exponential distribution are generated using the inverse transform method. C1) Generate U - UCO,i) (2) Set s - A ln...e - [(x+s - 7)/ n] 0 + [Cx-T)/n]0 c. Random variables from the conditional weibull distribution are generated using the inverse transform method. C1...using a standard normal transformation and the inverse transform method. B - 3 APPENDIX 3 DISTRIBUTIONS SUPPORTED BY THE MODEL (1) Generate Y - PCX S
Model's sparse representation based on reduced mixed GMsFE basis methods
NASA Astrophysics Data System (ADS)
Jiang, Lijian; Li, Qiuqi
2017-06-01
In this paper, we propose a model's sparse representation based on reduced mixed generalized multiscale finite element (GMsFE) basis methods for elliptic PDEs with random inputs. A typical application for the elliptic PDEs is the flow in heterogeneous random porous media. Mixed generalized multiscale finite element method (GMsFEM) is one of the accurate and efficient approaches to solve the flow problem in a coarse grid and obtain the velocity with local mass conservation. When the inputs of the PDEs are parameterized by the random variables, the GMsFE basis functions usually depend on the random parameters. This leads to a large number degree of freedoms for the mixed GMsFEM and substantially impacts on the computation efficiency. In order to overcome the difficulty, we develop reduced mixed GMsFE basis methods such that the multiscale basis functions are independent of the random parameters and span a low-dimensional space. To this end, a greedy algorithm is used to find a set of optimal samples from a training set scattered in the parameter space. Reduced mixed GMsFE basis functions are constructed based on the optimal samples using two optimal sampling strategies: basis-oriented cross-validation and proper orthogonal decomposition. Although the dimension of the space spanned by the reduced mixed GMsFE basis functions is much smaller than the dimension of the original full order model, the online computation still depends on the number of coarse degree of freedoms. To significantly improve the online computation, we integrate the reduced mixed GMsFE basis methods with sparse tensor approximation and obtain a sparse representation for the model's outputs. The sparse representation is very efficient for evaluating the model's outputs for many instances of parameters. To illustrate the efficacy of the proposed methods, we present a few numerical examples for elliptic PDEs with multiscale and random inputs. In particular, a two-phase flow model in random porous media is simulated by the proposed sparse representation method.
Model's sparse representation based on reduced mixed GMsFE basis methods
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jiang, Lijian, E-mail: ljjiang@hnu.edu.cn; Li, Qiuqi, E-mail: qiuqili@hnu.edu.cn
2017-06-01
In this paper, we propose a model's sparse representation based on reduced mixed generalized multiscale finite element (GMsFE) basis methods for elliptic PDEs with random inputs. A typical application for the elliptic PDEs is the flow in heterogeneous random porous media. Mixed generalized multiscale finite element method (GMsFEM) is one of the accurate and efficient approaches to solve the flow problem in a coarse grid and obtain the velocity with local mass conservation. When the inputs of the PDEs are parameterized by the random variables, the GMsFE basis functions usually depend on the random parameters. This leads to a largemore » number degree of freedoms for the mixed GMsFEM and substantially impacts on the computation efficiency. In order to overcome the difficulty, we develop reduced mixed GMsFE basis methods such that the multiscale basis functions are independent of the random parameters and span a low-dimensional space. To this end, a greedy algorithm is used to find a set of optimal samples from a training set scattered in the parameter space. Reduced mixed GMsFE basis functions are constructed based on the optimal samples using two optimal sampling strategies: basis-oriented cross-validation and proper orthogonal decomposition. Although the dimension of the space spanned by the reduced mixed GMsFE basis functions is much smaller than the dimension of the original full order model, the online computation still depends on the number of coarse degree of freedoms. To significantly improve the online computation, we integrate the reduced mixed GMsFE basis methods with sparse tensor approximation and obtain a sparse representation for the model's outputs. The sparse representation is very efficient for evaluating the model's outputs for many instances of parameters. To illustrate the efficacy of the proposed methods, we present a few numerical examples for elliptic PDEs with multiscale and random inputs. In particular, a two-phase flow model in random porous media is simulated by the proposed sparse representation method.« less
On a Stochastic Failure Model under Random Shocks
NASA Astrophysics Data System (ADS)
Cha, Ji Hwan
2013-02-01
In most conventional settings, the events caused by an external shock are initiated at the moments of its occurrence. In this paper, we study a new classes of shock model, where each shock from a nonhomogeneous Poisson processes can trigger a failure of a system not immediately, as in classical extreme shock models, but with delay of some random time. We derive the corresponding survival and failure rate functions. Furthermore, we study the limiting behaviour of the failure rate function where it is applicable.
Johnson, Jason K.; Oyen, Diane Adele; Chertkov, Michael; ...
2016-12-01
Inference and learning of graphical models are both well-studied problems in statistics and machine learning that have found many applications in science and engineering. However, exact inference is intractable in general graphical models, which suggests the problem of seeking the best approximation to a collection of random variables within some tractable family of graphical models. In this paper, we focus on the class of planar Ising models, for which exact inference is tractable using techniques of statistical physics. Based on these techniques and recent methods for planarity testing and planar embedding, we propose a greedy algorithm for learning the bestmore » planar Ising model to approximate an arbitrary collection of binary random variables (possibly from sample data). Given the set of all pairwise correlations among variables, we select a planar graph and optimal planar Ising model defined on this graph to best approximate that set of correlations. Finally, we demonstrate our method in simulations and for two applications: modeling senate voting records and identifying geo-chemical depth trends from Mars rover data.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Johnson, Jason K.; Oyen, Diane Adele; Chertkov, Michael
Inference and learning of graphical models are both well-studied problems in statistics and machine learning that have found many applications in science and engineering. However, exact inference is intractable in general graphical models, which suggests the problem of seeking the best approximation to a collection of random variables within some tractable family of graphical models. In this paper, we focus on the class of planar Ising models, for which exact inference is tractable using techniques of statistical physics. Based on these techniques and recent methods for planarity testing and planar embedding, we propose a greedy algorithm for learning the bestmore » planar Ising model to approximate an arbitrary collection of binary random variables (possibly from sample data). Given the set of all pairwise correlations among variables, we select a planar graph and optimal planar Ising model defined on this graph to best approximate that set of correlations. Finally, we demonstrate our method in simulations and for two applications: modeling senate voting records and identifying geo-chemical depth trends from Mars rover data.« less
2014-01-01
Background Meta-regression is becoming increasingly used to model study level covariate effects. However this type of statistical analysis presents many difficulties and challenges. Here two methods for calculating confidence intervals for the magnitude of the residual between-study variance in random effects meta-regression models are developed. A further suggestion for calculating credible intervals using informative prior distributions for the residual between-study variance is presented. Methods Two recently proposed and, under the assumptions of the random effects model, exact methods for constructing confidence intervals for the between-study variance in random effects meta-analyses are extended to the meta-regression setting. The use of Generalised Cochran heterogeneity statistics is extended to the meta-regression setting and a Newton-Raphson procedure is developed to implement the Q profile method for meta-analysis and meta-regression. WinBUGS is used to implement informative priors for the residual between-study variance in the context of Bayesian meta-regressions. Results Results are obtained for two contrasting examples, where the first example involves a binary covariate and the second involves a continuous covariate. Intervals for the residual between-study variance are wide for both examples. Conclusions Statistical methods, and R computer software, are available to compute exact confidence intervals for the residual between-study variance under the random effects model for meta-regression. These frequentist methods are almost as easily implemented as their established counterparts for meta-analysis. Bayesian meta-regressions are also easily performed by analysts who are comfortable using WinBUGS. Estimates of the residual between-study variance in random effects meta-regressions should be routinely reported and accompanied by some measure of their uncertainty. Confidence and/or credible intervals are well-suited to this purpose. PMID:25196829
Generation of kth-order random toposequences
NASA Astrophysics Data System (ADS)
Odgers, Nathan P.; McBratney, Alex. B.; Minasny, Budiman
2008-05-01
The model presented in this paper derives toposequences from a digital elevation model (DEM). It is written in ArcInfo Macro Language (AML). The toposequences are called kth-order random toposequences, because they take a random path uphill to the top of a hill and downhill to a stream or valley bottom from a randomly selected seed point, and they are located in a streamshed of order k according to a particular stream-ordering system. We define a kth-order streamshed as the area of land that drains directly to a stream segment of stream order k. The model attempts to optimise the spatial configuration of a set of derived toposequences iteratively by using simulated annealing to maximise the total sum of distances between each toposequence hilltop in the set. The user is able to select the order, k, of the derived toposequences. Toposequences are useful for determining soil sampling locations for use in collecting soil data for digital soil mapping applications. Sampling locations can be allocated according to equal elevation or equal-distance intervals along the length of the toposequence, for example. We demonstrate the use of this model for a study area in the Hunter Valley of New South Wales, Australia. Of the 64 toposequences derived, 32 were first-order random toposequences according to Strahler's stream-ordering system, and 32 were second-order random toposequences. The model that we present in this paper is an efficient method for sampling soil along soil toposequences. The soils along a toposequence are related to each other by the topography they are found in, so soil data collected by this method is useful for establishing soil-landscape rules for the preparation of digital soil maps.
Maximizing lipocalin prediction through balanced and diversified training set and decision fusion.
Nath, Abhigyan; Subbiah, Karthikeyan
2015-12-01
Lipocalins are short in sequence length and perform several important biological functions. These proteins are having less than 20% sequence similarity among paralogs. Experimentally identifying them is an expensive and time consuming process. The computational methods based on the sequence similarity for allocating putative members to this family are also far elusive due to the low sequence similarity existing among the members of this family. Consequently, the machine learning methods become a viable alternative for their prediction by using the underlying sequence/structurally derived features as the input. Ideally, any machine learning based prediction method must be trained with all possible variations in the input feature vector (all the sub-class input patterns) to achieve perfect learning. A near perfect learning can be achieved by training the model with diverse types of input instances belonging to the different regions of the entire input space. Furthermore, the prediction performance can be improved through balancing the training set as the imbalanced data sets will tend to produce the prediction bias towards majority class and its sub-classes. This paper is aimed to achieve (i) the high generalization ability without any classification bias through the diversified and balanced training sets as well as (ii) enhanced the prediction accuracy by combining the results of individual classifiers with an appropriate fusion scheme. Instead of creating the training set randomly, we have first used the unsupervised Kmeans clustering algorithm to create diversified clusters of input patterns and created the diversified and balanced training set by selecting an equal number of patterns from each of these clusters. Finally, probability based classifier fusion scheme was applied on boosted random forest algorithm (which produced greater sensitivity) and K nearest neighbour algorithm (which produced greater specificity) to achieve the enhanced predictive performance than that of individual base classifiers. The performance of the learned models trained on Kmeans preprocessed training set is far better than the randomly generated training sets. The proposed method achieved a sensitivity of 90.6%, specificity of 91.4% and accuracy of 91.0% on the first test set and sensitivity of 92.9%, specificity of 96.2% and accuracy of 94.7% on the second blind test set. These results have established that diversifying training set improves the performance of predictive models through superior generalization ability and balancing the training set improves prediction accuracy. For smaller data sets, unsupervised Kmeans based sampling can be an effective technique to increase generalization than that of the usual random splitting method. Copyright © 2015 Elsevier Ltd. All rights reserved.
Local dependence in random graph models: characterization, properties and statistical inference
Schweinberger, Michael; Handcock, Mark S.
2015-01-01
Summary Dependent phenomena, such as relational, spatial and temporal phenomena, tend to be characterized by local dependence in the sense that units which are close in a well-defined sense are dependent. In contrast with spatial and temporal phenomena, though, relational phenomena tend to lack a natural neighbourhood structure in the sense that it is unknown which units are close and thus dependent. Owing to the challenge of characterizing local dependence and constructing random graph models with local dependence, many conventional exponential family random graph models induce strong dependence and are not amenable to statistical inference. We take first steps to characterize local dependence in random graph models, inspired by the notion of finite neighbourhoods in spatial statistics and M-dependence in time series, and we show that local dependence endows random graph models with desirable properties which make them amenable to statistical inference. We show that random graph models with local dependence satisfy a natural domain consistency condition which every model should satisfy, but conventional exponential family random graph models do not satisfy. In addition, we establish a central limit theorem for random graph models with local dependence, which suggests that random graph models with local dependence are amenable to statistical inference. We discuss how random graph models with local dependence can be constructed by exploiting either observed or unobserved neighbourhood structure. In the absence of observed neighbourhood structure, we take a Bayesian view and express the uncertainty about the neighbourhood structure by specifying a prior on a set of suitable neighbourhood structures. We present simulation results and applications to two real world networks with ‘ground truth’. PMID:26560142
Mathematical and physical meaning of the Bell inequalities
NASA Astrophysics Data System (ADS)
Santos, Emilio
2016-09-01
It is shown that the Bell inequalities are closely related to the triangle inequalities involving distance functions amongst pairs of random variables with values \\{0,1\\}. A hidden variables model may be defined as a mapping between a set of quantum projection operators and a set of random variables. The model is noncontextual if there is a joint probability distribution. The Bell inequalities are necessary conditions for its existence. The inequalities are most relevant when measurements are performed at space-like separation, thus showing a conflict between quantum mechanics and local realism (Bell's theorem). The relations of the Bell inequalities with contextuality, the Kochen-Specker theorem, and quantum entanglement are briefly discussed.
Decision tree modeling using R.
Zhang, Zhongheng
2016-08-01
In machine learning field, decision tree learner is powerful and easy to interpret. It employs recursive binary partitioning algorithm that splits the sample in partitioning variable with the strongest association with the response variable. The process continues until some stopping criteria are met. In the example I focus on conditional inference tree, which incorporates tree-structured regression models into conditional inference procedures. While growing a single tree is subject to small changes in the training data, random forests procedure is introduced to address this problem. The sources of diversity for random forests come from the random sampling and restricted set of input variables to be selected. Finally, I introduce R functions to perform model based recursive partitioning. This method incorporates recursive partitioning into conventional parametric model building.
Soares, Marta O.; Palmer, Stephen; Ades, Anthony E.; Harrison, David; Shankar-Hari, Manu; Rowan, Kathy M.
2015-01-01
Cost-effectiveness analysis (CEA) models are routinely used to inform health care policy. Key model inputs include relative effectiveness of competing treatments, typically informed by meta-analysis. Heterogeneity is ubiquitous in meta-analysis, and random effects models are usually used when there is variability in effects across studies. In the absence of observed treatment effect modifiers, various summaries from the random effects distribution (random effects mean, predictive distribution, random effects distribution, or study-specific estimate [shrunken or independent of other studies]) can be used depending on the relationship between the setting for the decision (population characteristics, treatment definitions, and other contextual factors) and the included studies. If covariates have been measured that could potentially explain the heterogeneity, then these can be included in a meta-regression model. We describe how covariates can be included in a network meta-analysis model and how the output from such an analysis can be used in a CEA model. We outline a model selection procedure to help choose between competing models and stress the importance of clinical input. We illustrate the approach with a health technology assessment of intravenous immunoglobulin for the management of adult patients with severe sepsis in an intensive care setting, which exemplifies how risk of bias information can be incorporated into CEA models. We show that the results of the CEA and value-of-information analyses are sensitive to the model and highlight the importance of sensitivity analyses when conducting CEA in the presence of heterogeneity. The methods presented extend naturally to heterogeneity in other model inputs, such as baseline risk. PMID:25712447
Welton, Nicky J; Soares, Marta O; Palmer, Stephen; Ades, Anthony E; Harrison, David; Shankar-Hari, Manu; Rowan, Kathy M
2015-07-01
Cost-effectiveness analysis (CEA) models are routinely used to inform health care policy. Key model inputs include relative effectiveness of competing treatments, typically informed by meta-analysis. Heterogeneity is ubiquitous in meta-analysis, and random effects models are usually used when there is variability in effects across studies. In the absence of observed treatment effect modifiers, various summaries from the random effects distribution (random effects mean, predictive distribution, random effects distribution, or study-specific estimate [shrunken or independent of other studies]) can be used depending on the relationship between the setting for the decision (population characteristics, treatment definitions, and other contextual factors) and the included studies. If covariates have been measured that could potentially explain the heterogeneity, then these can be included in a meta-regression model. We describe how covariates can be included in a network meta-analysis model and how the output from such an analysis can be used in a CEA model. We outline a model selection procedure to help choose between competing models and stress the importance of clinical input. We illustrate the approach with a health technology assessment of intravenous immunoglobulin for the management of adult patients with severe sepsis in an intensive care setting, which exemplifies how risk of bias information can be incorporated into CEA models. We show that the results of the CEA and value-of-information analyses are sensitive to the model and highlight the importance of sensitivity analyses when conducting CEA in the presence of heterogeneity. The methods presented extend naturally to heterogeneity in other model inputs, such as baseline risk. © The Author(s) 2015.
From micro-correlations to macro-correlations
DOE Office of Scientific and Technical Information (OSTI.GOV)
Eliazar, Iddo, E-mail: iddo.eliazar@intel.com
2016-11-15
Random vectors with a symmetric correlation structure share a common value of pair-wise correlation between their different components. The symmetric correlation structure appears in a multitude of settings, e.g. mixture models. In a mixture model the components of the random vector are drawn independently from a general probability distribution that is determined by an underlying parameter, and the parameter itself is randomized. In this paper we study the overall correlation of high-dimensional random vectors with a symmetric correlation structure. Considering such a random vector, and terming its pair-wise correlation “micro-correlation”, we use an asymptotic analysis to derive the random vector’smore » “macro-correlation” : a score that takes values in the unit interval, and that quantifies the random vector’s overall correlation. The method of obtaining macro-correlations from micro-correlations is then applied to a diverse collection of frameworks that demonstrate the method’s wide applicability.« less
QSAR as a random event: modeling of nanoparticles uptake in PaCa2 cancer cells.
Toropov, Andrey A; Toropova, Alla P; Puzyn, Tomasz; Benfenati, Emilio; Gini, Giuseppina; Leszczynska, Danuta; Leszczynski, Jerzy
2013-06-01
Quantitative structure-property/activity relationships (QSPRs/QSARs) are a tool to predict various endpoints for various substances. The "classic" QSPR/QSAR analysis is based on the representation of the molecular structure by the molecular graph. However, simplified molecular input-line entry system (SMILES) gradually becomes most popular representation of the molecular structure in the databases available on the Internet. Under such circumstances, the development of molecular descriptors calculated directly from SMILES becomes attractive alternative to "classic" descriptors. The CORAL software (http://www.insilico.eu/coral) is provider of SMILES-based optimal molecular descriptors which are aimed to correlate with various endpoints. We analyzed data set on nanoparticles uptake in PaCa2 pancreatic cancer cells. The data set includes 109 nanoparticles with the same core but different surface modifiers (small organic molecules). The concept of a QSAR as a random event is suggested in opposition to "classic" QSARs which are based on the only one distribution of available data into the training and the validation sets. In other words, five random splits into the "visible" training set and the "invisible" validation set were examined. The SMILES-based optimal descriptors (obtained by the Monte Carlo technique) for these splits are calculated with the CORAL software. The statistical quality of all these models is good. Copyright © 2013 Elsevier Ltd. All rights reserved.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Pražnikar, Jure; University of Primorska,; Turk, Dušan, E-mail: dusan.turk@ijs.si
2014-12-01
The maximum-likelihood free-kick target, which calculates model error estimates from the work set and a randomly displaced model, proved superior in the accuracy and consistency of refinement of crystal structures compared with the maximum-likelihood cross-validation target, which calculates error estimates from the test set and the unperturbed model. The refinement of a molecular model is a computational procedure by which the atomic model is fitted to the diffraction data. The commonly used target in the refinement of macromolecular structures is the maximum-likelihood (ML) function, which relies on the assessment of model errors. The current ML functions rely on cross-validation. Theymore » utilize phase-error estimates that are calculated from a small fraction of diffraction data, called the test set, that are not used to fit the model. An approach has been developed that uses the work set to calculate the phase-error estimates in the ML refinement from simulating the model errors via the random displacement of atomic coordinates. It is called ML free-kick refinement as it uses the ML formulation of the target function and is based on the idea of freeing the model from the model bias imposed by the chemical energy restraints used in refinement. This approach for the calculation of error estimates is superior to the cross-validation approach: it reduces the phase error and increases the accuracy of molecular models, is more robust, provides clearer maps and may use a smaller portion of data for the test set for the calculation of R{sub free} or may leave it out completely.« less
Asymptotic laws for random knot diagrams
NASA Astrophysics Data System (ADS)
Chapman, Harrison
2017-06-01
We study random knotting by considering knot and link diagrams as decorated, (rooted) topological maps on spheres and pulling them uniformly from among sets of a given number of vertices n, as first established in recent work with Cantarella and Mastin. The knot diagram model is an exciting new model which captures both the random geometry of space curve models of knotting as well as the ease of computing invariants from diagrams. We prove that unknot diagrams are asymptotically exponentially rare, an analogue of Sumners and Whittington’s landmark result for self-avoiding polygons. Our proof uses the same key idea: we first show that knot diagrams obey a pattern theorem, which describes their fractal structure. We examine how quickly this behavior occurs in practice. As a consequence, almost all diagrams are asymmetric, simplifying sampling from this model. We conclude with experimental data on knotting in this model. This model of random knotting is similar to those studied by Diao et al, and Dunfield et al.
NASA Astrophysics Data System (ADS)
Wang, Yu; Fan, Jie; Xu, Ye; Sun, Wei; Chen, Dong
2017-06-01
Effective application of carbon capture, utilization and storage (CCUS) systems could help to alleviate the influence of climate change by reducing carbon dioxide (CO2) emissions. The research objective of this study is to develop an equilibrium chance-constrained programming model with bi-random variables (ECCP model) for supporting the CCUS management system under random circumstances. The major advantage of the ECCP model is that it tackles random variables as bi-random variables with a normal distribution, where the mean values follow a normal distribution. This could avoid irrational assumptions and oversimplifications in the process of parameter design and enrich the theory of stochastic optimization. The ECCP model is solved by an equilibrium change-constrained programming algorithm, which provides convenience for decision makers to rank the solution set using the natural order of real numbers. The ECCP model is applied to a CCUS management problem, and the solutions could be useful in helping managers to design and generate rational CO2-allocation patterns under complexities and uncertainties.
Alternative Multiple Imputation Inference for Mean and Covariance Structure Modeling
ERIC Educational Resources Information Center
Lee, Taehun; Cai, Li
2012-01-01
Model-based multiple imputation has become an indispensable method in the educational and behavioral sciences. Mean and covariance structure models are often fitted to multiply imputed data sets. However, the presence of multiple random imputations complicates model fit testing, which is an important aspect of mean and covariance structure…
2006-06-01
levels of automation applied as per Figure 13. .................................. 60 x THIS PAGE...models generated for this thesis were set to run for 60 minutes. To run the simulation for the set time, the analyst provides a random number seed to...1984). The IMPRINT 59 workload value of 60 has been used by a consensus of workload modeling SMEs to represent the ‘high’ threshold, while the
Merging Marine Ecosystem Models and Genomics
NASA Astrophysics Data System (ADS)
Coles, V.; Hood, R. R.; Stukel, M. R.; Moran, M. A.; Paul, J. H.; Satinsky, B.; Zielinski, B.; Yager, P. L.
2015-12-01
oceanography. One of the grand challenges of oceanography is to develop model techniques to more effectively incorporate genomic information. As one approach, we developed an ecosystem model whose community is determined by randomly assigning functional genes to build each organism's "DNA". Microbes are assigned a size that sets their baseline environmental responses using allometric response cuves. These responses are modified by the costs and benefits conferred by each gene in an organism's genome. The microbes are embedded in a general circulation model where environmental conditions shape the emergent population. This model is used to explore whether organisms constructed from randomized combinations of metabolic capability alone can self-organize to create realistic oceanic biogeochemical gradients. Realistic community size spectra and chlorophyll-a concentrations emerge in the model. The model is run repeatedly with randomly-generated microbial communities and each time realistic gradients in community size spectra, chlorophyll-a, and forms of nitrogen develop. This supports the hypothesis that the metabolic potential of a community rather than the realized species composition is the primary factor setting vertical and horizontal environmental gradients. Vertical distributions of nitrogen and transcripts for genes involved in nitrification are broadly consistent with observations. Modeled gene and transcript abundance for nitrogen cycling and processing of land-derived organic material match observations along the extreme gradients in the Amazon River plume, and they help to explain the factors controlling observed variability.
Discretisation Schemes for Level Sets of Planar Gaussian Fields
NASA Astrophysics Data System (ADS)
Beliaev, D.; Muirhead, S.
2018-01-01
Smooth random Gaussian functions play an important role in mathematical physics, a main example being the random plane wave model conjectured by Berry to give a universal description of high-energy eigenfunctions of the Laplacian on generic compact manifolds. Our work is motivated by questions about the geometry of such random functions, in particular relating to the structure of their nodal and level sets. We study four discretisation schemes that extract information about level sets of planar Gaussian fields. Each scheme recovers information up to a different level of precision, and each requires a maximum mesh-size in order to be valid with high probability. The first two schemes are generalisations and enhancements of similar schemes that have appeared in the literature (Beffara and Gayet in Publ Math IHES, 2017. https://doi.org/10.1007/s10240-017-0093-0; Mischaikow and Wanner in Ann Appl Probab 17:980-1018, 2007); these give complete topological information about the level sets on either a local or global scale. As an application, we improve the results in Beffara and Gayet (2017) on Russo-Seymour-Welsh estimates for the nodal set of positively-correlated planar Gaussian fields. The third and fourth schemes are, to the best of our knowledge, completely new. The third scheme is specific to the nodal set of the random plane wave, and provides global topological information about the nodal set up to `visible ambiguities'. The fourth scheme gives a way to approximate the mean number of excursion domains of planar Gaussian fields.
NASA Astrophysics Data System (ADS)
Chan, C. H.; Brown, G.; Rikvold, P. A.
2017-05-01
A generalized approach to Wang-Landau simulations, macroscopically constrained Wang-Landau, is proposed to simulate the density of states of a system with multiple macroscopic order parameters. The method breaks a multidimensional random-walk process in phase space into many separate, one-dimensional random-walk processes in well-defined subspaces. Each of these random walks is constrained to a different set of values of the macroscopic order parameters. When the multivariable density of states is obtained for one set of values of fieldlike model parameters, the density of states for any other values of these parameters can be obtained by a simple transformation of the total system energy. All thermodynamic quantities of the system can then be rapidly calculated at any point in the phase diagram. We demonstrate how to use the multivariable density of states to draw the phase diagram, as well as order-parameter probability distributions at specific phase points, for a model spin-crossover material: an antiferromagnetic Ising model with ferromagnetic long-range interactions. The fieldlike parameters in this model are an effective magnetic field and the strength of the long-range interaction.
A Visual Detection Learning Model
NASA Technical Reports Server (NTRS)
Beard, Bettina L.; Ahumada, Albert J., Jr.; Trejo, Leonard (Technical Monitor)
1998-01-01
Our learning model has memory templates representing the target-plus-noise and noise-alone stimulus sets. The best correlating template determines the response. The correlations and the feedback participate in the additive template updating rule. The model can predict the relative thresholds for detection in random, fixed and twin noise.
Testing self-regulation interventions to increase walking using factorial randomized N-of-1 trials.
Sniehotta, Falko F; Presseau, Justin; Hobbs, Nicola; Araújo-Soares, Vera
2012-11-01
To investigate the suitability of N-of-1 randomized controlled trials (RCTs) as a means of testing the effectiveness of behavior change techniques based on self-regulation theory (goal setting and self-monitoring) for promoting walking in healthy adult volunteers. A series of N-of-1 RCTs in 10 normal and overweight adults ages 19-67 (M = 36.9 years). We randomly allocated 60 days within each individual to text message-prompted daily goal-setting and/or self-monitoring interventions in accordance with a 2 (step-count goal prompt vs. alternative goal prompt) × 2 (self-monitoring: open vs. blinded Omron-HJ-113-E pedometer) factorial design. Aggregated data were analyzed using random intercept multilevel models. Single cases were analyzed individually. The primary outcome was daily pedometer step counts over 60 days. Single-case analyses showed that 4 participants significantly increased walking: 2 on self-monitoring days and 2 on goal-setting days, compared with control days. Six participants did not benefit from the interventions. In aggregated analyses, mean step counts were higher on goal-setting days (8,499.9 vs. 7,956.3) and on self-monitoring days (8,630.3 vs. 7,825.9). Multilevel analyses showed a significant effect of the self-monitoring condition (p = .01), the goal-setting condition approached significance (p = .08), and there was a small linear increase in walking over time (p = .03). N-of-1 randomized trials are a suitable means to test behavioral interventions in individual participants.
Edge union of networks on the same vertex set
NASA Astrophysics Data System (ADS)
Loe, Chuan Wen; Jeldtoft Jensen, Henrik
2013-06-01
Random network generators such as Erdős-Rényi, Watts-Strogatz and Barabási-Albert models are used as models to study real-world networks. Let G1(V, E1) and G2(V, E2) be two such networks on the same vertex set V. This paper studies the degree distribution and clustering coefficient of the resultant networks, G(V, E1∪E2).
Random covering of the circle: the configuration-space of the free deposition process
NASA Astrophysics Data System (ADS)
Huillet, Thierry
2003-12-01
Consider a circle of circumference 1. Throw at random n points, sequentially, on this circle and append clockwise an arc (or rod) of length s to each such point. The resulting random set (the free gas of rods) is a collection of a random number of clusters with random sizes. It models a free deposition process on a 1D substrate. For such processes, we shall consider the occurrence times (number of rods) and probabilities, as n grows, of the following configurations: those avoiding rod overlap (the hard-rod gas), those for which the largest gap is smaller than rod length s (the packing gas), those (parking configurations) for which hard rod and packing constraints are both fulfilled and covering configurations. Special attention is paid to the statistical properties of each such (rare) configuration in the asymptotic density domain when ns = rgr, for some finite density rgr of points. Using results from spacings in the random division of the circle, explicit large deviation rate functions can be computed in each case from state equations. Lastly, a process consisting in selecting at random one of these specific equilibrium configurations (called the observable) can be modelled. When particularized to the parking model, this system produces parking configurations differently from Rényi's random sequential adsorption model.
Predicting Coastal Flood Severity using Random Forest Algorithm
NASA Astrophysics Data System (ADS)
Sadler, J. M.; Goodall, J. L.; Morsy, M. M.; Spencer, K.
2017-12-01
Coastal floods have become more common recently and are predicted to further increase in frequency and severity due to sea level rise. Predicting floods in coastal cities can be difficult due to the number of environmental and geographic factors which can influence flooding events. Built stormwater infrastructure and irregular urban landscapes add further complexity. This paper demonstrates the use of machine learning algorithms in predicting street flood occurrence in an urban coastal setting. The model is trained and evaluated using data from Norfolk, Virginia USA from September 2010 - October 2016. Rainfall, tide levels, water table levels, and wind conditions are used as input variables. Street flooding reports made by city workers after named and unnamed storm events, ranging from 1-159 reports per event, are the model output. Results show that Random Forest provides predictive power in estimating the number of flood occurrences given a set of environmental conditions with an out-of-bag root mean squared error of 4.3 flood reports and a mean absolute error of 0.82 flood reports. The Random Forest algorithm performed much better than Poisson regression. From the Random Forest model, total daily rainfall was by far the most important factor in flood occurrence prediction, followed by daily low tide and daily higher high tide. The model demonstrated here could be used to predict flood severity based on forecast rainfall and tide conditions and could be further enhanced using more complete street flooding data for model training.
Rating the raters in a mixed model: An approach to deciphering the rater reliability
NASA Astrophysics Data System (ADS)
Shang, Junfeng; Wang, Yougui
2013-05-01
Rating the raters has attracted extensive attention in recent years. Ratings are quite complex in that the subjective assessment and a number of criteria are involved in a rating system. Whenever the human judgment is a part of ratings, the inconsistency of ratings is the source of variance in scores, and it is therefore quite natural for people to verify the trustworthiness of ratings. Accordingly, estimation of the rater reliability will be of great interest and an appealing issue. To facilitate the evaluation of the rater reliability in a rating system, we propose a mixed model where the scores of the ratees offered by a rater are described with the fixed effects determined by the ability of the ratees and the random effects produced by the disagreement of the raters. In such a mixed model, for the rater random effects, we derive its posterior distribution for the prediction of random effects. To quantitatively make a decision in revealing the unreliable raters, the predictive influence function (PIF) serves as a criterion which compares the posterior distributions of random effects between the full data and rater-deleted data sets. The benchmark for this criterion is also discussed. This proposed methodology of deciphering the rater reliability is investigated in the multiple simulated and two real data sets.
A powerful and efficient set test for genetic markers that handles confounders
Listgarten, Jennifer; Lippert, Christoph; Kang, Eun Yong; Xiang, Jing; Kadie, Carl M.; Heckerman, David
2013-01-01
Motivation: Approaches for testing sets of variants, such as a set of rare or common variants within a gene or pathway, for association with complex traits are important. In particular, set tests allow for aggregation of weak signal within a set, can capture interplay among variants and reduce the burden of multiple hypothesis testing. Until now, these approaches did not address confounding by family relatedness and population structure, a problem that is becoming more important as larger datasets are used to increase power. Results: We introduce a new approach for set tests that handles confounders. Our model is based on the linear mixed model and uses two random effects—one to capture the set association signal and one to capture confounders. We also introduce a computational speedup for two random-effects models that makes this approach feasible even for extremely large cohorts. Using this model with both the likelihood ratio test and score test, we find that the former yields more power while controlling type I error. Application of our approach to richly structured Genetic Analysis Workshop 14 data demonstrates that our method successfully corrects for population structure and family relatedness, whereas application of our method to a 15 000 individual Crohn’s disease case–control cohort demonstrates that it additionally recovers genes not recoverable by univariate analysis. Availability: A Python-based library implementing our approach is available at http://mscompbio.codeplex.com. Contact: jennl@microsoft.com or lippert@microsoft.com or heckerma@microsoft.com Supplementary information: Supplementary data are available at Bioinformatics online. PMID:23599503
Diffusion in randomly perturbed dissipative dynamics
NASA Astrophysics Data System (ADS)
Rodrigues, Christian S.; Chechkin, Aleksei V.; de Moura, Alessandro P. S.; Grebogi, Celso; Klages, Rainer
2014-11-01
Dynamical systems having many coexisting attractors present interesting properties from both fundamental theoretical and modelling points of view. When such dynamics is under bounded random perturbations, the basins of attraction are no longer invariant and there is the possibility of transport among them. Here we introduce a basic theoretical setting which enables us to study this hopping process from the perspective of anomalous transport using the concept of a random dynamical system with holes. We apply it to a simple model by investigating the role of hyperbolicity for the transport among basins. We show numerically that our system exhibits non-Gaussian position distributions, power-law escape times, and subdiffusion. Our simulation results are reproduced consistently from stochastic continuous time random walk theory.
Stacked Denoising Autoencoders Applied to Star/Galaxy Classification
NASA Astrophysics Data System (ADS)
Qin, Hao-ran; Lin, Ji-ming; Wang, Jun-yi
2017-04-01
In recent years, the deep learning algorithm, with the characteristics of strong adaptability, high accuracy, and structural complexity, has become more and more popular, but it has not yet been used in astronomy. In order to solve the problem that the star/galaxy classification accuracy is high for the bright source set, but low for the faint source set of the Sloan Digital Sky Survey (SDSS) data, we introduced the new deep learning algorithm, namely the SDA (stacked denoising autoencoder) neural network and the dropout fine-tuning technique, which can greatly improve the robustness and antinoise performance. We randomly selected respectively the bright source sets and faint source sets from the SDSS DR12 and DR7 data with spectroscopic measurements, and made preprocessing on them. Then, we randomly selected respectively the training sets and testing sets without replacement from the bright source sets and faint source sets. At last, using these training sets we made the training to obtain the SDA models of the bright sources and faint sources in the SDSS DR7 and DR12, respectively. We compared the test result of the SDA model on the DR12 testing set with the test results of the Library for Support Vector Machines (LibSVM), J48 decision tree, Logistic Model Tree (LMT), Support Vector Machine (SVM), Logistic Regression, and Decision Stump algorithm, and compared the test result of the SDA model on the DR7 testing set with the test results of six kinds of decision trees. The experiments show that the SDA has a better classification accuracy than other machine learning algorithms for the faint source sets of DR7 and DR12. Especially, when the completeness function is used as the evaluation index, compared with the decision tree algorithms, the correctness rate of SDA has improved about 15% for the faint source set of SDSS-DR7.
Ashrafi, Parivash; Sun, Yi; Davey, Neil; Adams, Roderick G; Wilkinson, Simon C; Moss, Gary Patrick
2018-03-01
The aim of this study was to investigate how to improve predictions from Gaussian Process models by optimising the model hyperparameters. Optimisation methods, including Grid Search, Conjugate Gradient, Random Search, Evolutionary Algorithm and Hyper-prior, were evaluated and applied to previously published data. Data sets were also altered in a structured manner to reduce their size, which retained the range, or 'chemical space' of the key descriptors to assess the effect of the data range on model quality. The Hyper-prior Smoothbox kernel results in the best models for the majority of data sets, and they exhibited significantly better performance than benchmark quantitative structure-permeability relationship (QSPR) models. When the data sets were systematically reduced in size, the different optimisation methods generally retained their statistical quality, whereas benchmark QSPR models performed poorly. The design of the data set, and possibly also the approach to validation of the model, is critical in the development of improved models. The size of the data set, if carefully controlled, was not generally a significant factor for these models and that models of excellent statistical quality could be produced from substantially smaller data sets. © 2018 Royal Pharmaceutical Society.
Solving large test-day models by iteration on data and preconditioned conjugate gradient.
Lidauer, M; Strandén, I; Mäntysaari, E A; Pösö, J; Kettunen, A
1999-12-01
A preconditioned conjugate gradient method was implemented into an iteration on a program for data estimation of breeding values, and its convergence characteristics were studied. An algorithm was used as a reference in which one fixed effect was solved by Gauss-Seidel method, and other effects were solved by a second-order Jacobi method. Implementation of the preconditioned conjugate gradient required storing four vectors (size equal to number of unknowns in the mixed model equations) in random access memory and reading the data at each round of iteration. The preconditioner comprised diagonal blocks of the coefficient matrix. Comparison of algorithms was based on solutions of mixed model equations obtained by a single-trait animal model and a single-trait, random regression test-day model. Data sets for both models used milk yield records of primiparous Finnish dairy cows. Animal model data comprised 665,629 lactation milk yields and random regression test-day model data of 6,732,765 test-day milk yields. Both models included pedigree information of 1,099,622 animals. The animal model ¿random regression test-day model¿ required 122 ¿305¿ rounds of iteration to converge with the reference algorithm, but only 88 ¿149¿ were required with the preconditioned conjugate gradient. To solve the random regression test-day model with the preconditioned conjugate gradient required 237 megabytes of random access memory and took 14% of the computation time needed by the reference algorithm.
NASA Astrophysics Data System (ADS)
Erener, Arzu; Sivas, A. Abdullah; Selcuk-Kestel, A. Sevtap; Düzgün, H. Sebnem
2017-07-01
All of the quantitative landslide susceptibility mapping (QLSM) methods requires two basic data types, namely, landslide inventory and factors that influence landslide occurrence (landslide influencing factors, LIF). Depending on type of landslides, nature of triggers and LIF, accuracy of the QLSM methods differs. Moreover, how to balance the number of 0 (nonoccurrence) and 1 (occurrence) in the training set obtained from the landslide inventory and how to select which one of the 1's and 0's to be included in QLSM models play critical role in the accuracy of the QLSM. Although performance of various QLSM methods is largely investigated in the literature, the challenge of training set construction is not adequately investigated for the QLSM methods. In order to tackle this challenge, in this study three different training set selection strategies along with the original data set is used for testing the performance of three different regression methods namely Logistic Regression (LR), Bayesian Logistic Regression (BLR) and Fuzzy Logistic Regression (FLR). The first sampling strategy is proportional random sampling (PRS), which takes into account a weighted selection of landslide occurrences in the sample set. The second method, namely non-selective nearby sampling (NNS), includes randomly selected sites and their surrounding neighboring points at certain preselected distances to include the impact of clustering. Selective nearby sampling (SNS) is the third method, which concentrates on the group of 1's and their surrounding neighborhood. A randomly selected group of landslide sites and their neighborhood are considered in the analyses similar to NNS parameters. It is found that LR-PRS, FLR-PRS and BLR-Whole Data set-ups, with order, yield the best fits among the other alternatives. The results indicate that in QLSM based on regression models, avoidance of spatial correlation in the data set is critical for the model's performance.
Rigorously testing multialternative decision field theory against random utility models.
Berkowitsch, Nicolas A J; Scheibehenne, Benjamin; Rieskamp, Jörg
2014-06-01
Cognitive models of decision making aim to explain the process underlying observed choices. Here, we test a sequential sampling model of decision making, multialternative decision field theory (MDFT; Roe, Busemeyer, & Townsend, 2001), on empirical grounds and compare it against 2 established random utility models of choice: the probit and the logit model. Using a within-subject experimental design, participants in 2 studies repeatedly choose among sets of options (consumer products) described on several attributes. The results of Study 1 showed that all models predicted participants' choices equally well. In Study 2, in which the choice sets were explicitly designed to distinguish the models, MDFT had an advantage in predicting the observed choices. Study 2 further revealed the occurrence of multiple context effects within single participants, indicating an interdependent evaluation of choice options and correlations between different context effects. In sum, the results indicate that sequential sampling models can provide relevant insights into the cognitive process underlying preferential choices and thus can lead to better choice predictions. PsycINFO Database Record (c) 2014 APA, all rights reserved.
Prediction models for clustered data: comparison of a random intercept and standard regression model
2013-01-01
Background When study data are clustered, standard regression analysis is considered inappropriate and analytical techniques for clustered data need to be used. For prediction research in which the interest of predictor effects is on the patient level, random effect regression models are probably preferred over standard regression analysis. It is well known that the random effect parameter estimates and the standard logistic regression parameter estimates are different. Here, we compared random effect and standard logistic regression models for their ability to provide accurate predictions. Methods Using an empirical study on 1642 surgical patients at risk of postoperative nausea and vomiting, who were treated by one of 19 anesthesiologists (clusters), we developed prognostic models either with standard or random intercept logistic regression. External validity of these models was assessed in new patients from other anesthesiologists. We supported our results with simulation studies using intra-class correlation coefficients (ICC) of 5%, 15%, or 30%. Standard performance measures and measures adapted for the clustered data structure were estimated. Results The model developed with random effect analysis showed better discrimination than the standard approach, if the cluster effects were used for risk prediction (standard c-index of 0.69 versus 0.66). In the external validation set, both models showed similar discrimination (standard c-index 0.68 versus 0.67). The simulation study confirmed these results. For datasets with a high ICC (≥15%), model calibration was only adequate in external subjects, if the used performance measure assumed the same data structure as the model development method: standard calibration measures showed good calibration for the standard developed model, calibration measures adapting the clustered data structure showed good calibration for the prediction model with random intercept. Conclusion The models with random intercept discriminate better than the standard model only if the cluster effect is used for predictions. The prediction model with random intercept had good calibration within clusters. PMID:23414436
Bouwmeester, Walter; Twisk, Jos W R; Kappen, Teus H; van Klei, Wilton A; Moons, Karel G M; Vergouwe, Yvonne
2013-02-15
When study data are clustered, standard regression analysis is considered inappropriate and analytical techniques for clustered data need to be used. For prediction research in which the interest of predictor effects is on the patient level, random effect regression models are probably preferred over standard regression analysis. It is well known that the random effect parameter estimates and the standard logistic regression parameter estimates are different. Here, we compared random effect and standard logistic regression models for their ability to provide accurate predictions. Using an empirical study on 1642 surgical patients at risk of postoperative nausea and vomiting, who were treated by one of 19 anesthesiologists (clusters), we developed prognostic models either with standard or random intercept logistic regression. External validity of these models was assessed in new patients from other anesthesiologists. We supported our results with simulation studies using intra-class correlation coefficients (ICC) of 5%, 15%, or 30%. Standard performance measures and measures adapted for the clustered data structure were estimated. The model developed with random effect analysis showed better discrimination than the standard approach, if the cluster effects were used for risk prediction (standard c-index of 0.69 versus 0.66). In the external validation set, both models showed similar discrimination (standard c-index 0.68 versus 0.67). The simulation study confirmed these results. For datasets with a high ICC (≥15%), model calibration was only adequate in external subjects, if the used performance measure assumed the same data structure as the model development method: standard calibration measures showed good calibration for the standard developed model, calibration measures adapting the clustered data structure showed good calibration for the prediction model with random intercept. The models with random intercept discriminate better than the standard model only if the cluster effect is used for predictions. The prediction model with random intercept had good calibration within clusters.
The estimation of branching curves in the presence of subject-specific random effects.
Elmi, Angelo; Ratcliffe, Sarah J; Guo, Wensheng
2014-12-20
Branching curves are a technique for modeling curves that change trajectory at a change (branching) point. Currently, the estimation framework is limited to independent data, and smoothing splines are used for estimation. This article aims to extend the branching curve framework to the longitudinal data setting where the branching point varies by subject. If the branching point is modeled as a random effect, then the longitudinal branching curve framework is a semiparametric nonlinear mixed effects model. Given existing issues with using random effects within a smoothing spline, we express the model as a B-spline based semiparametric nonlinear mixed effects model. Simple, clever smoothness constraints are enforced on the B-splines at the change point. The method is applied to Women's Health data where we model the shape of the labor curve (cervical dilation measured longitudinally) before and after treatment with oxytocin (a labor stimulant). Copyright © 2014 John Wiley & Sons, Ltd.
A comparison of methods for estimating the random effects distribution of a linear mixed model.
Ghidey, Wendimagegn; Lesaffre, Emmanuel; Verbeke, Geert
2010-12-01
This article reviews various recently suggested approaches to estimate the random effects distribution in a linear mixed model, i.e. (1) the smoothing by roughening approach of Shen and Louis,(1) (2) the semi-non-parametric approach of Zhang and Davidian,(2) (3) the heterogeneity model of Verbeke and Lesaffre( 3) and (4) a flexible approach of Ghidey et al. (4) These four approaches are compared via an extensive simulation study. We conclude that for the considered cases, the approach of Ghidey et al. (4) often shows to have the smallest integrated mean squared error for estimating the random effects distribution. An analysis of a longitudinal dental data set illustrates the performance of the methods in a practical example.
Iterative random vs. Kennard-Stone sampling for IR spectrum-based classification task using PLS2-DA
NASA Astrophysics Data System (ADS)
Lee, Loong Chuen; Liong, Choong-Yeun; Jemain, Abdul Aziz
2018-04-01
External testing (ET) is preferred over auto-prediction (AP) or k-fold-cross-validation in estimating more realistic predictive ability of a statistical model. With IR spectra, Kennard-stone (KS) sampling algorithm is often used to split the data into training and test sets, i.e. respectively for model construction and for model testing. On the other hand, iterative random sampling (IRS) has not been the favored choice though it is theoretically more likely to produce reliable estimation. The aim of this preliminary work is to compare performances of KS and IRS in sampling a representative training set from an attenuated total reflectance - Fourier transform infrared spectral dataset (of four varieties of blue gel pen inks) for PLS2-DA modeling. The `best' performance achievable from the dataset is estimated with AP on the full dataset (APF, error). Both IRS (n = 200) and KS were used to split the dataset in the ratio of 7:3. The classic decision rule (i.e. maximum value-based) is employed for new sample prediction via partial least squares - discriminant analysis (PLS2-DA). Error rate of each model was estimated repeatedly via: (a) AP on full data (APF, error); (b) AP on training set (APS, error); and (c) ET on the respective test set (ETS, error). A good PLS2-DA model is expected to produce APS, error and EVS, error that is similar to the APF, error. Bearing that in mind, the similarities between (a) APS, error vs. APF, error; (b) ETS, error vs. APF, error and; (c) APS, error vs. ETS, error were evaluated using correlation tests (i.e. Pearson and Spearman's rank test), using series of PLS2-DA models computed from KS-set and IRS-set, respectively. Overall, models constructed from IRS-set exhibits more similarities between the internal and external error rates than the respective KS-set, i.e. less risk of overfitting. In conclusion, IRS is more reliable than KS in sampling representative training set.
Modeling Epidemics Spreading on Social Contact Networks.
Zhang, Zhaoyang; Wang, Honggang; Wang, Chonggang; Fang, Hua
2015-09-01
Social contact networks and the way people interact with each other are the key factors that impact on epidemics spreading. However, it is challenging to model the behavior of epidemics based on social contact networks due to their high dynamics. Traditional models such as susceptible-infected-recovered (SIR) model ignore the crowding or protection effect and thus has some unrealistic assumption. In this paper, we consider the crowding or protection effect and develop a novel model called improved SIR model. Then, we use both deterministic and stochastic models to characterize the dynamics of epidemics on social contact networks. The results from both simulations and real data set conclude that the epidemics are more likely to outbreak on social contact networks with higher average degree. We also present some potential immunization strategies, such as random set immunization, dominating set immunization, and high degree set immunization to further prove the conclusion.
Deeb, Omar; Shaik, Basheerulla; Agrawal, Vijay K
2014-10-01
Quantitative Structure-Activity Relationship (QSAR) models for binding affinity constants (log Ki) of 78 flavonoid ligands towards the benzodiazepine site of GABA (A) receptor complex were calculated using the machine learning methods: artificial neural network (ANN) and support vector machine (SVM) techniques. The models obtained were compared with those obtained using multiple linear regression (MLR) analysis. The descriptor selection and model building were performed with 10-fold cross-validation using the training data set. The SVM and MLR coefficient of determination values are 0.944 and 0.879, respectively, for the training set and are higher than those of ANN models. Though the SVM model shows improvement of training set fitting, the ANN model was superior to SVM and MLR in predicting the test set. Randomization test is employed to check the suitability of the models.
Modeling Epidemics Spreading on Social Contact Networks
ZHANG, ZHAOYANG; WANG, HONGGANG; WANG, CHONGGANG; FANG, HUA
2016-01-01
Social contact networks and the way people interact with each other are the key factors that impact on epidemics spreading. However, it is challenging to model the behavior of epidemics based on social contact networks due to their high dynamics. Traditional models such as susceptible-infected-recovered (SIR) model ignore the crowding or protection effect and thus has some unrealistic assumption. In this paper, we consider the crowding or protection effect and develop a novel model called improved SIR model. Then, we use both deterministic and stochastic models to characterize the dynamics of epidemics on social contact networks. The results from both simulations and real data set conclude that the epidemics are more likely to outbreak on social contact networks with higher average degree. We also present some potential immunization strategies, such as random set immunization, dominating set immunization, and high degree set immunization to further prove the conclusion. PMID:27722037
Universality in chaos: Lyapunov spectrum and random matrix theory.
Hanada, Masanori; Shimada, Hidehiko; Tezuka, Masaki
2018-02-01
We propose the existence of a new universality in classical chaotic systems when the number of degrees of freedom is large: the statistical property of the Lyapunov spectrum is described by random matrix theory. We demonstrate it by studying the finite-time Lyapunov exponents of the matrix model of a stringy black hole and the mass-deformed models. The massless limit, which has a dual string theory interpretation, is special in that the universal behavior can be seen already at t=0, while in other cases it sets in at late time. The same pattern is demonstrated also in the product of random matrices.
Universality in chaos: Lyapunov spectrum and random matrix theory
NASA Astrophysics Data System (ADS)
Hanada, Masanori; Shimada, Hidehiko; Tezuka, Masaki
2018-02-01
We propose the existence of a new universality in classical chaotic systems when the number of degrees of freedom is large: the statistical property of the Lyapunov spectrum is described by random matrix theory. We demonstrate it by studying the finite-time Lyapunov exponents of the matrix model of a stringy black hole and the mass-deformed models. The massless limit, which has a dual string theory interpretation, is special in that the universal behavior can be seen already at t =0 , while in other cases it sets in at late time. The same pattern is demonstrated also in the product of random matrices.
Transcription, intercellular variability and correlated random walk.
Müller, Johannes; Kuttler, Christina; Hense, Burkhard A; Zeiser, Stefan; Liebscher, Volkmar
2008-11-01
We develop a simple model for the random distribution of a gene product. It is assumed that the only source of variance is due to switching transcription on and off by a random process. Under the condition that the transition rates between on and off are constant we find that the amount of mRNA follows a scaled Beta distribution. Additionally, a simple positive feedback loop is considered. The simplicity of the model allows for an explicit solution also in this setting. These findings in turn allow, e.g., for easy parameter scans. We find that bistable behavior translates into bimodal distributions. These theoretical findings are in line with experimental results.
Macroscopic damping model for structural dynamics with random polycrystalline configurations
NASA Astrophysics Data System (ADS)
Yang, Yantao; Cui, Junzhi; Yu, Yifan; Xiang, Meizhen
2018-06-01
In this paper the macroscopic damping model for dynamical behavior of the structures with random polycrystalline configurations at micro-nano scales is established. First, the global motion equation of a crystal is decomposed into a set of motion equations with independent single degree of freedom (SDOF) along normal discrete modes, and then damping behavior is introduced into each SDOF motion. Through the interpolation of discrete modes, the continuous representation of damping effects for the crystal is obtained. Second, from energy conservation law the expression of the damping coefficient is derived, and the approximate formula of damping coefficient is given. Next, the continuous damping coefficient for polycrystalline cluster is expressed, the continuous dynamical equation with damping term is obtained, and then the concrete damping coefficients for a polycrystalline Cu sample are shown. Finally, by using statistical two-scale homogenization method, the macroscopic homogenized dynamical equation containing damping term for the structures with random polycrystalline configurations at micro-nano scales is set up.
Humphreys, Keith; Blodgett, Janet C; Wagner, Todd H
2014-11-01
Observational studies of Alcoholics Anonymous' (AA) effectiveness are vulnerable to self-selection bias because individuals choose whether or not to attend AA. The present study, therefore, employed an innovative statistical technique to derive a selection bias-free estimate of AA's impact. Six data sets from 5 National Institutes of Health-funded randomized trials (1 with 2 independent parallel arms) of AA facilitation interventions were analyzed using instrumental variables models. Alcohol-dependent individuals in one of the data sets (n = 774) were analyzed separately from the rest of sample (n = 1,582 individuals pooled from 5 data sets) because of heterogeneity in sample parameters. Randomization itself was used as the instrumental variable. Randomization was a good instrument in both samples, effectively predicting increased AA attendance that could not be attributed to self-selection. In 5 of the 6 data sets, which were pooled for analysis, increased AA attendance that was attributable to randomization (i.e., free of self-selection bias) was effective at increasing days of abstinence at 3-month (B = 0.38, p = 0.001) and 15-month (B = 0.42, p = 0.04) follow-up. However, in the remaining data set, in which preexisting AA attendance was much higher, further increases in AA involvement caused by the randomly assigned facilitation intervention did not affect drinking outcome. For most individuals seeking help for alcohol problems, increasing AA attendance leads to short- and long-term decreases in alcohol consumption that cannot be attributed to self-selection. However, for populations with high preexisting AA involvement, further increases in AA attendance may have little impact. Copyright © 2014 by the Research Society on Alcoholism.
Optimal strategy analysis based on robust predictive control for inventory system with random demand
NASA Astrophysics Data System (ADS)
Saputra, Aditya; Widowati, Sutrisno
2017-12-01
In this paper, the optimal strategy for a single product single supplier inventory system with random demand is analyzed by using robust predictive control with additive random parameter. We formulate the dynamical system of this system as a linear state space with additive random parameter. To determine and analyze the optimal strategy for the given inventory system, we use robust predictive control approach which gives the optimal strategy i.e. the optimal product volume that should be purchased from the supplier for each time period so that the expected cost is minimal. A numerical simulation is performed with some generated random inventory data. We simulate in MATLAB software where the inventory level must be controlled as close as possible to a set point decided by us. From the results, robust predictive control model provides the optimal strategy i.e. the optimal product volume that should be purchased and the inventory level was followed the given set point.
NASA Astrophysics Data System (ADS)
Higuchi, Kazuhide; Miyaji, Kousuke; Johguchi, Koh; Takeuchi, Ken
2012-02-01
This paper proposes a verify-programming method for the resistive random access memory (ReRAM) cell which achieves a 50-times higher endurance and a fast set and reset compared with the conventional method. The proposed verify-programming method uses the incremental pulse width with turnback (IPWWT) for the reset and the incremental voltage with turnback (IVWT) for the set. With the combination of IPWWT reset and IVWT set, the endurance-cycle increases from 48 ×103 to 2444 ×103 cycles. Furthermore, the measured data retention-time after 20 ×103 set/reset cycles is estimated to be 10 years. Additionally, the filamentary based physical model is proposed to explain the set/reset failure mechanism with various set/reset pulse shapes. The reset pulse width and set voltage correspond to the width and length of the conductive-filament, respectively. Consequently, since the proposed IPWWT and IVWT recover set and reset failures of ReRAM cells, the endurance-cycles are improved.
On the mixing time of geographical threshold graphs
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bradonjic, Milan
In this paper, we study the mixing time of random graphs generated by the geographical threshold graph (GTG) model, a generalization of random geometric graphs (RGG). In a GTG, nodes are distributed in a Euclidean space, and edges are assigned according to a threshold function involving the distance between nodes as well as randomly chosen node weights. The motivation for analyzing this model is that many real networks (e.g., wireless networks, the Internet, etc.) need to be studied by using a 'richer' stochastic model (which in this case includes both a distance between nodes and weights on the nodes). Wemore » specifically study the mixing times of random walks on 2-dimensional GTGs near the connectivity threshold. We provide a set of criteria on the distribution of vertex weights that guarantees that the mixing time is {Theta}(n log n).« less
Pathwise upper semi-continuity of random pullback attractors along the time axis
NASA Astrophysics Data System (ADS)
Cui, Hongyong; Kloeden, Peter E.; Wu, Fuke
2018-07-01
The pullback attractor of a non-autonomous random dynamical system is a time-indexed family of random sets, typically having the form {At(ṡ) } t ∈ R with each At(ṡ) a random set. This paper is concerned with the nature of such time-dependence. It is shown that the upper semi-continuity of the mapping t ↦At(ω) for each ω fixed has an equivalence relationship with the uniform compactness of the local union ∪s∈IAs(ω) , where I ⊂ R is compact. Applied to a semi-linear degenerate parabolic equation with additive noise and a wave equation with multiplicative noise we show that, in order to prove the above locally uniform compactness and upper semi-continuity, no additional conditions are required, in which sense the two properties appear to be general properties satisfied by a large number of real models.
On the information content of hydrological signatures and their relationship to catchment attributes
NASA Astrophysics Data System (ADS)
Addor, Nans; Clark, Martyn P.; Prieto, Cristina; Newman, Andrew J.; Mizukami, Naoki; Nearing, Grey; Le Vine, Nataliya
2017-04-01
Hydrological signatures, which are indices characterizing hydrologic behavior, are increasingly used for the evaluation, calibration and selection of hydrological models. Their key advantage is to provide more direct insights into specific hydrological processes than aggregated metrics (e.g., the Nash-Sutcliffe efficiency). A plethora of signatures now exists, which enable characterizing a variety of hydrograph features, but also makes the selection of signatures for new studies challenging. Here we propose that the selection of signatures should be based on their information content, which we estimated using several approaches, all leading to similar conclusions. To explore the relationship between hydrological signatures and the landscape, we extended a previously published data set of hydrometeorological time series for 671 catchments in the contiguous United States, by characterizing the climatic conditions, topography, soil, vegetation and stream network of each catchment. This new catchment attributes data set will soon be in open access, and we are looking forward to introducing it to the community. We used this data set in a data-learning algorithm (random forests) to explore whether hydrological signatures could be inferred from catchment attributes alone. We find that some signatures can be predicted remarkably well by random forests and, interestingly, the same signatures are well captured when simulating discharge using a conceptual hydrological model. We discuss what this result reveals about our understanding of hydrological processes shaping hydrological signatures. We also identify which catchment attributes exert the strongest control on catchment behavior, in particular during extreme hydrological events. Overall, climatic attributes have the most significant influence, and strongly condition how well hydrological signatures can be predicted by random forests and simulated by the hydrological model. In contrast, soil characteristics at the catchment scale are not found to be significant predictors by random forests, which raises questions on how to best use soil data for hydrological modeling, for instance for parameter estimation. We finally demonstrate that signatures with high spatial variability are poorly captured by random forests and model simulations, which makes their regionalization delicate. We conclude with a ranking of signatures based on their information content, and propose that the signatures with high information content are best suited for model calibration, model selection and understanding hydrologic similarity.
Extension of mixture-of-experts networks for binary classification of hierarchical data.
Ng, Shu-Kay; McLachlan, Geoffrey J
2007-09-01
For many applied problems in the context of medically relevant artificial intelligence, the data collected exhibit a hierarchical or clustered structure. Ignoring the interdependence between hierarchical data can result in misleading classification. In this paper, we extend the mechanism for mixture-of-experts (ME) networks for binary classification of hierarchical data. Another extension is to quantify cluster-specific information on data hierarchy by random effects via the generalized linear mixed-effects model (GLMM). The extension of ME networks is implemented by allowing for correlation in the hierarchical data in both the gating and expert networks via the GLMM. The proposed model is illustrated using a real thyroid disease data set. In our study, we consider 7652 thyroid diagnosis records from 1984 to early 1987 with complete information on 20 attribute values. We obtain 10 independent random splits of the data into a training set and a test set in the proportions 85% and 15%. The test sets are used to assess the generalization performance of the proposed model, based on the percentage of misclassifications. For comparison, the results obtained from the ME network with independence assumption are also included. With the thyroid disease data, the misclassification rate on test sets for the extended ME network is 8.9%, compared to 13.9% for the ME network. In addition, based on model selection methods described in Section 2, a network with two experts is selected. These two expert networks can be considered as modeling two groups of patients with high and low incidence rates. Significant variation among the predicted cluster-specific random effects is detected in the patient group with low incidence rate. It is shown that the extended ME network outperforms the ME network for binary classification of hierarchical data. With the thyroid disease data, useful information on the relative log odds of patients with diagnosed conditions at different periods can be evaluated. This information can be taken into consideration for the assessment of treatment planning of the disease. The proposed extended ME network thus facilitates a more general approach to incorporate data hierarchy mechanism in network modeling.
Modeling the nitrogen cycle one gene at a time
NASA Astrophysics Data System (ADS)
Coles, V.; Stukel, M. R.; Hood, R. R.; Moran, M. A.; Paul, J. H.; Satinsky, B.; Zielinski, B.; Yager, P. L.
2016-02-01
Marine ecosystem models are lagging the revolution in microbial oceanography. As a result, modeling of the nitrogen cycle has largely failed to leverage new genomic information on nitrogen cycling pathways and the organisms that mediate them. We developed a nitrogen based ecosystem model whose community is determined by randomly assigning functional genes to build each organism's "DNA". Microbes are assigned a size that sets their baseline environmental responses using allometric response curves. These responses are modified by the costs and benefits conferred by each gene in an organism's genome. The microbes are embedded in a general circulation model where environmental conditions shape the emergent population. This model is used to explore whether organisms constructed from randomized combinations of metabolic capability alone can self-organize to create realistic oceanic biogeochemical gradients. Community size spectra and chlorophyll-a concentrations emerge in the model with reasonable fidelity to observations. The model is run repeatedly with randomly-generated microbial communities and each time realistic gradients in community size spectra, chlorophyll-a, and forms of nitrogen develop. This supports the hypothesis that the metabolic potential of a community rather than the realized species composition is the primary factor setting vertical and horizontal environmental gradients. Vertical distributions of nitrogen and transcripts for genes involved in nitrification are broadly consistent with observations. Modeled gene and transcript abundance for nitrogen cycling and processing of land-derived organic material match observations along the extreme gradients in the Amazon River plume, and they help to explain the factors controlling observed variability.
Effects of ignition location models on the burn patterns of simulated wildfires
Bar-Massada, A.; Syphard, A.D.; Hawbaker, T.J.; Stewart, S.I.; Radeloff, V.C.
2011-01-01
Fire simulation studies that use models such as FARSITE often assume that ignition locations are distributed randomly, because spatially explicit information about actual ignition locations are difficult to obtain. However, many studies show that the spatial distribution of ignition locations, whether human-caused or natural, is non-random. Thus, predictions from fire simulations based on random ignitions may be unrealistic. However, the extent to which the assumption of ignition location affects the predictions of fire simulation models has never been systematically explored. Our goal was to assess the difference in fire simulations that are based on random versus non-random ignition location patterns. We conducted four sets of 6000 FARSITE simulations for the Santa Monica Mountains in California to quantify the influence of random and non-random ignition locations and normal and extreme weather conditions on fire size distributions and spatial patterns of burn probability. Under extreme weather conditions, fires were significantly larger for non-random ignitions compared to random ignitions (mean area of 344.5 ha and 230.1 ha, respectively), but burn probability maps were highly correlated (r = 0.83). Under normal weather, random ignitions produced significantly larger fires than non-random ignitions (17.5 ha and 13.3 ha, respectively), and the spatial correlations between burn probability maps were not high (r = 0.54), though the difference in the average burn probability was small. The results of the study suggest that the location of ignitions used in fire simulation models may substantially influence the spatial predictions of fire spread patterns. However, the spatial bias introduced by using a random ignition location model may be minimized if the fire simulations are conducted under extreme weather conditions when fire spread is greatest. ?? 2010 Elsevier Ltd.
Approximate ground states of the random-field Potts model from graph cuts
NASA Astrophysics Data System (ADS)
Kumar, Manoj; Kumar, Ravinder; Weigel, Martin; Banerjee, Varsha; Janke, Wolfhard; Puri, Sanjay
2018-05-01
While the ground-state problem for the random-field Ising model is polynomial, and can be solved using a number of well-known algorithms for maximum flow or graph cut, the analog random-field Potts model corresponds to a multiterminal flow problem that is known to be NP-hard. Hence an efficient exact algorithm is very unlikely to exist. As we show here, it is nevertheless possible to use an embedding of binary degrees of freedom into the Potts spins in combination with graph-cut methods to solve the corresponding ground-state problem approximately in polynomial time. We benchmark this heuristic algorithm using a set of quasiexact ground states found for small systems from long parallel tempering runs. For a not-too-large number q of Potts states, the method based on graph cuts finds the same solutions in a fraction of the time. We employ the new technique to analyze the breakup length of the random-field Potts model in two dimensions.
Modeling Local Interactions during the Motion of Cyanobacteria
Galante, Amanda; Wisen, Susanne; Bhaya, Devaki; Levy, Doron
2012-01-01
Synechocystis sp., a common unicellular freshwater cyanobacterium, has been used as a model organism to study phototaxis, an ability to move in the direction of a light source. This microorganism displays a number of additional characteristics such as delayed motion, surface dependence, and a quasi-random motion, where cells move in a seemingly disordered fashion instead of in the direction of the light source, a global force on the system. These unexplained motions are thought to be modulated by local interactions between cells such as intercellular communication. In this paper, we consider only local interactions of these phototactic cells in order to mathematically model this quasi-random motion. We analyze an experimental data set to illustrate the presence of quasi-random motion and then derive a stochastic dynamic particle system modeling interacting phototactic cells. The simulations of our model are consistent with experimentally observed phototactic motion. PMID:22713858
Remote Sensing/gis Integration for Site Planning and Resource Management
NASA Technical Reports Server (NTRS)
Fellows, J. D.
1982-01-01
The development of an interactive/batch gridded information system (array of cells georeferenced to USGS quad sheets) and interfacing application programs (e.g., hydrologic models) is discussed. This system allows non-programer users to request any data set(s) stored in the data base by inputing any random polygon's (watershed, political zone) boundary points. The data base information contained within this polygon can be used to produce maps, statistics, and define model parameters for the area. Present/proposed conditions for the area may be compared by inputing future usage (land cover, soils, slope, etc.). This system, known as the Hydrologic Analysis Program (HAP), is especially effective in the real time analysis of proposed land cover changes on runoff hydrographs and graphics/statistics resource inventories of random study area/watersheds.
Central Limit Theorem for Exponentially Quasi-local Statistics of Spin Models on Cayley Graphs
NASA Astrophysics Data System (ADS)
Reddy, Tulasi Ram; Vadlamani, Sreekar; Yogeshwaran, D.
2018-04-01
Central limit theorems for linear statistics of lattice random fields (including spin models) are usually proven under suitable mixing conditions or quasi-associativity. Many interesting examples of spin models do not satisfy mixing conditions, and on the other hand, it does not seem easy to show central limit theorem for local statistics via quasi-associativity. In this work, we prove general central limit theorems for local statistics and exponentially quasi-local statistics of spin models on discrete Cayley graphs with polynomial growth. Further, we supplement these results by proving similar central limit theorems for random fields on discrete Cayley graphs taking values in a countable space, but under the stronger assumptions of α -mixing (for local statistics) and exponential α -mixing (for exponentially quasi-local statistics). All our central limit theorems assume a suitable variance lower bound like many others in the literature. We illustrate our general central limit theorem with specific examples of lattice spin models and statistics arising in computational topology, statistical physics and random networks. Examples of clustering spin models include quasi-associated spin models with fast decaying covariances like the off-critical Ising model, level sets of Gaussian random fields with fast decaying covariances like the massive Gaussian free field and determinantal point processes with fast decaying kernels. Examples of local statistics include intrinsic volumes, face counts, component counts of random cubical complexes while exponentially quasi-local statistics include nearest neighbour distances in spin models and Betti numbers of sub-critical random cubical complexes.
NASA Astrophysics Data System (ADS)
Rychlik, Igor; Mao, Wengang
2018-02-01
The wind speed variability in the North Atlantic has been successfully modelled using a spatio-temporal transformed Gaussian field. However, this type of model does not correctly describe the extreme wind speeds attributed to tropical storms and hurricanes. In this study, the transformed Gaussian model is further developed to include the occurrence of severe storms. In this new model, random components are added to the transformed Gaussian field to model rare events with extreme wind speeds. The resulting random field is locally stationary and homogeneous. The localized dependence structure is described by time- and space-dependent parameters. The parameters have a natural physical interpretation. To exemplify its application, the model is fitted to the ECMWF ERA-Interim reanalysis data set. The model is applied to compute long-term wind speed distributions and return values, e.g., 100- or 1000-year extreme wind speeds, and to simulate random wind speed time series at a fixed location or spatio-temporal wind fields around that location.
Polynomial order selection in random regression models via penalizing adaptively the likelihood.
Corrales, J D; Munilla, S; Cantet, R J C
2015-08-01
Orthogonal Legendre polynomials (LP) are used to model the shape of additive genetic and permanent environmental effects in random regression models (RRM). Frequently, the Akaike (AIC) and the Bayesian (BIC) information criteria are employed to select LP order. However, it has been theoretically shown that neither AIC nor BIC is simultaneously optimal in terms of consistency and efficiency. Thus, the goal was to introduce a method, 'penalizing adaptively the likelihood' (PAL), as a criterion to select LP order in RRM. Four simulated data sets and real data (60,513 records, 6675 Colombian Holstein cows) were employed. Nested models were fitted to the data, and AIC, BIC and PAL were calculated for all of them. Results showed that PAL and BIC identified with probability of one the true LP order for the additive genetic and permanent environmental effects, but AIC tended to favour over parameterized models. Conversely, when the true model was unknown, PAL selected the best model with higher probability than AIC. In the latter case, BIC never favoured the best model. To summarize, PAL selected a correct model order regardless of whether the 'true' model was within the set of candidates. © 2015 Blackwell Verlag GmbH.
Martínez-Martínez, F; Rupérez-Moreno, M J; Martínez-Sober, M; Solves-Llorens, J A; Lorente, D; Serrano-López, A J; Martínez-Sanchis, S; Monserrat, C; Martín-Guerrero, J D
2017-11-01
This work presents a data-driven method to simulate, in real-time, the biomechanical behavior of the breast tissues in some image-guided interventions such as biopsies or radiotherapy dose delivery as well as to speed up multimodal registration algorithms. Ten real breasts were used for this work. Their deformation due to the displacement of two compression plates was simulated off-line using the finite element (FE) method. Three machine learning models were trained with the data from those simulations. Then, they were used to predict in real-time the deformation of the breast tissues during the compression. The models were a decision tree and two tree-based ensemble methods (extremely randomized trees and random forest). Two different experimental setups were designed to validate and study the performance of these models under different conditions. The mean 3D Euclidean distance between nodes predicted by the models and those extracted from the FE simulations was calculated to assess the performance of the models in the validation set. The experiments proved that extremely randomized trees performed better than the other two models. The mean error committed by the three models in the prediction of the nodal displacements was under 2 mm, a threshold usually set for clinical applications. The time needed for breast compression prediction is sufficiently short to allow its use in real-time (<0.2 s). Copyright © 2017 Elsevier Ltd. All rights reserved.
Shah, Anoop D.; Bartlett, Jonathan W.; Carpenter, James; Nicholas, Owen; Hemingway, Harry
2014-01-01
Multivariate imputation by chained equations (MICE) is commonly used for imputing missing data in epidemiologic research. The “true” imputation model may contain nonlinearities which are not included in default imputation models. Random forest imputation is a machine learning technique which can accommodate nonlinearities and interactions and does not require a particular regression model to be specified. We compared parametric MICE with a random forest-based MICE algorithm in 2 simulation studies. The first study used 1,000 random samples of 2,000 persons drawn from the 10,128 stable angina patients in the CALIBER database (Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; 2001–2010) with complete data on all covariates. Variables were artificially made “missing at random,” and the bias and efficiency of parameter estimates obtained using different imputation methods were compared. Both MICE methods produced unbiased estimates of (log) hazard ratios, but random forest was more efficient and produced narrower confidence intervals. The second study used simulated data in which the partially observed variable depended on the fully observed variables in a nonlinear way. Parameter estimates were less biased using random forest MICE, and confidence interval coverage was better. This suggests that random forest imputation may be useful for imputing complex epidemiologic data sets in which some patients have missing data. PMID:24589914
Shah, Anoop D; Bartlett, Jonathan W; Carpenter, James; Nicholas, Owen; Hemingway, Harry
2014-03-15
Multivariate imputation by chained equations (MICE) is commonly used for imputing missing data in epidemiologic research. The "true" imputation model may contain nonlinearities which are not included in default imputation models. Random forest imputation is a machine learning technique which can accommodate nonlinearities and interactions and does not require a particular regression model to be specified. We compared parametric MICE with a random forest-based MICE algorithm in 2 simulation studies. The first study used 1,000 random samples of 2,000 persons drawn from the 10,128 stable angina patients in the CALIBER database (Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; 2001-2010) with complete data on all covariates. Variables were artificially made "missing at random," and the bias and efficiency of parameter estimates obtained using different imputation methods were compared. Both MICE methods produced unbiased estimates of (log) hazard ratios, but random forest was more efficient and produced narrower confidence intervals. The second study used simulated data in which the partially observed variable depended on the fully observed variables in a nonlinear way. Parameter estimates were less biased using random forest MICE, and confidence interval coverage was better. This suggests that random forest imputation may be useful for imputing complex epidemiologic data sets in which some patients have missing data.
Empowering nurses for work engagement and health in hospital settings.
Laschinger, Heather K Spence; Finegan, Joan
2005-10-01
Employee empowerment has become an increasingly important factor in determining employee health and wellbeing in restructured healthcare settings. The authors tested a theoretical model which specified the relationships among structural empowerment, 6 areas of worklife that promote employee engagement, and staff nurses' physical and mental health. A predictive, non-experimental design was used to test the model in a random sample of staff nurses. The authors discuss their findings and the implication for nurse administrators.
Asymptotic Equivalence of Probability Measures and Stochastic Processes
NASA Astrophysics Data System (ADS)
Touchette, Hugo
2018-03-01
Let P_n and Q_n be two probability measures representing two different probabilistic models of some system (e.g., an n-particle equilibrium system, a set of random graphs with n vertices, or a stochastic process evolving over a time n) and let M_n be a random variable representing a "macrostate" or "global observable" of that system. We provide sufficient conditions, based on the Radon-Nikodym derivative of P_n and Q_n, for the set of typical values of M_n obtained relative to P_n to be the same as the set of typical values obtained relative to Q_n in the limit n→ ∞. This extends to general probability measures and stochastic processes the well-known thermodynamic-limit equivalence of the microcanonical and canonical ensembles, related mathematically to the asymptotic equivalence of conditional and exponentially-tilted measures. In this more general sense, two probability measures that are asymptotically equivalent predict the same typical or macroscopic properties of the system they are meant to model.
Stevens, Forrest R; Gaughan, Andrea E; Linard, Catherine; Tatem, Andrew J
2015-01-01
High resolution, contemporary data on human population distributions are vital for measuring impacts of population growth, monitoring human-environment interactions and for planning and policy development. Many methods are used to disaggregate census data and predict population densities for finer scale, gridded population data sets. We present a new semi-automated dasymetric modeling approach that incorporates detailed census and ancillary data in a flexible, "Random Forest" estimation technique. We outline the combination of widely available, remotely-sensed and geospatial data that contribute to the modeled dasymetric weights and then use the Random Forest model to generate a gridded prediction of population density at ~100 m spatial resolution. This prediction layer is then used as the weighting surface to perform dasymetric redistribution of the census counts at a country level. As a case study we compare the new algorithm and its products for three countries (Vietnam, Cambodia, and Kenya) with other common gridded population data production methodologies. We discuss the advantages of the new method and increases over the accuracy and flexibility of those previous approaches. Finally, we outline how this algorithm will be extended to provide freely-available gridded population data sets for Africa, Asia and Latin America.
McKim, James M.; Hartung, Thomas; Kleensang, Andre; Sá-Rocha, Vanessa
2016-01-01
Supervised learning methods promise to improve integrated testing strategies (ITS), but must be adjusted to handle high dimensionality and dose–response data. ITS approaches are currently fueled by the increasing mechanistic understanding of adverse outcome pathways (AOP) and the development of tests reflecting these mechanisms. Simple approaches to combine skin sensitization data sets, such as weight of evidence, fail due to problems in information redundancy and high dimension-ality. The problem is further amplified when potency information (dose/response) of hazards would be estimated. Skin sensitization currently serves as the foster child for AOP and ITS development, as legislative pressures combined with a very good mechanistic understanding of contact dermatitis have led to test development and relatively large high-quality data sets. We curated such a data set and combined a recursive variable selection algorithm to evaluate the information available through in silico, in chemico and in vitro assays. Chemical similarity alone could not cluster chemicals’ potency, and in vitro models consistently ranked high in recursive feature elimination. This allows reducing the number of tests included in an ITS. Next, we analyzed with a hidden Markov model that takes advantage of an intrinsic inter-relationship among the local lymph node assay classes, i.e. the monotonous connection between local lymph node assay and dose. The dose-informed random forest/hidden Markov model was superior to the dose-naive random forest model on all data sets. Although balanced accuracy improvement may seem small, this obscures the actual improvement in misclassifications as the dose-informed hidden Markov model strongly reduced "false-negatives" (i.e. extreme sensitizers as non-sensitizer) on all data sets. PMID:26046447
Luechtefeld, Thomas; Maertens, Alexandra; McKim, James M; Hartung, Thomas; Kleensang, Andre; Sá-Rocha, Vanessa
2015-11-01
Supervised learning methods promise to improve integrated testing strategies (ITS), but must be adjusted to handle high dimensionality and dose-response data. ITS approaches are currently fueled by the increasing mechanistic understanding of adverse outcome pathways (AOP) and the development of tests reflecting these mechanisms. Simple approaches to combine skin sensitization data sets, such as weight of evidence, fail due to problems in information redundancy and high dimensionality. The problem is further amplified when potency information (dose/response) of hazards would be estimated. Skin sensitization currently serves as the foster child for AOP and ITS development, as legislative pressures combined with a very good mechanistic understanding of contact dermatitis have led to test development and relatively large high-quality data sets. We curated such a data set and combined a recursive variable selection algorithm to evaluate the information available through in silico, in chemico and in vitro assays. Chemical similarity alone could not cluster chemicals' potency, and in vitro models consistently ranked high in recursive feature elimination. This allows reducing the number of tests included in an ITS. Next, we analyzed with a hidden Markov model that takes advantage of an intrinsic inter-relationship among the local lymph node assay classes, i.e. the monotonous connection between local lymph node assay and dose. The dose-informed random forest/hidden Markov model was superior to the dose-naive random forest model on all data sets. Although balanced accuracy improvement may seem small, this obscures the actual improvement in misclassifications as the dose-informed hidden Markov model strongly reduced " false-negatives" (i.e. extreme sensitizers as non-sensitizer) on all data sets. Copyright © 2015 John Wiley & Sons, Ltd.
Signs of universality in the structure of culture
NASA Astrophysics Data System (ADS)
Băbeanu, Alexandru-Ionuţ; Talman, Leandros; Garlaschelli, Diego
2017-11-01
Understanding the dynamics of opinions, preferences and of culture as whole requires more use of empirical data than has been done so far. It is clear that an important role in driving this dynamics is played by social influence, which is the essential ingredient of many quantitative models. Such models require that all traits are fixed when specifying the "initial cultural state". Typically, this initial state is randomly generated, from a uniform distribution over the set of possible combinations of traits. However, recent work has shown that the outcome of social influence dynamics strongly depends on the nature of the initial state. If the latter is sampled from empirical data instead of being generated in a uniformly random way, a higher level of cultural diversity is found after long-term dynamics, for the same level of propensity towards collective behavior in the short-term. Moreover, if the initial state is randomized by shuffling the empirical traits among people, the level of long-term cultural diversity is in-between those obtained for the empirical and uniformly random counterparts. The current study repeats the analysis for multiple empirical data sets, showing that the results are remarkably similar, although the matrix of correlations between cultural variables clearly differs across data sets. This points towards robust structural properties inherent in empirical cultural states, possibly due to universal laws governing the dynamics of culture in the real world. The results also suggest that this dynamics might be characterized by criticality and involve mechanisms beyond social influence.
Standardized Mean Differences in Two-Level Cross-Classified Random Effects Models
ERIC Educational Resources Information Center
Lai, Mark H. C.; Kwok, Oi-Man
2014-01-01
Multilevel modeling techniques are becoming more popular in handling data with multilevel structure in educational and behavioral research. Recently, researchers have paid more attention to cross-classified data structure that naturally arises in educational settings. However, unlike traditional single-level research, methodological studies about…
Ensemble Methods for Classification of Physical Activities from Wrist Accelerometry.
Chowdhury, Alok Kumar; Tjondronegoro, Dian; Chandran, Vinod; Trost, Stewart G
2017-09-01
To investigate whether the use of ensemble learning algorithms improve physical activity recognition accuracy compared to the single classifier algorithms, and to compare the classification accuracy achieved by three conventional ensemble machine learning methods (bagging, boosting, random forest) and a custom ensemble model comprising four algorithms commonly used for activity recognition (binary decision tree, k nearest neighbor, support vector machine, and neural network). The study used three independent data sets that included wrist-worn accelerometer data. For each data set, a four-step classification framework consisting of data preprocessing, feature extraction, normalization and feature selection, and classifier training and testing was implemented. For the custom ensemble, decisions from the single classifiers were aggregated using three decision fusion methods: weighted majority vote, naïve Bayes combination, and behavior knowledge space combination. Classifiers were cross-validated using leave-one subject out cross-validation and compared on the basis of average F1 scores. In all three data sets, ensemble learning methods consistently outperformed the individual classifiers. Among the conventional ensemble methods, random forest models provided consistently high activity recognition; however, the custom ensemble model using weighted majority voting demonstrated the highest classification accuracy in two of the three data sets. Combining multiple individual classifiers using conventional or custom ensemble learning methods can improve activity recognition accuracy from wrist-worn accelerometer data.
A closer look at cross-validation for assessing the accuracy of gene regulatory networks and models.
Tabe-Bordbar, Shayan; Emad, Amin; Zhao, Sihai Dave; Sinha, Saurabh
2018-04-26
Cross-validation (CV) is a technique to assess the generalizability of a model to unseen data. This technique relies on assumptions that may not be satisfied when studying genomics datasets. For example, random CV (RCV) assumes that a randomly selected set of samples, the test set, well represents unseen data. This assumption doesn't hold true where samples are obtained from different experimental conditions, and the goal is to learn regulatory relationships among the genes that generalize beyond the observed conditions. In this study, we investigated how the CV procedure affects the assessment of supervised learning methods used to learn gene regulatory networks (or in other applications). We compared the performance of a regression-based method for gene expression prediction estimated using RCV with that estimated using a clustering-based CV (CCV) procedure. Our analysis illustrates that RCV can produce over-optimistic estimates of the model's generalizability compared to CCV. Next, we defined the 'distinctness' of test set from training set and showed that this measure is predictive of performance of the regression method. Finally, we introduced a simulated annealing method to construct partitions with gradually increasing distinctness and showed that performance of different gene expression prediction methods can be better evaluated using this method.
Kinematic Methods of Designing Free Form Shells
NASA Astrophysics Data System (ADS)
Korotkiy, V. A.; Khmarova, L. I.
2017-11-01
The geometrical shell model is formed in light of the set requirements expressed through surface parameters. The shell is modelled using the kinematic method according to which the shell is formed as a continuous one-parameter set of curves. The authors offer a kinematic method based on the use of second-order curves with a variable eccentricity as a form-making element. Additional guiding ruled surfaces are used to control the designed surface form. The authors made a software application enabling to plot a second-order curve specified by a random set of five coplanar points and tangents.
Seehaus, Frank; Schwarze, Michael; Flörkemeier, Thilo; von Lewinski, Gabriela; Kaptein, Bart L; Jakubowitz, Eike; Hurschler, Christof
2016-05-01
Implant migration can be accurately quantified by model-based Roentgen stereophotogrammetric analysis (RSA), using an implant surface model to locate the implant relative to the bone. In a clinical situation, a single reverse engineering (RE) model for each implant type and size is used. It is unclear to what extent the accuracy and precision of migration measurement is affected by implant manufacturing variability unaccounted for by a single representative model. Individual RE models were generated for five short-stem hip implants of the same type and size. Two phantom analyses and one clinical analysis were performed: "Accuracy-matched models": one stem was assessed, and the results from the original RE model were compared with randomly selected models. "Accuracy-random model": each of the five stems was assessed and analyzed using one randomly selected RE model. "Precision-clinical setting": implant migration was calculated for eight patients, and all five available RE models were applied to each case. For the two phantom experiments, the 95%CI of the bias ranged from -0.28 mm to 0.30 mm for translation and -2.3° to 2.5° for rotation. In the clinical setting, precision is less than 0.5 mm and 1.2° for translation and rotation, respectively, except for rotations about the proximodistal axis (<4.1°). High accuracy and precision of model-based RSA can be achieved and are not biased by using a single representative RE model. At least for implants similar in shape to the investigated short-stem, individual models are not necessary. © 2015 Orthopaedic Research Society. Published by Wiley Periodicals, Inc. J Orthop Res 34:903-910, 2016. © 2015 Orthopaedic Research Society. Published by Wiley Periodicals, Inc.
A Comparison of Forecast Error Generators for Modeling Wind and Load Uncertainty
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lu, Ning; Diao, Ruisheng; Hafen, Ryan P.
2013-12-18
This paper presents four algorithms to generate random forecast error time series, including a truncated-normal distribution model, a state-space based Markov model, a seasonal autoregressive moving average (ARMA) model, and a stochastic-optimization based model. The error time series are used to create real-time (RT), hour-ahead (HA), and day-ahead (DA) wind and load forecast time series that statistically match historically observed forecasting data sets, used for variable generation integration studies. A comparison is made using historical DA load forecast and actual load values to generate new sets of DA forecasts with similar stoical forecast error characteristics. This paper discusses and comparesmore » the capabilities of each algorithm to preserve the characteristics of the historical forecast data sets.« less
Adapted random sampling patterns for accelerated MRI.
Knoll, Florian; Clason, Christian; Diwoky, Clemens; Stollberger, Rudolf
2011-02-01
Variable density random sampling patterns have recently become increasingly popular for accelerated imaging strategies, as they lead to incoherent aliasing artifacts. However, the design of these sampling patterns is still an open problem. Current strategies use model assumptions like polynomials of different order to generate a probability density function that is then used to generate the sampling pattern. This approach relies on the optimization of design parameters which is very time consuming and therefore impractical for daily clinical use. This work presents a new approach that generates sampling patterns by making use of power spectra of existing reference data sets and hence requires neither parameter tuning nor an a priori mathematical model of the density of sampling points. The approach is validated with downsampling experiments, as well as with accelerated in vivo measurements. The proposed approach is compared with established sampling patterns, and the generalization potential is tested by using a range of reference images. Quantitative evaluation is performed for the downsampling experiments using RMS differences to the original, fully sampled data set. Our results demonstrate that the image quality of the method presented in this paper is comparable to that of an established model-based strategy when optimization of the model parameter is carried out and yields superior results to non-optimized model parameters. However, no random sampling pattern showed superior performance when compared to conventional Cartesian subsampling for the considered reconstruction strategy.
The Construct of Creativity: Structural Model for Self-Reported Creativity Ratings
ERIC Educational Resources Information Center
Kaufman, James C.; Cole, Jason C.; Baer, John
2009-01-01
Several thousand subjects completed self-report questionnaires about their own creativity in 56 discrete domains. This sample was then randomly divided into three subsamples that were subject to factor analyses that compared an oblique model (with a set of correlated factors) and a hierarchical model (with a single second-order, or hierarchical,…
Technical Note: Introduction of variance component analysis to setup error analysis in radiotherapy
DOE Office of Scientific and Technical Information (OSTI.GOV)
Matsuo, Yukinori, E-mail: ymatsuo@kuhp.kyoto-u.ac.
Purpose: The purpose of this technical note is to introduce variance component analysis to the estimation of systematic and random components in setup error of radiotherapy. Methods: Balanced data according to the one-factor random effect model were assumed. Results: Analysis-of-variance (ANOVA)-based computation was applied to estimate the values and their confidence intervals (CIs) for systematic and random errors and the population mean of setup errors. The conventional method overestimates systematic error, especially in hypofractionated settings. The CI for systematic error becomes much wider than that for random error. The ANOVA-based estimation can be extended to a multifactor model considering multiplemore » causes of setup errors (e.g., interpatient, interfraction, and intrafraction). Conclusions: Variance component analysis may lead to novel applications to setup error analysis in radiotherapy.« less
Liu, Gui-Song; Guo, Hao-Song; Pan, Tao; Wang, Ji-Hua; Cao, Gan
2014-10-01
Based on Savitzky-Golay (SG) smoothing screening, principal component analysis (PCA) combined with separately supervised linear discriminant analysis (LDA) and unsupervised hierarchical clustering analysis (HCA) were used for non-destructive visible and near-infrared (Vis-NIR) detection for breed screening of transgenic sugarcane. A random and stability-dependent framework of calibration, prediction, and validation was proposed. A total of 456 samples of sugarcane leaves planting in the elongating stage were collected from the field, which was composed of 306 transgenic (positive) samples containing Bt and Bar gene and 150 non-transgenic (negative) samples. A total of 156 samples (negative 50 and positive 106) were randomly selected as the validation set; the remaining samples (negative 100 and positive 200, a total of 300 samples) were used as the modeling set, and then the modeling set was subdivided into calibration (negative 50 and positive 100, a total of 150 samples) and prediction sets (negative 50 and positive 100, a total of 150 samples) for 50 times. The number of SG smoothing points was ex- panded, while some modes of higher derivative were removed because of small absolute value, and a total of 264 smoothing modes were used for screening. The pairwise combinations of first three principal components were used, and then the optimal combination of principal components was selected according to the model effect. Based on all divisions of calibration and prediction sets and all SG smoothing modes, the SG-PCA-LDA and SG-PCA-HCA models were established, the model parameters were optimized based on the average prediction effect for all divisions to produce modeling stability. Finally, the model validation was performed by validation set. With SG smoothing, the modeling accuracy and stability of PCA-LDA, PCA-HCA were signif- icantly improved. For the optimal SG-PCA-LDA model, the recognition rate of positive and negative validation samples were 94.3%, 96.0%; and were 92.5%, 98.0% for the optimal SG-PCA-LDA model, respectively. Vis-NIR spectro- scopic pattern recognition combined with SG smoothing could be used for accurate recognition of transgenic sugarcane leaves, and provided a convenient screening method for transgenic sugarcane breeding.
NASA Astrophysics Data System (ADS)
Monthus, Cécile
2018-03-01
For the many-body-localized phase of random Majorana models, a general strong disorder real-space renormalization procedure known as RSRG-X (Pekker et al 2014 Phys. Rev. X 4 011052) is described to produce the whole set of excited states, via the iterative construction of the local integrals of motion (LIOMs). The RG rules are then explicitly derived for arbitrary quadratic Hamiltonians (free-fermions models) and for the Kitaev chain with local interactions involving even numbers of consecutive Majorana fermions. The emphasis is put on the advantages of the Majorana language over the usual quantum spin language to formulate unified RSRG-X rules.
Hart, Carl R; Reznicek, Nathan J; Wilson, D Keith; Pettit, Chris L; Nykaza, Edward T
2016-05-01
Many outdoor sound propagation models exist, ranging from highly complex physics-based simulations to simplified engineering calculations, and more recently, highly flexible statistical learning methods. Several engineering and statistical learning models are evaluated by using a particular physics-based model, namely, a Crank-Nicholson parabolic equation (CNPE), as a benchmark. Narrowband transmission loss values predicted with the CNPE, based upon a simulated data set of meteorological, boundary, and source conditions, act as simulated observations. In the simulated data set sound propagation conditions span from downward refracting to upward refracting, for acoustically hard and soft boundaries, and low frequencies. Engineering models used in the comparisons include the ISO 9613-2 method, Harmonoise, and Nord2000 propagation models. Statistical learning methods used in the comparisons include bagged decision tree regression, random forest regression, boosting regression, and artificial neural network models. Computed skill scores are relative to sound propagation in a homogeneous atmosphere over a rigid ground. Overall skill scores for the engineering noise models are 0.6%, -7.1%, and 83.8% for the ISO 9613-2, Harmonoise, and Nord2000 models, respectively. Overall skill scores for the statistical learning models are 99.5%, 99.5%, 99.6%, and 99.6% for bagged decision tree, random forest, boosting, and artificial neural network regression models, respectively.
A random matrix approach to credit risk.
Münnix, Michael C; Schäfer, Rudi; Guhr, Thomas
2014-01-01
We estimate generic statistical properties of a structural credit risk model by considering an ensemble of correlation matrices. This ensemble is set up by Random Matrix Theory. We demonstrate analytically that the presence of correlations severely limits the effect of diversification in a credit portfolio if the correlations are not identically zero. The existence of correlations alters the tails of the loss distribution considerably, even if their average is zero. Under the assumption of randomly fluctuating correlations, a lower bound for the estimation of the loss distribution is provided.
A Random Matrix Approach to Credit Risk
Guhr, Thomas
2014-01-01
We estimate generic statistical properties of a structural credit risk model by considering an ensemble of correlation matrices. This ensemble is set up by Random Matrix Theory. We demonstrate analytically that the presence of correlations severely limits the effect of diversification in a credit portfolio if the correlations are not identically zero. The existence of correlations alters the tails of the loss distribution considerably, even if their average is zero. Under the assumption of randomly fluctuating correlations, a lower bound for the estimation of the loss distribution is provided. PMID:24853864
Forecasting Space Weather-Induced GPS Performance Degradation Using Random Forest
NASA Astrophysics Data System (ADS)
Filjar, R.; Filic, M.; Milinkovic, F.
2017-12-01
Space weather and ionospheric dynamics have a profound effect on positioning performance of the Global Satellite Navigation System (GNSS). However, the quantification of that effect is still the subject of scientific activities around the world. In the latest contribution to the understanding of the space weather and ionospheric effects on satellite-based positioning performance, we conducted a study of several candidates for forecasting method for space weather-induced GPS positioning performance deterioration. First, a 5-days set of experimentally collected data was established, encompassing the space weather and ionospheric activity indices (including: the readings of the Sudden Ionospheric Disturbance (SID) monitors, components of geomagnetic field strength, global Kp index, Dst index, GPS-derived Total Electron Content (TEC) samples, standard deviation of TEC samples, and sunspot number) and observations of GPS positioning error components (northing, easting, and height positioning error) derived from the Adriatic Sea IGS reference stations' RINEX raw pseudorange files in quiet space weather periods. This data set was split into the training and test sub-sets. Then, a selected set of supervised machine learning methods based on Random Forest was applied to the experimentally collected data set in order to establish the appropriate regional (the Adriatic Sea) forecasting models for space weather-induced GPS positioning performance deterioration. The forecasting models were developed in the R/rattle statistical programming environment. The forecasting quality of the regional forecasting models developed was assessed, and the conclusions drawn on the advantages and shortcomings of the regional forecasting models for space weather-caused GNSS positioning performance deterioration.
Geometric Modeling of Inclusions as Ellipsoids
NASA Technical Reports Server (NTRS)
Bonacuse, Peter J.
2008-01-01
Nonmetallic inclusions in gas turbine disk alloys can have a significant detrimental impact on fatigue life. Because large inclusions that lead to anomalously low lives occur infrequently, probabilistic approaches can be utilized to avoid the excessively conservative assumption of lifing to a large inclusion in a high stress location. A prerequisite to modeling the impact of inclusions on the fatigue life distribution is a characterization of the inclusion occurrence rate and size distribution. To help facilitate this process, a geometric simulation of the inclusions was devised. To make the simulation problem tractable, the irregularly sized and shaped inclusions were modeled as arbitrarily oriented, three independent dimensioned, ellipsoids. Random orientation of the ellipsoid is accomplished through a series of three orthogonal rotations of axes. In this report, a set of mathematical models for the following parameters are described: the intercepted area of a randomly sectioned ellipsoid, the dimensions and orientation of the intercepted ellipse, the area of a randomly oriented sectioned ellipse, the depth and width of a randomly oriented sectioned ellipse, and the projected area of a randomly oriented ellipsoid. These parameters are necessary to determine an inclusion s potential to develop a propagating fatigue crack. Without these mathematical models, computationally expensive search algorithms would be required to compute these parameters.
Jones, Andrew S; Taktak, Azzam G F; Helliwell, Timothy R; Fenton, John E; Birchall, Martin A; Husband, David J; Fisher, Anthony C
2006-06-01
The accepted method of modelling and predicting failure/survival, Cox's proportional hazards model, is theoretically inferior to neural network derived models for analysing highly complex systems with large datasets. A blinded comparison of the neural network versus the Cox's model in predicting survival utilising data from 873 treated patients with laryngeal cancer. These were divided randomly and equally into a training set and a study set and Cox's and neural network models applied in turn. Data were then divided into seven sets of binary covariates and the analysis repeated. Overall survival was not significantly different on Kaplan-Meier plot, or with either test model. Although the network produced qualitatively similar results to Cox's model it was significantly more sensitive to differences in survival curves for age and N stage. We propose that neural networks are capable of prediction in systems involving complex interactions between variables and non-linearity.
NASA Astrophysics Data System (ADS)
Eliazar, Iddo; Klafter, Joseph
2008-05-01
Many random populations can be modeled as a countable set of points scattered randomly on the positive half-line. The points may represent magnitudes of earthquakes and tornados, masses of stars, market values of public companies, etc. In this article we explore a specific class of random such populations we coin ` Paretian Poisson processes'. This class is elemental in statistical physics—connecting together, in a deep and fundamental way, diverse issues including: the Poisson distribution of the Law of Small Numbers; Paretian tail statistics; the Fréchet distribution of Extreme Value Theory; the one-sided Lévy distribution of the Central Limit Theorem; scale-invariance, renormalization and fractality; resilience to random perturbations.
Two-lane rural highways safety performance functions.
DOT National Transportation Integrated Search
2016-05-01
This report documents findings from a comprehensive set of safety performance functions developed for the entire : state two-lane rural highway system in Washington. The findings indicate that random parameter models and : heterogeneous negative bino...
NASA Astrophysics Data System (ADS)
Panunzio, Alfonso M.; Puel, G.; Cottereau, R.; Simon, S.; Quost, X.
2017-03-01
This paper describes the construction of a stochastic model of urban railway track geometry irregularities, based on experimental data. The considered irregularities are track gauge, superelevation, horizontal and vertical curvatures. They are modelled as random fields whose statistical properties are extracted from a large set of on-track measurements of the geometry of an urban railway network. About 300-1000 terms are used in the Karhunen-Loève/Polynomial Chaos expansions to represent the random fields with appropriate accuracy. The construction of the random fields is then validated by comparing on-track measurements of the contact forces and numerical dynamics simulations for different operational conditions (train velocity and car load) and horizontal layouts (alignment, curve). The dynamics simulations are performed both with and without randomly generated geometrical irregularities for the track. The power spectrum densities obtained from the dynamics simulations with the model of geometrical irregularities compare extremely well with those obtained from the experimental contact forces. Without irregularities, the spectrum is 10-50 dB too low.
Urdea, Mickey; Kolberg, Janice; Wilber, Judith; Gerwien, Robert; Moler, Edward; Rowe, Michael; Jorgensen, Paul; Hansen, Torben; Pedersen, Oluf; Jørgensen, Torben; Borch-Johnsen, Knut
2009-01-01
Background Improved identification of subjects at high risk for development of type 2 diabetes would allow preventive interventions to be targeted toward individuals most likely to benefit. In previous research, predictive biomarkers were identified and used to develop multivariate models to assess an individual's risk of developing diabetes. Here we describe the training and validation of the PreDx™ Diabetes Risk Score (DRS) model in a clinical laboratory setting using baseline serum samples from subjects in the Inter99 cohort, a population-based primary prevention study of cardiovascular disease. Methods Among 6784 subjects free of diabetes at baseline, 215 subjects progressed to diabetes (converters) during five years of follow-up. A nested case-control study was performed using serum samples from 202 converters and 597 randomly selected nonconverters. Samples were randomly assigned to equally sized training and validation sets. Seven biomarkers were measured using assays developed for use in a clinical reference laboratory. Results The PreDx DRS model performed better on the training set (area under the curve [AUC] = 0.837) than fasting plasma glucose alone (AUC = 0.779). When applied to the sequestered validation set, the PreDx DRS showed the same performance (AUC = 0.838), thus validating the model. This model had a better AUC than any other single measure from a fasting sample. Moreover, the model provided further risk stratification among high-risk subpopulations with impaired fasting glucose or metabolic syndrome. Conclusions The PreDx DRS provides the absolute risk of diabetes conversion in five years for subjects identified to be “at risk” using the clinical factors. PMID:20144324
Urdea, Mickey; Kolberg, Janice; Wilber, Judith; Gerwien, Robert; Moler, Edward; Rowe, Michael; Jorgensen, Paul; Hansen, Torben; Pedersen, Oluf; Jørgensen, Torben; Borch-Johnsen, Knut
2009-07-01
Improved identification of subjects at high risk for development of type 2 diabetes would allow preventive interventions to be targeted toward individuals most likely to benefit. In previous research, predictive biomarkers were identified and used to develop multivariate models to assess an individual's risk of developing diabetes. Here we describe the training and validation of the PreDx Diabetes Risk Score (DRS) model in a clinical laboratory setting using baseline serum samples from subjects in the Inter99 cohort, a population-based primary prevention study of cardiovascular disease. Among 6784 subjects free of diabetes at baseline, 215 subjects progressed to diabetes (converters) during five years of follow-up. A nested case-control study was performed using serum samples from 202 converters and 597 randomly selected nonconverters. Samples were randomly assigned to equally sized training and validation sets. Seven biomarkers were measured using assays developed for use in a clinical reference laboratory. The PreDx DRS model performed better on the training set (area under the curve [AUC] = 0.837) than fasting plasma glucose alone (AUC = 0.779). When applied to the sequestered validation set, the PreDx DRS showed the same performance (AUC = 0.838), thus validating the model. This model had a better AUC than any other single measure from a fasting sample. Moreover, the model provided further risk stratification among high-risk subpopulations with impaired fasting glucose or metabolic syndrome. The PreDx DRS provides the absolute risk of diabetes conversion in five years for subjects identified to be "at risk" using the clinical factors. Copyright 2009 Diabetes Technology Society.
Zhan, Xue-yan; Zhao, Na; Lin, Zhao-zhou; Wu, Zhi-sheng; Yuan, Rui-juan; Qiao, Yan-jiang
2014-12-01
The appropriate algorithm for calibration set selection was one of the key technologies for a good NIR quantitative model. There are different algorithms for calibration set selection, such as Random Sampling (RS) algorithm, Conventional Selection (CS) algorithm, Kennard-Stone(KS) algorithm and Sample set Portioning based on joint x-y distance (SPXY) algorithm, et al. However, there lack systematic comparisons between two algorithms of the above algorithms. The NIR quantitative models to determine the asiaticoside content in Centella total glucosides were established in the present paper, of which 7 indexes were classified and selected, and the effects of CS algorithm, KS algorithm and SPXY algorithm for calibration set selection on the accuracy and robustness of NIR quantitative models were investigated. The accuracy indexes of NIR quantitative models with calibration set selected by SPXY algorithm were significantly different from that with calibration set selected by CS algorithm or KS algorithm, while the robustness indexes, such as RMSECV and |RMSEP-RMSEC|, were not significantly different. Therefore, SPXY algorithm for calibration set selection could improve the predicative accuracy of NIR quantitative models to determine asiaticoside content in Centella total glucosides, and have no significant effect on the robustness of the models, which provides a reference to determine the appropriate algorithm for calibration set selection when NIR quantitative models are established for the solid system of traditional Chinese medcine.
NASA Astrophysics Data System (ADS)
Ángel López Comino, José; Stich, Daniel; Ferreira, Ana M. G.; Morales Soto, José
2015-04-01
The inversion of seismic data for extended fault slip distributions provides us detailed models of earthquake sources. The validity of the solutions depends on the fit between observed and synthetic seismograms generated with the source model. However, there may exist more than one model that fit the data in a similar way, leading to a multiplicity of solutions. This underdetermined problem has been analyzed and studied by several authors, who agree that inverting for a single best model may become overly dependent on the details of the procedure. We have addressed this resolution problem by using a global search that scans the solutions domain using random slipmaps, applying a Popperian inversion strategy that involves the generation of a representative set of slip distributions. The proposed technique solves the forward problem for a large set of models calculating their corresponding synthetic seismograms. Then, we propose to perform extended fault inversion through falsification, that is, falsify inappropriate trial models that do not reproduce the data within a reasonable level of mismodelling. The remainder of surviving trial models forms our set of coequal solutions. Thereby the ambiguities that might exist can be detected by taking a look at the solutions, allowing for an efficient assessment of the resolution. The solution set may contain only members with similar slip distributions, or else uncover some fundamental ambiguity like, for example, different patterns of main slip patches or different patterns of rupture propagation. For a feasibility study, the proposed resolution test has been evaluated using teleseismic body wave recordings from the September 5th 2012 Nicoya, Costa Rica earthquake. Note that the inversion strategy can be applied to any type of seismic, geodetic or tsunami data for which we can handle the forward problem. A 2D von Karman distribution is used to describe the spectrum of heterogeneity in slipmaps, and we generate possible models by spectral synthesis for random phase, keeping the rake angle, rupture velocity and slip velocity function fixed. The 2012 Nicoya earthquake turns out to be relatively well constrained from 50 teleseismic waveforms. The solution set contains 252 out of 10.000 trial models with normalized L1-fit within 5 percent from the global minimum. The set includes only similar solutions -a single centred slip patch- with minor differences. Uncertainties are related to the details of the slip maximum, including the amount of peak slip (2m to 3.5m), as well as the characteristics of peripheral slip below 1 m. Synthetic tests suggest that slip patterns like Nicoya may be a fortunate case, while it may be more difficult to unambiguously reconstruct more distributed slip from teleseismic data.
Tóth, Gergely; Bodai, Zsolt; Héberger, Károly
2013-10-01
Coefficient of determination (R (2)) and its leave-one-out cross-validated analogue (denoted by Q (2) or R cv (2) ) are the most frequantly published values to characterize the predictive performance of models. In this article we use R (2) and Q (2) in a reversed aspect to determine uncommon points, i.e. influential points in any data sets. The term (1 - Q (2))/(1 - R (2)) corresponds to the ratio of predictive residual sum of squares and the residual sum of squares. The ratio correlates to the number of influential points in experimental and random data sets. We propose an (approximate) F test on (1 - Q (2))/(1 - R (2)) term to quickly pre-estimate the presence of influential points in training sets of models. The test is founded upon the routinely calculated Q (2) and R (2) values and warns the model builders to verify the training set, to perform influence analysis or even to change to robust modeling.
Kaspi, Omer; Yosipof, Abraham; Senderowitz, Hanoch
2017-06-06
An important aspect of chemoinformatics and material-informatics is the usage of machine learning algorithms to build Quantitative Structure Activity Relationship (QSAR) models. The RANdom SAmple Consensus (RANSAC) algorithm is a predictive modeling tool widely used in the image processing field for cleaning datasets from noise. RANSAC could be used as a "one stop shop" algorithm for developing and validating QSAR models, performing outlier removal, descriptors selection, model development and predictions for test set samples using applicability domain. For "future" predictions (i.e., for samples not included in the original test set) RANSAC provides a statistical estimate for the probability of obtaining reliable predictions, i.e., predictions within a pre-defined number of standard deviations from the true values. In this work we describe the first application of RNASAC in material informatics, focusing on the analysis of solar cells. We demonstrate that for three datasets representing different metal oxide (MO) based solar cell libraries RANSAC-derived models select descriptors previously shown to correlate with key photovoltaic properties and lead to good predictive statistics for these properties. These models were subsequently used to predict the properties of virtual solar cells libraries highlighting interesting dependencies of PV properties on MO compositions.
Should multiple imputation be the method of choice for handling missing data in randomized trials?
Sullivan, Thomas R; White, Ian R; Salter, Amy B; Ryan, Philip; Lee, Katherine J
2016-01-01
The use of multiple imputation has increased markedly in recent years, and journal reviewers may expect to see multiple imputation used to handle missing data. However in randomized trials, where treatment group is always observed and independent of baseline covariates, other approaches may be preferable. Using data simulation we evaluated multiple imputation, performed both overall and separately by randomized group, across a range of commonly encountered scenarios. We considered both missing outcome and missing baseline data, with missing outcome data induced under missing at random mechanisms. Provided the analysis model was correctly specified, multiple imputation produced unbiased treatment effect estimates, but alternative unbiased approaches were often more efficient. When the analysis model overlooked an interaction effect involving randomized group, multiple imputation produced biased estimates of the average treatment effect when applied to missing outcome data, unless imputation was performed separately by randomized group. Based on these results, we conclude that multiple imputation should not be seen as the only acceptable way to handle missing data in randomized trials. In settings where multiple imputation is adopted, we recommend that imputation is carried out separately by randomized group. PMID:28034175
Should multiple imputation be the method of choice for handling missing data in randomized trials?
Sullivan, Thomas R; White, Ian R; Salter, Amy B; Ryan, Philip; Lee, Katherine J
2016-01-01
The use of multiple imputation has increased markedly in recent years, and journal reviewers may expect to see multiple imputation used to handle missing data. However in randomized trials, where treatment group is always observed and independent of baseline covariates, other approaches may be preferable. Using data simulation we evaluated multiple imputation, performed both overall and separately by randomized group, across a range of commonly encountered scenarios. We considered both missing outcome and missing baseline data, with missing outcome data induced under missing at random mechanisms. Provided the analysis model was correctly specified, multiple imputation produced unbiased treatment effect estimates, but alternative unbiased approaches were often more efficient. When the analysis model overlooked an interaction effect involving randomized group, multiple imputation produced biased estimates of the average treatment effect when applied to missing outcome data, unless imputation was performed separately by randomized group. Based on these results, we conclude that multiple imputation should not be seen as the only acceptable way to handle missing data in randomized trials. In settings where multiple imputation is adopted, we recommend that imputation is carried out separately by randomized group.
Stochastic arbitrage return and its implication for option pricing
NASA Astrophysics Data System (ADS)
Fedotov, Sergei; Panayides, Stephanos
2005-01-01
The purpose of this work is to explore the role that random arbitrage opportunities play in pricing financial derivatives. We use a non-equilibrium model to set up a stochastic portfolio, and for the random arbitrage return, we choose a stationary ergodic random process rapidly varying in time. We exploit the fact that option price and random arbitrage returns change on different time scales which allows us to develop an asymptotic pricing theory involving the central limit theorem for random processes. We restrict ourselves to finding pricing bands for options rather than exact prices. The resulting pricing bands are shown to be independent of the detailed statistical characteristics of the arbitrage return. We find that the volatility “smile” can also be explained in terms of random arbitrage opportunities.
Identifying differentially expressed genes in cancer patients using a non-parameter Ising model.
Li, Xumeng; Feltus, Frank A; Sun, Xiaoqian; Wang, James Z; Luo, Feng
2011-10-01
Identification of genes and pathways involved in diseases and physiological conditions is a major task in systems biology. In this study, we developed a novel non-parameter Ising model to integrate protein-protein interaction network and microarray data for identifying differentially expressed (DE) genes. We also proposed a simulated annealing algorithm to find the optimal configuration of the Ising model. The Ising model was applied to two breast cancer microarray data sets. The results showed that more cancer-related DE sub-networks and genes were identified by the Ising model than those by the Markov random field model. Furthermore, cross-validation experiments showed that DE genes identified by Ising model can improve classification performance compared with DE genes identified by Markov random field model. Copyright © 2011 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Bernal's road to random packing and the structure of liquids
NASA Astrophysics Data System (ADS)
Finney, John L.
2013-11-01
Until the 1960s, liquids were generally regarded as either dense gases or disordered solids, and theoretical attempts at understanding their structures and properties were largely based on those concepts. Bernal, himself a crystallographer, was unhappy with either approach, preferring to regard simple liquids as 'homogeneous, coherent and essentially irregular assemblages of molecules containing no crystalline regions'. He set about realizing this conceptual model through a detailed examination of the structures and properties of random packings of spheres. In order to test the relevance of the model to real liquids, ways had to be found to realize and characterize random packings. This was at a time when computing was slow and in its infancy, so he and his collaborators set about building models in the laboratory, and examining aspects of their structures in order to characterize them in ways which would enable comparison with the properties of real liquids. Some of the imaginative - often time consuming and frustrating - routes followed are described, as well the comparisons made with the properties of simple liquids. With the increase of the power of computers in the 1960s, computational approaches became increasingly exploited in random packing studies. This enabled the use of packing concepts, and the tools developed to characterize them, in understanding systems as diverse as metallic glasses, crystal-liquid interfaces, protein structures, enzyme-substrate interactions and the distribution of galaxies, as well as their exploitation in, for example, oil extraction, understanding chromatographic separation columns, and packed beds in industrial processes.
Damage Propagation Modeling for Aircraft Engine Prognostics
NASA Technical Reports Server (NTRS)
Saxena, Abhinav; Goebel, Kai; Simon, Don; Eklund, Neil
2008-01-01
This paper describes how damage propagation can be modeled within the modules of aircraft gas turbine engines. To that end, response surfaces of all sensors are generated via a thermo-dynamical simulation model for the engine as a function of variations of flow and efficiency of the modules of interest. An exponential rate of change for flow and efficiency loss was imposed for each data set, starting at a randomly chosen initial deterioration set point. The rate of change of the flow and efficiency denotes an otherwise unspecified fault with increasingly worsening effect. The rates of change of the faults were constrained to an upper threshold but were otherwise chosen randomly. Damage propagation was allowed to continue until a failure criterion was reached. A health index was defined as the minimum of several superimposed operational margins at any given time instant and the failure criterion is reached when health index reaches zero. Output of the model was the time series (cycles) of sensed measurements typically available from aircraft gas turbine engines. The data generated were used as challenge data for the Prognostics and Health Management (PHM) data competition at PHM 08.
Dietary interventions to prevent and manage diabetes in worksite settings: a meta-analysis.
Shrestha, Archana; Karmacharya, Biraj Man; Khudyakov, Polyna; Weber, Mary Beth; Spiegelman, Donna
2018-01-25
The translation of lifestyle intervention to improve glucose tolerance into the workplace has been rare. The objective of this meta-analysis is to summarize the evidence for the effectiveness of dietary interventions in worksite settings on lowering blood sugar levels. We searched for studies in PubMed, Embase, Econlit, Ovid, Cochrane, Web of Science, and Cumulative Index to Nursing and Allied Health Literature. Search terms were as follows: (1) Exposure-based: nutrition/diet/dietary intervention/health promotion/primary prevention/health behavior/health education/food /program evaluation; (2) Outcome-based: diabetes/hyperglycemia/glucose/HbA1c/glycated hemoglobin; and (3) Setting-based: workplace/worksite/occupational/industry/job/employee. We manually searched review articles and reference lists of articles identified from 1969 to December 2016. We tested for between-studies heterogeneity and calculated the pooled effect sizes for changes in HbA1c (%) and fasting glucose (mg/dl) using random effect models for meta-analysis in 2016. A total of 17 articles out of 1663 initially selected articles were included in the meta-analysis. With a random-effects model, worksite dietary interventions led to a pooled -0.18% (95% CI, -0.29 to -0.06; P<0.001) difference in HbA1c. With the random-effects model, the interventions resulted in 2.60 mg/dl lower fasting glucose with borderline significance (95% CI: -5.27 to 0.08, P=0.06). In the multivariate meta-regression model, the interventions with high percent of female participants and that used the intervention directly delivered to individuals, rather the environment changes, were associated with more effective interventions. Workplace dietary interventions can improve HbA1c. The effects were larger for the interventions with greater number of female participants and with individual-level interventions.
Stevens, Forrest R.; Gaughan, Andrea E.; Linard, Catherine; Tatem, Andrew J.
2015-01-01
High resolution, contemporary data on human population distributions are vital for measuring impacts of population growth, monitoring human-environment interactions and for planning and policy development. Many methods are used to disaggregate census data and predict population densities for finer scale, gridded population data sets. We present a new semi-automated dasymetric modeling approach that incorporates detailed census and ancillary data in a flexible, “Random Forest” estimation technique. We outline the combination of widely available, remotely-sensed and geospatial data that contribute to the modeled dasymetric weights and then use the Random Forest model to generate a gridded prediction of population density at ~100 m spatial resolution. This prediction layer is then used as the weighting surface to perform dasymetric redistribution of the census counts at a country level. As a case study we compare the new algorithm and its products for three countries (Vietnam, Cambodia, and Kenya) with other common gridded population data production methodologies. We discuss the advantages of the new method and increases over the accuracy and flexibility of those previous approaches. Finally, we outline how this algorithm will be extended to provide freely-available gridded population data sets for Africa, Asia and Latin America. PMID:25689585
A tale of two "forests": random forest machine learning AIDS tropical forest carbon mapping.
Mascaro, Joseph; Asner, Gregory P; Knapp, David E; Kennedy-Bowdoin, Ty; Martin, Roberta E; Anderson, Christopher; Higgins, Mark; Chadwick, K Dana
2014-01-01
Accurate and spatially-explicit maps of tropical forest carbon stocks are needed to implement carbon offset mechanisms such as REDD+ (Reduced Deforestation and Degradation Plus). The Random Forest machine learning algorithm may aid carbon mapping applications using remotely-sensed data. However, Random Forest has never been compared to traditional and potentially more reliable techniques such as regionally stratified sampling and upscaling, and it has rarely been employed with spatial data. Here, we evaluated the performance of Random Forest in upscaling airborne LiDAR (Light Detection and Ranging)-based carbon estimates compared to the stratification approach over a 16-million hectare focal area of the Western Amazon. We considered two runs of Random Forest, both with and without spatial contextual modeling by including--in the latter case--x, and y position directly in the model. In each case, we set aside 8 million hectares (i.e., half of the focal area) for validation; this rigorous test of Random Forest went above and beyond the internal validation normally compiled by the algorithm (i.e., called "out-of-bag"), which proved insufficient for this spatial application. In this heterogeneous region of Northern Peru, the model with spatial context was the best preforming run of Random Forest, and explained 59% of LiDAR-based carbon estimates within the validation area, compared to 37% for stratification or 43% by Random Forest without spatial context. With the 60% improvement in explained variation, RMSE against validation LiDAR samples improved from 33 to 26 Mg C ha(-1) when using Random Forest with spatial context. Our results suggest that spatial context should be considered when using Random Forest, and that doing so may result in substantially improved carbon stock modeling for purposes of climate change mitigation.
A Tale of Two “Forests”: Random Forest Machine Learning Aids Tropical Forest Carbon Mapping
Mascaro, Joseph; Asner, Gregory P.; Knapp, David E.; Kennedy-Bowdoin, Ty; Martin, Roberta E.; Anderson, Christopher; Higgins, Mark; Chadwick, K. Dana
2014-01-01
Accurate and spatially-explicit maps of tropical forest carbon stocks are needed to implement carbon offset mechanisms such as REDD+ (Reduced Deforestation and Degradation Plus). The Random Forest machine learning algorithm may aid carbon mapping applications using remotely-sensed data. However, Random Forest has never been compared to traditional and potentially more reliable techniques such as regionally stratified sampling and upscaling, and it has rarely been employed with spatial data. Here, we evaluated the performance of Random Forest in upscaling airborne LiDAR (Light Detection and Ranging)-based carbon estimates compared to the stratification approach over a 16-million hectare focal area of the Western Amazon. We considered two runs of Random Forest, both with and without spatial contextual modeling by including—in the latter case—x, and y position directly in the model. In each case, we set aside 8 million hectares (i.e., half of the focal area) for validation; this rigorous test of Random Forest went above and beyond the internal validation normally compiled by the algorithm (i.e., called “out-of-bag”), which proved insufficient for this spatial application. In this heterogeneous region of Northern Peru, the model with spatial context was the best preforming run of Random Forest, and explained 59% of LiDAR-based carbon estimates within the validation area, compared to 37% for stratification or 43% by Random Forest without spatial context. With the 60% improvement in explained variation, RMSE against validation LiDAR samples improved from 33 to 26 Mg C ha−1 when using Random Forest with spatial context. Our results suggest that spatial context should be considered when using Random Forest, and that doing so may result in substantially improved carbon stock modeling for purposes of climate change mitigation. PMID:24489686
Motifs in triadic random graphs based on Steiner triple systems
NASA Astrophysics Data System (ADS)
Winkler, Marco; Reichardt, Jörg
2013-08-01
Conventionally, pairwise relationships between nodes are considered to be the fundamental building blocks of complex networks. However, over the last decade, the overabundance of certain subnetwork patterns, i.e., the so-called motifs, has attracted much attention. It has been hypothesized that these motifs, instead of links, serve as the building blocks of network structures. Although the relation between a network's topology and the general properties of the system, such as its function, its robustness against perturbations, or its efficiency in spreading information, is the central theme of network science, there is still a lack of sound generative models needed for testing the functional role of subgraph motifs. Our work aims to overcome this limitation. We employ the framework of exponential random graph models (ERGMs) to define models based on triadic substructures. The fact that only a small portion of triads can actually be set independently poses a challenge for the formulation of such models. To overcome this obstacle, we use Steiner triple systems (STSs). These are partitions of sets of nodes into pair-disjoint triads, which thus can be specified independently. Combining the concepts of ERGMs and STSs, we suggest generative models capable of generating ensembles of networks with nontrivial triadic Z-score profiles. Further, we discover inevitable correlations between the abundance of triad patterns, which occur solely for statistical reasons and need to be taken into account when discussing the functional implications of motif statistics. Moreover, we calculate the degree distributions of our triadic random graphs analytically.
Implications of crater distributions on Venus
NASA Technical Reports Server (NTRS)
Kaula, W. M.
1993-01-01
The horizontal locations of craters on Venus are consistent with randomness. However, (1) randomness does not make crater counts useless for age indications; (2) consistency does not imply necessity or optimality; and (3) horizontal location is not the only reference frame against which to test models. Re (1), the apparent smallness of resurfacing areas means that a region on the order of one percent of the planet with a typical number of craters, 5-15, will have a range of feature ages of several 100 My. Re (2), models of resurfacing somewhat similar to Earth's can be found that are also consistent and more optimal than random: i.e., resurfacing occurring in clusters, that arise and die away in lime intervals on the order of 50 My. These agree with the observation that there are more areas of high crater density, and fewer of moderate density, than optimal for random. Re (3), 799 crater elevations were tested; there are more at low elevations and fewer at high elevations than optimal for random: i.e., 54.6 percent below the median. Only one of 40 random sets of 799 was as extreme.
Elephant random walks and their connection to Pólya-type urns
NASA Astrophysics Data System (ADS)
Baur, Erich; Bertoin, Jean
2016-11-01
In this paper, we explain the connection between the elephant random walk (ERW) and an urn model à la Pólya and derive functional limit theorems for the former. The ERW model was introduced in [Phys. Rev. E 70, 045101 (2004), 10.1103/PhysRevE.70.045101] to study memory effects in a highly non-Markovian setting. More specifically, the ERW is a one-dimensional discrete-time random walk with a complete memory of its past. The influence of the memory is measured in terms of a memory parameter p between zero and one. In the past years, a considerable effort has been undertaken to understand the large-scale behavior of the ERW, depending on the choice of p . Here, we use known results on urns to explicitly solve the ERW in all memory regimes. The method works as well for ERWs in higher dimensions and is widely applicable to related models.
Network Dynamics of Innovation Processes.
Iacopini, Iacopo; Milojević, Staša; Latora, Vito
2018-01-26
We introduce a model for the emergence of innovations, in which cognitive processes are described as random walks on the network of links among ideas or concepts, and an innovation corresponds to the first visit of a node. The transition matrix of the random walk depends on the network weights, while in turn the weight of an edge is reinforced by the passage of a walker. The presence of the network naturally accounts for the mechanism of the "adjacent possible," and the model reproduces both the rate at which novelties emerge and the correlations among them observed empirically. We show this by using synthetic networks and by studying real data sets on the growth of knowledge in different scientific disciplines. Edge-reinforced random walks on complex topologies offer a new modeling framework for the dynamics of correlated novelties and are another example of coevolution of processes and networks.
Network Dynamics of Innovation Processes
NASA Astrophysics Data System (ADS)
Iacopini, Iacopo; Milojević, Staša; Latora, Vito
2018-01-01
We introduce a model for the emergence of innovations, in which cognitive processes are described as random walks on the network of links among ideas or concepts, and an innovation corresponds to the first visit of a node. The transition matrix of the random walk depends on the network weights, while in turn the weight of an edge is reinforced by the passage of a walker. The presence of the network naturally accounts for the mechanism of the "adjacent possible," and the model reproduces both the rate at which novelties emerge and the correlations among them observed empirically. We show this by using synthetic networks and by studying real data sets on the growth of knowledge in different scientific disciplines. Edge-reinforced random walks on complex topologies offer a new modeling framework for the dynamics of correlated novelties and are another example of coevolution of processes and networks.
Approximate scaling properties of RNA free energy landscapes
NASA Technical Reports Server (NTRS)
Baskaran, S.; Stadler, P. F.; Schuster, P.
1996-01-01
RNA free energy landscapes are analysed by means of "time-series" that are obtained from random walks restricted to excursion sets. The power spectra, the scaling of the jump size distribution, and the scaling of the curve length measured with different yard stick lengths are used to describe the structure of these "time series". Although they are stationary by construction, we find that their local behavior is consistent with both AR(1) and self-affine processes. Random walks confined to excursion sets (i.e., with the restriction that the fitness value exceeds a certain threshold at each step) exhibit essentially the same statistics as free random walks. We find that an AR(1) time series is in general approximately self-affine on timescales up to approximately the correlation length. We present an empirical relation between the correlation parameter rho of the AR(1) model and the exponents characterizing self-affinity.
Provably secure Rabin-p cryptosystem in hybrid setting
NASA Astrophysics Data System (ADS)
Asbullah, Muhammad Asyraf; Ariffin, Muhammad Rezal Kamel
2016-06-01
In this work, we design an efficient and provably secure hybrid cryptosystem depicted by a combination of the Rabin-p cryptosystem with an appropriate symmetric encryption scheme. We set up a hybrid structure which is proven secure in the sense of indistinguishable against the chosen-ciphertext attack. We presume that the integer factorization problem is hard and the hash function that modeled as a random function.
Bohmanova, J; Miglior, F; Jamrozik, J; Misztal, I; Sullivan, P G
2008-09-01
A random regression model with both random and fixed regressions fitted by Legendre polynomials of order 4 was compared with 3 alternative models fitting linear splines with 4, 5, or 6 knots. The effects common for all models were a herd-test-date effect, fixed regressions on days in milk (DIM) nested within region-age-season of calving class, and random regressions for additive genetic and permanent environmental effects. Data were test-day milk, fat and protein yields, and SCS recorded from 5 to 365 DIM during the first 3 lactations of Canadian Holstein cows. A random sample of 50 herds consisting of 96,756 test-day records was generated to estimate variance components within a Bayesian framework via Gibbs sampling. Two sets of genetic evaluations were subsequently carried out to investigate performance of the 4 models. Models were compared by graphical inspection of variance functions, goodness of fit, error of prediction of breeding values, and stability of estimated breeding values. Models with splines gave lower estimates of variances at extremes of lactations than the model with Legendre polynomials. Differences among models in goodness of fit measured by percentages of squared bias, correlations between predicted and observed records, and residual variances were small. The deviance information criterion favored the spline model with 6 knots. Smaller error of prediction and higher stability of estimated breeding values were achieved by using spline models with 5 and 6 knots compared with the model with Legendre polynomials. In general, the spline model with 6 knots had the best overall performance based upon the considered model comparison criteria.
Using Audit Information to Adjust Parameter Estimates for Data Errors in Clinical Trials
Shepherd, Bryan E.; Shaw, Pamela A.; Dodd, Lori E.
2013-01-01
Background Audits are often performed to assess the quality of clinical trial data, but beyond detecting fraud or sloppiness, the audit data is generally ignored. In earlier work using data from a non-randomized study, Shepherd and Yu (2011) developed statistical methods to incorporate audit results into study estimates, and demonstrated that audit data could be used to eliminate bias. Purpose In this manuscript we examine the usefulness of audit-based error-correction methods in clinical trial settings where a continuous outcome is of primary interest. Methods We demonstrate the bias of multiple linear regression estimates in general settings with an outcome that may have errors and a set of covariates for which some may have errors and others, including treatment assignment, are recorded correctly for all subjects. We study this bias under different assumptions including independence between treatment assignment, covariates, and data errors (conceivable in a double-blinded randomized trial) and independence between treatment assignment and covariates but not data errors (possible in an unblinded randomized trial). We review moment-based estimators to incorporate the audit data and propose new multiple imputation estimators. The performance of estimators is studied in simulations. Results When treatment is randomized and unrelated to data errors, estimates of the treatment effect using the original error-prone data (i.e., ignoring the audit results) are unbiased. In this setting, both moment and multiple imputation estimators incorporating audit data are more variable than standard analyses using the original data. In contrast, in settings where treatment is randomized but correlated with data errors and in settings where treatment is not randomized, standard treatment effect estimates will be biased. And in all settings, parameter estimates for the original, error-prone covariates will be biased. Treatment and covariate effect estimates can be corrected by incorporating audit data using either the multiple imputation or moment-based approaches. Bias, precision, and coverage of confidence intervals improve as the audit size increases. Limitations The extent of bias and the performance of methods depend on the extent and nature of the error as well as the size of the audit. This work only considers methods for the linear model. Settings much different than those considered here need further study. Conclusions In randomized trials with continuous outcomes and treatment assignment independent of data errors, standard analyses of treatment effects will be unbiased and are recommended. However, if treatment assignment is correlated with data errors or other covariates, naive analyses may be biased. In these settings, and when covariate effects are of interest, approaches for incorporating audit results should be considered. PMID:22848072
Analysis of overdispersed count data by mixtures of Poisson variables and Poisson processes.
Hougaard, P; Lee, M L; Whitmore, G A
1997-12-01
Count data often show overdispersion compared to the Poisson distribution. Overdispersion is typically modeled by a random effect for the mean, based on the gamma distribution, leading to the negative binomial distribution for the count. This paper considers a larger family of mixture distributions, including the inverse Gaussian mixture distribution. It is demonstrated that it gives a significantly better fit for a data set on the frequency of epileptic seizures. The same approach can be used to generate counting processes from Poisson processes, where the rate or the time is random. A random rate corresponds to variation between patients, whereas a random time corresponds to variation within patients.
Melfsen, Andreas; Hartung, Eberhard; Haeussermann, Angelika
2013-02-01
The robustness of in-line raw milk analysis with near-infrared spectroscopy (NIRS) was tested with respect to the prediction of the raw milk contents fat, protein and lactose. Near-infrared (NIR) spectra of raw milk (n = 3119) were acquired on three different farms during the milking process of 354 milkings over a period of six months. Calibration models were calculated for: a random data set of each farm (fully random internal calibration); first two thirds of the visits per farm (internal calibration); whole datasets of two of the three farms (external calibration), and combinations of external and internal datasets. Validation was done either on the remaining data set per farm (internal validation) or on data of the remaining farms (external validation). Excellent calibration results were obtained when fully randomised internal calibration sets were used for milk analysis. In this case, RPD values of around ten, five and three for the prediction of fat, protein and lactose content, respectively, were achieved. Farm internal calibrations achieved much poorer prediction results especially for the prediction of protein and lactose with RPD values of around two and one respectively. The prediction accuracy improved when validation was done on spectra of an external farm, mainly due to the higher sample variation in external calibration sets in terms of feeding diets and individual cow effects. The results showed that further improvements were achieved when additional farm information was added to the calibration set. One of the main requirements towards a robust calibration model is the ability to predict milk constituents in unknown future milk samples. The robustness and quality of prediction increases with increasing variation of, e.g., feeding and cow individual milk composition in the calibration model.
Ranked set sampling: cost and optimal set size.
Nahhas, Ramzi W; Wolfe, Douglas A; Chen, Haiying
2002-12-01
McIntyre (1952, Australian Journal of Agricultural Research 3, 385-390) introduced ranked set sampling (RSS) as a method for improving estimation of a population mean in settings where sampling and ranking of units from the population are inexpensive when compared with actual measurement of the units. Two of the major factors in the usefulness of RSS are the set size and the relative costs of the various operations of sampling, ranking, and measurement. In this article, we consider ranking error models and cost models that enable us to assess the effect of different cost structures on the optimal set size for RSS. For reasonable cost structures, we find that the optimal RSS set sizes are generally larger than had been anticipated previously. These results will provide a useful tool for determining whether RSS is likely to lead to an improvement over simple random sampling in a given setting and, if so, what RSS set size is best to use in this case.
Yap, Melvin J; Balota, David A; Cortese, Michael J; Watson, Jason M
2006-12-01
This article evaluates 2 competing models that address the decision-making processes mediating word recognition and lexical decision performance: a hybrid 2-stage model of lexical decision performance and a random-walk model. In 2 experiments, nonword type and word frequency were manipulated across 2 contrasts (pseudohomophone-legal nonword and legal-illegal nonword). When nonwords became more wordlike (i.e., BRNTA vs. BRANT vs. BRANE), response latencies to nonwords were slowed and the word frequency effect increased. More important, distributional analyses revealed that the Nonword Type = Word Frequency interaction was modulated by different components of the response time distribution, depending on the specific nonword contrast. A single-process random-walk model was able to account for this particular set of findings more successfully than the hybrid 2-stage model. (c) 2006 APA, all rights reserved.
Yu, Wenxi; Liu, Yang; Ma, Zongwei; Bi, Jun
2017-08-01
Using satellite-based aerosol optical depth (AOD) measurements and statistical models to estimate ground-level PM 2.5 is a promising way to fill the areas that are not covered by ground PM 2.5 monitors. The statistical models used in previous studies are primarily Linear Mixed Effects (LME) and Geographically Weighted Regression (GWR) models. In this study, we developed a new regression model between PM 2.5 and AOD using Gaussian processes in a Bayesian hierarchical setting. Gaussian processes model the stochastic nature of the spatial random effects, where the mean surface and the covariance function is specified. The spatial stochastic process is incorporated under the Bayesian hierarchical framework to explain the variation of PM 2.5 concentrations together with other factors, such as AOD, spatial and non-spatial random effects. We evaluate the results of our model and compare them with those of other, conventional statistical models (GWR and LME) by within-sample model fitting and out-of-sample validation (cross validation, CV). The results show that our model possesses a CV result (R 2 = 0.81) that reflects higher accuracy than that of GWR and LME (0.74 and 0.48, respectively). Our results indicate that Gaussian process models have the potential to improve the accuracy of satellite-based PM 2.5 estimates.
Modeling groundwater nitrate concentrations in private wells in Iowa
Wheeler, David C.; Nolan, Bernard T.; Flory, Abigail R.; DellaValle, Curt T.; Ward, Mary H.
2015-01-01
Contamination of drinking water by nitrate is a growing problem in many agricultural areas of the country. Ingested nitrate can lead to the endogenous formation of N-nitroso compounds, potent carcinogens. We developed a predictive model for nitrate concentrations in private wells in Iowa. Using 34,084 measurements of nitrate in private wells, we trained and tested random forest models to predict log nitrate levels by systematically assessing the predictive performance of 179 variables in 36 thematic groups (well depth, distance to sinkholes, location, land use, soil characteristics, nitrogen inputs, meteorology, and other factors). The final model contained 66 variables in 17 groups. Some of the most important variables were well depth, slope length within 1 km of the well, year of sample, and distance to nearest animal feeding operation. The correlation between observed and estimated nitrate concentrations was excellent in the training set (r-square = 0.77) and was acceptable in the testing set (r-square = 0.38). The random forest model had substantially better predictive performance than a traditional linear regression model or a regression tree. Our model will be used to investigate the association between nitrate levels in drinking water and cancer risk in the Iowa participants of the Agricultural Health Study cohort.
Wang, Ming; Long, Qi
2016-09-01
Prediction models for disease risk and prognosis play an important role in biomedical research, and evaluating their predictive accuracy in the presence of censored data is of substantial interest. The standard concordance (c) statistic has been extended to provide a summary measure of predictive accuracy for survival models. Motivated by a prostate cancer study, we address several issues associated with evaluating survival prediction models based on c-statistic with a focus on estimators using the technique of inverse probability of censoring weighting (IPCW). Compared to the existing work, we provide complete results on the asymptotic properties of the IPCW estimators under the assumption of coarsening at random (CAR), and propose a sensitivity analysis under the mechanism of noncoarsening at random (NCAR). In addition, we extend the IPCW approach as well as the sensitivity analysis to high-dimensional settings. The predictive accuracy of prediction models for cancer recurrence after prostatectomy is assessed by applying the proposed approaches. We find that the estimated predictive accuracy for the models in consideration is sensitive to NCAR assumption, and thus identify the best predictive model. Finally, we further evaluate the performance of the proposed methods in both settings of low-dimensional and high-dimensional data under CAR and NCAR through simulations. © 2016, The International Biometric Society.
Modeling groundwater nitrate concentrations in private wells in Iowa.
Wheeler, David C; Nolan, Bernard T; Flory, Abigail R; DellaValle, Curt T; Ward, Mary H
2015-12-01
Contamination of drinking water by nitrate is a growing problem in many agricultural areas of the country. Ingested nitrate can lead to the endogenous formation of N-nitroso compounds, potent carcinogens. We developed a predictive model for nitrate concentrations in private wells in Iowa. Using 34,084 measurements of nitrate in private wells, we trained and tested random forest models to predict log nitrate levels by systematically assessing the predictive performance of 179 variables in 36 thematic groups (well depth, distance to sinkholes, location, land use, soil characteristics, nitrogen inputs, meteorology, and other factors). The final model contained 66 variables in 17 groups. Some of the most important variables were well depth, slope length within 1 km of the well, year of sample, and distance to nearest animal feeding operation. The correlation between observed and estimated nitrate concentrations was excellent in the training set (r-square=0.77) and was acceptable in the testing set (r-square=0.38). The random forest model had substantially better predictive performance than a traditional linear regression model or a regression tree. Our model will be used to investigate the association between nitrate levels in drinking water and cancer risk in the Iowa participants of the Agricultural Health Study cohort. Copyright © 2015 Elsevier B.V. All rights reserved.
Biehler, J; Wall, W A
2018-02-01
If computational models are ever to be used in high-stakes decision making in clinical practice, the use of personalized models and predictive simulation techniques is a must. This entails rigorous quantification of uncertainties as well as harnessing available patient-specific data to the greatest extent possible. Although researchers are beginning to realize that taking uncertainty in model input parameters into account is a necessity, the predominantly used probabilistic description for these uncertain parameters is based on elementary random variable models. In this work, we set out for a comparison of different probabilistic models for uncertain input parameters using the example of an uncertain wall thickness in finite element models of abdominal aortic aneurysms. We provide the first comparison between a random variable and a random field model for the aortic wall and investigate the impact on the probability distribution of the computed peak wall stress. Moreover, we show that the uncertainty about the prevailing peak wall stress can be reduced if noninvasively available, patient-specific data are harnessed for the construction of the probabilistic wall thickness model. Copyright © 2017 John Wiley & Sons, Ltd.
George, Steven Z; Teyhen, Deydre S; Wu, Samuel S; Wright, Alison C; Dugan, Jessica L; Yang, Guijun; Robinson, Michael E; Childs, John D
2009-07-01
The general population has a pessimistic view of low back pain (LBP), and evidence-based information has been used to positively influence LBP beliefs in previously reported mass media studies. However, there is a lack of randomized trials investigating whether LBP beliefs can be modified in primary prevention settings. This cluster randomized clinical trial investigated the effect of an evidence-based psychosocial educational program (PSEP) on LBP beliefs for soldiers completing military training. A military setting was selected for this clinical trial, because LBP is a common cause of soldier disability. Companies of soldiers (n = 3,792) were recruited, and cluster randomized to receive a PSEP or no education (control group, CG). The PSEP consisted of an interactive seminar, and soldiers were issued the Back Book for reference material. The primary outcome measure was the back beliefs questionnaire (BBQ), which assesses inevitable consequences of and ability to cope with LBP. The BBQ was administered before randomization and 12 weeks later. A linear mixed model was fitted for the BBQ at the 12-week follow-up, and a generalized linear mixed model was fitted for the dichotomous outcomes on BBQ change of greater than two points. Sensitivity analyses were performed to account for drop out. BBQ scores (potential range: 9-45) improved significantly from baseline of 25.6 +/- 5.7 (mean +/- SD) to 26.9 +/- 6.2 for those receiving the PSEP, while there was a significant decline from 26.1 +/- 5.7 to 25.6 +/- 6.0 for those in the CG. The adjusted mean BBQ score at follow-up for those receiving the PSEP was 1.49 points higher than those in the CG (P < 0.0001). The adjusted odds ratio of BBQ improvement of greater than two points for those receiving the PSEP was 1.51 (95% CI = 1.22-1.86) times that of those in the CG. BBQ improvement was also mildly associated with race and college education. Sensitivity analyses suggested minimal influence of drop out. In conclusion, soldiers that received the PSEP had an improvement in their beliefs related to the inevitable consequences of and ability to cope with LBP. This is the first randomized trial to show positive influence on LBP beliefs in a primary prevention setting, and these findings have potentially important public health implications for prevention of LBP.
The use of propensity scores to assess the generalizability of results from randomized trials
Stuart, Elizabeth A.; Cole, Stephen R.; Bradshaw, Catherine P.; Leaf, Philip J.
2014-01-01
Randomized trials remain the most accepted design for estimating the effects of interventions, but they do not necessarily answer a question of primary interest: Will the program be effective in a target population in which it may be implemented? In other words, are the results generalizable? There has been very little statistical research on how to assess the generalizability, or “external validity,” of randomized trials. We propose the use of propensity-score-based metrics to quantify the similarity of the participants in a randomized trial and a target population. In this setting the propensity score model predicts participation in the randomized trial, given a set of covariates. The resulting propensity scores are used first to quantify the difference between the trial participants and the target population, and then to match, subclassify, or weight the control group outcomes to the population, assessing how well the propensity score-adjusted outcomes track the outcomes actually observed in the population. These metrics can serve as a first step in assessing the generalizability of results from randomized trials to target populations. This paper lays out these ideas, discusses the assumptions underlying the approach, and illustrates the metrics using data on the evaluation of a schoolwide prevention program called Positive Behavioral Interventions and Supports. PMID:24926156
Hengl, Tomislav; Heuvelink, Gerard B. M.; Kempen, Bas; Leenaars, Johan G. B.; Walsh, Markus G.; Shepherd, Keith D.; Sila, Andrew; MacMillan, Robert A.; Mendes de Jesus, Jorge; Tamene, Lulseged; Tondoh, Jérôme E.
2015-01-01
80% of arable land in Africa has low soil fertility and suffers from physical soil problems. Additionally, significant amounts of nutrients are lost every year due to unsustainable soil management practices. This is partially the result of insufficient use of soil management knowledge. To help bridge the soil information gap in Africa, the Africa Soil Information Service (AfSIS) project was established in 2008. Over the period 2008–2014, the AfSIS project compiled two point data sets: the Africa Soil Profiles (legacy) database and the AfSIS Sentinel Site database. These data sets contain over 28 thousand sampling locations and represent the most comprehensive soil sample data sets of the African continent to date. Utilizing these point data sets in combination with a large number of covariates, we have generated a series of spatial predictions of soil properties relevant to the agricultural management—organic carbon, pH, sand, silt and clay fractions, bulk density, cation-exchange capacity, total nitrogen, exchangeable acidity, Al content and exchangeable bases (Ca, K, Mg, Na). We specifically investigate differences between two predictive approaches: random forests and linear regression. Results of 5-fold cross-validation demonstrate that the random forests algorithm consistently outperforms the linear regression algorithm, with average decreases of 15–75% in Root Mean Squared Error (RMSE) across soil properties and depths. Fitting and running random forests models takes an order of magnitude more time and the modelling success is sensitive to artifacts in the input data, but as long as quality-controlled point data are provided, an increase in soil mapping accuracy can be expected. Results also indicate that globally predicted soil classes (USDA Soil Taxonomy, especially Alfisols and Mollisols) help improve continental scale soil property mapping, and are among the most important predictors. This indicates a promising potential for transferring pedological knowledge from data rich countries to countries with limited soil data. PMID:26110833
Hengl, Tomislav; Heuvelink, Gerard B M; Kempen, Bas; Leenaars, Johan G B; Walsh, Markus G; Shepherd, Keith D; Sila, Andrew; MacMillan, Robert A; Mendes de Jesus, Jorge; Tamene, Lulseged; Tondoh, Jérôme E
2015-01-01
80% of arable land in Africa has low soil fertility and suffers from physical soil problems. Additionally, significant amounts of nutrients are lost every year due to unsustainable soil management practices. This is partially the result of insufficient use of soil management knowledge. To help bridge the soil information gap in Africa, the Africa Soil Information Service (AfSIS) project was established in 2008. Over the period 2008-2014, the AfSIS project compiled two point data sets: the Africa Soil Profiles (legacy) database and the AfSIS Sentinel Site database. These data sets contain over 28 thousand sampling locations and represent the most comprehensive soil sample data sets of the African continent to date. Utilizing these point data sets in combination with a large number of covariates, we have generated a series of spatial predictions of soil properties relevant to the agricultural management--organic carbon, pH, sand, silt and clay fractions, bulk density, cation-exchange capacity, total nitrogen, exchangeable acidity, Al content and exchangeable bases (Ca, K, Mg, Na). We specifically investigate differences between two predictive approaches: random forests and linear regression. Results of 5-fold cross-validation demonstrate that the random forests algorithm consistently outperforms the linear regression algorithm, with average decreases of 15-75% in Root Mean Squared Error (RMSE) across soil properties and depths. Fitting and running random forests models takes an order of magnitude more time and the modelling success is sensitive to artifacts in the input data, but as long as quality-controlled point data are provided, an increase in soil mapping accuracy can be expected. Results also indicate that globally predicted soil classes (USDA Soil Taxonomy, especially Alfisols and Mollisols) help improve continental scale soil property mapping, and are among the most important predictors. This indicates a promising potential for transferring pedological knowledge from data rich countries to countries with limited soil data.
Artificial neural networks modelling the prednisolone nanoprecipitation in microfluidic reactors.
Ali, Hany S M; Blagden, Nicholas; York, Peter; Amani, Amir; Brook, Toni
2009-06-28
This study employs artificial neural networks (ANNs) to create a model to identify relationships between variables affecting drug nanoprecipitation using microfluidic reactors. The input variables examined were saturation levels of prednisolone, solvent and antisolvent flow rates, microreactor inlet angles and internal diameters, while particle size was the single output. ANNs software was used to analyse a set of data obtained by random selection of the variables. The developed model was then assessed using a separate set of validation data and provided good agreement with the observed results. The antisolvent flow rate was found to have the dominant role on determining final particle size.
Determining Scale-dependent Patterns in Spatial and Temporal Datasets
NASA Astrophysics Data System (ADS)
Roy, A.; Perfect, E.; Mukerji, T.; Sylvester, L.
2016-12-01
Spatial and temporal datasets of interest to Earth scientists often contain plots of one variable against another, e.g., rainfall magnitude vs. time or fracture aperture vs. spacing. Such data, comprised of distributions of events along a transect / timeline along with their magnitudes, can display persistent or antipersistent trends, as well as random behavior, that may contain signatures of underlying physical processes. Lacunarity is a technique that was originally developed for multiscale analysis of data. In a recent study we showed that lacunarity can be used for revealing changes in scale-dependent patterns in fracture spacing data. Here we present a further improvement in our technique, with lacunarity applied to various non-binary datasets comprised of event spacings and magnitudes. We test our technique on a set of four synthetic datasets, three of which are based on an autoregressive model and have magnitudes at every point along the "timeline" thus representing antipersistent, persistent, and random trends. The fourth dataset is made up of five clusters of events, each containing a set of random magnitudes. The concept of lacunarity ratio, LR, is introduced; this is the lacunarity of a given dataset normalized to the lacunarity of its random counterpart. It is demonstrated that LR can successfully delineate scale-dependent changes in terms of antipersistence and persistence in the synthetic datasets. This technique is then applied to three different types of data: a hundred-year rainfall record from Knoxville, TN, USA, a set of varved sediments from Marca Shale, and a set of fracture aperture and spacing data from NE Mexico. While the rainfall data and varved sediments both appear to be persistent at small scales, at larger scales they both become random. On the other hand, the fracture data shows antipersistence at small scale (within cluster) and random behavior at large scales. Such differences in behavior with respect to scale-dependent changes in antipersistence to random, persistence to random, or otherwise, maybe be related to differences in the physicochemical properties and processes contributing to multiscale datasets.
NASA Astrophysics Data System (ADS)
Ma, Ligang; Ma, Fenglan; Li, Jiadan; Gu, Qing; Yang, Shengtian; Ding, Jianli
2017-04-01
Land degradation, specifically soil salinization has rendered large areas of China west sterile and unproductive while diminishing the productivity of adjacent lands and other areas where salting is less severe. Up to now despite decades of research in soil mapping, few accurate and up-to-date information on the spatial extent and variability of soil salinity are available for large geographic regions. This study explores the po-tentials of assessing soil salinity via linear and random forest modeling of remote sensing based environmental factors and indirect indicators. A case study is presented for the arid oases of Tarim and Junggar Basin, Xinjiang, China using time series land surface temperature (LST), evapotranspiration (ET), TRMM precipitation (TRM), DEM product and vegetation indexes as well as their second order products. In par-ticular, the location of the oasis, the best feature sets, different salinity degrees and modeling approaches were fully examined. All constructed models were evaluated for their fit to the whole data set and their performance in a leave-one-field-out spatial cross-validation. In addition, the Kruskal-Wallis rank test was adopted for the statis-tical comparison of different models. Overall, the random forest model outperformed the linear model for the two basins, all salinity degrees and datasets. As for feature set, LST and ET were consistently identified to be the most important factors for two ba-sins while the contribution of vegetation indexes vary with location. What's more, models performances are promising for the salinity ranges that are most relevant to agricultural productivity.
Votano, Joseph R; Parham, Marc; Hall, L Mark; Hall, Lowell H; Kier, Lemont B; Oloff, Scott; Tropsha, Alexander
2006-11-30
Four modeling techniques, using topological descriptors to represent molecular structure, were employed to produce models of human serum protein binding (% bound) on a data set of 1008 experimental values, carefully screened from publicly available sources. To our knowledge, this data is the largest set on human serum protein binding reported for QSAR modeling. The data was partitioned into a training set of 808 compounds and an external validation test set of 200 compounds. Partitioning was accomplished by clustering the compounds in a structure descriptor space so that random sampling of 20% of the whole data set produced an external test set that is a good representative of the training set with respect to both structure and protein binding values. The four modeling techniques include multiple linear regression (MLR), artificial neural networks (ANN), k-nearest neighbors (kNN), and support vector machines (SVM). With the exception of the MLR model, the ANN, kNN, and SVM QSARs were ensemble models. Training set correlation coefficients and mean absolute error ranged from r2=0.90 and MAE=7.6 for ANN to r2=0.61 and MAE=16.2 for MLR. Prediction results from the validation set yielded correlation coefficients and mean absolute errors which ranged from r2=0.70 and MAE=14.1 for ANN to a low of r2=0.59 and MAE=18.3 for the SVM model. Structure descriptors that contribute significantly to the models are discussed and compared with those found in other published models. For the ANN model, structure descriptor trends with respect to their affects on predicted protein binding can assist the chemist in structure modification during the drug design process.
2012-03-01
with each SVM discriminating between a pair of the N total speakers in the data set. The (( + 1))/2 classifiers then vote on the final...classification of a test sample. The Random Forest classifier is an ensemble classifier that votes amongst decision trees generated with each node using...Forest vote , and the effects of overtraining will be mitigated by the fact that each decision tree is overtrained differently (due to the random
Learning accurate and interpretable models based on regularized random forests regression
2014-01-01
Background Many biology related research works combine data from multiple sources in an effort to understand the underlying problems. It is important to find and interpret the most important information from these sources. Thus it will be beneficial to have an effective algorithm that can simultaneously extract decision rules and select critical features for good interpretation while preserving the prediction performance. Methods In this study, we focus on regression problems for biological data where target outcomes are continuous. In general, models constructed from linear regression approaches are relatively easy to interpret. However, many practical biological applications are nonlinear in essence where we can hardly find a direct linear relationship between input and output. Nonlinear regression techniques can reveal nonlinear relationship of data, but are generally hard for human to interpret. We propose a rule based regression algorithm that uses 1-norm regularized random forests. The proposed approach simultaneously extracts a small number of rules from generated random forests and eliminates unimportant features. Results We tested the approach on some biological data sets. The proposed approach is able to construct a significantly smaller set of regression rules using a subset of attributes while achieving prediction performance comparable to that of random forests regression. Conclusion It demonstrates high potential in aiding prediction and interpretation of nonlinear relationships of the subject being studied. PMID:25350120
Chan, C H; Chan, E Y; Ng, D K; Chow, P Y; Kwok, K L
2006-11-01
Paediatric risk of mortality and paediatric index of mortality (PIM) are the commonly-used mortality prediction models (MPM) in children admitted to paediatric intensive care unit (PICU). The current study was undertaken to develop a better MPM using artificial neural network, a domain of artificial intelligence. The purpose of this retrospective case series was to compare an artificial neural network (ANN) model and PIM with the observed mortality in a cohort of patients admitted to a five-bed PICU in a Hong Kong non-teaching general hospital. The patients were under the age of 17 years and admitted to our PICU from April 2001 to December 2004. Data were collected from each patient admitted to our PICU. All data were randomly allocated to either the training or validation set. The data from the training set were used to construct a series of ANN models. The data from the validation set were used to validate the ANN and PIM models. The accuracy of ANN models and PIM was assessed by area under the receiver operator characteristics (ROC) curve and calibration. All data were randomly allocated to either the training (n=274) or validation set (n=273). Three ANN models were developed using the data from the training set, namely ANN8 (trained with variables required for PIM), ANN9 (trained with variables required for PIM and pre-ICU intubation) and ANN23 (trained with variables required for ANN9 and 14 principal ICU diagnoses). Three ANN models and PIM were used to predict mortality in the validation set. We found that PIM and ANN9 had a high ROC curve (PIM: 0.808, 95 percent confidence interval 0.552 to 1.000, ANN9: 0.957, 95 percent confidence interval 0.915 to 1.000), whereas ANN8 and ANN23 gave a suboptimal area under the ROC curve. ANN8 required only five variables for the calculation of risk, compared with eight for PIM. The current study demonstrated the process of predictive mortality risk model development using ANN. Further multicentre studies are required to produce a representative ANN-based mortality prediction model for use in different PICUs.
Toward a Principled Sampling Theory for Quasi-Orders
Ünlü, Ali; Schrepp, Martin
2016-01-01
Quasi-orders, that is, reflexive and transitive binary relations, have numerous applications. In educational theories, the dependencies of mastery among the problems of a test can be modeled by quasi-orders. Methods such as item tree or Boolean analysis that mine for quasi-orders in empirical data are sensitive to the underlying quasi-order structure. These data mining techniques have to be compared based on extensive simulation studies, with unbiased samples of randomly generated quasi-orders at their basis. In this paper, we develop techniques that can provide the required quasi-order samples. We introduce a discrete doubly inductive procedure for incrementally constructing the set of all quasi-orders on a finite item set. A randomization of this deterministic procedure allows us to generate representative samples of random quasi-orders. With an outer level inductive algorithm, we consider the uniform random extensions of the trace quasi-orders to higher dimension. This is combined with an inner level inductive algorithm to correct the extensions that violate the transitivity property. The inner level correction step entails sampling biases. We propose three algorithms for bias correction and investigate them in simulation. It is evident that, on even up to 50 items, the new algorithms create close to representative quasi-order samples within acceptable computing time. Hence, the principled approach is a significant improvement to existing methods that are used to draw quasi-orders uniformly at random but cannot cope with reasonably large item sets. PMID:27965601
Toward a Principled Sampling Theory for Quasi-Orders.
Ünlü, Ali; Schrepp, Martin
2016-01-01
Quasi-orders, that is, reflexive and transitive binary relations, have numerous applications. In educational theories, the dependencies of mastery among the problems of a test can be modeled by quasi-orders. Methods such as item tree or Boolean analysis that mine for quasi-orders in empirical data are sensitive to the underlying quasi-order structure. These data mining techniques have to be compared based on extensive simulation studies, with unbiased samples of randomly generated quasi-orders at their basis. In this paper, we develop techniques that can provide the required quasi-order samples. We introduce a discrete doubly inductive procedure for incrementally constructing the set of all quasi-orders on a finite item set. A randomization of this deterministic procedure allows us to generate representative samples of random quasi-orders. With an outer level inductive algorithm, we consider the uniform random extensions of the trace quasi-orders to higher dimension. This is combined with an inner level inductive algorithm to correct the extensions that violate the transitivity property. The inner level correction step entails sampling biases. We propose three algorithms for bias correction and investigate them in simulation. It is evident that, on even up to 50 items, the new algorithms create close to representative quasi-order samples within acceptable computing time. Hence, the principled approach is a significant improvement to existing methods that are used to draw quasi-orders uniformly at random but cannot cope with reasonably large item sets.
A Random Finite Set Approach to Space Junk Tracking and Identification
2014-09-03
Final 3. DATES COVERED (From - To) 31 Jan 13 – 29 Apr 14 4. TITLE AND SUBTITLE A Random Finite Set Approach to Space Junk Tracking and...01-2013 to 29-04-2014 4. TITLE AND SUBTITLE A Random Finite Set Approach to Space Junk Tracking and Identification 5a. CONTRACT NUMBER FA2386-13...Prescribed by ANSI Std Z39-18 A Random Finite Set Approach to Space Junk Tracking and Indentification Ba-Ngu Vo1, Ba-Tuong Vo1, 1Department of
Extending existing structural identifiability analysis methods to mixed-effects models.
Janzén, David L I; Jirstrand, Mats; Chappell, Michael J; Evans, Neil D
2018-01-01
The concept of structural identifiability for state-space models is expanded to cover mixed-effects state-space models. Two methods applicable for the analytical study of the structural identifiability of mixed-effects models are presented. The two methods are based on previously established techniques for non-mixed-effects models; namely the Taylor series expansion and the input-output form approach. By generating an exhaustive summary, and by assuming an infinite number of subjects, functions of random variables can be derived which in turn determine the distribution of the system's observation function(s). By considering the uniqueness of the analytical statistical moments of the derived functions of the random variables, the structural identifiability of the corresponding mixed-effects model can be determined. The two methods are applied to a set of examples of mixed-effects models to illustrate how they work in practice. Copyright © 2017 Elsevier Inc. All rights reserved.
Shirazi, M; Zeinaloo, A A; Parikh, S V; Sadeghi, M; Taghva, A; Arbabi, M; Kashani, A Sabouri; Alaeddini, F; Lonka, K; Wahlström, R
2008-04-01
The Prochaska model of readiness to change has been proposed to be used in educational interventions to improve medical care. To evaluate the impact on readiness to change of an educational intervention on management of depressive disorders based on a modified version of the Prochaska model in comparison with a standard programme of continuing medical education (CME). This is a randomized controlled trial within primary care practices in southern Tehran, Iran. The participants included 192 general physicians working in primary care (GPs) were recruited after random selection and randomized to intervention (96) and control (96). Intervention consisted of interactive, learner-centred educational methods in large and small group settings depending on the GPs' stages of readiness to change. Change in stage of readiness to change measured by the modified version of the Prochaska questionnaire was the The final number of participants was 78 (81%) in the intervention arm and 81 (84%) in the control arm. Significantly (P < 0.01), more GPs (57/96 = 59% versus 12/96 = 12%) in the intervention group changed to higher stages of readiness to change. The intervention effect was 46% points (P < 0.001) and 50% points (P < 0.001) in the large and small group setting, respectively. Educational formats that suit different stages of learning can support primary care doctors to reach higher stages of behavioural change in the topic of depressive disorders. Our findings have practical implications for conducting CME programmes in Iran and are possibly also applicable in other parts of the world.
Causal mediation analysis for longitudinal data with exogenous exposure
Bind, M.-A. C.; Vanderweele, T. J.; Coull, B. A.; Schwartz, J. D.
2016-01-01
Mediation analysis is a valuable approach to examine pathways in epidemiological research. Prospective cohort studies are often conducted to study biological mechanisms and often collect longitudinal measurements on each participant. Mediation formulae for longitudinal data have been developed. Here, we formalize the natural direct and indirect effects using a causal framework with potential outcomes that allows for an interaction between the exposure and the mediator. To allow different types of longitudinal measures of the mediator and outcome, we assume two generalized mixed-effects models for both the mediator and the outcome. The model for the mediator has subject-specific random intercepts and random exposure slopes for each cluster, and the outcome model has random intercepts and random slopes for the exposure, the mediator, and their interaction. We also expand our approach to settings with multiple mediators and derive the mediated effects, jointly through all mediators. Our method requires the absence of time-varying confounding with respect to the exposure and the mediator. This assumption is achieved in settings with exogenous exposure and mediator, especially when exposure and mediator are not affected by variables measured at earlier time points. We apply the methodology to data from the Normative Aging Study and estimate the direct and indirect effects, via DNA methylation, of air pollution, and temperature on intercellular adhesion molecule 1 (ICAM-1) protein levels. Our results suggest that air pollution and temperature have a direct effect on ICAM-1 protein levels (i.e. not through a change in ICAM-1 DNA methylation) and that temperature has an indirect effect via a change in ICAM-1 DNA methylation. PMID:26272993
Polgreen, Linnea A; Anthony, Christopher; Carr, Lucas; Simmering, Jacob E; Evans, Nicholas J; Foster, Eric D; Segre, Alberto M; Cremer, James F; Polgreen, Philip M
2018-01-01
Activity-monitoring devices may increase activity, but their effectiveness in sedentary, diseased, and less-motivated populations is unknown. Subjects with diabetes or pre-diabetes were given a Fitbit and randomized into three groups: Fitbit only, Fitbit with reminders, and Fitbit with both reminders and goal setting. Subjects in the reminders group were sent text-message reminders to wear their Fitbit. The goal-setting group was sent a daily text message asking for a step goal. All subjects had three in-person visits (baseline, 3 and 6 months). We modelled daily steps and goal setting using linear mixed-effects models. 138 subjects participated with 48 in the Fitbit-only, 44 in the reminders, and 46 in the goal-setting groups. Daily steps decreased for all groups during the study. Average daily steps were 7123, 6906, and 6854 for the Fitbit-only, the goal-setting, and the reminders groups, respectively. The reminders group was 17.2 percentage points more likely to wear their Fitbit than the Fitbit-only group. Setting a goal was associated with a significant increase of 791 daily steps, but setting more goals did not lead to step increases. In a population of patients with diabetes or pre-diabetes, individualized reminders to wear their Fitbit and elicit personal step goals did not lead to increases in daily steps, although daily steps were higher on days when goals were set. Our intervention improved engagement and data collection, important goals for activity surveillance. This study demonstrates that new, more-effective interventions for increasing activity in patients with pre-diabetes and diabetes are needed.
Anthony, Christopher; Carr, Lucas; Simmering, Jacob E.; Evans, Nicholas J.; Foster, Eric D.; Segre, Alberto M.; Cremer, James F.; Polgreen, Philip M.
2018-01-01
Introduction Activity-monitoring devices may increase activity, but their effectiveness in sedentary, diseased, and less-motivated populations is unknown. Methods Subjects with diabetes or pre-diabetes were given a Fitbit and randomized into three groups: Fitbit only, Fitbit with reminders, and Fitbit with both reminders and goal setting. Subjects in the reminders group were sent text-message reminders to wear their Fitbit. The goal-setting group was sent a daily text message asking for a step goal. All subjects had three in-person visits (baseline, 3 and 6 months). We modelled daily steps and goal setting using linear mixed-effects models. Results 138 subjects participated with 48 in the Fitbit-only, 44 in the reminders, and 46 in the goal-setting groups. Daily steps decreased for all groups during the study. Average daily steps were 7123, 6906, and 6854 for the Fitbit-only, the goal-setting, and the reminders groups, respectively. The reminders group was 17.2 percentage points more likely to wear their Fitbit than the Fitbit-only group. Setting a goal was associated with a significant increase of 791 daily steps, but setting more goals did not lead to step increases. Conclusion In a population of patients with diabetes or pre-diabetes, individualized reminders to wear their Fitbit and elicit personal step goals did not lead to increases in daily steps, although daily steps were higher on days when goals were set. Our intervention improved engagement and data collection, important goals for activity surveillance. This study demonstrates that new, more-effective interventions for increasing activity in patients with pre-diabetes and diabetes are needed. PMID:29718931
Experimental Errors in QSAR Modeling Sets: What We Can Do and What We Cannot Do.
Zhao, Linlin; Wang, Wenyi; Sedykh, Alexander; Zhu, Hao
2017-06-30
Numerous chemical data sets have become available for quantitative structure-activity relationship (QSAR) modeling studies. However, the quality of different data sources may be different based on the nature of experimental protocols. Therefore, potential experimental errors in the modeling sets may lead to the development of poor QSAR models and further affect the predictions of new compounds. In this study, we explored the relationship between the ratio of questionable data in the modeling sets, which was obtained by simulating experimental errors, and the QSAR modeling performance. To this end, we used eight data sets (four continuous endpoints and four categorical endpoints) that have been extensively curated both in-house and by our collaborators to create over 1800 various QSAR models. Each data set was duplicated to create several new modeling sets with different ratios of simulated experimental errors (i.e., randomizing the activities of part of the compounds) in the modeling process. A fivefold cross-validation process was used to evaluate the modeling performance, which deteriorates when the ratio of experimental errors increases. All of the resulting models were also used to predict external sets of new compounds, which were excluded at the beginning of the modeling process. The modeling results showed that the compounds with relatively large prediction errors in cross-validation processes are likely to be those with simulated experimental errors. However, after removing a certain number of compounds with large prediction errors in the cross-validation process, the external predictions of new compounds did not show improvement. Our conclusion is that the QSAR predictions, especially consensus predictions, can identify compounds with potential experimental errors. But removing those compounds by the cross-validation procedure is not a reasonable means to improve model predictivity due to overfitting.
Experimental Errors in QSAR Modeling Sets: What We Can Do and What We Cannot Do
2017-01-01
Numerous chemical data sets have become available for quantitative structure–activity relationship (QSAR) modeling studies. However, the quality of different data sources may be different based on the nature of experimental protocols. Therefore, potential experimental errors in the modeling sets may lead to the development of poor QSAR models and further affect the predictions of new compounds. In this study, we explored the relationship between the ratio of questionable data in the modeling sets, which was obtained by simulating experimental errors, and the QSAR modeling performance. To this end, we used eight data sets (four continuous endpoints and four categorical endpoints) that have been extensively curated both in-house and by our collaborators to create over 1800 various QSAR models. Each data set was duplicated to create several new modeling sets with different ratios of simulated experimental errors (i.e., randomizing the activities of part of the compounds) in the modeling process. A fivefold cross-validation process was used to evaluate the modeling performance, which deteriorates when the ratio of experimental errors increases. All of the resulting models were also used to predict external sets of new compounds, which were excluded at the beginning of the modeling process. The modeling results showed that the compounds with relatively large prediction errors in cross-validation processes are likely to be those with simulated experimental errors. However, after removing a certain number of compounds with large prediction errors in the cross-validation process, the external predictions of new compounds did not show improvement. Our conclusion is that the QSAR predictions, especially consensus predictions, can identify compounds with potential experimental errors. But removing those compounds by the cross-validation procedure is not a reasonable means to improve model predictivity due to overfitting. PMID:28691113
Zevin, Boris; Dedy, Nicolas J; Bonrath, Esther M; Grantcharov, Teodor P
2017-05-01
There is no comprehensive simulation-enhanced training curriculum to address cognitive, psychomotor, and nontechnical skills for an advanced minimally invasive procedure. 1) To develop and provide evidence of validity for a comprehensive simulation-enhanced training (SET) curriculum for an advanced minimally invasive procedure; (2) to demonstrate transfer of acquired psychomotor skills from a simulation laboratory to live porcine model; and (3) to compare training outcomes of SET curriculum group and chief resident group. University. This prospective single-blinded, randomized, controlled trial allocated 20 intermediate-level surgery residents to receive either conventional training (control) or SET curriculum training (intervention). The SET curriculum consisted of cognitive, psychomotor, and nontechnical training modules. Psychomotor skills in a live anesthetized porcine model in the OR was the primary outcome. Knowledge of advanced minimally invasive and bariatric surgery and nontechnical skills in a simulated OR crisis scenario were the secondary outcomes. Residents in the SET curriculum group went on to perform a laparoscopic jejunojejunostomy in the OR. Cognitive, psychomotor, and nontechnical skills of SET curriculum group were also compared to a group of 12 chief surgery residents. SET curriculum group demonstrated superior psychomotor skills in a live porcine model (56 [47-62] versus 44 [38-53], P<.05) and superior nontechnical skills (41 [38-45] versus 31 [24-40], P<.01) compared with conventional training group. SET curriculum group and conventional training group demonstrated equivalent knowledge (14 [12-15] versus 13 [11-15], P = 0.47). SET curriculum group demonstrated equivalent psychomotor skills in the live porcine model and in the OR in a human patient (56 [47-62] versus 63 [61-68]; P = .21). SET curriculum group demonstrated inferior knowledge (13 [11-15] versus 16 [14-16]; P<.05), equivalent psychomotor skill (63 [61-68] versus 68 [62-74]; P = .50), and superior nontechnical skills (41 [38-45] versus 34 [27-35], P<.01) compared with chief resident group. Completion of the SET curriculum resulted in superior training outcomes, compared with conventional surgery training. Implementation of the SET curriculum can standardize training for an advanced minimally invasive procedure and can ensure that comprehensive proficiency milestones are met before exposure to patient care. Copyright © 2017 American Society for Bariatric Surgery. Published by Elsevier Inc. All rights reserved.
Hurdle models for multilevel zero-inflated data via h-likelihood.
Molas, Marek; Lesaffre, Emmanuel
2010-12-30
Count data often exhibit overdispersion. One type of overdispersion arises when there is an excess of zeros in comparison with the standard Poisson distribution. Zero-inflated Poisson and hurdle models have been proposed to perform a valid likelihood-based analysis to account for the surplus of zeros. Further, data often arise in clustered, longitudinal or multiple-membership settings. The proper analysis needs to reflect the design of a study. Typically random effects are used to account for dependencies in the data. We examine the h-likelihood estimation and inference framework for hurdle models with random effects for complex designs. We extend the h-likelihood procedures to fit hurdle models, thereby extending h-likelihood to truncated distributions. Two applications of the methodology are presented. Copyright © 2010 John Wiley & Sons, Ltd.
A Bayesian, generalized frailty model for comet assays.
Ghebretinsae, Aklilu Habteab; Faes, Christel; Molenberghs, Geert; De Boeck, Marlies; Geys, Helena
2013-05-01
This paper proposes a flexible modeling approach for so-called comet assay data regularly encountered in preclinical research. While such data consist of non-Gaussian outcomes in a multilevel hierarchical structure, traditional analyses typically completely or partly ignore this hierarchical nature by summarizing measurements within a cluster. Non-Gaussian outcomes are often modeled using exponential family models. This is true not only for binary and count data, but also for, example, time-to-event outcomes. Two important reasons for extending this family are for (1) the possible occurrence of overdispersion, meaning that the variability in the data may not be adequately described by the models, which often exhibit a prescribed mean-variance link, and (2) the accommodation of a hierarchical structure in the data, owing to clustering in the data. The first issue is dealt with through so-called overdispersion models. Clustering is often accommodated through the inclusion of random subject-specific effects. Though not always, one conventionally assumes such random effects to be normally distributed. In the case of time-to-event data, one encounters, for example, the gamma frailty model (Duchateau and Janssen, 2007 ). While both of these issues may occur simultaneously, models combining both are uncommon. Molenberghs et al. ( 2010 ) proposed a broad class of generalized linear models accommodating overdispersion and clustering through two separate sets of random effects. Here, we use this method to model data from a comet assay with a three-level hierarchical structure. Although a conjugate gamma random effect is used for the overdispersion random effect, both gamma and normal random effects are considered for the hierarchical random effect. Apart from model formulation, we place emphasis on Bayesian estimation. Our proposed method has an upper hand over the traditional analysis in that it (1) uses the appropriate distribution stipulated in the literature; (2) deals with the complete hierarchical nature; and (3) uses all information instead of summary measures. The fit of the model to the comet assay is compared against the background of more conventional model fits. Results indicate the toxicity of 1,2-dimethylhydrazine dihydrochloride at different dose levels (low, medium, and high).
Fuzzy Random λ-Mean SAD Portfolio Selection Problem: An Ant Colony Optimization Approach
NASA Astrophysics Data System (ADS)
Thakur, Gour Sundar Mitra; Bhattacharyya, Rupak; Mitra, Swapan Kumar
2010-10-01
To reach the investment goal, one has to select a combination of securities among different portfolios containing large number of securities. Only the past records of each security do not guarantee the future return. As there are many uncertain factors which directly or indirectly influence the stock market and there are also some newer stock markets which do not have enough historical data, experts' expectation and experience must be combined with the past records to generate an effective portfolio selection model. In this paper the return of security is assumed to be Fuzzy Random Variable Set (FRVS), where returns are set of random numbers which are in turn fuzzy numbers. A new λ-Mean Semi Absolute Deviation (λ-MSAD) portfolio selection model is developed. The subjective opinions of the investors to the rate of returns of each security are taken into consideration by introducing a pessimistic-optimistic parameter vector λ. λ-Mean Semi Absolute Deviation (λ-MSAD) model is preferred as it follows absolute deviation of the rate of returns of a portfolio instead of the variance as the measure of the risk. As this model can be reduced to Linear Programming Problem (LPP) it can be solved much faster than quadratic programming problems. Ant Colony Optimization (ACO) is used for solving the portfolio selection problem. ACO is a paradigm for designing meta-heuristic algorithms for combinatorial optimization problem. Data from BSE is used for illustration.
J. Breidenbach; E. Kublin; R. McGaughey; H.-E. Andersen; S. Reutebuch
2008-01-01
For this study, hierarchical data sets--in that several sample plots are located within a stand--were analyzed for study sites in the USA and Germany. The German data had an additional hierarchy as the stands are located within four distinct public forests. Fixed-effects models and mixed-effects models with a random intercept on the stand level were fit to each data...
A computational approach to compare regression modelling strategies in prediction research.
Pajouheshnia, Romin; Pestman, Wiebe R; Teerenstra, Steven; Groenwold, Rolf H H
2016-08-25
It is often unclear which approach to fit, assess and adjust a model will yield the most accurate prediction model. We present an extension of an approach for comparing modelling strategies in linear regression to the setting of logistic regression and demonstrate its application in clinical prediction research. A framework for comparing logistic regression modelling strategies by their likelihoods was formulated using a wrapper approach. Five different strategies for modelling, including simple shrinkage methods, were compared in four empirical data sets to illustrate the concept of a priori strategy comparison. Simulations were performed in both randomly generated data and empirical data to investigate the influence of data characteristics on strategy performance. We applied the comparison framework in a case study setting. Optimal strategies were selected based on the results of a priori comparisons in a clinical data set and the performance of models built according to each strategy was assessed using the Brier score and calibration plots. The performance of modelling strategies was highly dependent on the characteristics of the development data in both linear and logistic regression settings. A priori comparisons in four empirical data sets found that no strategy consistently outperformed the others. The percentage of times that a model adjustment strategy outperformed a logistic model ranged from 3.9 to 94.9 %, depending on the strategy and data set. However, in our case study setting the a priori selection of optimal methods did not result in detectable improvement in model performance when assessed in an external data set. The performance of prediction modelling strategies is a data-dependent process and can be highly variable between data sets within the same clinical domain. A priori strategy comparison can be used to determine an optimal logistic regression modelling strategy for a given data set before selecting a final modelling approach.
Chow, Sy-Miin; Lu, Zhaohua; Sherwood, Andrew; Zhu, Hongtu
2016-03-01
The past decade has evidenced the increased prevalence of irregularly spaced longitudinal data in social sciences. Clearly lacking, however, are modeling tools that allow researchers to fit dynamic models to irregularly spaced data, particularly data that show nonlinearity and heterogeneity in dynamical structures. We consider the issue of fitting multivariate nonlinear differential equation models with random effects and unknown initial conditions to irregularly spaced data. A stochastic approximation expectation-maximization algorithm is proposed and its performance is evaluated using a benchmark nonlinear dynamical systems model, namely, the Van der Pol oscillator equations. The empirical utility of the proposed technique is illustrated using a set of 24-h ambulatory cardiovascular data from 168 men and women. Pertinent methodological challenges and unresolved issues are discussed.
Chow, Sy- Miin; Lu, Zhaohua; Zhu, Hongtu; Sherwood, Andrew
2014-01-01
The past decade has evidenced the increased prevalence of irregularly spaced longitudinal data in social sciences. Clearly lacking, however, are modeling tools that allow researchers to fit dynamic models to irregularly spaced data, particularly data that show nonlinearity and heterogeneity in dynamical structures. We consider the issue of fitting multivariate nonlinear differential equation models with random effects and unknown initial conditions to irregularly spaced data. A stochastic approximation expectation–maximization algorithm is proposed and its performance is evaluated using a benchmark nonlinear dynamical systems model, namely, the Van der Pol oscillator equations. The empirical utility of the proposed technique is illustrated using a set of 24-h ambulatory cardiovascular data from 168 men and women. Pertinent methodological challenges and unresolved issues are discussed. PMID:25416456
Computer simulation of the probability that endangered whales will interact with oil spills
DOE Office of Scientific and Technical Information (OSTI.GOV)
Reed, M.; Jayko, K.; Bowles, A.
1987-03-01
A numerical model system was developed to assess quantitatively the probability that endangered bowhead and gray whales will encounter spilled oil in Alaskan waters. Bowhead and gray whale migration and diving-surfacing models, and an oil-spill trajectory model comprise the system. The migration models were developed from conceptual considerations, then calibrated with and tested against observations. The movement of a whale point is governed by a random walk algorithm which stochastically follows a migratory pathway. The oil-spill model, developed under a series of other contracts, accounts for transport and spreading behavior in open water and in the presence of sea ice.more » Historical wind records and heavy, normal, or light ice cover data sets are selected at random to provide stochastic oil-spill scenarios for whale-oil interaction simulations.« less
Phelps, G.A.
2008-01-01
This report describes some simple spatial statistical methods to explore the relationships of scattered points to geologic or other features, represented by points, lines, or areas. It also describes statistical methods to search for linear trends and clustered patterns within the scattered point data. Scattered points are often contained within irregularly shaped study areas, necessitating the use of methods largely unexplored in the point pattern literature. The methods take advantage of the power of modern GIS toolkits to numerically approximate the null hypothesis of randomly located data within an irregular study area. Observed distributions can then be compared with the null distribution of a set of randomly located points. The methods are non-parametric and are applicable to irregularly shaped study areas. Patterns within the point data are examined by comparing the distribution of the orientation of the set of vectors defined by each pair of points within the data with the equivalent distribution for a random set of points within the study area. A simple model is proposed to describe linear or clustered structure within scattered data. A scattered data set of damage to pavement and pipes, recorded after the 1989 Loma Prieta earthquake, is used as an example to demonstrate the analytical techniques. The damage is found to be preferentially located nearer a set of mapped lineaments than randomly scattered damage, suggesting range-front faulting along the base of the Santa Cruz Mountains is related to both the earthquake damage and the mapped lineaments. The damage also exhibit two non-random patterns: a single cluster of damage centered in the town of Los Gatos, California, and a linear alignment of damage along the range front of the Santa Cruz Mountains, California. The linear alignment of damage is strongest between 45? and 50? northwest. This agrees well with the mean trend of the mapped lineaments, measured as 49? northwest.
Automatic Estimation of Osteoporotic Fracture Cases by Using Ensemble Learning Approaches.
Kilic, Niyazi; Hosgormez, Erkan
2016-03-01
Ensemble learning methods are one of the most powerful tools for the pattern classification problems. In this paper, the effects of ensemble learning methods and some physical bone densitometry parameters on osteoporotic fracture detection were investigated. Six feature set models were constructed including different physical parameters and they fed into the ensemble classifiers as input features. As ensemble learning techniques, bagging, gradient boosting and random subspace (RSM) were used. Instance based learning (IBk) and random forest (RF) classifiers applied to six feature set models. The patients were classified into three groups such as osteoporosis, osteopenia and control (healthy), using ensemble classifiers. Total classification accuracy and f-measure were also used to evaluate diagnostic performance of the proposed ensemble classification system. The classification accuracy has reached to 98.85 % by the combination of model 6 (five BMD + five T-score values) using RSM-RF classifier. The findings of this paper suggest that the patients will be able to be warned before a bone fracture occurred, by just examining some physical parameters that can easily be measured without invasive operations.
Uncertainty Analysis in 3D Equilibrium Reconstruction
Cianciosa, Mark R.; Hanson, James D.; Maurer, David A.
2018-02-21
Reconstruction is an inverse process where a parameter space is searched to locate a set of parameters with the highest probability of describing experimental observations. Due to systematic errors and uncertainty in experimental measurements, this optimal set of parameters will contain some associated uncertainty. This uncertainty in the optimal parameters leads to uncertainty in models derived using those parameters. V3FIT is a three-dimensional (3D) equilibrium reconstruction code that propagates uncertainty from the input signals, to the reconstructed parameters, and to the final model. Here in this paper, we describe the methods used to propagate uncertainty in V3FIT. Using the resultsmore » of whole shot 3D equilibrium reconstruction of the Compact Toroidal Hybrid, this propagated uncertainty is validated against the random variation in the resulting parameters. Two different model parameterizations demonstrate how the uncertainty propagation can indicate the quality of a reconstruction. As a proxy for random sampling, the whole shot reconstruction results in a time interval that will be used to validate the propagated uncertainty from a single time slice.« less
Uncertainty Analysis in 3D Equilibrium Reconstruction
DOE Office of Scientific and Technical Information (OSTI.GOV)
Cianciosa, Mark R.; Hanson, James D.; Maurer, David A.
Reconstruction is an inverse process where a parameter space is searched to locate a set of parameters with the highest probability of describing experimental observations. Due to systematic errors and uncertainty in experimental measurements, this optimal set of parameters will contain some associated uncertainty. This uncertainty in the optimal parameters leads to uncertainty in models derived using those parameters. V3FIT is a three-dimensional (3D) equilibrium reconstruction code that propagates uncertainty from the input signals, to the reconstructed parameters, and to the final model. Here in this paper, we describe the methods used to propagate uncertainty in V3FIT. Using the resultsmore » of whole shot 3D equilibrium reconstruction of the Compact Toroidal Hybrid, this propagated uncertainty is validated against the random variation in the resulting parameters. Two different model parameterizations demonstrate how the uncertainty propagation can indicate the quality of a reconstruction. As a proxy for random sampling, the whole shot reconstruction results in a time interval that will be used to validate the propagated uncertainty from a single time slice.« less
Evaluation of variable selection methods for random forests and omics data sets.
Degenhardt, Frauke; Seifert, Stephan; Szymczak, Silke
2017-10-16
Machine learning methods and in particular random forests are promising approaches for prediction based on high dimensional omics data sets. They provide variable importance measures to rank predictors according to their predictive power. If building a prediction model is the main goal of a study, often a minimal set of variables with good prediction performance is selected. However, if the objective is the identification of involved variables to find active networks and pathways, approaches that aim to select all relevant variables should be preferred. We evaluated several variable selection procedures based on simulated data as well as publicly available experimental methylation and gene expression data. Our comparison included the Boruta algorithm, the Vita method, recurrent relative variable importance, a permutation approach and its parametric variant (Altmann) as well as recursive feature elimination (RFE). In our simulation studies, Boruta was the most powerful approach, followed closely by the Vita method. Both approaches demonstrated similar stability in variable selection, while Vita was the most robust approach under a pure null model without any predictor variables related to the outcome. In the analysis of the different experimental data sets, Vita demonstrated slightly better stability in variable selection and was less computationally intensive than Boruta.In conclusion, we recommend the Boruta and Vita approaches for the analysis of high-dimensional data sets. Vita is considerably faster than Boruta and thus more suitable for large data sets, but only Boruta can also be applied in low-dimensional settings. © The Author 2017. Published by Oxford University Press.
NASA Astrophysics Data System (ADS)
Kim, K.-h.; Oh, T.-s.; Park, K.-r.; Lee, J. H.; Ghim, Y.-c.
2017-11-01
One factor determining the reliability of measurements of electron temperature using a Thomson scattering (TS) system is transmittance of the optical bandpass filters in polychromators. We investigate the system performance as a function of electron temperature to determine reliable range of measurements for a given set of the optical bandpass filters. We show that such a reliability, i.e., both bias and random errors, can be obtained by building a forward model of the KSTAR TS system to generate synthetic TS data with the prescribed electron temperature and density profiles. The prescribed profiles are compared with the estimated ones to quantify both bias and random errors.
Weighted re-randomization tests for minimization with unbalanced allocation.
Han, Baoguang; Yu, Menggang; McEntegart, Damian
2013-01-01
Re-randomization test has been considered as a robust alternative to the traditional population model-based methods for analyzing randomized clinical trials. This is especially so when the clinical trials are randomized according to minimization, which is a popular covariate-adaptive randomization method for ensuring balance among prognostic factors. Among various re-randomization tests, fixed-entry-order re-randomization is advocated as an effective strategy when a temporal trend is suspected. Yet when the minimization is applied to trials with unequal allocation, fixed-entry-order re-randomization test is biased and thus compromised in power. We find that the bias is due to non-uniform re-allocation probabilities incurred by the re-randomization in this case. We therefore propose a weighted fixed-entry-order re-randomization test to overcome the bias. The performance of the new test was investigated in simulation studies that mimic the settings of a real clinical trial. The weighted re-randomization test was found to work well in the scenarios investigated including the presence of a strong temporal trend. Copyright © 2013 John Wiley & Sons, Ltd.
An agent-based model of dialect evolution in killer whales.
Filatova, Olga A; Miller, Patrick J O
2015-05-21
The killer whale is one of the few animal species with vocal dialects that arise from socially learned group-specific call repertoires. We describe a new agent-based model of killer whale populations and test a set of vocal-learning rules to assess which mechanisms may lead to the formation of dialect groupings observed in the wild. We tested a null model with genetic transmission and no learning, and ten models with learning rules that differ by template source (mother or matriline), variation type (random errors or innovations) and type of call change (no divergence from kin vs. divergence from kin). The null model without vocal learning did not produce the pattern of group-specific call repertoires we observe in nature. Learning from either mother alone or the entire matriline with calls changing by random errors produced a graded distribution of the call phenotype, without the discrete call types observed in nature. Introducing occasional innovation or random error proportional to matriline variance yielded more or less discrete and stable call types. A tendency to diverge from the calls of related matrilines provided fast divergence of loose call clusters. A pattern resembling the dialect diversity observed in the wild arose only when rules were applied in combinations and similar outputs could arise from different learning rules and their combinations. Our results emphasize the lack of information on quantitative features of wild killer whale dialects and reveal a set of testable questions that can draw insights into the cultural evolution of killer whale dialects. Copyright © 2015 Elsevier Ltd. All rights reserved.
Passenger Flow Analysis, 1978. Riverside Line, MBTA.
DOT National Transportation Integrated Search
1981-08-01
In order to complete a set of passenger flow estimates for use in a simulation model of a light rail line, a count of passenger movement was made at randomly selected stations in the underground section. Above-ground stations had been studied a year ...
Ghiglietti, Andrea; Scarale, Maria Giovanna; Miceli, Rosalba; Ieva, Francesca; Mariani, Luigi; Gavazzi, Cecilia; Paganoni, Anna Maria; Edefonti, Valeria
2018-03-22
Recently, response-adaptive designs have been proposed in randomized clinical trials to achieve ethical and/or cost advantages by using sequential accrual information collected during the trial to dynamically update the probabilities of treatment assignments. In this context, urn models-where the probability to assign patients to treatments is interpreted as the proportion of balls of different colors available in a virtual urn-have been used as response-adaptive randomization rules. We propose the use of Randomly Reinforced Urn (RRU) models in a simulation study based on a published randomized clinical trial on the efficacy of home enteral nutrition in cancer patients after major gastrointestinal surgery. We compare results with the RRU design with those previously published with the non-adaptive approach. We also provide a code written with the R software to implement the RRU design in practice. In detail, we simulate 10,000 trials based on the RRU model in three set-ups of different total sample sizes. We report information on the number of patients allocated to the inferior treatment and on the empirical power of the t-test for the treatment coefficient in the ANOVA model. We carry out a sensitivity analysis to assess the effect of different urn compositions. For each sample size, in approximately 75% of the simulation runs, the number of patients allocated to the inferior treatment by the RRU design is lower, as compared to the non-adaptive design. The empirical power of the t-test for the treatment effect is similar in the two designs.
van der Ploeg, Tjeerd; Nieboer, Daan; Steyerberg, Ewout W
2016-10-01
Prediction of medical outcomes may potentially benefit from using modern statistical modeling techniques. We aimed to externally validate modeling strategies for prediction of 6-month mortality of patients suffering from traumatic brain injury (TBI) with predictor sets of increasing complexity. We analyzed individual patient data from 15 different studies including 11,026 TBI patients. We consecutively considered a core set of predictors (age, motor score, and pupillary reactivity), an extended set with computed tomography scan characteristics, and a further extension with two laboratory measurements (glucose and hemoglobin). With each of these sets, we predicted 6-month mortality using default settings with five statistical modeling techniques: logistic regression (LR), classification and regression trees, random forests (RFs), support vector machines (SVM) and neural nets. For external validation, a model developed on one of the 15 data sets was applied to each of the 14 remaining sets. This process was repeated 15 times for a total of 630 validations. The area under the receiver operating characteristic curve (AUC) was used to assess the discriminative ability of the models. For the most complex predictor set, the LR models performed best (median validated AUC value, 0.757), followed by RF and support vector machine models (median validated AUC value, 0.735 and 0.732, respectively). With each predictor set, the classification and regression trees models showed poor performance (median validated AUC value, <0.7). The variability in performance across the studies was smallest for the RF- and LR-based models (inter quartile range for validated AUC values from 0.07 to 0.10). In the area of predicting mortality from TBI, nonlinear and nonadditive effects are not pronounced enough to make modern prediction methods beneficial. Copyright © 2016 Elsevier Inc. All rights reserved.
ERIC Educational Resources Information Center
Tak, Susanne; Plaisier, Marco; van Rooij, Iris
2008-01-01
To explain human performance on the "Traveling Salesperson" problem (TSP), MacGregor, Ormerod, and Chronicle (2000) proposed that humans construct solutions according to the steps described by their convex-hull algorithm. Focusing on tour length as the dependent variable, and using only random or semirandom point sets, the authors…
ERIC Educational Resources Information Center
Corsello, Maryann; Sharma, Anu; Jerabek, Angela
2015-01-01
Ninth grade is a pivotal year for students. Numerous studies find that academic performance in 9th grade often sets the student's trajectory throughout the high school years, as well as the probability of graduation. The Building Assets Reducing Risks (BARR) model is a comprehensive approach that addresses developmental, academic, and school…
Combined statistical analysis of landslide release and propagation
NASA Astrophysics Data System (ADS)
Mergili, Martin; Rohmaneo, Mohammad; Chu, Hone-Jay
2016-04-01
Statistical methods - often coupled with stochastic concepts - are commonly employed to relate areas affected by landslides with environmental layers, and to estimate spatial landslide probabilities by applying these relationships. However, such methods only concern the release of landslides, disregarding their motion. Conceptual models for mass flow routing are used for estimating landslide travel distances and possible impact areas. Automated approaches combining release and impact probabilities are rare. The present work attempts to fill this gap by a fully automated procedure combining statistical and stochastic elements, building on the open source GRASS GIS software: (1) The landslide inventory is subset into release and deposition zones. (2) We employ a traditional statistical approach to estimate the spatial release probability of landslides. (3) We back-calculate the probability distribution of the angle of reach of the observed landslides, employing the software tool r.randomwalk. One set of random walks is routed downslope from each pixel defined as release area. Each random walk stops when leaving the observed impact area of the landslide. (4) The cumulative probability function (cdf) derived in (3) is used as input to route a set of random walks downslope from each pixel in the study area through the DEM, assigning the probability gained from the cdf to each pixel along the path (impact probability). The impact probability of a pixel is defined as the average impact probability of all sets of random walks impacting a pixel. Further, the average release probabilities of the release pixels of all sets of random walks impacting a given pixel are stored along with the area of the possible release zone. (5) We compute the zonal release probability by increasing the release probability according to the size of the release zone - the larger the zone, the larger the probability that a landslide will originate from at least one pixel within this zone. We quantify this relationship by a set of empirical curves. (6) Finally, we multiply the zonal release probability with the impact probability in order to estimate the combined impact probability for each pixel. We demonstrate the model with a 167 km² study area in Taiwan, using an inventory of landslides triggered by the typhoon Morakot. Analyzing the model results leads us to a set of key conclusions: (i) The average composite impact probability over the entire study area corresponds well to the density of observed landside pixels. Therefore we conclude that the method is valid in general, even though the concept of the zonal release probability bears some conceptual issues that have to be kept in mind. (ii) The parameters used as predictors cannot fully explain the observed distribution of landslides. The size of the release zone influences the composite impact probability to a larger degree than the pixel-based release probability. (iii) The prediction rate increases considerably when excluding the largest, deep-seated, landslides from the analysis. We conclude that such landslides are mainly related to geological features hardly reflected in the predictor layers used.
Inverse-Probability-Weighted Estimation for Monotone and Nonmonotone Missing Data.
Sun, BaoLuo; Perkins, Neil J; Cole, Stephen R; Harel, Ofer; Mitchell, Emily M; Schisterman, Enrique F; Tchetgen Tchetgen, Eric J
2018-03-01
Missing data is a common occurrence in epidemiologic research. In this paper, 3 data sets with induced missing values from the Collaborative Perinatal Project, a multisite US study conducted from 1959 to 1974, are provided as examples of prototypical epidemiologic studies with missing data. Our goal was to estimate the association of maternal smoking behavior with spontaneous abortion while adjusting for numerous confounders. At the same time, we did not necessarily wish to evaluate the joint distribution among potentially unobserved covariates, which is seldom the subject of substantive scientific interest. The inverse probability weighting (IPW) approach preserves the semiparametric structure of the underlying model of substantive interest and clearly separates the model of substantive interest from the model used to account for the missing data. However, IPW often will not result in valid inference if the missing-data pattern is nonmonotone, even if the data are missing at random. We describe a recently proposed approach to modeling nonmonotone missing-data mechanisms under missingness at random to use in constructing the weights in IPW complete-case estimation, and we illustrate the approach using 3 data sets described in a companion article (Am J Epidemiol. 2018;187(3):568-575).
Inverse-Probability-Weighted Estimation for Monotone and Nonmonotone Missing Data
Sun, BaoLuo; Perkins, Neil J; Cole, Stephen R; Harel, Ofer; Mitchell, Emily M; Schisterman, Enrique F; Tchetgen Tchetgen, Eric J
2018-01-01
Abstract Missing data is a common occurrence in epidemiologic research. In this paper, 3 data sets with induced missing values from the Collaborative Perinatal Project, a multisite US study conducted from 1959 to 1974, are provided as examples of prototypical epidemiologic studies with missing data. Our goal was to estimate the association of maternal smoking behavior with spontaneous abortion while adjusting for numerous confounders. At the same time, we did not necessarily wish to evaluate the joint distribution among potentially unobserved covariates, which is seldom the subject of substantive scientific interest. The inverse probability weighting (IPW) approach preserves the semiparametric structure of the underlying model of substantive interest and clearly separates the model of substantive interest from the model used to account for the missing data. However, IPW often will not result in valid inference if the missing-data pattern is nonmonotone, even if the data are missing at random. We describe a recently proposed approach to modeling nonmonotone missing-data mechanisms under missingness at random to use in constructing the weights in IPW complete-case estimation, and we illustrate the approach using 3 data sets described in a companion article (Am J Epidemiol. 2018;187(3):568–575). PMID:29165557
Knot probabilities in random diagrams
NASA Astrophysics Data System (ADS)
Cantarella, Jason; Chapman, Harrison; Mastin, Matt
2016-10-01
We consider a natural model of random knotting—choose a knot diagram at random from the finite set of diagrams with n crossings. We tabulate diagrams with 10 and fewer crossings and classify the diagrams by knot type, allowing us to compute exact probabilities for knots in this model. As expected, most diagrams with 10 and fewer crossings are unknots (about 78% of the roughly 1.6 billion 10 crossing diagrams). For these crossing numbers, the unknot fraction is mostly explained by the prevalence of ‘tree-like’ diagrams which are unknots for any assignment of over/under information at crossings. The data shows a roughly linear relationship between the log of knot type probability and the log of the frequency rank of the knot type, analogous to Zipf’s law for word frequency. The complete tabulation and all knot frequencies are included as supplementary data.
A mixed-effects regression model for longitudinal multivariate ordinal data.
Liu, Li C; Hedeker, Donald
2006-03-01
A mixed-effects item response theory model that allows for three-level multivariate ordinal outcomes and accommodates multiple random subject effects is proposed for analysis of multivariate ordinal outcomes in longitudinal studies. This model allows for the estimation of different item factor loadings (item discrimination parameters) for the multiple outcomes. The covariates in the model do not have to follow the proportional odds assumption and can be at any level. Assuming either a probit or logistic response function, maximum marginal likelihood estimation is proposed utilizing multidimensional Gauss-Hermite quadrature for integration of the random effects. An iterative Fisher scoring solution, which provides standard errors for all model parameters, is used. An analysis of a longitudinal substance use data set, where four items of substance use behavior (cigarette use, alcohol use, marijuana use, and getting drunk or high) are repeatedly measured over time, is used to illustrate application of the proposed model.
NASA Astrophysics Data System (ADS)
López-Comino, José Ángel; Stich, Daniel; Ferreira, Ana M. G.; Morales, Jose
2015-09-01
Inversions for the full slip distribution of earthquakes provide detailed models of earthquake sources, but stability and non-uniqueness of the inversions is a major concern. The problem is underdetermined in any realistic setting, and significantly different slip distributions may translate to fairly similar seismograms. In such circumstances, inverting for a single best model may become overly dependent on the details of the procedure. Instead, we propose to perform extended fault inversion trough falsification. We generate a representative set of heterogeneous slipmaps, compute their forward predictions, and falsify inappropriate trial models that do not reproduce the data within a reasonable level of mismodelling. The remainder of surviving trial models forms our set of coequal solutions. The solution set may contain only members with similar slip distributions, or else uncover some fundamental ambiguity such as, for example, different patterns of main slip patches. For a feasibility study, we use teleseismic body wave recordings from the 2012 September 5 Nicoya, Costa Rica earthquake, although the inversion strategy can be applied to any type of seismic, geodetic or tsunami data for which we can handle the forward problem. We generate 10 000 pseudo-random, heterogeneous slip distributions assuming a von Karman autocorrelation function, keeping the rake angle, rupture velocity and slip velocity function fixed. The slip distribution of the 2012 Nicoya earthquake turns out to be relatively well constrained from 50 teleseismic waveforms. Two hundred fifty-two slip models with normalized L1-fit within 5 per cent from the global minimum from our solution set. They consistently show a single dominant slip patch around the hypocentre. Uncertainties are related to the details of the slip maximum, including the amount of peak slip (2-3.5 m), as well as the characteristics of peripheral slip below 1 m. Synthetic tests suggest that slip patterns such as Nicoya may be a fortunate case, while it may be more difficult to unambiguously reconstruct more distributed slip from teleseismic data.
An Xdata Architecture for Federated Graph Models and Multi-tier Asymmetric Computing
2014-01-01
Wikipedia, a scale-free random graph (kron), Akamai trace route data, Bitcoin transaction data, and a Twitter follower network. We present results for...3x (SSSP on a random graph) and nearly 300x (Akamai and Bitcoin ) over the CPU performance of a well-known and widely deployed CPU-based graph...provided better throughput for smaller frontiers such as roadmaps or the Bitcoin data set. In our work, we have focused on two-phase kernels, but it
Robust portfolio selection based on asymmetric measures of variability of stock returns
NASA Astrophysics Data System (ADS)
Chen, Wei; Tan, Shaohua
2009-10-01
This paper addresses a new uncertainty set--interval random uncertainty set for robust optimization. The form of interval random uncertainty set makes it suitable for capturing the downside and upside deviations of real-world data. These deviation measures capture distributional asymmetry and lead to better optimization results. We also apply our interval random chance-constrained programming to robust mean-variance portfolio selection under interval random uncertainty sets in the elements of mean vector and covariance matrix. Numerical experiments with real market data indicate that our approach results in better portfolio performance.
Localized motion in random matrix decomposition of complex financial systems
NASA Astrophysics Data System (ADS)
Jiang, Xiong-Fei; Zheng, Bo; Ren, Fei; Qiu, Tian
2017-04-01
With the random matrix theory, we decompose the multi-dimensional time series of complex financial systems into a set of orthogonal eigenmode functions, which are classified into the market mode, sector mode, and random mode. In particular, the localized motion generated by the business sectors, plays an important role in financial systems. Both the business sectors and their impact on the stock market are identified from the localized motion. We clarify that the localized motion induces different characteristics of the time correlations for the stock-market index and individual stocks. With a variation of a two-factor model, we reproduce the return-volatility correlations of the eigenmodes.
Metabolomics biomarkers to predict acamprosate treatment response in alcohol-dependent subjects.
Hinton, David J; Vázquez, Marely Santiago; Geske, Jennifer R; Hitschfeld, Mario J; Ho, Ada M C; Karpyak, Victor M; Biernacka, Joanna M; Choi, Doo-Sup
2017-05-31
Precision medicine for alcohol use disorder (AUD) allows optimal treatment of the right patient with the right drug at the right time. Here, we generated multivariable models incorporating clinical information and serum metabolite levels to predict acamprosate treatment response. The sample of 120 patients was randomly split into a training set (n = 80) and test set (n = 40) five independent times. Treatment response was defined as complete abstinence (no alcohol consumption during 3 months of acamprosate treatment) while nonresponse was defined as any alcohol consumption during this period. In each of the five training sets, we built a predictive model using a least absolute shrinkage and section operator (LASSO) penalized selection method and then evaluated the predictive performance of each model in the corresponding test set. The models predicted acamprosate treatment response with a mean sensitivity and specificity in the test sets of 0.83 and 0.31, respectively, suggesting our model performed well at predicting responders, but not non-responders (i.e. many non-responders were predicted to respond). Studies with larger sample sizes and additional biomarkers will expand the clinical utility of predictive algorithms for pharmaceutical response in AUD.
Janet, Jon Paul; Kulik, Heather J
2017-11-22
Machine learning (ML) of quantum mechanical properties shows promise for accelerating chemical discovery. For transition metal chemistry where accurate calculations are computationally costly and available training data sets are small, the molecular representation becomes a critical ingredient in ML model predictive accuracy. We introduce a series of revised autocorrelation functions (RACs) that encode relationships of the heuristic atomic properties (e.g., size, connectivity, and electronegativity) on a molecular graph. We alter the starting point, scope, and nature of the quantities evaluated in standard ACs to make these RACs amenable to inorganic chemistry. On an organic molecule set, we first demonstrate superior standard AC performance to other presently available topological descriptors for ML model training, with mean unsigned errors (MUEs) for atomization energies on set-aside test molecules as low as 6 kcal/mol. For inorganic chemistry, our RACs yield 1 kcal/mol ML MUEs on set-aside test molecules in spin-state splitting in comparison to 15-20× higher errors for feature sets that encode whole-molecule structural information. Systematic feature selection methods including univariate filtering, recursive feature elimination, and direct optimization (e.g., random forest and LASSO) are compared. Random-forest- or LASSO-selected subsets 4-5× smaller than the full RAC set produce sub- to 1 kcal/mol spin-splitting MUEs, with good transferability to metal-ligand bond length prediction (0.004-5 Å MUE) and redox potential on a smaller data set (0.2-0.3 eV MUE). Evaluation of feature selection results across property sets reveals the relative importance of local, electronic descriptors (e.g., electronegativity, atomic number) in spin-splitting and distal, steric effects in redox potential and bond lengths.
Pugliese, Laura; Woodriff, Molly; Crowley, Olga; Lam, Vivian; Sohn, Jeremy; Bradley, Scott
2016-03-16
Rising rates of smartphone ownership highlight opportunities for improved mobile application usage in clinical trials. While current methods call for device provisioning, the "bring your own device" (BYOD) model permits participants to use personal phones allowing for improved patient engagement and lowered operational costs. However, more evidence is needed to demonstrate the BYOD model's feasibility in research settings. To assess if CentrosHealth, a mobile application designed to support trial compliance, produces different outcomes in medication adherence and application engagement when distributed through study-provisioned devices compared to the BYOD model. 87 participants were randomly selected to use the mobile application or no intervention for a 28-day pilot study at a 2:1 randomization ratio (2 intervention: 1 control) and asked to consume a twice-daily probiotic supplement. The application users were further randomized into two groups: receiving the application on a personal "BYOD" or study-provided smartphone. In-depth interviews were performed in a randomly-selected subset of the intervention group (five BYOD and five study-provided smartphone users). The BYOD subgroup showed significantly greater engagement than study-provided phone users, as shown by higher application use frequency and duration over the study period. The BYOD subgroup also demonstrated a significant effect of engagement on medication adherence for number of application sessions (unstandardized regression coefficient beta=0.0006, p=0.02) and time spent therein (beta=0.00001, p=0.03). Study-provided phone users showed higher initial adherence rates, but greater decline (5.7%) than BYOD users (0.9%) over the study period. In-depth interviews revealed that participants preferred the BYOD model over using study-provided devices. Results indicate that the BYOD model is feasible in health research settings and improves participant experience, calling for further BYOD model validity assessment. Although group differences in medication adherence decline were insignificant, the greater trend of decline in provisioned device users warrants further investigation to determine if trends reach significance over time. Significantly higher application engagement rates and effect of engagement on medication adherence in the BYOD subgroup similarly imply that greater application engagement may correlate to better medication adherence over time.
Goedhart, Paul W; van der Voet, Hilko; Baldacchino, Ferdinando; Arpaia, Salvatore
2014-04-01
Genetic modification of plants may result in unintended effects causing potentially adverse effects on the environment. A comparative safety assessment is therefore required by authorities, such as the European Food Safety Authority, in which the genetically modified plant is compared with its conventional counterpart. Part of the environmental risk assessment is a comparative field experiment in which the effect on non-target organisms is compared. Statistical analysis of such trials come in two flavors: difference testing and equivalence testing. It is important to know the statistical properties of these, for example, the power to detect environmental change of a given magnitude, before the start of an experiment. Such prospective power analysis can best be studied by means of a statistical simulation model. This paper describes a general framework for simulating data typically encountered in environmental risk assessment of genetically modified plants. The simulation model, available as Supplementary Material, can be used to generate count data having different statistical distributions possibly with excess-zeros. In addition the model employs completely randomized or randomized block experiments, can be used to simulate single or multiple trials across environments, enables genotype by environment interaction by adding random variety effects, and finally includes repeated measures in time following a constant, linear or quadratic pattern in time possibly with some form of autocorrelation. The model also allows to add a set of reference varieties to the GM plants and its comparator to assess the natural variation which can then be used to set limits of concern for equivalence testing. The different count distributions are described in some detail and some examples of how to use the simulation model to study various aspects, including a prospective power analysis, are provided.
Goedhart, Paul W; van der Voet, Hilko; Baldacchino, Ferdinando; Arpaia, Salvatore
2014-01-01
Genetic modification of plants may result in unintended effects causing potentially adverse effects on the environment. A comparative safety assessment is therefore required by authorities, such as the European Food Safety Authority, in which the genetically modified plant is compared with its conventional counterpart. Part of the environmental risk assessment is a comparative field experiment in which the effect on non-target organisms is compared. Statistical analysis of such trials come in two flavors: difference testing and equivalence testing. It is important to know the statistical properties of these, for example, the power to detect environmental change of a given magnitude, before the start of an experiment. Such prospective power analysis can best be studied by means of a statistical simulation model. This paper describes a general framework for simulating data typically encountered in environmental risk assessment of genetically modified plants. The simulation model, available as Supplementary Material, can be used to generate count data having different statistical distributions possibly with excess-zeros. In addition the model employs completely randomized or randomized block experiments, can be used to simulate single or multiple trials across environments, enables genotype by environment interaction by adding random variety effects, and finally includes repeated measures in time following a constant, linear or quadratic pattern in time possibly with some form of autocorrelation. The model also allows to add a set of reference varieties to the GM plants and its comparator to assess the natural variation which can then be used to set limits of concern for equivalence testing. The different count distributions are described in some detail and some examples of how to use the simulation model to study various aspects, including a prospective power analysis, are provided. PMID:24834325
Diffusion Processes Satisfying a Conservation Law Constraint
Bakosi, J.; Ristorcelli, J. R.
2014-03-04
We investigate coupled stochastic differential equations governing N non-negative continuous random variables that satisfy a conservation principle. In various fields a conservation law requires that a set of fluctuating variables be non-negative and (if appropriately normalized) sum to one. As a result, any stochastic differential equation model to be realizable must not produce events outside of the allowed sample space. We develop a set of constraints on the drift and diffusion terms of such stochastic models to ensure that both the non-negativity and the unit-sum conservation law constraint are satisfied as the variables evolve in time. We investigate the consequencesmore » of the developed constraints on the Fokker-Planck equation, the associated system of stochastic differential equations, and the evolution equations of the first four moments of the probability density function. We show that random variables, satisfying a conservation law constraint, represented by stochastic diffusion processes, must have diffusion terms that are coupled and nonlinear. The set of constraints developed enables the development of statistical representations of fluctuating variables satisfying a conservation law. We exemplify the results with the bivariate beta process and the multivariate Wright-Fisher, Dirichlet, and Lochner’s generalized Dirichlet processes.« less
Diffusion Processes Satisfying a Conservation Law Constraint
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bakosi, J.; Ristorcelli, J. R.
We investigate coupled stochastic differential equations governing N non-negative continuous random variables that satisfy a conservation principle. In various fields a conservation law requires that a set of fluctuating variables be non-negative and (if appropriately normalized) sum to one. As a result, any stochastic differential equation model to be realizable must not produce events outside of the allowed sample space. We develop a set of constraints on the drift and diffusion terms of such stochastic models to ensure that both the non-negativity and the unit-sum conservation law constraint are satisfied as the variables evolve in time. We investigate the consequencesmore » of the developed constraints on the Fokker-Planck equation, the associated system of stochastic differential equations, and the evolution equations of the first four moments of the probability density function. We show that random variables, satisfying a conservation law constraint, represented by stochastic diffusion processes, must have diffusion terms that are coupled and nonlinear. The set of constraints developed enables the development of statistical representations of fluctuating variables satisfying a conservation law. We exemplify the results with the bivariate beta process and the multivariate Wright-Fisher, Dirichlet, and Lochner’s generalized Dirichlet processes.« less
Application of Ensemble Detection and Analysis to Modeling Uncertainty in Non Stationary Process
NASA Technical Reports Server (NTRS)
Racette, Paul
2010-01-01
Characterization of non stationary and nonlinear processes is a challenge in many engineering and scientific disciplines. Climate change modeling and projection, retrieving information from Doppler measurements of hydrometeors, and modeling calibration architectures and algorithms in microwave radiometers are example applications that can benefit from improvements in the modeling and analysis of non stationary processes. Analyses of measured signals have traditionally been limited to a single measurement series. Ensemble Detection is a technique whereby mixing calibrated noise produces an ensemble measurement set. The collection of ensemble data sets enables new methods for analyzing random signals and offers powerful new approaches to studying and analyzing non stationary processes. Derived information contained in the dynamic stochastic moments of a process will enable many novel applications.
2016-01-01
Background As more and more researchers are turning to big data for new opportunities of biomedical discoveries, machine learning models, as the backbone of big data analysis, are mentioned more often in biomedical journals. However, owing to the inherent complexity of machine learning methods, they are prone to misuse. Because of the flexibility in specifying machine learning models, the results are often insufficiently reported in research articles, hindering reliable assessment of model validity and consistent interpretation of model outputs. Objective To attain a set of guidelines on the use of machine learning predictive models within clinical settings to make sure the models are correctly applied and sufficiently reported so that true discoveries can be distinguished from random coincidence. Methods A multidisciplinary panel of machine learning experts, clinicians, and traditional statisticians were interviewed, using an iterative process in accordance with the Delphi method. Results The process produced a set of guidelines that consists of (1) a list of reporting items to be included in a research article and (2) a set of practical sequential steps for developing predictive models. Conclusions A set of guidelines was generated to enable correct application of machine learning models and consistent reporting of model specifications and results in biomedical research. We believe that such guidelines will accelerate the adoption of big data analysis, particularly with machine learning methods, in the biomedical research community. PMID:27986644
Baseline adjustments for binary data in repeated cross-sectional cluster randomized trials.
Nixon, R M; Thompson, S G
2003-09-15
Analysis of covariance models, which adjust for a baseline covariate, are often used to compare treatment groups in a controlled trial in which individuals are randomized. Such analysis adjusts for any baseline imbalance and usually increases the precision of the treatment effect estimate. We assess the value of such adjustments in the context of a cluster randomized trial with repeated cross-sectional design and a binary outcome. In such a design, a new sample of individuals is taken from the clusters at each measurement occasion, so that baseline adjustment has to be at the cluster level. Logistic regression models are used to analyse the data, with cluster level random effects to allow for different outcome probabilities in each cluster. We compare the estimated treatment effect and its precision in models that incorporate a covariate measuring the cluster level probabilities at baseline and those that do not. In two data sets, taken from a cluster randomized trial in the treatment of menorrhagia, the value of baseline adjustment is only evident when the number of subjects per cluster is large. We assess the generalizability of these findings by undertaking a simulation study, and find that increased precision of the treatment effect requires both large cluster sizes and substantial heterogeneity between clusters at baseline, but baseline imbalance arising by chance in a randomized study can always be effectively adjusted for. Copyright 2003 John Wiley & Sons, Ltd.
Steel, Mike
2012-10-01
Neutral macroevolutionary models, such as the Yule model, give rise to a probability distribution on the set of discrete rooted binary trees over a given leaf set. Such models can provide a signal as to the approximate location of the root when only the unrooted phylogenetic tree is known, and this signal becomes relatively more significant as the number of leaves grows. In this short note, we show that among models that treat all taxa equally, and are sampling consistent (i.e. the distribution on trees is not affected by taxa yet to be included), all such models, except one (the so-called PDA model), convey some information as to the location of the ancestral root in an unrooted tree. Copyright © 2012 Elsevier Inc. All rights reserved.
NASA Astrophysics Data System (ADS)
Blanc-Benon, Philippe; Lipkens, Bart; Dallois, Laurent; Hamilton, Mark F.; Blackstock, David T.
2002-01-01
Sonic boom propagation can be affected by atmospheric turbulence. It has been shown that turbulence affects the perceived loudness of sonic booms, mainly by changing its peak pressure and rise time. The models reported here describe the nonlinear propagation of sound through turbulence. Turbulence is modeled as a set of individual realizations of a random temperature or velocity field. In the first model, linear geometrical acoustics is used to trace rays through each realization of the turbulent field. A nonlinear transport equation is then derived along each eigenray connecting the source and receiver. The transport equation is solved by a Pestorius algorithm. In the second model, the KZK equation is modified to account for the effect of a random temperature field and it is then solved numerically. Results from numerical experiments that simulate the propagation of spark-produced N waves through turbulence are presented. It is observed that turbulence decreases, on average, the peak pressure of the N waves and increases the rise time. Nonlinear distortion is less when turbulence is present than without it. The effects of random vector fields are stronger than those of random temperature fields. The location of the caustics and the deformation of the wave front are also presented. These observations confirm the results from the model experiment in which spark-produced N waves are used to simulate sonic boom propagation through a turbulent atmosphere.
Blanc-Benon, Philippe; Lipkens, Bart; Dallois, Laurent; Hamilton, Mark F; Blackstock, David T
2002-01-01
Sonic boom propagation can be affected by atmospheric turbulence. It has been shown that turbulence affects the perceived loudness of sonic booms, mainly by changing its peak pressure and rise time. The models reported here describe the nonlinear propagation of sound through turbulence. Turbulence is modeled as a set of individual realizations of a random temperature or velocity field. In the first model, linear geometrical acoustics is used to trace rays through each realization of the turbulent field. A nonlinear transport equation is then derived along each eigenray connecting the source and receiver. The transport equation is solved by a Pestorius algorithm. In the second model, the KZK equation is modified to account for the effect of a random temperature field and it is then solved numerically. Results from numerical experiments that simulate the propagation of spark-produced N waves through turbulence are presented. It is observed that turbulence decreases, on average, the peak pressure of the N waves and increases the rise time. Nonlinear distortion is less when turbulence is present than without it. The effects of random vector fields are stronger than those of random temperature fields. The location of the caustics and the deformation of the wave front are also presented. These observations confirm the results from the model experiment in which spark-produced N waves are used to simulate sonic boom propagation through a turbulent atmosphere.
Beauvais, Francis
2013-04-01
The randomized controlled trial (RCT) is the 'gold standard' of modern clinical pharmacology. However, for many practitioners of homeopathy, blind RCTs are an inadequate research tool for testing complex therapies such as homeopathy. Classical probabilities used in biological sciences and in medicine are only a special case of the generalized theory of probability used in quantum physics. I describe homeopathy trials using a quantum-like statistical model, a model inspired by quantum physics and taking into consideration superposition of states, non-commuting observables, probability interferences, contextuality, etc. The negative effect of blinding on success of homeopathy trials and the 'smearing effect' ('specific' effects of homeopathy medicine occurring in the placebo group) are described by quantum-like probabilities without supplementary ad hoc hypotheses. The difference of positive outcome rates between placebo and homeopathy groups frequently vanish in centralized blind trials. The model proposed here suggests a way to circumvent such problems in masked homeopathy trials by incorporating in situ randomization/unblinding. In this quantum-like model of homeopathy clinical trials, success in open-label setting and failure with centralized blind RCTs emerge logically from the formalism. This model suggests that significant differences between placebo and homeopathy in blind RCTs would be found more frequently if in situ randomization/unblinding was used. Copyright © 2013. Published by Elsevier Ltd.
A feature-based developmental model of the infant brain in structural MRI.
Toews, Matthew; Wells, William M; Zöllei, Lilla
2012-01-01
In this paper, anatomical development is modeled as a collection of distinctive image patterns localized in space and time. A Bayesian posterior probability is defined over a random variable of subject age, conditioned on data in the form of scale-invariant image features. The model is automatically learned from a large set of images exhibiting significant variation, used to discover anatomical structure related to age and development, and fit to new images to predict age. The model is applied to a set of 230 infant structural MRIs of 92 subjects acquired at multiple sites over an age range of 8-590 days. Experiments demonstrate that the model can be used to identify age-related anatomical structure, and to predict the age of new subjects with an average error of 72 days.
Epidemic spreading on complex networks with community structures
Stegehuis, Clara; van der Hofstad, Remco; van Leeuwaarden, Johan S. H.
2016-01-01
Many real-world networks display a community structure. We study two random graph models that create a network with similar community structure as a given network. One model preserves the exact community structure of the original network, while the other model only preserves the set of communities and the vertex degrees. These models show that community structure is an important determinant of the behavior of percolation processes on networks, such as information diffusion or virus spreading: the community structure can both enforce as well as inhibit diffusion processes. Our models further show that it is the mesoscopic set of communities that matters. The exact internal structures of communities barely influence the behavior of percolation processes across networks. This insensitivity is likely due to the relative denseness of the communities. PMID:27440176
Stamovlasis, Dimitrios; Tsaparlis, Georgios
2003-07-01
The present study examines the role of limited human channel capacity from a science education perspective. A model of science problem solving has been previously validated by applying concepts and tools of complexity theory (the working memory, random walk method). The method correlated the subjects' rank-order achievement scores in organic-synthesis chemistry problems with the subjects' working memory capacity. In this work, we apply the same nonlinear approach to a different data set, taken from chemical-equilibrium problem solving. In contrast to the organic-synthesis problems, these problems are algorithmic, require numerical calculations, and have a complex logical structure. As a result, these problems cause deviations from the model, and affect the pattern observed with the nonlinear method. In addition to Baddeley's working memory capacity, the Pascual-Leone's mental (M-) capacity is examined by the same random-walk method. As the complexity of the problem increases, the fractal dimension of the working memory random walk demonstrates a sudden drop, while the fractal dimension of the M-capacity random walk decreases in a linear fashion. A review of the basic features of the two capacities and their relation is included. The method and findings have consequences for problem solving not only in chemistry and science education, but also in other disciplines.
Training set selection for the prediction of essential genes.
Cheng, Jian; Xu, Zhao; Wu, Wenwu; Zhao, Li; Li, Xiangchen; Liu, Yanlin; Tao, Shiheng
2014-01-01
Various computational models have been developed to transfer annotations of gene essentiality between organisms. However, despite the increasing number of microorganisms with well-characterized sets of essential genes, selection of appropriate training sets for predicting the essential genes of poorly-studied or newly sequenced organisms remains challenging. In this study, a machine learning approach was applied reciprocally to predict the essential genes in 21 microorganisms. Results showed that training set selection greatly influenced predictive accuracy. We determined four criteria for training set selection: (1) essential genes in the selected training set should be reliable; (2) the growth conditions in which essential genes are defined should be consistent in training and prediction sets; (3) species used as training set should be closely related to the target organism; and (4) organisms used as training and prediction sets should exhibit similar phenotypes or lifestyles. We then analyzed the performance of an incomplete training set and an integrated training set with multiple organisms. We found that the size of the training set should be at least 10% of the total genes to yield accurate predictions. Additionally, the integrated training sets exhibited remarkable increase in stability and accuracy compared with single sets. Finally, we compared the performance of the integrated training sets with the four criteria and with random selection. The results revealed that a rational selection of training sets based on our criteria yields better performance than random selection. Thus, our results provide empirical guidance on training set selection for the identification of essential genes on a genome-wide scale.
Gottfredson, Nisha C; Bauer, Daniel J; Baldwin, Scott A; Okiishi, John C
2014-10-01
This study demonstrates how to use a shared parameter mixture model (SPMM) in longitudinal psychotherapy studies to accommodate missingness that is due to a correlation between rate of improvement and termination of therapy. Traditional growth models assume that such a relationship does not exist (i.e., assume that data are missing at random) and produce biased results if this assumption is incorrect. We used longitudinal data from 4,676 patients enrolled in a naturalistic study of psychotherapy to compare results from a latent growth model and an SPMM. In this data set, estimates of the rate of improvement during therapy differed by 6.50%-6.66% across the two models, indicating that participants with steeper trajectories left psychotherapy earliest, thereby potentially biasing inference for the slope in the latent growth model. We conclude that reported estimates of change during therapy may be underestimated in naturalistic studies of therapy in which participants and their therapists determine the end of treatment. Because non-randomly missing data can also occur in randomized controlled trials or in observational studies of development, the utility of the SPMM extends beyond naturalistic psychotherapy data. PsycINFO Database Record (c) 2014 APA, all rights reserved.
Model and parametric uncertainty in source-based kinematic models of earthquake ground motion
Hartzell, Stephen; Frankel, Arthur; Liu, Pengcheng; Zeng, Yuehua; Rahman, Shariftur
2011-01-01
Four independent ground-motion simulation codes are used to model the strong ground motion for three earthquakes: 1994 Mw 6.7 Northridge, 1989 Mw 6.9 Loma Prieta, and 1999 Mw 7.5 Izmit. These 12 sets of synthetics are used to make estimates of the variability in ground-motion predictions. In addition, ground-motion predictions over a grid of sites are used to estimate parametric uncertainty for changes in rupture velocity. We find that the combined model uncertainty and random variability of the simulations is in the same range as the variability of regional empirical ground-motion data sets. The majority of the standard deviations lie between 0.5 and 0.7 natural-log units for response spectra and 0.5 and 0.8 for Fourier spectra. The estimate of model epistemic uncertainty, based on the different model predictions, lies between 0.2 and 0.4, which is about one-half of the estimates for the standard deviation of the combined model uncertainty and random variability. Parametric uncertainty, based on variation of just the average rupture velocity, is shown to be consistent in amplitude with previous estimates, showing percentage changes in ground motion from 50% to 300% when rupture velocity changes from 2.5 to 2.9 km/s. In addition, there is some evidence that mean biases can be reduced by averaging ground-motion estimates from different methods.
NASA Astrophysics Data System (ADS)
Sparaciari, Carlo; Paris, Matteo G. A.
2013-01-01
We address measurement schemes where certain observables Xk are chosen at random within a set of nondegenerate isospectral observables and then measured on repeated preparations of a physical system. Each observable has a probability zk to be measured, with ∑kzk=1, and the statistics of this generalized measurement is described by a positive operator-valued measure. This kind of scheme is referred to as quantum roulettes, since each observable Xk is chosen at random, e.g., according to the fluctuating value of an external parameter. Here we focus on quantum roulettes for qubits involving the measurements of Pauli matrices, and we explicitly evaluate their canonical Naimark extensions, i.e., their implementation as indirect measurements involving an interaction scheme with a probe system. We thus provide a concrete model to realize the roulette without destroying the signal state, which can be measured again after the measurement or can be transmitted. Finally, we apply our results to the description of Stern-Gerlach-like experiments on a two-level system.
Random catalytic reaction networks
NASA Astrophysics Data System (ADS)
Stadler, Peter F.; Fontana, Walter; Miller, John H.
1993-03-01
We study networks that are a generalization of replicator (or Lotka-Volterra) equations. They model the dynamics of a population of object types whose binary interactions determine the specific type of interaction product. Such a system always reduces its dimension to a subset that contains production pathways for all of its members. The network equation can be rewritten at a level of collectives in terms of two basic interaction patterns: replicator sets and cyclic transformation pathways among sets. Although the system contains well-known cases that exhibit very complicated dynamics, the generic behavior of randomly generated systems is found (numerically) to be extremely robust: convergence to a globally stable rest point. It is easy to tailor networks that display replicator interactions where the replicators are entire self-sustaining subsystems, rather than structureless units. A numerical scan of random systems highlights the special properties of elementary replicators: they reduce the effective interconnectedness of the system, resulting in enhanced competition, and strong correlations between the concentrations.
Wu, Wei; Chen, Albert Y C; Zhao, Liang; Corso, Jason J
2014-03-01
Detection and segmentation of a brain tumor such as glioblastoma multiforme (GBM) in magnetic resonance (MR) images are often challenging due to its intrinsically heterogeneous signal characteristics. A robust segmentation method for brain tumor MRI scans was developed and tested. Simple thresholds and statistical methods are unable to adequately segment the various elements of the GBM, such as local contrast enhancement, necrosis, and edema. Most voxel-based methods cannot achieve satisfactory results in larger data sets, and the methods based on generative or discriminative models have intrinsic limitations during application, such as small sample set learning and transfer. A new method was developed to overcome these challenges. Multimodal MR images are segmented into superpixels using algorithms to alleviate the sampling issue and to improve the sample representativeness. Next, features were extracted from the superpixels using multi-level Gabor wavelet filters. Based on the features, a support vector machine (SVM) model and an affinity metric model for tumors were trained to overcome the limitations of previous generative models. Based on the output of the SVM and spatial affinity models, conditional random fields theory was applied to segment the tumor in a maximum a posteriori fashion given the smoothness prior defined by our affinity model. Finally, labeling noise was removed using "structural knowledge" such as the symmetrical and continuous characteristics of the tumor in spatial domain. The system was evaluated with 20 GBM cases and the BraTS challenge data set. Dice coefficients were computed, and the results were highly consistent with those reported by Zikic et al. (MICCAI 2012, Lecture notes in computer science. vol 7512, pp 369-376, 2012). A brain tumor segmentation method using model-aware affinity demonstrates comparable performance with other state-of-the art algorithms.
Meinich Petersen, Sandra; Zoffmann, Vibeke; Kjærgaard, Jesper; Graff Stensballe, Lone; Graff Steensballe, Lone; Greisen, Gorm
2014-04-15
When a child participates in a clinical trial, informed consent has to be given by the parents. Parental motives for participation are complex, but the hope of getting a new and better treatment for the child is important. We wondered how parents react when their child is allocated to the control group of a randomized controlled trial, and how it will affect their future engagement in the trial. We included parents of newborns randomized to the control arm in the Danish Calmette study at Rigshospitalet in Copenhagen. The Calmette study is a randomized clinical trial investigating the non-specific effects of early BCG-vaccine to healthy neonates. Randomization is performed immediately after birth and parents are not blinded to the allocation. We set up a semi-structured focus group with six parents from four families. Afterwards we telephone-interviewed another 19 mothers to achieve saturation. Thematic analysis was used to identify themes across the data sets. The parents reported good understanding of the randomization process. Their most common reaction to allocation was disappointment, though relief was also seen. A model of reactions to being allocated to the control group was developed based on the participants' different positions along two continuities from 'Our participation in trial is not important' to 'Our participation in trial is important', and 'Vaccine not important to us' to 'Vaccine important to us'. Four very disappointed families had thought of getting the vaccine elsewhere, and one had actually had their child vaccinated. All parents involved in the focus group and the telephone interviews wanted to participate in the follow-ups planned for the Calmette study. This study identified an almost universal experience of disappointment among parents of newborns who were randomized to the control group, but also a broad expression of understanding and accepting the idea of randomization. The trial staff might use the model of reactions in understanding the parents' disappointment and in this way support their motives for participation. A generalized version might be applicable across randomized controlled trials at large. The Calmette study is registered in EudraCT (https://eudract.ema.europa.eu/) with trial number 2010-021979-85.
Mathematical models of cell factories: moving towards the core of industrial biotechnology.
Cvijovic, Marija; Bordel, Sergio; Nielsen, Jens
2011-09-01
Industrial biotechnology involves the utilization of cell factories for the production of fuels and chemicals. Traditionally, the development of highly productive microbial strains has relied on random mutagenesis and screening. The development of predictive mathematical models provides a new paradigm for the rational design of cell factories. Instead of selecting among a set of strains resulting from random mutagenesis, mathematical models allow the researchers to predict in silico the outcomes of different genetic manipulations and engineer new strains by performing gene deletions or additions leading to a higher productivity of the desired chemicals. In this review we aim to summarize the main modelling approaches of biological processes and illustrate the particular applications that they have found in the field of industrial microbiology. © 2010 The Authors. Journal compilation © 2010 Society for Applied Microbiology and Blackwell Publishing Ltd.
Dishman, Rod K; Vandenberg, Robert J; Motl, Robert W; Wilson, Mark G; DeJoy, David M
2010-08-01
The effectiveness of an intervention depends on its dose and on moderators of dose, which usually are not studied. The purpose of the study is to determine whether goal setting and theory-based moderators of goal setting had dose relations with increases in goal-related physical activity during a successful workplace intervention. A group-randomized 12-week intervention that included personal goal setting was implemented in fall 2005, with a multiracial/ethnic sample of employees at 16 geographically diverse worksites. Here, we examined dose-related variables in the cohort of participants (N = 664) from the 8 worksites randomized to the intervention. Participants in the intervention exceeded 9000 daily pedometer steps and 300 weekly minutes of moderate-to-vigorous physical activity (MVPA) during the last 6 weeks of the study, which approximated or exceeded current public health guidelines. Linear growth modeling indicated that participants who set higher goals and sustained higher levels of self-efficacy, commitment and intention about attaining their goals had greater increases in pedometer steps and MVPA. The relation between change in participants' satisfaction with current physical activity and increases in physical activity was mediated by increases in self-set goals. The results show a dose relation of increased physical activity with changes in goal setting, satisfaction, self-efficacy, commitment and intention, consistent with goal-setting theory.
Kahan, Brennan C
2016-12-13
Patient recruitment in clinical trials is often challenging, and as a result, many trials are stopped early due to insufficient recruitment. The re-randomization design allows patients to be re-enrolled and re-randomized for each new treatment episode that they experience. Because it allows multiple enrollments for each patient, this design has been proposed as a way to increase the recruitment rate in clinical trials. However, it is unknown to what extent recruitment could be increased in practice. We modelled the expected recruitment rate for parallel-group and re-randomization trials in different settings based on estimates from real trials and datasets. We considered three clinical areas: in vitro fertilization, severe asthma exacerbations, and acute sickle cell pain crises. We compared the two designs in terms of the expected time to complete recruitment, and the sample size recruited over a fixed recruitment period. Across the different scenarios we considered, we estimated that re-randomization could reduce the expected time to complete recruitment by between 4 and 22 months (relative reductions of 19% and 45%), or increase the sample size recruited over a fixed recruitment period by between 29% and 171%. Re-randomization can increase recruitment most for trials with a short follow-up period, a long trial recruitment duration, and patients with high rates of treatment episodes. Re-randomization has the potential to increase the recruitment rate in certain settings, and could lead to quicker and more efficient trials in these scenarios.
Kaitaniemi, Pekka
2008-04-09
Allometric equations are widely used in many branches of biological science. The potential information content of the normalization constant b in allometric equations of the form Y = bX(a) has, however, remained largely neglected. To demonstrate the potential for utilizing this information, I generated a large number of artificial datasets that resembled those that are frequently encountered in biological studies, i.e., relatively small samples including measurement error or uncontrolled variation. The value of X was allowed to vary randomly within the limits describing different data ranges, and a was set to a fixed theoretical value. The constant b was set to a range of values describing the effect of a continuous environmental variable. In addition, a normally distributed random error was added to the values of both X and Y. Two different approaches were then used to model the data. The traditional approach estimated both a and b using a regression model, whereas an alternative approach set the exponent a at its theoretical value and only estimated the value of b. Both approaches produced virtually the same model fit with less than 0.3% difference in the coefficient of determination. Only the alternative approach was able to precisely reproduce the effect of the environmental variable, which was largely lost among noise variation when using the traditional approach. The results show how the value of b can be used as a source of valuable biological information if an appropriate regression model is selected.
Time series, correlation matrices and random matrix models
DOE Office of Scientific and Technical Information (OSTI.GOV)
Vinayak; Seligman, Thomas H.
2014-01-08
In this set of five lectures the authors have presented techniques to analyze open classical and quantum systems using correlation matrices. For diverse reasons we shall see that random matrices play an important role to describe a null hypothesis or a minimum information hypothesis for the description of a quantum system or subsystem. In the former case various forms of correlation matrices of time series associated with the classical observables of some system. The fact that such series are necessarily finite, inevitably introduces noise and this finite time influence lead to a random or stochastic component in these time series.more » By consequence random correlation matrices have a random component, and corresponding ensembles are used. In the latter we use random matrices to describe high temperature environment or uncontrolled perturbations, ensembles of differing chaotic systems etc. The common theme of the lectures is thus the importance of random matrix theory in a wide range of fields in and around physics.« less
Dietary interventions to prevent and manage diabetes in worksite settings: a meta-analysis
Shrestha, Archana; Karmacharya, Biraj Man; Khudyakov, Polyna; Weber, Mary Beth; Spiegelman, Donna
2017-01-01
Objectives: The translation of lifestyle intervention to improve glucose tolerance into the workplace has been rare. The objective of this meta-analysis is to summarize the evidence for the effectiveness of dietary interventions in worksite settings on lowering blood sugar levels. Methods: We searched for studies in PubMed, Embase, Econlit, Ovid, Cochrane, Web of Science, and Cumulative Index to Nursing and Allied Health Literature. Search terms were as follows: (1) Exposure-based: nutrition/diet/dietary intervention/health promotion/primary prevention/health behavior/health education/food /program evaluation; (2) Outcome-based: diabetes/hyperglycemia/glucose/HbA1c/glycated hemoglobin; and (3) Setting-based: workplace/worksite/occupational/industry/job/employee. We manually searched review articles and reference lists of articles identified from 1969 to December 2016. We tested for between-studies heterogeneity and calculated the pooled effect sizes for changes in HbA1c (%) and fasting glucose (mg/dl) using random effect models for meta-analysis in 2016. Results: A total of 17 articles out of 1663 initially selected articles were included in the meta-analysis. With a random-effects model, worksite dietary interventions led to a pooled -0.18% (95% CI, -0.29 to -0.06; P<0.001) difference in HbA1c. With the random-effects model, the interventions resulted in 2.60 mg/dl lower fasting glucose with borderline significance (95% CI: -5.27 to 0.08, P=0.06). In the multivariate meta-regression model, the interventions with high percent of female participants and that used the intervention directly delivered to individuals, rather the environment changes, were associated with more effective interventions. Conclusion: Workplace dietary interventions can improve HbA1c. The effects were larger for the interventions with greater number of female participants and with individual-level interventions. PMID:29187673
Subjective randomness as statistical inference.
Griffiths, Thomas L; Daniels, Dylan; Austerweil, Joseph L; Tenenbaum, Joshua B
2018-06-01
Some events seem more random than others. For example, when tossing a coin, a sequence of eight heads in a row does not seem very random. Where do these intuitions about randomness come from? We argue that subjective randomness can be understood as the result of a statistical inference assessing the evidence that an event provides for having been produced by a random generating process. We show how this account provides a link to previous work relating randomness to algorithmic complexity, in which random events are those that cannot be described by short computer programs. Algorithmic complexity is both incomputable and too general to capture the regularities that people can recognize, but viewing randomness as statistical inference provides two paths to addressing these problems: considering regularities generated by simpler computing machines, and restricting the set of probability distributions that characterize regularity. Building on previous work exploring these different routes to a more restricted notion of randomness, we define strong quantitative models of human randomness judgments that apply not just to binary sequences - which have been the focus of much of the previous work on subjective randomness - but also to binary matrices and spatial clustering. Copyright © 2018 Elsevier Inc. All rights reserved.
The variability of software scoring of the CDMAM phantom associated with a limited number of images
NASA Astrophysics Data System (ADS)
Yang, Chang-Ying J.; Van Metter, Richard
2007-03-01
Software scoring approaches provide an attractive alternative to human evaluation of CDMAM images from digital mammography systems, particularly for annual quality control testing as recommended by the European Protocol for the Quality Control of the Physical and Technical Aspects of Mammography Screening (EPQCM). Methods for correlating CDCOM-based results with human observer performance have been proposed. A common feature of all methods is the use of a small number (at most eight) of CDMAM images to evaluate the system. This study focuses on the potential variability in the estimated system performance that is associated with these methods. Sets of 36 CDMAM images were acquired under carefully controlled conditions from three different digital mammography systems. The threshold visibility thickness (TVT) for each disk diameter was determined using previously reported post-analysis methods from the CDCOM scorings for a randomly selected group of eight images for one measurement trial. This random selection process was repeated 3000 times to estimate the variability in the resulting TVT values for each disk diameter. The results from using different post-analysis methods, different random selection strategies and different digital systems were compared. Additional variability of the 0.1 mm disk diameter was explored by comparing the results from two different image data sets acquired under the same conditions from the same system. The magnitude and the type of error estimated for experimental data was explained through modeling. The modeled results also suggest a limitation in the current phantom design for the 0.1 mm diameter disks. Through modeling, it was also found that, because of the binomial statistic nature of the CDMAM test, the true variability of the test could be underestimated by the commonly used method of random re-sampling.
Causal mediation analysis for longitudinal data with exogenous exposure.
Bind, M-A C; Vanderweele, T J; Coull, B A; Schwartz, J D
2016-01-01
Mediation analysis is a valuable approach to examine pathways in epidemiological research. Prospective cohort studies are often conducted to study biological mechanisms and often collect longitudinal measurements on each participant. Mediation formulae for longitudinal data have been developed. Here, we formalize the natural direct and indirect effects using a causal framework with potential outcomes that allows for an interaction between the exposure and the mediator. To allow different types of longitudinal measures of the mediator and outcome, we assume two generalized mixed-effects models for both the mediator and the outcome. The model for the mediator has subject-specific random intercepts and random exposure slopes for each cluster, and the outcome model has random intercepts and random slopes for the exposure, the mediator, and their interaction. We also expand our approach to settings with multiple mediators and derive the mediated effects, jointly through all mediators. Our method requires the absence of time-varying confounding with respect to the exposure and the mediator. This assumption is achieved in settings with exogenous exposure and mediator, especially when exposure and mediator are not affected by variables measured at earlier time points. We apply the methodology to data from the Normative Aging Study and estimate the direct and indirect effects, via DNA methylation, of air pollution, and temperature on intercellular adhesion molecule 1 (ICAM-1) protein levels. Our results suggest that air pollution and temperature have a direct effect on ICAM-1 protein levels (i.e. not through a change in ICAM-1 DNA methylation) and that temperature has an indirect effect via a change in ICAM-1 DNA methylation. © The Author 2015. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
Applications of Derandomization Theory in Coding
NASA Astrophysics Data System (ADS)
Cheraghchi, Mahdi
2011-07-01
Randomized techniques play a fundamental role in theoretical computer science and discrete mathematics, in particular for the design of efficient algorithms and construction of combinatorial objects. The basic goal in derandomization theory is to eliminate or reduce the need for randomness in such randomized constructions. In this thesis, we explore some applications of the fundamental notions in derandomization theory to problems outside the core of theoretical computer science, and in particular, certain problems related to coding theory. First, we consider the wiretap channel problem which involves a communication system in which an intruder can eavesdrop a limited portion of the transmissions, and construct efficient and information-theoretically optimal communication protocols for this model. Then we consider the combinatorial group testing problem. In this classical problem, one aims to determine a set of defective items within a large population by asking a number of queries, where each query reveals whether a defective item is present within a specified group of items. We use randomness condensers to explicitly construct optimal, or nearly optimal, group testing schemes for a setting where the query outcomes can be highly unreliable, as well as the threshold model where a query returns positive if the number of defectives pass a certain threshold. Finally, we design ensembles of error-correcting codes that achieve the information-theoretic capacity of a large class of communication channels, and then use the obtained ensembles for construction of explicit capacity achieving codes. [This is a shortened version of the actual abstract in the thesis.
Improving ensemble decision tree performance using Adaboost and Bagging
NASA Astrophysics Data System (ADS)
Hasan, Md. Rajib; Siraj, Fadzilah; Sainin, Mohd Shamrie
2015-12-01
Ensemble classifier systems are considered as one of the most promising in medical data classification and the performance of deceision tree classifier can be increased by the ensemble method as it is proven to be better than single classifiers. However, in a ensemble settings the performance depends on the selection of suitable base classifier. This research employed two prominent esemble s namely Adaboost and Bagging with base classifiers such as Random Forest, Random Tree, j48, j48grafts and Logistic Model Regression (LMT) that have been selected independently. The empirical study shows that the performance varries when different base classifiers are selected and even some places overfitting issue also been noted. The evidence shows that ensemble decision tree classfiers using Adaboost and Bagging improves the performance of selected medical data sets.
Loophole-free Bell test using electron spins in diamond: second experiment and additional analysis
Hensen, B.; Kalb, N.; Blok, M. S.; Dréau, A. E.; Reiserer, A.; Vermeulen, R. F. L.; Schouten, R. N.; Markham, M.; Twitchen, D. J.; Goodenough, K.; Elkouss, D.; Wehner, S.; Taminiau, T. H.; Hanson, R.
2016-01-01
The recently reported violation of a Bell inequality using entangled electronic spins in diamonds (Hensen et al., Nature 526, 682–686) provided the first loophole-free evidence against local-realist theories of nature. Here we report on data from a second Bell experiment using the same experimental setup with minor modifications. We find a violation of the CHSH-Bell inequality of 2.35 ± 0.18, in agreement with the first run, yielding an overall value of S = 2.38 ± 0.14. We calculate the resulting P-values of the second experiment and of the combined Bell tests. We provide an additional analysis of the distribution of settings choices recorded during the two tests, finding that the observed distributions are consistent with uniform settings for both tests. Finally, we analytically study the effect of particular models of random number generator (RNG) imperfection on our hypothesis test. We find that the winning probability per trial in the CHSH game can be bounded knowing only the mean of the RNG bias. This implies that our experimental result is robust for any model underlying the estimated average RNG bias, for random bits produced up to 690 ns too early by the random number generator. PMID:27509823
Clark, Nicholas J; Wells, Konstans; Lindberg, Oscar
2018-05-16
Inferring interactions between co-occurring species is key to identify processes governing community assembly. Incorporating interspecific interactions in predictive models is common in ecology, yet most methods do not adequately account for indirect interactions (where an interaction between two species is masked by their shared interactions with a third) and assume interactions do not vary along environmental gradients. Markov random fields (MRF) overcome these limitations by estimating interspecific interactions, while controlling for indirect interactions, from multispecies occurrence data. We illustrate the utility of MRFs for ecologists interested in interspecific interactions, and demonstrate how covariates can be included (a set of models known as Conditional Random Fields, CRF) to infer how interactions vary along environmental gradients. We apply CRFs to two data sets of presence-absence data. The first illustrates how blood parasite (Haemoproteus, Plasmodium, and nematode microfilaria spp.) co-infection probabilities covary with relative abundance of their avian hosts. The second shows that co-occurrences between mosquito larvae and predatory insects vary along water temperature gradients. Other applications are discussed, including the potential to identify replacement or shifting impacts of highly connected species along climate or land-use gradients. We provide tools for building CRFs and plotting/interpreting results as an R package. © 2018 by the Ecological Society of America.
Vehicle track segmentation using higher order random fields
Quach, Tu -Thach
2017-01-09
Here, we present an approach to segment vehicle tracks in coherent change detection images, a product of combining two synthetic aperture radar images taken at different times. The approach uses multiscale higher order random field models to capture track statistics, such as curvatures and their parallel nature, that are not currently utilized in existing methods. These statistics are encoded as 3-by-3 patterns at different scales. The model can complete disconnected tracks often caused by sensor noise and various environmental effects. Coupling the model with a simple classifier, our approach is effective at segmenting salient tracks. We improve the F-measure onmore » a standard vehicle track data set to 0.963, up from 0.897 obtained by the current state-of-the-art method.« less
Vehicle track segmentation using higher order random fields
DOE Office of Scientific and Technical Information (OSTI.GOV)
Quach, Tu -Thach
Here, we present an approach to segment vehicle tracks in coherent change detection images, a product of combining two synthetic aperture radar images taken at different times. The approach uses multiscale higher order random field models to capture track statistics, such as curvatures and their parallel nature, that are not currently utilized in existing methods. These statistics are encoded as 3-by-3 patterns at different scales. The model can complete disconnected tracks often caused by sensor noise and various environmental effects. Coupling the model with a simple classifier, our approach is effective at segmenting salient tracks. We improve the F-measure onmore » a standard vehicle track data set to 0.963, up from 0.897 obtained by the current state-of-the-art method.« less
Random walks with shape prior for cochlea segmentation in ex vivo μCT.
Ruiz Pujadas, Esmeralda; Kjer, Hans Martin; Piella, Gemma; Ceresa, Mario; González Ballester, Miguel Angel
2016-09-01
Cochlear implantation is a safe and effective surgical procedure to restore hearing in deaf patients. However, the level of restoration achieved may vary due to differences in anatomy, implant type and surgical access. In order to reduce the variability of the surgical outcomes, we previously proposed the use of a high-resolution model built from [Formula: see text] images and then adapted to patient-specific clinical CT scans. As the accuracy of the model is dependent on the precision of the original segmentation, it is extremely important to have accurate [Formula: see text] segmentation algorithms. We propose a new framework for cochlea segmentation in ex vivo [Formula: see text] images using random walks where a distance-based shape prior is combined with a region term estimated by a Gaussian mixture model. The prior is also weighted by a confidence map to adjust its influence according to the strength of the image contour. Random walks is performed iteratively, and the prior mask is aligned in every iteration. We tested the proposed approach in ten [Formula: see text] data sets and compared it with other random walks-based segmentation techniques such as guided random walks (Eslami et al. in Med Image Anal 17(2):236-253, 2013) and constrained random walks (Li et al. in Advances in image and video technology. Springer, Berlin, pp 215-226, 2012). Our approach demonstrated higher accuracy results due to the probability density model constituted by the region term and shape prior information weighed by a confidence map. The weighted combination of the distance-based shape prior with a region term into random walks provides accurate segmentations of the cochlea. The experiments suggest that the proposed approach is robust for cochlea segmentation.
Chen, Xuewu; Wei, Ming; Wu, Jingxian; Hou, Xianyao
2014-01-01
Most traditional mode choice models are based on the principle of random utility maximization derived from econometric theory. Alternatively, mode choice modeling can be regarded as a pattern recognition problem reflected from the explanatory variables of determining the choices between alternatives. The paper applies the knowledge discovery technique of rough sets theory to model travel mode choices incorporating household and individual sociodemographics and travel information, and to identify the significance of each attribute. The study uses the detailed travel diary survey data of Changxing county which contains information on both household and individual travel behaviors for model estimation and evaluation. The knowledge is presented in the form of easily understood IF-THEN statements or rules which reveal how each attribute influences mode choice behavior. These rules are then used to predict travel mode choices from information held about previously unseen individuals and the classification performance is assessed. The rough sets model shows high robustness and good predictive ability. The most significant condition attributes identified to determine travel mode choices are gender, distance, household annual income, and occupation. Comparative evaluation with the MNL model also proves that the rough sets model gives superior prediction accuracy and coverage on travel mode choice modeling. PMID:25431585
CADASTER QSPR Models for Predictions of Melting and Boiling Points of Perfluorinated Chemicals.
Bhhatarai, Barun; Teetz, Wolfram; Liu, Tao; Öberg, Tomas; Jeliazkova, Nina; Kochev, Nikolay; Pukalov, Ognyan; Tetko, Igor V; Kovarich, Simona; Papa, Ester; Gramatica, Paola
2011-03-14
Quantitative structure property relationship (QSPR) studies on per- and polyfluorinated chemicals (PFCs) on melting point (MP) and boiling point (BP) are presented. The training and prediction chemicals used for developing and validating the models were selected from Syracuse PhysProp database and literatures. The available experimental data sets were split in two different ways: a) random selection on response value, and b) structural similarity verified by self-organizing-map (SOM), in order to propose reliable predictive models, developed only on the training sets and externally verified on the prediction sets. Individual linear and non-linear approaches based models developed by different CADASTER partners on 0D-2D Dragon descriptors, E-state descriptors and fragment based descriptors as well as consensus model and their predictions are presented. In addition, the predictive performance of the developed models was verified on a blind external validation set (EV-set) prepared using PERFORCE database on 15 MP and 25 BP data respectively. This database contains only long chain perfluoro-alkylated chemicals, particularly monitored by regulatory agencies like US-EPA and EU-REACH. QSPR models with internal and external validation on two different external prediction/validation sets and study of applicability-domain highlighting the robustness and high accuracy of the models are discussed. Finally, MPs for additional 303 PFCs and BPs for 271 PFCs were predicted for which experimental measurements are unknown. Copyright © 2011 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Fast Constrained Spectral Clustering and Cluster Ensemble with Random Projection
Liu, Wenfen
2017-01-01
Constrained spectral clustering (CSC) method can greatly improve the clustering accuracy with the incorporation of constraint information into spectral clustering and thus has been paid academic attention widely. In this paper, we propose a fast CSC algorithm via encoding landmark-based graph construction into a new CSC model and applying random sampling to decrease the data size after spectral embedding. Compared with the original model, the new algorithm has the similar results with the increase of its model size asymptotically; compared with the most efficient CSC algorithm known, the new algorithm runs faster and has a wider range of suitable data sets. Meanwhile, a scalable semisupervised cluster ensemble algorithm is also proposed via the combination of our fast CSC algorithm and dimensionality reduction with random projection in the process of spectral ensemble clustering. We demonstrate by presenting theoretical analysis and empirical results that the new cluster ensemble algorithm has advantages in terms of efficiency and effectiveness. Furthermore, the approximate preservation of random projection in clustering accuracy proved in the stage of consensus clustering is also suitable for the weighted k-means clustering and thus gives the theoretical guarantee to this special kind of k-means clustering where each point has its corresponding weight. PMID:29312447
Groves, Benjamin; Kuchina, Anna; Rosenberg, Alexander B.; Jojic, Nebojsa; Fields, Stanley; Seelig, Georg
2017-01-01
Our ability to predict protein expression from DNA sequence alone remains poor, reflecting our limited understanding of cis-regulatory grammar and hampering the design of engineered genes for synthetic biology applications. Here, we generate a model that predicts the protein expression of the 5′ untranslated region (UTR) of mRNAs in the yeast Saccharomyces cerevisiae. We constructed a library of half a million 50-nucleotide-long random 5′ UTRs and assayed their activity in a massively parallel growth selection experiment. The resulting data allow us to quantify the impact on protein expression of Kozak sequence composition, upstream open reading frames (uORFs), and secondary structure. We trained a convolutional neural network (CNN) on the random library and showed that it performs well at predicting the protein expression of both a held-out set of the random 5′ UTRs as well as native S. cerevisiae 5′ UTRs. The model additionally was used to computationally evolve highly active 5′ UTRs. We confirmed experimentally that the great majority of the evolved sequences led to higher protein expression rates than the starting sequences, demonstrating the predictive power of this model. PMID:29097404
Jung, Ho-Won; El Emam, Khaled
2014-05-29
A linear programming (LP) model was proposed to create de-identified data sets that maximally include spatial detail (e.g., geocodes such as ZIP or postal codes, census blocks, and locations on maps) while complying with the HIPAA Privacy Rule's Expert Determination method, i.e., ensuring that the risk of re-identification is very small. The LP model determines the transition probability from an original location of a patient to a new randomized location. However, it has a limitation for the cases of areas with a small population (e.g., median of 10 people in a ZIP code). We extend the previous LP model to accommodate the cases of a smaller population in some locations, while creating de-identified patient spatial data sets which ensure the risk of re-identification is very small. Our LP model was applied to a data set of 11,740 postal codes in the City of Ottawa, Canada. On this data set we demonstrated the limitations of the previous LP model, in that it produces improbable results, and showed how our extensions to deal with small areas allows the de-identification of the whole data set. The LP model described in this study can be used to de-identify geospatial information for areas with small populations with minimal distortion to postal codes. Our LP model can be extended to include other information, such as age and gender.
Role of Statistical Random-Effects Linear Models in Personalized Medicine.
Diaz, Francisco J; Yeh, Hung-Wen; de Leon, Jose
2012-03-01
Some empirical studies and recent developments in pharmacokinetic theory suggest that statistical random-effects linear models are valuable tools that allow describing simultaneously patient populations as a whole and patients as individuals. This remarkable characteristic indicates that these models may be useful in the development of personalized medicine, which aims at finding treatment regimes that are appropriate for particular patients, not just appropriate for the average patient. In fact, published developments show that random-effects linear models may provide a solid theoretical framework for drug dosage individualization in chronic diseases. In particular, individualized dosages computed with these models by means of an empirical Bayesian approach may produce better results than dosages computed with some methods routinely used in therapeutic drug monitoring. This is further supported by published empirical and theoretical findings that show that random effects linear models may provide accurate representations of phase III and IV steady-state pharmacokinetic data, and may be useful for dosage computations. These models have applications in the design of clinical algorithms for drug dosage individualization in chronic diseases; in the computation of dose correction factors; computation of the minimum number of blood samples from a patient that are necessary for calculating an optimal individualized drug dosage in therapeutic drug monitoring; measure of the clinical importance of clinical, demographic, environmental or genetic covariates; study of drug-drug interactions in clinical settings; the implementation of computational tools for web-site-based evidence farming; design of pharmacogenomic studies; and in the development of a pharmacological theory of dosage individualization.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zentner, I.; Ferré, G., E-mail: gregoire.ferre@ponts.org; Poirion, F.
2016-06-01
In this paper, a new method for the identification and simulation of non-Gaussian and non-stationary stochastic fields given a database is proposed. It is based on two successive biorthogonal decompositions aiming at representing spatio–temporal stochastic fields. The proposed double expansion allows to build the model even in the case of large-size problems by separating the time, space and random parts of the field. A Gaussian kernel estimator is used to simulate the high dimensional set of random variables appearing in the decomposition. The capability of the method to reproduce the non-stationary and non-Gaussian features of random phenomena is illustrated bymore » applications to earthquakes (seismic ground motion) and sea states (wave heights).« less
Ren, Anna N; Neher, Robert E; Bell, Tyler; Grimm, James
2018-06-01
Preoperative planning is important to achieve successful implantation in primary total knee arthroplasty (TKA). However, traditional TKA templating techniques are not accurate enough to predict the component size to a very close range. With the goal of developing a general predictive statistical model using patient demographic information, ordinal logistic regression was applied to build a proportional odds model to predict the tibia component size. The study retrospectively collected the data of 1992 primary Persona Knee System TKA procedures. Of them, 199 procedures were randomly selected as testing data and the rest of the data were randomly partitioned between model training data and model evaluation data with a ratio of 7:3. Different models were trained and evaluated on the training and validation data sets after data exploration. The final model had patient gender, age, weight, and height as independent variables and predicted the tibia size within 1 size difference 96% of the time on the validation data, 94% of the time on the testing data, and 92% on a prospective cadaver data set. The study results indicated the statistical model built by ordinal logistic regression can increase the accuracy of tibia sizing information for Persona Knee preoperative templating. This research shows statistical modeling may be used with radiographs to dramatically enhance the templating accuracy, efficiency, and quality. In general, this methodology can be applied to other TKA products when the data are applicable. Copyright © 2018 Elsevier Inc. All rights reserved.
A stochastic Markov chain model to describe lung cancer growth and metastasis.
Newton, Paul K; Mason, Jeremy; Bethel, Kelly; Bazhenova, Lyudmila A; Nieva, Jorge; Kuhn, Peter
2012-01-01
A stochastic Markov chain model for metastatic progression is developed for primary lung cancer based on a network construction of metastatic sites with dynamics modeled as an ensemble of random walkers on the network. We calculate a transition matrix, with entries (transition probabilities) interpreted as random variables, and use it to construct a circular bi-directional network of primary and metastatic locations based on postmortem tissue analysis of 3827 autopsies on untreated patients documenting all primary tumor locations and metastatic sites from this population. The resulting 50 potential metastatic sites are connected by directed edges with distributed weightings, where the site connections and weightings are obtained by calculating the entries of an ensemble of transition matrices so that the steady-state distribution obtained from the long-time limit of the Markov chain dynamical system corresponds to the ensemble metastatic distribution obtained from the autopsy data set. We condition our search for a transition matrix on an initial distribution of metastatic tumors obtained from the data set. Through an iterative numerical search procedure, we adjust the entries of a sequence of approximations until a transition matrix with the correct steady-state is found (up to a numerical threshold). Since this constrained linear optimization problem is underdetermined, we characterize the statistical variance of the ensemble of transition matrices calculated using the means and variances of their singular value distributions as a diagnostic tool. We interpret the ensemble averaged transition probabilities as (approximately) normally distributed random variables. The model allows us to simulate and quantify disease progression pathways and timescales of progression from the lung position to other sites and we highlight several key findings based on the model.
Robust reliable sampled-data control for switched systems with application to flight control
NASA Astrophysics Data System (ADS)
Sakthivel, R.; Joby, Maya; Shi, P.; Mathiyalagan, K.
2016-11-01
This paper addresses the robust reliable stabilisation problem for a class of uncertain switched systems with random delays and norm bounded uncertainties. The main aim of this paper is to obtain the reliable robust sampled-data control design which involves random time delay with an appropriate gain control matrix for achieving the robust exponential stabilisation for uncertain switched system against actuator failures. In particular, the involved delays are assumed to be randomly time-varying which obeys certain mutually uncorrelated Bernoulli distributed white noise sequences. By constructing an appropriate Lyapunov-Krasovskii functional (LKF) and employing an average-dwell time approach, a new set of criteria is derived for ensuring the robust exponential stability of the closed-loop switched system. More precisely, the Schur complement and Jensen's integral inequality are used in derivation of stabilisation criteria. By considering the relationship among the random time-varying delay and its lower and upper bounds, a new set of sufficient condition is established for the existence of reliable robust sampled-data control in terms of solution to linear matrix inequalities (LMIs). Finally, an illustrative example based on the F-18 aircraft model is provided to show the effectiveness of the proposed design procedures.
A mixture model for bovine abortion and foetal survival.
Hanson, Timothy; Bedrick, Edward J; Johnson, Wesley O; Thurmond, Mark C
2003-05-30
The effect of spontaneous abortion on the dairy industry is substantial, costing the industry on the order of US dollars 200 million per year in California alone. We analyse data from a cohort study of nine dairy herds in Central California. A key feature of the analysis is the observation that only a relatively small proportion of cows will abort (around 10;15 per cent), so that it is inappropriate to analyse the time-to-abortion (TTA) data as if it were standard censored survival data, with cows that fail to abort by the end of the study treated as censored observations. We thus broaden the scope to consider the analysis of foetal lifetime distribution (FLD) data for the cows, with the dual goals of characterizing the effects of various risk factors on (i). the likelihood of abortion and, conditional on abortion status, on (ii). the risk of early versus late abortion. A single model is developed to accomplish both goals with two sets of specific herd effects modelled as random effects. Because multimodal foetal hazard functions are expected for the TTA data, both a parametric mixture model and a non-parametric model are developed. Furthermore, the two sets of analyses are linked because of anticipated dependence between the random herd effects. All modelling and inferences are accomplished using modern Bayesian methods. Copyright 2003 John Wiley & Sons, Ltd.
The valuation of the EQ-5D in Portugal.
Ferreira, Lara N; Ferreira, Pedro L; Pereira, Luis N; Oppe, Mark
2014-03-01
The EQ-5D is a preference-based measure widely used in cost-utility analysis (CUA). Several countries have conducted surveys to derive value sets, but this was not the case for Portugal. The purpose of this study was to estimate a value set for the EQ-5D for Portugal using the time trade-off (TTO). A representative sample of the Portuguese general population (n = 450) stratified by age and gender valued 24 health states. Face-to-face interviews were conducted by trained interviewers. Each respondent ranked and valued seven health states using the TTO. Several models were estimated at both the individual and aggregated levels to predict health state valuations. Alternative functional forms were considered to account for the skewed distribution of these valuations. The models were analyzed in terms of their coefficients, overall fit and the ability for predicting the TTO values. Random effects models were estimated using generalized least squares and were robust across model specification. The results are generally consistent with other value sets. This research provides the Portuguese EQ-5D value set based on the preferences of the Portuguese general population as measured by the TTO. This value set is recommended for use in CUA conducted in Portugal.
Protein Loop Structure Prediction Using Conformational Space Annealing.
Heo, Seungryong; Lee, Juyong; Joo, Keehyoung; Shin, Hang-Cheol; Lee, Jooyoung
2017-05-22
We have developed a protein loop structure prediction method by combining a new energy function, which we call E PLM (energy for protein loop modeling), with the conformational space annealing (CSA) global optimization algorithm. The energy function includes stereochemistry, dynamic fragment assembly, distance-scaled finite ideal gas reference (DFIRE), and generalized orientation- and distance-dependent terms. For the conformational search of loop structures, we used the CSA algorithm, which has been quite successful in dealing with various hard global optimization problems. We assessed the performance of E PLM with two widely used loop-decoy sets, Jacobson and RAPPER, and compared the results against the DFIRE potential. The accuracy of model selection from a pool of loop decoys as well as de novo loop modeling starting from randomly generated structures was examined separately. For the selection of a nativelike structure from a decoy set, E PLM was more accurate than DFIRE in the case of the Jacobson set and had similar accuracy in the case of the RAPPER set. In terms of sampling more nativelike loop structures, E PLM outperformed E DFIRE for both decoy sets. This new approach equipped with E PLM and CSA can serve as the state-of-the-art de novo loop modeling method.
Turbulence hierarchy in a random fibre laser
González, Iván R. Roa; Lima, Bismarck C.; Pincheira, Pablo I. R.; Brum, Arthur A.; Macêdo, Antônio M. S.; Vasconcelos, Giovani L.; de S. Menezes, Leonardo; Raposo, Ernesto P.; Gomes, Anderson S. L.; Kashyap, Raman
2017-01-01
Turbulence is a challenging feature common to a wide range of complex phenomena. Random fibre lasers are a special class of lasers in which the feedback arises from multiple scattering in a one-dimensional disordered cavity-less medium. Here we report on statistical signatures of turbulence in the distribution of intensity fluctuations in a continuous-wave-pumped erbium-based random fibre laser, with random Bragg grating scatterers. The distribution of intensity fluctuations in an extensive data set exhibits three qualitatively distinct behaviours: a Gaussian regime below threshold, a mixture of two distributions with exponentially decaying tails near the threshold and a mixture of distributions with stretched-exponential tails above threshold. All distributions are well described by a hierarchical stochastic model that incorporates Kolmogorov’s theory of turbulence, which includes energy cascade and the intermittence phenomenon. Our findings have implications for explaining the remarkably challenging turbulent behaviour in photonics, using a random fibre laser as the experimental platform. PMID:28561064
Quantifying randomness in real networks
NASA Astrophysics Data System (ADS)
Orsini, Chiara; Dankulov, Marija M.; Colomer-de-Simón, Pol; Jamakovic, Almerima; Mahadevan, Priya; Vahdat, Amin; Bassler, Kevin E.; Toroczkai, Zoltán; Boguñá, Marián; Caldarelli, Guido; Fortunato, Santo; Krioukov, Dmitri
2015-10-01
Represented as graphs, real networks are intricate combinations of order and disorder. Fixing some of the structural properties of network models to their values observed in real networks, many other properties appear as statistical consequences of these fixed observables, plus randomness in other respects. Here we employ the dk-series, a complete set of basic characteristics of the network structure, to study the statistical dependencies between different network properties. We consider six real networks--the Internet, US airport network, human protein interactions, technosocial web of trust, English word network, and an fMRI map of the human brain--and find that many important local and global structural properties of these networks are closely reproduced by dk-random graphs whose degree distributions, degree correlations and clustering are as in the corresponding real network. We discuss important conceptual, methodological, and practical implications of this evaluation of network randomness, and release software to generate dk-random graphs.
Nguyen, Thanh-Tung; Huang, Joshua; Wu, Qingyao; Nguyen, Thuy; Li, Mark
2015-01-01
Single-nucleotide polymorphisms (SNPs) selection and identification are the most important tasks in Genome-wide association data analysis. The problem is difficult because genome-wide association data is very high dimensional and a large portion of SNPs in the data is irrelevant to the disease. Advanced machine learning methods have been successfully used in Genome-wide association studies (GWAS) for identification of genetic variants that have relatively big effects in some common, complex diseases. Among them, the most successful one is Random Forests (RF). Despite of performing well in terms of prediction accuracy in some data sets with moderate size, RF still suffers from working in GWAS for selecting informative SNPs and building accurate prediction models. In this paper, we propose to use a new two-stage quality-based sampling method in random forests, named ts-RF, for SNP subspace selection for GWAS. The method first applies p-value assessment to find a cut-off point that separates informative and irrelevant SNPs in two groups. The informative SNPs group is further divided into two sub-groups: highly informative and weak informative SNPs. When sampling the SNP subspace for building trees for the forest, only those SNPs from the two sub-groups are taken into account. The feature subspaces always contain highly informative SNPs when used to split a node at a tree. This approach enables one to generate more accurate trees with a lower prediction error, meanwhile possibly avoiding overfitting. It allows one to detect interactions of multiple SNPs with the diseases, and to reduce the dimensionality and the amount of Genome-wide association data needed for learning the RF model. Extensive experiments on two genome-wide SNP data sets (Parkinson case-control data comprised of 408,803 SNPs and Alzheimer case-control data comprised of 380,157 SNPs) and 10 gene data sets have demonstrated that the proposed model significantly reduced prediction errors and outperformed most existing the-state-of-the-art random forests. The top 25 SNPs in Parkinson data set were identified by the proposed model including four interesting genes associated with neurological disorders. The presented approach has shown to be effective in selecting informative sub-groups of SNPs potentially associated with diseases that traditional statistical approaches might fail. The new RF works well for the data where the number of case-control objects is much smaller than the number of SNPs, which is a typical problem in gene data and GWAS. Experiment results demonstrated the effectiveness of the proposed RF model that outperformed the state-of-the-art RFs, including Breiman's RF, GRRF and wsRF methods.
Inquiry in the Physical Geology Classroom: Supporting Students' Conceptual Model Development
ERIC Educational Resources Information Center
Miller, Heather R.; McNeal, Karen S.; Herbert, Bruce E.
2010-01-01
This study characterizes the impact of an inquiry-based learning (IBL) module versus a traditionally structured laboratory exercise. Laboratory sections were randomized into experimental and control groups. The experimental group was taught using IBL pedagogical techniques and included manipulation of large-scale data-sets, use of multiple…
Ludescher, Josef; Bunde, Armin
2014-12-01
We consider representative financial records (stocks and indices) on time scales between one minute and one day, as well as historical monthly data sets, and show that the distribution P(Q)(r) of the interoccurrence times r between losses below a negative threshold -Q, for fixed mean interoccurrence times R(Q) in multiples of the corresponding time resolutions, can be described on all time scales by the same q exponentials, P(Q)(r)∝1/{[1+(q-1)βr](1/(q-1))}. We propose that the asset- and time-scale-independent analytic form of P(Q)(r) can be regarded as an additional stylized fact of the financial markets and represents a nontrivial test for market models. We analyze the distribution P(Q)(r) as well as the autocorrelation C(Q)(s) of the interoccurrence times for three market models: (i) multiplicative random cascades, (ii) multifractal random walks, and (iii) the generalized autoregressive conditional heteroskedasticity [GARCH(1,1)] model. We find that only one of the considered models, the multifractal random walk model, approximately reproduces the q-exponential form of P(Q)(r) and the power-law decay of C(Q)(s).
Randomizing growing networks with a time-respecting null model
NASA Astrophysics Data System (ADS)
Ren, Zhuo-Ming; Mariani, Manuel Sebastian; Zhang, Yi-Cheng; Medo, Matúš
2018-05-01
Complex networks are often used to represent systems that are not static but grow with time: People make new friendships, new papers are published and refer to the existing ones, and so forth. To assess the statistical significance of measurements made on such networks, we propose a randomization methodology—a time-respecting null model—that preserves both the network's degree sequence and the time evolution of individual nodes' degree values. By preserving the temporal linking patterns of the analyzed system, the proposed model is able to factor out the effect of the system's temporal patterns on its structure. We apply the model to the citation network of Physical Review scholarly papers and the citation network of US movies. The model reveals that the two data sets are strikingly different with respect to their degree-degree correlations, and we discuss the important implications of this finding on the information provided by paradigmatic node centrality metrics such as indegree and Google's PageRank. The randomization methodology proposed here can be used to assess the significance of any structural property in growing networks, which could bring new insights into the problems where null models play a critical role, such as the detection of communities and network motifs.
NASA Astrophysics Data System (ADS)
Ludescher, Josef; Bunde, Armin
2014-12-01
We consider representative financial records (stocks and indices) on time scales between one minute and one day, as well as historical monthly data sets, and show that the distribution PQ(r ) of the interoccurrence times r between losses below a negative threshold -Q , for fixed mean interoccurrence times RQ in multiples of the corresponding time resolutions, can be described on all time scales by the same q exponentials, PQ(r ) ∝1 /{[1+(q -1 ) β r ] 1 /(q -1 )} . We propose that the asset- and time-scale-independent analytic form of PQ(r ) can be regarded as an additional stylized fact of the financial markets and represents a nontrivial test for market models. We analyze the distribution PQ(r ) as well as the autocorrelation CQ(s ) of the interoccurrence times for three market models: (i) multiplicative random cascades, (ii) multifractal random walks, and (iii) the generalized autoregressive conditional heteroskedasticity [GARCH(1,1)] model. We find that only one of the considered models, the multifractal random walk model, approximately reproduces the q -exponential form of PQ(r ) and the power-law decay of CQ(s ) .
Modeling energetic and theoretical costs of thermoregulatory strategy.
Alford, John G; Lutterschmidt, William I
2012-01-01
Poikilothermic ectotherms have evolved behaviours that help them maintain or regulate their body temperature (T (b)) around a preferred or 'set point' temperature (T (set)). Thermoregulatory behaviors may range from body positioning to optimize heat gain to shuttling among preferred microhabitats to find appropriate environmental temperatures. We have modelled movement patterns between an active and non-active shuttling behaviour within a habitat (as a biased random walk) to investigate the potential cost of two thermoregulatory strategies. Generally, small-bodied ectotherms actively thermoregulate while large-bodied ectotherms may passively thermoconform to their environment. We were interested in the potential energetic cost for a large-bodied ectotherm if it were forced to actively thermoregulate rather than thermoconform. We therefore modelled movements and the resulting and comparative energetic costs in precisely maintaining a T (set) for a small-bodied versus large-bodied ectotherm to study and evaluate the thermoregulatory strategy.
School system evaluation by value added analysis under endogeneity.
Manzi, Jorge; San Martín, Ernesto; Van Bellegem, Sébastien
2014-01-01
Value added is a common tool in educational research on effectiveness. It is often modeled as a (prediction of a) random effect in a specific hierarchical linear model. This paper shows that this modeling strategy is not valid when endogeneity is present. Endogeneity stems, for instance, from a correlation between the random effect in the hierarchical model and some of its covariates. This paper shows that this phenomenon is far from exceptional and can even be a generic problem when the covariates contain the prior score attainments, a typical situation in value added modeling. Starting from a general, model-free definition of value added, the paper derives an explicit expression of the value added in an endogeneous hierarchical linear Gaussian model. Inference on value added is proposed using an instrumental variable approach. The impact of endogeneity on the value added and the estimated value added is calculated accurately. This is also illustrated on a large data set of individual scores of about 200,000 students in Chile.
Ahn, Jaeil; Morita, Satoshi; Wang, Wenyi; Yuan, Ying
2017-01-01
Analyzing longitudinal dyadic data is a challenging task due to the complicated correlations from repeated measurements and within-dyad interdependence, as well as potentially informative (or non-ignorable) missing data. We propose a dyadic shared-parameter model to analyze longitudinal dyadic data with ordinal outcomes and informative intermittent missing data and dropouts. We model the longitudinal measurement process using a proportional odds model, which accommodates the within-dyad interdependence using the concept of the actor-partner interdependence effects, as well as dyad-specific random effects. We model informative dropouts and intermittent missing data using a transition model, which shares the same set of random effects as the longitudinal measurement model. We evaluate the performance of the proposed method through extensive simulation studies. As our approach relies on some untestable assumptions on the missing data mechanism, we perform sensitivity analyses to evaluate how the analysis results change when the missing data mechanism is misspecified. We demonstrate our method using a longitudinal dyadic study of metastatic breast cancer.
Comparison of Oral Reading Errors between Contextual Sentences and Random Words among Schoolchildren
ERIC Educational Resources Information Center
Khalid, Nursyairah Mohd; Buari, Noor Halilah; Chen, Ai-Hong
2017-01-01
This paper compares the oral reading errors between the contextual sentences and random words among schoolchildren. Two sets of reading materials were developed to test the oral reading errors in 30 schoolchildren (10.00±1.44 years). Set A was comprised contextual sentences while Set B encompassed random words. The schoolchildren were asked to…
IMAGINE: Interstellar MAGnetic field INference Engine
NASA Astrophysics Data System (ADS)
Steininger, Theo
2018-03-01
IMAGINE (Interstellar MAGnetic field INference Engine) performs inference on generic parametric models of the Galaxy. The modular open source framework uses highly optimized tools and technology such as the MultiNest sampler (ascl:1109.006) and the information field theory framework NIFTy (ascl:1302.013) to create an instance of the Milky Way based on a set of parameters for physical observables, using Bayesian statistics to judge the mismatch between measured data and model prediction. The flexibility of the IMAGINE framework allows for simple refitting for newly available data sets and makes state-of-the-art Bayesian methods easily accessible particularly for random components of the Galactic magnetic field.
NASA Astrophysics Data System (ADS)
Laws, Priscilla W.
2004-05-01
The Workshop Physics Activity Guide is a set of student workbooks designed to serve as the foundation for a two-semester calculus-based introductory physics course. It consists of 28 units that interweave text materials with activities that include prediction, qualitative observation, explanation, equation derivation, mathematical modeling, quantitative experiments, and problem solving. Students use a powerful set of computer tools to record, display, and analyze data, as well as to develop mathematical models of physical phenomena. The design of many of the activities is based on the outcomes of physics education research.
Unique effects of setting goals on behavior change: Systematic review and meta-analysis.
Epton, Tracy; Currie, Sinead; Armitage, Christopher J
2017-12-01
Goal setting is a common feature of behavior change interventions, but it is unclear when goal setting is optimally effective. The aims of this systematic review and meta-analysis were to evaluate: (a) the unique effects of goal setting on behavior change, and (b) under what circumstances and for whom goal setting works best. Four databases were searched for articles that assessed the unique effects of goal setting on behavior change using randomized controlled trials. One-hundred and 41 papers were identified from which 384 effect sizes (N = 16,523) were extracted and analyzed. A moderator analysis of sample characteristics, intervention characteristics, inclusion of other behavior change techniques, study design and delivery, quality of study, outcome measures, and behavior targeted was conducted. A random effects model indicated a small positive unique effect of goal setting across a range of behaviors, d = .34 (CI [.28, .41]). Moderator analyses indicated that goal setting was particularly effective if the goal was: (a) difficult, (b) set publicly, and (c) was a group goal. There was weaker evidence that goal setting was more effective when paired with external monitoring of the behavior/outcome by others without feedback and delivered face-to-face. Goal setting is an effective behavior change technique that has the potential to be considered a fundamental component of successful interventions. The present review adds novel insights into the means by which goal setting might be augmented to maximize behavior change and sets the agenda for future programs of research. (PsycINFO Database Record (c) 2017 APA, all rights reserved).
A Feature-based Developmental Model of the Infant Brain in Structural MRI
Toews, Matthew; Wells, William M.; Zöllei, Lilla
2014-01-01
In this paper, anatomical development is modeled as a collection of distinctive image patterns localized in space and time. A Bayesian posterior probability is defined over a random variable of subject age, conditioned on data in the form of scale-invariant image features. The model is automatically learned from a large set of images exhibiting significant variation, used to discover anatomical structure related to age and development, and fit to new images to predict age. The model is applied to a set of 230 infant structural MRIs of 92 subjects acquired at multiple sites over an age range of 8-590 days. Experiments demonstrate that the model can be used to identify age-related anatomical structure, and to predict the age of new subjects with an average error of 72 days. PMID:23286050
Evolving optimised decision rules for intrusion detection using particle swarm paradigm
NASA Astrophysics Data System (ADS)
Sivatha Sindhu, Siva S.; Geetha, S.; Kannan, A.
2012-12-01
The aim of this article is to construct a practical intrusion detection system (IDS) that properly analyses the statistics of network traffic pattern and classify them as normal or anomalous class. The objective of this article is to prove that the choice of effective network traffic features and a proficient machine-learning paradigm enhances the detection accuracy of IDS. In this article, a rule-based approach with a family of six decision tree classifiers, namely Decision Stump, C4.5, Naive Baye's Tree, Random Forest, Random Tree and Representative Tree model to perform the detection of anomalous network pattern is introduced. In particular, the proposed swarm optimisation-based approach selects instances that compose training set and optimised decision tree operate over this trained set producing classification rules with improved coverage, classification capability and generalisation ability. Experiment with the Knowledge Discovery and Data mining (KDD) data set which have information on traffic pattern, during normal and intrusive behaviour shows that the proposed algorithm produces optimised decision rules and outperforms other machine-learning algorithm.
Bagging Voronoi classifiers for clustering spatial functional data
NASA Astrophysics Data System (ADS)
Secchi, Piercesare; Vantini, Simone; Vitelli, Valeria
2013-06-01
We propose a bagging strategy based on random Voronoi tessellations for the exploration of geo-referenced functional data, suitable for different purposes (e.g., classification, regression, dimensional reduction, …). Urged by an application to environmental data contained in the Surface Solar Energy database, we focus in particular on the problem of clustering functional data indexed by the sites of a spatial finite lattice. We thus illustrate our strategy by implementing a specific algorithm whose rationale is to (i) replace the original data set with a reduced one, composed by local representatives of neighborhoods covering the entire investigated area; (ii) analyze the local representatives; (iii) repeat the previous analysis many times for different reduced data sets associated to randomly generated different sets of neighborhoods, thus obtaining many different weak formulations of the analysis; (iv) finally, bag together the weak analyses to obtain a conclusive strong analysis. Through an extensive simulation study, we show that this new procedure - which does not require an explicit model for spatial dependence - is statistically and computationally efficient.
Qin, Zijian; Wang, Maolin; Yan, Aixia
2017-07-01
In this study, quantitative structure-activity relationship (QSAR) models using various descriptor sets and training/test set selection methods were explored to predict the bioactivity of hepatitis C virus (HCV) NS3/4A protease inhibitors by using a multiple linear regression (MLR) and a support vector machine (SVM) method. 512 HCV NS3/4A protease inhibitors and their IC 50 values which were determined by the same FRET assay were collected from the reported literature to build a dataset. All the inhibitors were represented with selected nine global and 12 2D property-weighted autocorrelation descriptors calculated from the program CORINA Symphony. The dataset was divided into a training set and a test set by a random and a Kohonen's self-organizing map (SOM) method. The correlation coefficients (r 2 ) of training sets and test sets were 0.75 and 0.72 for the best MLR model, 0.87 and 0.85 for the best SVM model, respectively. In addition, a series of sub-dataset models were also developed. The performances of all the best sub-dataset models were better than those of the whole dataset models. We believe that the combination of the best sub- and whole dataset SVM models can be used as reliable lead designing tools for new NS3/4A protease inhibitors scaffolds in a drug discovery pipeline. Copyright © 2017 Elsevier Ltd. All rights reserved.
Harnessing the Bethe free energy†
Bapst, Victor
2016-01-01
ABSTRACT A wide class of problems in combinatorics, computer science and physics can be described along the following lines. There are a large number of variables ranging over a finite domain that interact through constraints that each bind a few variables and either encourage or discourage certain value combinations. Examples include the k‐SAT problem or the Ising model. Such models naturally induce a Gibbs measure on the set of assignments, which is characterised by its partition function. The present paper deals with the partition function of problems where the interactions between variables and constraints are induced by a sparse random (hyper)graph. According to physics predictions, a generic recipe called the “replica symmetric cavity method” yields the correct value of the partition function if the underlying model enjoys certain properties [Krzkala et al., PNAS (2007) 10318–10323]. Guided by this conjecture, we prove general sufficient conditions for the success of the cavity method. The proofs are based on a “regularity lemma” for probability measures on sets of the form Ωn for a finite Ω and a large n that may be of independent interest. © 2016 Wiley Periodicals, Inc. Random Struct. Alg., 49, 694–741, 2016 PMID:28035178
An adaptive random search for short term generation scheduling with network constraints.
Marmolejo, J A; Velasco, Jonás; Selley, Héctor J
2017-01-01
This paper presents an adaptive random search approach to address a short term generation scheduling with network constraints, which determines the startup and shutdown schedules of thermal units over a given planning horizon. In this model, we consider the transmission network through capacity limits and line losses. The mathematical model is stated in the form of a Mixed Integer Non Linear Problem with binary variables. The proposed heuristic is a population-based method that generates a set of new potential solutions via a random search strategy. The random search is based on the Markov Chain Monte Carlo method. The main key of the proposed method is that the noise level of the random search is adaptively controlled in order to exploring and exploiting the entire search space. In order to improve the solutions, we consider coupling a local search into random search process. Several test systems are presented to evaluate the performance of the proposed heuristic. We use a commercial optimizer to compare the quality of the solutions provided by the proposed method. The solution of the proposed algorithm showed a significant reduction in computational effort with respect to the full-scale outer approximation commercial solver. Numerical results show the potential and robustness of our approach.
Identifying novel sequence variants of RNA 3D motifs
Zirbel, Craig L.; Roll, James; Sweeney, Blake A.; Petrov, Anton I.; Pirrung, Meg; Leontis, Neocles B.
2015-01-01
Predicting RNA 3D structure from sequence is a major challenge in biophysics. An important sub-goal is accurately identifying recurrent 3D motifs from RNA internal and hairpin loop sequences extracted from secondary structure (2D) diagrams. We have developed and validated new probabilistic models for 3D motif sequences based on hybrid Stochastic Context-Free Grammars and Markov Random Fields (SCFG/MRF). The SCFG/MRF models are constructed using atomic-resolution RNA 3D structures. To parameterize each model, we use all instances of each motif found in the RNA 3D Motif Atlas and annotations of pairwise nucleotide interactions generated by the FR3D software. Isostericity relations between non-Watson–Crick basepairs are used in scoring sequence variants. SCFG techniques model nested pairs and insertions, while MRF ideas handle crossing interactions and base triples. We use test sets of randomly-generated sequences to set acceptance and rejection thresholds for each motif group and thus control the false positive rate. Validation was carried out by comparing results for four motif groups to RMDetect. The software developed for sequence scoring (JAR3D) is structured to automatically incorporate new motifs as they accumulate in the RNA 3D Motif Atlas when new structures are solved and is available free for download. PMID:26130723
Hsu, Jia-Lien; Hung, Ping-Cheng; Lin, Hung-Yen; Hsieh, Chung-Ho
2015-04-01
Breast cancer is one of the most common cause of cancer mortality. Early detection through mammography screening could significantly reduce mortality from breast cancer. However, most of screening methods may consume large amount of resources. We propose a computational model, which is solely based on personal health information, for breast cancer risk assessment. Our model can be served as a pre-screening program in the low-cost setting. In our study, the data set, consisting of 3976 records, is collected from Taipei City Hospital starting from 2008.1.1 to 2008.12.31. Based on the dataset, we first apply the sampling techniques and dimension reduction method to preprocess the testing data. Then, we construct various kinds of classifiers (including basic classifiers, ensemble methods, and cost-sensitive methods) to predict the risk. The cost-sensitive method with random forest classifier is able to achieve recall (or sensitivity) as 100 %. At the recall of 100 %, the precision (positive predictive value, PPV), and specificity of cost-sensitive method with random forest classifier was 2.9 % and 14.87 %, respectively. In our study, we build a breast cancer risk assessment model by using the data mining techniques. Our model has the potential to be served as an assisting tool in the breast cancer screening.
Randomized Prediction Games for Adversarial Machine Learning.
Rota Bulo, Samuel; Biggio, Battista; Pillai, Ignazio; Pelillo, Marcello; Roli, Fabio
In spam and malware detection, attackers exploit randomization to obfuscate malicious data and increase their chances of evading detection at test time, e.g., malware code is typically obfuscated using random strings or byte sequences to hide known exploits. Interestingly, randomization has also been proposed to improve security of learning algorithms against evasion attacks, as it results in hiding information about the classifier to the attacker. Recent work has proposed game-theoretical formulations to learn secure classifiers, by simulating different evasion attacks and modifying the classification function accordingly. However, both the classification function and the simulated data manipulations have been modeled in a deterministic manner, without accounting for any form of randomization. In this paper, we overcome this limitation by proposing a randomized prediction game, namely, a noncooperative game-theoretic formulation in which the classifier and the attacker make randomized strategy selections according to some probability distribution defined over the respective strategy set. We show that our approach allows one to improve the tradeoff between attack detection and false alarms with respect to the state-of-the-art secure classifiers, even against attacks that are different from those hypothesized during design, on application examples including handwritten digit recognition, spam, and malware detection.In spam and malware detection, attackers exploit randomization to obfuscate malicious data and increase their chances of evading detection at test time, e.g., malware code is typically obfuscated using random strings or byte sequences to hide known exploits. Interestingly, randomization has also been proposed to improve security of learning algorithms against evasion attacks, as it results in hiding information about the classifier to the attacker. Recent work has proposed game-theoretical formulations to learn secure classifiers, by simulating different evasion attacks and modifying the classification function accordingly. However, both the classification function and the simulated data manipulations have been modeled in a deterministic manner, without accounting for any form of randomization. In this paper, we overcome this limitation by proposing a randomized prediction game, namely, a noncooperative game-theoretic formulation in which the classifier and the attacker make randomized strategy selections according to some probability distribution defined over the respective strategy set. We show that our approach allows one to improve the tradeoff between attack detection and false alarms with respect to the state-of-the-art secure classifiers, even against attacks that are different from those hypothesized during design, on application examples including handwritten digit recognition, spam, and malware detection.
NullSeq: A Tool for Generating Random Coding Sequences with Desired Amino Acid and GC Contents.
Liu, Sophia S; Hockenberry, Adam J; Lancichinetti, Andrea; Jewett, Michael C; Amaral, Luís A N
2016-11-01
The existence of over- and under-represented sequence motifs in genomes provides evidence of selective evolutionary pressures on biological mechanisms such as transcription, translation, ligand-substrate binding, and host immunity. In order to accurately identify motifs and other genome-scale patterns of interest, it is essential to be able to generate accurate null models that are appropriate for the sequences under study. While many tools have been developed to create random nucleotide sequences, protein coding sequences are subject to a unique set of constraints that complicates the process of generating appropriate null models. There are currently no tools available that allow users to create random coding sequences with specified amino acid composition and GC content for the purpose of hypothesis testing. Using the principle of maximum entropy, we developed a method that generates unbiased random sequences with pre-specified amino acid and GC content, which we have developed into a python package. Our method is the simplest way to obtain maximally unbiased random sequences that are subject to GC usage and primary amino acid sequence constraints. Furthermore, this approach can easily be expanded to create unbiased random sequences that incorporate more complicated constraints such as individual nucleotide usage or even di-nucleotide frequencies. The ability to generate correctly specified null models will allow researchers to accurately identify sequence motifs which will lead to a better understanding of biological processes as well as more effective engineering of biological systems.
Random sphere packing model of heterogeneous propellants
NASA Astrophysics Data System (ADS)
Kochevets, Sergei Victorovich
It is well recognized that combustion of heterogeneous propellants is strongly dependent on the propellant morphology. Recent developments in computing systems make it possible to start three-dimensional modeling of heterogeneous propellant combustion. A key component of such large scale computations is a realistic model of industrial propellants which retains the true morphology---a goal never achieved before. The research presented develops the Random Sphere Packing Model of heterogeneous propellants and generates numerical samples of actual industrial propellants. This is done by developing a sphere packing algorithm which randomly packs a large number of spheres with a polydisperse size distribution within a rectangular domain. First, the packing code is developed, optimized for performance, and parallelized using the OpenMP shared memory architecture. Second, the morphology and packing fraction of two simple cases of unimodal and bimodal packs are investigated computationally and analytically. It is shown that both the Loose Random Packing and Dense Random Packing limits are not well defined and the growth rate of the spheres is identified as the key parameter controlling the efficiency of the packing. For a properly chosen growth rate, computational results are found to be in excellent agreement with experimental data. Third, two strategies are developed to define numerical samples of polydisperse heterogeneous propellants: the Deterministic Strategy and the Random Selection Strategy. Using these strategies, numerical samples of industrial propellants are generated. The packing fraction is investigated and it is shown that the experimental values of the packing fraction can be achieved computationally. It is strongly believed that this Random Sphere Packing Model of propellants is a major step forward in the realistic computational modeling of heterogeneous propellant of combustion. In addition, a method of analysis of the morphology of heterogeneous propellants is developed which uses the concept of multi-point correlation functions. A set of intrinsic length scales of local density fluctuations in random heterogeneous propellants is identified by performing a Monte-Carlo study of the correlation functions. This method of analysis shows great promise for understanding the origins of the combustion instability of heterogeneous propellants, and is believed to become a valuable tool for the development of safe and reliable rocket engines.
Free Vibration of Uncertain Unsymmetrically Laminated Beams
NASA Technical Reports Server (NTRS)
Kapania, Rakesh K.; Goyal, Vijay K.
2001-01-01
Monte Carlo Simulation and Stochastic FEA are used to predict randomness in the free vibration response of thin unsymmetrically laminated beams. For the present study, it is assumed that randomness in the response is only caused by uncertainties in the ply orientations. The ply orientations may become random or uncertain during the manufacturing process. A new 16-dof beam element, based on the first-order shear deformation beam theory, is used to study the stochastic nature of the natural frequencies. Using variational principles, the element stiffness matrix and mass matrix are obtained through analytical integration. Using a random sequence a large data set is generated, containing possible random ply-orientations. This data is assumed to be symmetric. The stochastic-based finite element model for free vibrations predicts the relation between the randomness in fundamental natural frequencies and the randomness in ply-orientation. The sensitivity derivatives are calculated numerically through an exact formulation. The squared fundamental natural frequencies are expressed in terms of deterministic and probabilistic quantities, allowing to determine how sensitive they are to variations in ply angles. The predicted mean-valued fundamental natural frequency squared and the variance of the present model are in good agreement with Monte Carlo Simulation. Results, also, show that variations between plus or minus 5 degrees in ply-angles can affect free vibration response of unsymmetrically and symmetrically laminated beams.
Jones, Andrew M; Lomas, James; Moore, Peter T; Rice, Nigel
2016-10-01
We conduct a quasi-Monte-Carlo comparison of the recent developments in parametric and semiparametric regression methods for healthcare costs, both against each other and against standard practice. The population of English National Health Service hospital in-patient episodes for the financial year 2007-2008 (summed for each patient) is randomly divided into two equally sized subpopulations to form an estimation set and a validation set. Evaluating out-of-sample using the validation set, a conditional density approximation estimator shows considerable promise in forecasting conditional means, performing best for accuracy of forecasting and among the best four for bias and goodness of fit. The best performing model for bias is linear regression with square-root-transformed dependent variables, whereas a generalized linear model with square-root link function and Poisson distribution performs best in terms of goodness of fit. Commonly used models utilizing a log-link are shown to perform badly relative to other models considered in our comparison.
Characterization of impulse noise and analysis of its effect upon correlation receivers
NASA Technical Reports Server (NTRS)
Houts, R. C.; Moore, J. D.
1971-01-01
A noise model is formulated to describe the impulse noise in many digital systems. A simplified model, which assumes that each noise burst contains a randomly weighted version of the same basic waveform, is used to derive the performance equations for a correlation receiver. The expected number of bit errors per noise burst is expressed as a function of the average signal energy, signal-set correlation coefficient, bit time, noise-weighting-factor variance and probability density function, and a time range function which depends on the crosscorrelation of the signal-set basis functions and the noise waveform. A procedure is established for extending the results for the simplified noise model to the general model. Unlike the performance results for Gaussian noise, it is shown that for impulse noise the error performance is affected by the choice of signal-set basis functions and that Orthogonal signaling is not equivalent to On-Off signaling with the same average energy.
Pozzobon, Victor; Perre, Patrick
2018-01-21
This work provides a model and the associated set of parameters allowing for microalgae population growth computation under intermittent lightning. Han's model is coupled with a simple microalgae growth model to yield a relationship between illumination and population growth. The model parameters were obtained by fitting a dataset available in literature using Particle Swarm Optimization method. In their work, authors grew microalgae in excess of nutrients under flashing conditions. Light/dark cycles used for these experimentations are quite close to those found in photobioreactor, i.e. ranging from several seconds to one minute. In this work, in addition to producing the set of parameters, Particle Swarm Optimization robustness was assessed. To do so, two different swarm initialization techniques were used, i.e. uniform and random distribution throughout the search-space. Both yielded the same results. In addition, swarm distribution analysis reveals that the swarm converges to a unique minimum. Thus, the produced set of parameters can be trustfully used to link light intensity to population growth rate. Furthermore, the set is capable to describe photodamages effects on population growth. Hence, accounting for light overexposure effect on algal growth. Copyright © 2017 Elsevier Ltd. All rights reserved.
A Comparison of Fuzzy Models in Similarity Assessment of Misregistered Area Class Maps
NASA Astrophysics Data System (ADS)
Brown, Scott
Spatial uncertainty refers to unknown error and vagueness in geographic data. It is relevant to land change and urban growth modelers, soil and biome scientists, geological surveyors and others, who must assess thematic maps for similarity, or categorical agreement. In this paper I build upon prior map comparison research, testing the effectiveness of similarity measures on misregistered data. Though several methods compare uncertain thematic maps, few methods have been tested on misregistration. My objective is to test five map comparison methods for sensitivity to misregistration, including sub-pixel errors in both position and rotation. Methods included four fuzzy categorical models: fuzzy kappa's model, fuzzy inference, cell aggregation, and the epsilon band. The fifth method used conventional crisp classification. I applied these methods to a case study map and simulated data in two sets: a test set with misregistration error, and a control set with equivalent uniform random error. For all five methods, I used raw accuracy or the kappa statistic to measure similarity. Rough-set epsilon bands report the most similarity increase in test maps relative to control data. Conversely, the fuzzy inference model reports a decrease in test map similarity.
ERIC Educational Resources Information Center
Felce, David; Perry, Jonathan
2004-01-01
Background: The aims were to: (i) explore the association between age and size of setting and staffing per resident; and (ii) report resident and setting characteristics, and indicators of service process and resident activity for a national random sample of staffed housing provision. Methods: Sixty settings were selected randomly from those…
Stanley, Clayton; Byrne, Michael D
2016-12-01
The growth of social media and user-created content on online sites provides unique opportunities to study models of human declarative memory. By framing the task of choosing a hashtag for a tweet and tagging a post on Stack Overflow as a declarative memory retrieval problem, 2 cognitively plausible declarative memory models were applied to millions of posts and tweets and evaluated on how accurately they predict a user's chosen tags. An ACT-R based Bayesian model and a random permutation vector-based model were tested on the large data sets. The results show that past user behavior of tag use is a strong predictor of future behavior. Furthermore, past behavior was successfully incorporated into the random permutation model that previously used only context. Also, ACT-R's attentional weight term was linked to an entropy-weighting natural language processing method used to attenuate high-frequency words (e.g., articles and prepositions). Word order was not found to be a strong predictor of tag use, and the random permutation model performed comparably to the Bayesian model without including word order. This shows that the strength of the random permutation model is not in the ability to represent word order, but rather in the way in which context information is successfully compressed. The results of the large-scale exploration show how the architecture of the 2 memory models can be modified to significantly improve accuracy, and may suggest task-independent general modifications that can help improve model fit to human data in a much wider range of domains. (PsycINFO Database Record (c) 2016 APA, all rights reserved).
Schroeter, Timon Sebastian; Schwaighofer, Anton; Mika, Sebastian; Ter Laak, Antonius; Suelzle, Detlev; Ganzer, Ursula; Heinrich, Nikolaus; Müller, Klaus-Robert
2007-12-01
We investigate the use of different Machine Learning methods to construct models for aqueous solubility. Models are based on about 4000 compounds, including an in-house set of 632 drug discovery molecules of Bayer Schering Pharma. For each method, we also consider an appropriate method to obtain error bars, in order to estimate the domain of applicability (DOA) for each model. Here, we investigate error bars from a Bayesian model (Gaussian Process (GP)), an ensemble based approach (Random Forest), and approaches based on the Mahalanobis distance to training data (for Support Vector Machine and Ridge Regression models). We evaluate all approaches in terms of their prediction accuracy (in cross-validation, and on an external validation set of 536 molecules) and in how far the individual error bars can faithfully represent the actual prediction error.
Schroeter, Timon Sebastian; Schwaighofer, Anton; Mika, Sebastian; Ter Laak, Antonius; Suelzle, Detlev; Ganzer, Ursula; Heinrich, Nikolaus; Müller, Klaus-Robert
2007-09-01
We investigate the use of different Machine Learning methods to construct models for aqueous solubility. Models are based on about 4000 compounds, including an in-house set of 632 drug discovery molecules of Bayer Schering Pharma. For each method, we also consider an appropriate method to obtain error bars, in order to estimate the domain of applicability (DOA) for each model. Here, we investigate error bars from a Bayesian model (Gaussian Process (GP)), an ensemble based approach (Random Forest), and approaches based on the Mahalanobis distance to training data (for Support Vector Machine and Ridge Regression models). We evaluate all approaches in terms of their prediction accuracy (in cross-validation, and on an external validation set of 536 molecules) and in how far the individual error bars can faithfully represent the actual prediction error.
NASA Astrophysics Data System (ADS)
Schroeter, Timon Sebastian; Schwaighofer, Anton; Mika, Sebastian; Ter Laak, Antonius; Suelzle, Detlev; Ganzer, Ursula; Heinrich, Nikolaus; Müller, Klaus-Robert
2007-12-01
We investigate the use of different Machine Learning methods to construct models for aqueous solubility. Models are based on about 4000 compounds, including an in-house set of 632 drug discovery molecules of Bayer Schering Pharma. For each method, we also consider an appropriate method to obtain error bars, in order to estimate the domain of applicability (DOA) for each model. Here, we investigate error bars from a Bayesian model (Gaussian Process (GP)), an ensemble based approach (Random Forest), and approaches based on the Mahalanobis distance to training data (for Support Vector Machine and Ridge Regression models). We evaluate all approaches in terms of their prediction accuracy (in cross-validation, and on an external validation set of 536 molecules) and in how far the individual error bars can faithfully represent the actual prediction error.
NASA Astrophysics Data System (ADS)
Schroeter, Timon Sebastian; Schwaighofer, Anton; Mika, Sebastian; Ter Laak, Antonius; Suelzle, Detlev; Ganzer, Ursula; Heinrich, Nikolaus; Müller, Klaus-Robert
2007-09-01
We investigate the use of different Machine Learning methods to construct models for aqueous solubility. Models are based on about 4000 compounds, including an in-house set of 632 drug discovery molecules of Bayer Schering Pharma. For each method, we also consider an appropriate method to obtain error bars, in order to estimate the domain of applicability (DOA) for each model. Here, we investigate error bars from a Bayesian model (Gaussian Process (GP)), an ensemble based approach (Random Forest), and approaches based on the Mahalanobis distance to training data (for Support Vector Machine and Ridge Regression models). We evaluate all approaches in terms of their prediction accuracy (in cross-validation, and on an external validation set of 536 molecules) and in how far the individual error bars can faithfully represent the actual prediction error.
On set-valued functionals: Multivariate risk measures and Aumann integrals
NASA Astrophysics Data System (ADS)
Ararat, Cagin
In this dissertation, multivariate risk measures for random vectors and Aumann integrals of set-valued functions are studied. Both are set-valued functionals with values in a complete lattice of subsets of Rm. Multivariate risk measures are considered in a general d-asset financial market with trading opportunities in discrete time. Specifically, the following features of the market are incorporated in the evaluation of multivariate risk: convex transaction costs modeled by solvency regions, intermediate trading constraints modeled by convex random sets, and the requirement of liquidation into the first m ≤ d of the assets. It is assumed that the investor has a "pure" multivariate risk measure R on the space of m-dimensional random vectors which represents her risk attitude towards the assets but does not take into account the frictions of the market. Then, the investor with a d-dimensional position minimizes the set-valued functional R over all m-dimensional positions that she can reach by trading in the market subject to the frictions described above. The resulting functional Rmar on the space of d-dimensional random vectors is another multivariate risk measure, called the market-extension of R. A dual representation for R mar that decomposes the effects of R and the frictions of the market is proved. Next, multivariate risk measures are studied in a utility-based framework. It is assumed that the investor has a complete risk preference towards each individual asset, which can be represented by a von Neumann-Morgenstern utility function. Then, an incomplete preference is considered for multivariate positions which is represented by the vector of the individual utility functions. Under this structure, multivariate shortfall and divergence risk measures are defined as the optimal values of set minimization problems. The dual relationship between the two classes of multivariate risk measures is constructed via a recent Lagrange duality for set optimization. In particular, it is shown that a shortfall risk measure can be written as an intersection over a family of divergence risk measures indexed by a scalarization parameter. Examples include the multivariate versions of the entropic risk measure and the average value at risk. In the second part, Aumann integrals of set-valued functions on a measurable space are viewed as set-valued functionals and a Daniell-Stone type characterization theorem is proved for such functionals. More precisely, it is shown that a functional that maps measurable set-valued functions into a certain complete lattice of subsets of Rm can be written as the Aumann integral with respect to a measure if and only if the functional is (1) additive and (2) positively homogeneous, (3) it preserves decreasing limits, (4) it maps halfspace-valued functions to halfspaces, and (5) it maps shifted cone-valued functions to shifted cones. While the first three properties already exist in the classical Daniell-Stone theorem for the Lebesgue integral, the last two properties are peculiar to the set-valued framework and they suffice to complement the first three properties to identify a set-valued functional as the Aumann integral with respect to a measure.
Quantum walks with tuneable self-avoidance in one dimension
Camilleri, Elizabeth; Rohde, Peter P.; Twamley, Jason
2014-01-01
Quantum walks exhibit many unique characteristics compared to classical random walks. In the classical setting, self-avoiding random walks have been studied as a variation on the usual classical random walk. Here the walker has memory of its previous locations and preferentially avoids stepping back to locations where it has previously resided. Classical self-avoiding random walks have found numerous algorithmic applications, most notably in the modelling of protein folding. We consider the analogous problem in the quantum setting – a quantum walk in one dimension with tunable levels of self-avoidance. We complement a quantum walk with a memory register that records where the walker has previously resided. The walker is then able to avoid returning back to previously visited sites or apply more general memory conditioned operations to control the walk. We characterise this walk by examining the variance of the walker's distribution against time, the standard metric for quantifying how quantum or classical a walk is. We parameterise the strength of the memory recording and the strength of the memory back-action on the walker, and investigate their effect on the dynamics of the walk. We find that by manipulating these parameters, which dictate the degree of self-avoidance, the walk can be made to reproduce ideal quantum or classical random walk statistics, or a plethora of more elaborate diffusive phenomena. In some parameter regimes we observe a close correspondence between classical self-avoiding random walks and the quantum self-avoiding walk. PMID:24762398
Cascade phenomenon against subsequent failures in complex networks
NASA Astrophysics Data System (ADS)
Jiang, Zhong-Yuan; Liu, Zhi-Quan; He, Xuan; Ma, Jian-Feng
2018-06-01
Cascade phenomenon may lead to catastrophic disasters which extremely imperil the network safety or security in various complex systems such as communication networks, power grids, social networks and so on. In some flow-based networks, the load of failed nodes can be redistributed locally to their neighboring nodes to maximally preserve the traffic oscillations or large-scale cascading failures. However, in such local flow redistribution model, a small set of key nodes attacked subsequently can result in network collapse. Then it is a critical problem to effectively find the set of key nodes in the network. To our best knowledge, this work is the first to study this problem comprehensively. We first introduce the extra capacity for every node to put up with flow fluctuations from neighbors, and two extra capacity distributions including degree based distribution and average distribution are employed. Four heuristic key nodes discovering methods including High-Degree-First (HDF), Low-Degree-First (LDF), Random and Greedy Algorithms (GA) are presented. Extensive simulations are realized in both scale-free networks and random networks. The results show that the greedy algorithm can efficiently find the set of key nodes in both scale-free and random networks. Our work studies network robustness against cascading failures from a very novel perspective, and methods and results are very useful for network robustness evaluations and protections.
Strong adsorption of random heteropolymers on protein surfaces
NASA Astrophysics Data System (ADS)
Nguyen, Trung; Qiao, Baofu; Panganiban, Brian; Delre, Christopher; Xu, Ting; Olvera de La Cruz, Monica
Rational design of copolymers for stablizing proteins' functionalities in unfavorable solvents and delivering nanoparticles through organic membranes demands a thorough understanding of how the proteins and colloids are encapsulated by a given type of copolymers. Random heteropolymers (RHPs), a special family of copolymers with random segment order, have long been recognized as a promising coating materials due to their biomimetic behaviors while allowing for much flexibility in the synthesis procedure. Of practical importance is the ability to predict the conditions under which a given family of random heteropolymers would provide optimal encapsulatio. Here we investigate the key factors that govern the adsorption of RHPs on the surface of a model protein. Using coarse-grained molecular simulation we identify the conditions under which the model protein is fully covered by the polymers. We have examined the nanometer-level details of the adsorbed polymer chains and found a clear connection between the surface coverage and adsorption strength, solvent selectivity and the volume fraction of adsorbing monomers. The results in this work set the stage for further investigation on engineering biomimetic RHPs for stabilizing and delivering functional proteins across multiple media.
Model risk for European-style stock index options.
Gençay, Ramazan; Gibson, Rajna
2007-01-01
In empirical modeling, there have been two strands for pricing in the options literature, namely the parametric and nonparametric models. Often, the support for the nonparametric methods is based on a benchmark such as the Black-Scholes (BS) model with constant volatility. In this paper, we study the stochastic volatility (SV) and stochastic volatility random jump (SVJ) models as parametric benchmarks against feedforward neural network (FNN) models, a class of neural network models. Our choice for FNN models is due to their well-studied universal approximation properties of an unknown function and its partial derivatives. Since the partial derivatives of an option pricing formula are risk pricing tools, an accurate estimation of the unknown option pricing function is essential for pricing and hedging. Our findings indicate that FNN models offer themselves as robust option pricing tools, over their sophisticated parametric counterparts in predictive settings. There are two routes to explain the superiority of FNN models over the parametric models in forecast settings. These are nonnormality of return distributions and adaptive learning.
Self-Organization of Temporal Structures — A Possible Solution for the Intervention Problem
NASA Astrophysics Data System (ADS)
von Lucadou, Walter
2006-10-01
The paper presents an experiment that is a conceptual replication of two earlier experiments which demonstrate entanglement correlations between a quantum physical random process and certain psychological variables of human observers. In the present study button-pushes were used as psychological variables. The button-pushes were performed by the subject with his or her left or right hand in order to "control" (according to the instruction) a random process that could be observed on a display. Each button-push started the next random event which, however, in reality, was independent of the button-pushes. The study consists of three independent sets of data (n = 386) that were gained with almost fee same apparatus in three different experimental situations. The first data set serves as reference. It was an automatic control-run without subjects. The second set was produced mainly by subjects who asked for taking part in a para-psychological experiment and who visited the "Parapsychological Counseling Office" in Freiburg especially for this purpose. Most of them were highly motivated persons who wanted to test their "psi ability". In this case the number of runs could be selected by the subjects before the experimental session. The third set of data (of the same size) was collected during two public exhibitions (at Basel and at Freiburg) where the visitors had the opportunity to participate in a "PK experiment". In this case the number of trials and runs was fixed in advance, but the duration of the experiment was dependent of the speed of button-pushes. The results corroborate the previous studies. The specific way how the subjects pushed the buttons is highly significantly correlated with the independent random process. This correlation shows up for the momentarily generated random events as well as for the previous and the later runs during the experimental session. In a strict sense, only the correlations with the future random events can be interpreted as non-local correlations. The structure of the data, however, allows the conclusion, that all observed correlations can be considered as entanglement-correlations. The number of entanglement-correlations was significantly higher for the highly motivated group (data set 2) than for the unselected group of the exhibition participants (data set 3). The latter, however, where not completely unsuccessful: A subgroup who showed "innovative" behavior also showed significant entanglement-correlations. It could further be shown, that the structure of the matrix of entanglement-correlations is not stable in time and changes if the experiment is repeated. In comparison with previous correlation-experiments, no decline of the effect size was observed. These results are in agreement with the predictions of the "Weak Quantum Theory (WQT)" and the "Model of Pragmatic Information (MPI)". These models interpret the measured correlations as entanglement-correlations within a self-organizing, organizationally closed, psycho-physical system that exist during a certain time-interval (as long as the system is active). The entanglement-correlations cannot be considered as a causal influence (in the sense of a PK-Influence) and thus are called "micro-synchronicity". After a short introduction (1.), the question is discussed how non-local correlations can be created in psycho-physical systems (2.). In chapter (3.) the description of the experimental setting is given and the apparatus (4.) and randomness test of the random event generator (5.) are described. Additionally, an overview of the structure of the data is given (6.) and the analysis methods are described (7.). In chapter (8.) the experimental hypotheses are formulated and the results are reported (9.). After the discussion of the results (10.) the conclusions (11).) of the study are presented.
Fisher's geometric model predicts the effects of random mutations when tested in the wild.
Stearns, Frank W; Fenster, Charles B
2016-02-01
Fisher's geometric model of adaptation (FGM) has been the conceptual foundation for studies investigating the genetic basis of adaptation since the onset of the neo Darwinian synthesis. FGM describes adaptation as the movement of a genotype toward a fitness optimum due to beneficial mutations. To date, one prediction of FGM, the probability of improvement is related to the distance from the optimum, has only been tested in microorganisms under laboratory conditions. There is reason to believe that results might differ under natural conditions where more mutations likely affect fitness, and where environmental variance may obscure the expected pattern. We chemically induced mutations into a set of 19 Arabidopsis thaliana accessions from across the native range of A. thaliana and planted them alongside the premutated founder lines in two habitats in the mid-Atlantic region of the United States under field conditions. We show that FGM is able to predict the outcome of a set of random induced mutations on fitness in a set of A. thaliana accessions grown in the wild: mutations are more likely to be beneficial in relatively less fit genotypes. This finding suggests that FGM is an accurate approximation of the process of adaptation under more realistic ecological conditions. © 2016 The Author(s). Evolution © 2016 The Society for the Study of Evolution.
Heo, Moonseong; Litwin, Alain H; Blackstock, Oni; Kim, Namhee; Arnsten, Julia H
2017-02-01
We derived sample size formulae for detecting main effects in group-based randomized clinical trials with different levels of data hierarchy between experimental and control arms. Such designs are necessary when experimental interventions need to be administered to groups of subjects whereas control conditions need to be administered to individual subjects. This type of trial, often referred to as a partially nested or partially clustered design, has been implemented for management of chronic diseases such as diabetes and is beginning to emerge more commonly in wider clinical settings. Depending on the research setting, the level of hierarchy of data structure for the experimental arm can be three or two, whereas that for the control arm is two or one. Such different levels of data hierarchy assume correlation structures of outcomes that are different between arms, regardless of whether research settings require two or three level data structure for the experimental arm. Therefore, the different correlations should be taken into account for statistical modeling and for sample size determinations. To this end, we considered mixed-effects linear models with different correlation structures between experimental and control arms to theoretically derive and empirically validate the sample size formulae with simulation studies.
Assessing the accuracy and stability of variable selection ...
Random forest (RF) modeling has emerged as an important statistical learning method in ecology due to its exceptional predictive performance. However, for large and complex ecological datasets there is limited guidance on variable selection methods for RF modeling. Typically, either a preselected set of predictor variables are used, or stepwise procedures are employed which iteratively add/remove variables according to their importance measures. This paper investigates the application of variable selection methods to RF models for predicting probable biological stream condition. Our motivating dataset consists of the good/poor condition of n=1365 stream survey sites from the 2008/2009 National Rivers and Stream Assessment, and a large set (p=212) of landscape features from the StreamCat dataset. Two types of RF models are compared: a full variable set model with all 212 predictors, and a reduced variable set model selected using a backwards elimination approach. We assess model accuracy using RF's internal out-of-bag estimate, and a cross-validation procedure with validation folds external to the variable selection process. We also assess the stability of the spatial predictions generated by the RF models to changes in the number of predictors, and argue that model selection needs to consider both accuracy and stability. The results suggest that RF modeling is robust to the inclusion of many variables of moderate to low importance. We found no substanti
Bi-dimensional null model analysis of presence-absence binary matrices.
Strona, Giovanni; Ulrich, Werner; Gotelli, Nicholas J
2018-01-01
Comparing the structure of presence/absence (i.e., binary) matrices with those of randomized counterparts is a common practice in ecology. However, differences in the randomization procedures (null models) can affect the results of the comparisons, leading matrix structural patterns to appear either "random" or not. Subjectivity in the choice of one particular null model over another makes it often advisable to compare the results obtained using several different approaches. Yet, available algorithms to randomize binary matrices differ substantially in respect to the constraints they impose on the discrepancy between observed and randomized row and column marginal totals, which complicates the interpretation of contrasting patterns. This calls for new strategies both to explore intermediate scenarios of restrictiveness in-between extreme constraint assumptions, and to properly synthesize the resulting information. Here we introduce a new modeling framework based on a flexible matrix randomization algorithm (named the "Tuning Peg" algorithm) that addresses both issues. The algorithm consists of a modified swap procedure in which the discrepancy between the row and column marginal totals of the target matrix and those of its randomized counterpart can be "tuned" in a continuous way by two parameters (controlling, respectively, row and column discrepancy). We show how combining the Tuning Peg with a wise random walk procedure makes it possible to explore the complete null space embraced by existing algorithms. This exploration allows researchers to visualize matrix structural patterns in an innovative bi-dimensional landscape of significance/effect size. We demonstrate the rational and potential of our approach with a set of simulated and real matrices, showing how the simultaneous investigation of a comprehensive and continuous portion of the null space can be extremely informative, and possibly key to resolving longstanding debates in the analysis of ecological matrices. © 2017 The Authors. Ecology, published by Wiley Periodicals, Inc., on behalf of the Ecological Society of America.
Spread of information and infection on finite random networks
NASA Astrophysics Data System (ADS)
Isham, Valerie; Kaczmarska, Joanna; Nekovee, Maziar
2011-04-01
The modeling of epidemic-like processes on random networks has received considerable attention in recent years. While these processes are inherently stochastic, most previous work has been focused on deterministic models that ignore important fluctuations that may persist even in the infinite network size limit. In a previous paper, for a class of epidemic and rumor processes, we derived approximate models for the full probability distribution of the final size of the epidemic, as opposed to only mean values. In this paper we examine via direct simulations the adequacy of the approximate model to describe stochastic epidemics and rumors on several random network topologies: homogeneous networks, Erdös-Rényi (ER) random graphs, Barabasi-Albert scale-free networks, and random geometric graphs. We find that the approximate model is reasonably accurate in predicting the probability of spread. However, the position of the threshold and the conditional mean of the final size for processes near the threshold are not well described by the approximate model even in the case of homogeneous networks. We attribute this failure to the presence of other structural properties beyond degree-degree correlations, and in particular clustering, which are present in any finite network but are not incorporated in the approximate model. In order to test this “hypothesis” we perform additional simulations on a set of ER random graphs where degree-degree correlations and clustering are separately and independently introduced using recently proposed algorithms from the literature. Our results show that even strong degree-degree correlations have only weak effects on the position of the threshold and the conditional mean of the final size. On the other hand, the introduction of clustering greatly affects both the position of the threshold and the conditional mean. Similar analysis for the Barabasi-Albert scale-free network confirms the significance of clustering on the dynamics of rumor spread. For this network, though, with its highly skewed degree distribution, the addition of positive correlation had a much stronger effect on the final size distribution than was found for the simple random graph.
Dynamic analysis of a pumped-storage hydropower plant with random power load
NASA Astrophysics Data System (ADS)
Zhang, Hao; Chen, Diyi; Xu, Beibei; Patelli, Edoardo; Tolo, Silvia
2018-02-01
This paper analyzes the dynamic response of a pumped-storage hydropower plant in generating mode. Considering the elastic water column effects in the penstock, a linearized reduced order dynamic model of the pumped-storage hydropower plant is used in this paper. As the power load is always random, a set of random generator electric power output is introduced to research the dynamic behaviors of the pumped-storage hydropower plant. Then, the influences of the PI gains on the dynamic characteristics of the pumped-storage hydropower plant with the random power load are analyzed. In addition, the effects of initial power load and PI parameters on the stability of the pumped-storage hydropower plant are studied in depth. All of the above results will provide theoretical guidance for the study and analysis of the pumped-storage hydropower plant.
Scattering from randomly oriented circular discs with application to vegetation
NASA Technical Reports Server (NTRS)
Karam, M. A.; Fung, A. K.
1984-01-01
A vegetation layer is modeled by a collection of randomly oriented circular discs over a half space. The backscattering coefficient from such a half space is computed using the radiative transfer theory. It is shown that significantly different results are obtained from this theory as compared with some earlier investigations using the same modeling approach but with restricted disc orientations. In particular, the backscattered cross polarized returns cannot have a fast increasing angular trend which is inconsistent with measurements. By setting the appropriate angle of orientation to zero the theory reduces to previously published results. Comparisons are shown with measurements taken from milo, corn and wheat and good agreements are obtained for both polarized and cross polarized returns.
Scattering from randomly oriented circular discs with application to vegetation
NASA Technical Reports Server (NTRS)
Karam, M. A.; Fung, A. K.
1983-01-01
A vegetation layer is modeled by a collection of randomly oriented circular discs over a half space. The backscattering coefficient from such a half space is computed using the radiative transfer theory. It is shown that significantly different results are obtained from this theory as compared with some earlier investigations using the same modeling approach but with restricted disc orientations. In particular, the backscattered cross-polarized returns cannot have a fast increasing angular trend which is inconsistent with measurements. By setting the appropriate angle of orientation to zero the theory reduces to previously published results. Comparisons are shown with measurements taken from milo, corn and wheat and good agreements are obtained for both polarized and cross-polarized returns.
Pu, Jie; Fang, Di; Wilson, Jeffrey R
2017-02-03
The analysis of correlated binary data is commonly addressed through the use of conditional models with random effects included in the systematic component as opposed to generalized estimating equations (GEE) models that addressed the random component. Since the joint distribution of the observations is usually unknown, the conditional distribution is a natural approach. Our objective was to compare the fit of different binary models for correlated data in Tabaco use. We advocate that the joint modeling of the mean and dispersion may be at times just as adequate. We assessed the ability of these models to account for the intraclass correlation. In so doing, we concentrated on fitting logistic regression models to address smoking behaviors. Frequentist and Bayes' hierarchical models were used to predict conditional probabilities, and the joint modeling (GLM and GAM) models were used to predict marginal probabilities. These models were fitted to National Longitudinal Study of Adolescent to Adult Health (Add Health) data for Tabaco use. We found that people were less likely to smoke if they had higher income, high school or higher education and religious. Individuals were more likely to smoke if they had abused drug or alcohol, spent more time on TV and video games, and been arrested. Moreover, individuals who drank alcohol early in life were more likely to be a regular smoker. Children who experienced mistreatment from their parents were more likely to use Tabaco regularly. The joint modeling of the mean and dispersion models offered a flexible and meaningful method of addressing the intraclass correlation. They do not require one to identify random effects nor distinguish from one level of the hierarchy to the other. Moreover, once one can identify the significant random effects, one can obtain similar results to the random coefficient models. We found that the set of marginal models accounting for extravariation through the additional dispersion submodel produced similar results with regards to inferences and predictions. Moreover, both marginal and conditional models demonstrated similar predictive power.
Bayesian Nonparametric Inference – Why and How
Müller, Peter; Mitra, Riten
2013-01-01
We review inference under models with nonparametric Bayesian (BNP) priors. The discussion follows a set of examples for some common inference problems. The examples are chosen to highlight problems that are challenging for standard parametric inference. We discuss inference for density estimation, clustering, regression and for mixed effects models with random effects distributions. While we focus on arguing for the need for the flexibility of BNP models, we also review some of the more commonly used BNP models, thus hopefully answering a bit of both questions, why and how to use BNP. PMID:24368932
The Impact of Marketing Actions on Relationship Quality in the Higher Education Sector in Jordan
ERIC Educational Resources Information Center
Al-Alak, Basheer A. M.
2006-01-01
This field/analytical study examined the marketing actions (antecedents) and performance (consequences) of relationship quality in a higher education setting. To analyze data collected from a random sample of 271 undergraduate students at AL-Zaytoonah Private University of Jordan, the linear structural relationship (LISREL) model was used to…
Drinking and Driving PSAs: A Content Analysis of Behavioral Influence Strategies.
ERIC Educational Resources Information Center
Slater, Michael D.
1999-01-01
Study randomly samples 66 drinking and driving television public service announcements that were then coded using a categorical and dimensional scheme. Data set reveals that informational/testimonial messages made up almost half of the total; positive appeals were the next most common, followed by empathy, fear, and modeling appeals. (Contains 34…
Adolescent Pregnancy in an Urban Environment: Issues, Programs, and Evaluation.
ERIC Educational Resources Information Center
Hardy, Janet B.; Zabin, Laurie Schwab
An in-depth discussion of national and local statistics regarding teenage and adolescent pregnancy and the developmental issues involved opens this analysis. Problems and adverse consequences of adolescent pregnancy in an urban setting are explored using a city-wide random sample of adolescent births. A model pregnancy and parenting program and…
Emotional Intelligence, Personality, and Task-Induced Stress
ERIC Educational Resources Information Center
Matthews, Gerald; Emo, Amanda K.; Funke, Gregory; Zeidner, Moshe; Roberts, Richard D.; Costa, Paul T.; Schulze, Ralf
2006-01-01
Emotional intelligence (EI) may predict stress responses and coping strategies in a variety of applied settings. This study compares EI and the personality factors of the Five Factor Model (FFM) as predictors of task-induced stress responses. Participants (N = 200) were randomly assigned to 1 of 4 task conditions, 3 of which were designed to be…
NASA Technical Reports Server (NTRS)
Duda, David P.; Minnis, Patrick
2009-01-01
Straightforward application of the Schmidt-Appleman contrail formation criteria to diagnose persistent contrail occurrence from numerical weather prediction data is hindered by significant bias errors in the upper tropospheric humidity. Logistic models of contrail occurrence have been proposed to overcome this problem, but basic questions remain about how random measurement error may affect their accuracy. A set of 5000 synthetic contrail observations is created to study the effects of random error in these probabilistic models. The simulated observations are based on distributions of temperature, humidity, and vertical velocity derived from Advanced Regional Prediction System (ARPS) weather analyses. The logistic models created from the simulated observations were evaluated using two common statistical measures of model accuracy, the percent correct (PC) and the Hanssen-Kuipers discriminant (HKD). To convert the probabilistic results of the logistic models into a dichotomous yes/no choice suitable for the statistical measures, two critical probability thresholds are considered. The HKD scores are higher when the climatological frequency of contrail occurrence is used as the critical threshold, while the PC scores are higher when the critical probability threshold is 0.5. For both thresholds, typical random errors in temperature, relative humidity, and vertical velocity are found to be small enough to allow for accurate logistic models of contrail occurrence. The accuracy of the models developed from synthetic data is over 85 percent for both the prediction of contrail occurrence and non-occurrence, although in practice, larger errors would be anticipated.
Dai, Junyi; Gunn, Rachel L; Gerst, Kyle R; Busemeyer, Jerome R; Finn, Peter R
2016-10-01
Previous studies have demonstrated that working memory capacity plays a central role in delay discounting in people with externalizing psychopathology. These studies used a hyperbolic discounting model, and its single parameter-a measure of delay discounting-was estimated using the standard method of searching for indifference points between intertemporal options. However, there are several problems with this approach. First, the deterministic perspective on delay discounting underlying the indifference point method might be inappropriate. Second, the estimation procedure using the R2 measure often leads to poor model fit. Third, when parameters are estimated using indifference points only, much of the information collected in a delay discounting decision task is wasted. To overcome these problems, this article proposes a random utility model of delay discounting. The proposed model has 2 parameters, 1 for delay discounting and 1 for choice variability. It was fit to choice data obtained from a recently published data set using both maximum-likelihood and Bayesian parameter estimation. As in previous studies, the delay discounting parameter was significantly associated with both externalizing problems and working memory capacity. Furthermore, choice variability was also found to be significantly associated with both variables. This finding suggests that randomness in decisions may be a mechanism by which externalizing problems and low working memory capacity are associated with poor decision making. The random utility model thus has the advantage of disclosing the role of choice variability, which had been masked by the traditional deterministic model. (PsycINFO Database Record (c) 2016 APA, all rights reserved).
Role of Statistical Random-Effects Linear Models in Personalized Medicine
Diaz, Francisco J; Yeh, Hung-Wen; de Leon, Jose
2012-01-01
Some empirical studies and recent developments in pharmacokinetic theory suggest that statistical random-effects linear models are valuable tools that allow describing simultaneously patient populations as a whole and patients as individuals. This remarkable characteristic indicates that these models may be useful in the development of personalized medicine, which aims at finding treatment regimes that are appropriate for particular patients, not just appropriate for the average patient. In fact, published developments show that random-effects linear models may provide a solid theoretical framework for drug dosage individualization in chronic diseases. In particular, individualized dosages computed with these models by means of an empirical Bayesian approach may produce better results than dosages computed with some methods routinely used in therapeutic drug monitoring. This is further supported by published empirical and theoretical findings that show that random effects linear models may provide accurate representations of phase III and IV steady-state pharmacokinetic data, and may be useful for dosage computations. These models have applications in the design of clinical algorithms for drug dosage individualization in chronic diseases; in the computation of dose correction factors; computation of the minimum number of blood samples from a patient that are necessary for calculating an optimal individualized drug dosage in therapeutic drug monitoring; measure of the clinical importance of clinical, demographic, environmental or genetic covariates; study of drug-drug interactions in clinical settings; the implementation of computational tools for web-site-based evidence farming; design of pharmacogenomic studies; and in the development of a pharmacological theory of dosage individualization. PMID:23467392
Kangovi, Shreya; Mitra, Nandita; Turr, Lindsey; Huo, Hairong; Grande, David; Long, Judith A.
2017-01-01
Upstream interventions – e.g. housing programs and community health worker interventions-address socioeconomic and behavioral factors that influence health outcomes across diseases. Studying these types of interventions in clinical trials raises a methodological challenge: how should researchers measure the effect of an upstream intervention in a sample of patients with different diseases? This paper addresses this question using an illustrative protocol of a randomized controlled trial of collaborative-goal setting versus goal-setting plus community health worker support among patients multiple chronic diseases: diabetes, obesity, hypertension and tobacco dependence. At study enrollment, patients met with their primary care providers to select one of their chronic diseases to focus on during the study, and to collaboratively set a goal for that disease. Patients randomly assigned to a community health worker also received six months of support to address socioeconomic and behavioral barriers to chronic disease control. The primary hypothesis was that there would be differences in patients’ selected chronic disease control as measured by HbA1c, body mass index, systolic blood pressure and cigarettes per day, between the goal-setting alone and community health worker support arms. To test this hypothesis, we will conduct a stratum specific multivariate analysis of variance which allows all patients (regardless of their selected chronic disease) to be included in a single model for the primary outcome. Population health researchers can use this approach to measure clinical outcomes across diseases. PMID:27965180
Naserkheil, Masoumeh; Miraie-Ashtiani, Seyed Reza; Nejati-Javaremi, Ardeshir; Son, Jihyun; Lee, Deukhwan
2016-12-01
The objective of this study was to estimate the genetic parameters of milk protein yields in Iranian Holstein dairy cattle. A total of 1,112,082 test-day milk protein yield records of 167,269 first lactation Holstein cows, calved from 1990 to 2010, were analyzed. Estimates of the variance components, heritability, and genetic correlations for milk protein yields were obtained using a random regression test-day model. Milking times, herd, age of recording, year, and month of recording were included as fixed effects in the model. Additive genetic and permanent environmental random effects for the lactation curve were taken into account by applying orthogonal Legendre polynomials of the fourth order in the model. The lowest and highest additive genetic variances were estimated at the beginning and end of lactation, respectively. Permanent environmental variance was higher at both extremes. Residual variance was lowest at the middle of the lactation and contrarily, heritability increased during this period. Maximum heritability was found during the 12th lactation stage (0.213±0.007). Genetic, permanent, and phenotypic correlations among test-days decreased as the interval between consecutive test-days increased. A relatively large data set was used in this study; therefore, the estimated (co)variance components for random regression coefficients could be used for national genetic evaluation of dairy cattle in Iran.
Naserkheil, Masoumeh; Miraie-Ashtiani, Seyed Reza; Nejati-Javaremi, Ardeshir; Son, Jihyun; Lee, Deukhwan
2016-01-01
The objective of this study was to estimate the genetic parameters of milk protein yields in Iranian Holstein dairy cattle. A total of 1,112,082 test-day milk protein yield records of 167,269 first lactation Holstein cows, calved from 1990 to 2010, were analyzed. Estimates of the variance components, heritability, and genetic correlations for milk protein yields were obtained using a random regression test-day model. Milking times, herd, age of recording, year, and month of recording were included as fixed effects in the model. Additive genetic and permanent environmental random effects for the lactation curve were taken into account by applying orthogonal Legendre polynomials of the fourth order in the model. The lowest and highest additive genetic variances were estimated at the beginning and end of lactation, respectively. Permanent environmental variance was higher at both extremes. Residual variance was lowest at the middle of the lactation and contrarily, heritability increased during this period. Maximum heritability was found during the 12th lactation stage (0.213±0.007). Genetic, permanent, and phenotypic correlations among test-days decreased as the interval between consecutive test-days increased. A relatively large data set was used in this study; therefore, the estimated (co)variance components for random regression coefficients could be used for national genetic evaluation of dairy cattle in Iran. PMID:26954192
Modeling of Transionospheric Radio Propagation
1975-08-01
entitled RFMOD, contains the main elements of the scattering theory, the morphological model for ionospheric irregularity strength and other...phasor lies within an elemental area on the complex plain. To begin, we write E as the resultant of its long-term mean (E) and a zero-mean, randomly...totally defined by either of these sets of three parameters (i.e., the three real variances or the real R and the real and imaginary parts of B ). Most
LaRocco-Cockburn, Anna; Reed, Susan D.; Melville, Jennifer; Croicu, Carmen; Russo, Joan; Inspektor, Michal; Edmondson, Eddie; Katon, Wayne
2013-01-01
Background Women have higher rates of depression and often experience depression symptoms during critical reproductive periods, including adolescence, pregnancy, postpartum, and menopause. Collaborative care intervention models for mood disorders in patients receiving care in an OB-GYN clinic setting have not been evaluated. Study design and methodology for a randomized, controlled trial of collaborative care depression management versus usual care in OB-GYN clinics and the details of the adapted collaborative care intervention and model implementation are described in this paper. Methods Women over age 18 years with clinically significant symptoms of depression, as measured by a Patient Health Questionnaire-9 (PHQ-9) score ≥10 and a clinical diagnosis of major depression or dysthymia, were randomized to the study intervention or to usual care and were followed for 18 months. The primary outcome assessed was change over time in the SCL-20 depression scale between baseline and 12 months. Baseline Results 205 women were randomized: 57% white, 20% African American, 9% Asian or Pacific Islander, 7% Hispanic, and 6% Native American. Mean age was 39 years. 4.6% were pregnant and 7.5% were within 12 months postpartum. The majority were single, (52%), and 95% had at least the equivalent of a high school diploma. Almost all patients met DSM IV criteria for major depression (99%) and approximately 33% met criteria for dysthymia. Conclusions An OB-GYN collaborative care team including a social worker, psychiatrist and OB-GYN physician who met weekly and used an electronic tracking system for patients were essential elements of the proposed depression care treatment model described here. Further study of models that improve quality of depression care that are adapted to the unique OB-GYN setting are needed. PMID:23939510
Le, Hai-Ha; Subtil, Fabien; Cerou, Marc; Marchant, Ivanny; Al-Gobari, Muaamar; Fall, Mor; Mimouni, Yanis; Kassaï, Behrouz; Lindholm, Lars; Thijs, Lutgarde; Gueyffier, François
2017-11-01
To construct a sudden death risk score specifically for hypertension (HYSUD) patients with or without cardiovascular history. Data were collected from six randomized controlled trials of antihypertensive treatments with 8044 women and 17 604 men differing in age ranges and blood pressure eligibility criteria. In total, 345 sudden deaths (1.35%) occurred during a mean follow-up of 5.16 years. Risk factors of sudden death were examined using a multivariable Cox proportional hazards model adjusted on trials. The model was transformed to an integer system, with points added for each factor according to its association with sudden death risk. Antihypertensive treatment was not associated with a reduction of the sudden death risk and had no interaction with other factors, allowing model development on both treatment and placebo groups. A risk score of sudden death in 5 years was built with seven significant risk factors: age, sex, SBP, serum total cholesterol, cigarette smoking, diabetes, and history of myocardial infarction. In terms of discrimination performance, HYSUD model was adequate with areas under the receiver operating characteristic curve of 77.74% (confidence interval 95%, 74.13-81.35) for the derivation set, of 77.46% (74.09-80.83) for the validation set, and of 79.17% (75.94-82.40) for the whole population. Our work provides a simple risk-scoring system for sudden death prediction in hypertension, using individual data from six randomized controlled trials of antihypertensive treatments. HYSUD score could help assessing a hypertensive individual's risk of sudden death and optimizing preventive therapeutic strategies for these patients.
Newton, Paul K; Mason, Jeremy; Venkatappa, Neethi; Jochelson, Maxine S; Hurt, Brian; Nieva, Jorge; Comen, Elizabeth; Norton, Larry; Kuhn, Peter
2015-01-01
Background: Cancer cell migration patterns are critical for understanding metastases and clinical evolution. Breast cancer spreads from one organ system to another via hematogenous and lymphatic routes. Although patterns of spread may superficially seem random and unpredictable, we explored the possibility that this is not the case. Aims: Develop a Markov based model of breast cancer progression that has predictive capability. Methods: On the basis of a longitudinal data set of 446 breast cancer patients, we created a Markov chain model of metastasis that describes the probabilities of metastasis occurring at a given anatomic site together with the probability of spread to additional sites. Progression is modeled as a random walk on a directed graph, where nodes represent anatomical sites where tumors can develop. Results: We quantify how survival depends on the location of the first metastatic site for different patient subcategories. In addition, we classify metastatic sites as “sponges” or “spreaders” with implications regarding anatomical pathway prediction and long-term survival. As metastatic tumors to the bone (main spreader) are most prominent, we focus in more detail on differences between groups of patients who form subsequent metastases to the lung as compared with the liver. Conclusions: We have found that spatiotemporal patterns of metastatic spread in breast cancer are neither random nor unpredictable. Furthermore, the novel concept of classifying organ sites as sponges or spreaders may motivate experiments seeking a biological basis for these phenomena and allow us to quantify the potential consequences of therapeutic targeting of sites in the oligometastatic setting and shed light on organotropic aspects of the disease. PMID:28721371
NASA Astrophysics Data System (ADS)
Wang, Lei; Xiong, Chuang; Wang, Xiaojun; Li, Yunlong; Xu, Menghui
2018-04-01
Considering that multi-source uncertainties from inherent nature as well as the external environment are unavoidable and severely affect the controller performance, the dynamic safety assessment with high confidence is of great significance for scientists and engineers. In view of this, the uncertainty quantification analysis and time-variant reliability estimation corresponding to the closed-loop control problems are conducted in this study under a mixture of random, interval, and convex uncertainties. By combining the state-space transformation and the natural set expansion, the boundary laws of controlled response histories are first confirmed with specific implementation of random items. For nonlinear cases, the collocation set methodology and fourth Rounge-Kutta algorithm are introduced as well. Enlightened by the first-passage model in random process theory as well as by the static probabilistic reliability ideas, a new definition of the hybrid time-variant reliability measurement is provided for the vibration control systems and the related solution details are further expounded. Two engineering examples are eventually presented to demonstrate the validity and applicability of the methodology developed.
NASA Astrophysics Data System (ADS)
Sposini, Vittoria; Chechkin, Aleksei V.; Seno, Flavio; Pagnini, Gianni; Metzler, Ralf
2018-04-01
A considerable number of systems have recently been reported in which Brownian yet non-Gaussian dynamics was observed. These are processes characterised by a linear growth in time of the mean squared displacement, yet the probability density function of the particle displacement is distinctly non-Gaussian, and often of exponential (Laplace) shape. This apparently ubiquitous behaviour observed in very different physical systems has been interpreted as resulting from diffusion in inhomogeneous environments and mathematically represented through a variable, stochastic diffusion coefficient. Indeed different models describing a fluctuating diffusivity have been studied. Here we present a new view of the stochastic basis describing time-dependent random diffusivities within a broad spectrum of distributions. Concretely, our study is based on the very generic class of the generalised Gamma distribution. Two models for the particle spreading in such random diffusivity settings are studied. The first belongs to the class of generalised grey Brownian motion while the second follows from the idea of diffusing diffusivities. The two processes exhibit significant characteristics which reproduce experimental results from different biological and physical systems. We promote these two physical models for the description of stochastic particle motion in complex environments.
Pharmacophore modeling and virtual screening to identify potential RET kinase inhibitors.
Shih, Kuei-Chung; Shiau, Chung-Wai; Chen, Ting-Shou; Ko, Ching-Huai; Lin, Chih-Lung; Lin, Chun-Yuan; Hwang, Chrong-Shiong; Tang, Chuan-Yi; Chen, Wan-Ru; Huang, Jui-Wen
2011-08-01
Chemical features based 3D pharmacophore model for REarranged during Transfection (RET) tyrosine kinase were developed by using a training set of 26 structurally diverse known RET inhibitors. The best pharmacophore hypothesis, which identified inhibitors with an associated correlation coefficient of 0.90 between their experimental and estimated anti-RET values, contained one hydrogen-bond acceptor, one hydrogen-bond donor, one hydrophobic, and one ring aromatic features. The model was further validated by a testing set, Fischer's randomization test, and goodness of hit (GH) test. We applied this pharmacophore model to screen NCI database for potential RET inhibitors. The hits were docked to RET with GOLD and CDOCKER after filtering by Lipinski's rules. Ultimately, 24 molecules were selected as potential RET inhibitors for further investigation. Copyright © 2011 Elsevier Ltd. All rights reserved.
Modeling the rate of HIV testing from repeated binary data amidst potential never-testers.
Rice, John D; Johnson, Brent A; Strawderman, Robert L
2018-01-04
Many longitudinal studies with a binary outcome measure involve a fraction of subjects with a homogeneous response profile. In our motivating data set, a study on the rate of human immunodeficiency virus (HIV) self-testing in a population of men who have sex with men (MSM), a substantial proportion of the subjects did not self-test during the follow-up study. The observed data in this context consist of a binary sequence for each subject indicating whether or not that subject experienced any events between consecutive observation time points, so subjects who never self-tested were observed to have a response vector consisting entirely of zeros. Conventional longitudinal analysis is not equipped to handle questions regarding the rate of events (as opposed to the odds, as in the classical logistic regression model). With the exception of discrete mixture models, such methods are also not equipped to handle settings in which there may exist a group of subjects for whom no events will ever occur, i.e. a so-called "never-responder" group. In this article, we model the observed data assuming that events occur according to some unobserved continuous-time stochastic process. In particular, we consider the underlying subject-specific processes to be Poisson conditional on some unobserved frailty, leading to a natural focus on modeling event rates. Specifically, we propose to use the power variance function (PVF) family of frailty distributions, which contains both the gamma and inverse Gaussian distributions as special cases and allows for the existence of a class of subjects having zero frailty. We generalize a computational algorithm developed for a log-gamma random intercept model (Conaway, 1990. A random effects model for binary data. Biometrics46, 317-328) to compute the exact marginal likelihood, which is then maximized to obtain estimates of model parameters. We conduct simulation studies, exploring the performance of the proposed method in comparison with competitors. Applying the PVF as well as a Gaussian random intercept model and a corresponding discrete mixture model to our motivating data set, we conclude that the group assigned to receive follow-up messages via SMS was self-testing at a significantly lower rate than the control group, but that there is no evidence to support the existence of a group of never-testers. © The Author 2017. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
Li, Hang; Wang, Maolin; Gong, Ya-Nan; Yan, Aixia
2016-01-01
β-secretase (BACE1) is an aspartyl protease, which is considered as a novel vital target in Alzheimer`s disease therapy. We collected a data set of 294 BACE1 inhibitors, and built six classification models to discriminate active and weakly active inhibitors using Kohonen's Self-Organizing Map (SOM) method and Support Vector Machine (SVM) method. Each molecular descriptor was calculated using the program ADRIANA.Code. We adopted two different methods: random method and Self-Organizing Map method, for training/test set split. The descriptors were selected by F-score and stepwise linear regression analysis. The best SVM model Model2C has a good prediction performance on test set with prediction accuracy, sensitivity (SE), specificity (SP) and Matthews correlation coefficient (MCC) of 89.02%, 90%, 88%, 0.78, respectively. Model 1A is the best SOM model, whose accuracy and MCC of the test set were 94.57% and 0.98, respectively. The lone pair electronegativity and polarizability related descriptors importantly contributed to bioactivity of BACE1 inhibitor. The Extended-Connectivity Finger-Prints_4 (ECFP_4) analysis found some vitally key substructural features, which could be helpful for further drug design research. The SOM and SVM models built in this study can be obtained from the authors by email or other contacts.
Yang, Lanlin; Cai, Sufen; Zhang, Shuoping; Kong, Xiangyi; Gu, Yifan; Lu, Changfu; Dai, Jing; Gong, Fei; Lu, Guangxiu; Lin, Ge
2018-05-01
Does single cleavage-stage (Day 3) embryo transfer using a time-lapse (TL) hierarchical classification model achieve comparable ongoing pregnancy rates (OPR) to single blastocyst (Day 5) transfer by conventional morphological (CM) selection? Day 3 single embryo transfer (SET) with a hierarchical classification model had a significantly lower OPR compared with Day 5 SET with CM selection. Cleavage-stage SET is an alternative to blastocyst SET. Time-lapse imaging assists better embryo selection, based on studies of pregnancy outcomes when adding time-lapse imaging to CM selection at the cleavage or blastocyst stage. This single-centre, randomized, open-label, active-controlled, non-inferiority study included 600 women between October 2015 and April 2017. Eligible patients were Chinese females, aged ≤36 years, who were undergoing their first or second fresh IVF cycle using their own oocytes, and who had FSH levels ≤12 IU/mL on Day 3 of the cycle and 10 or more oocytes retrieved. Patients who had underlying uterine conditions, oocyte donation, recurrent pregnancy loss, abnormal oocytes or <6 normally fertilized embryos (2PN) were excluded from the study participation. Patients were randomized 1:1 to either the cleavage-stage SET with a time-lapse hierarchical classification model for selection (D3 + TL) or blastocyst SET with CM selection (D5 + CM). All normally fertilized zygotes were cultured in Primo Vision. The study was conducted at a tertiary IVF centre (CITIC-Xiangya) and OPR was the primary outcome. A total of 600 patients were randomized to the two groups, among which 585 (D3 + TL = 290, D5 + CM = 295) were included in the Modified-intention-to-treat (mITT) population and 517 (D3 + TL = 261, D5 + CM = 256) were included in the PP population. In the per protocol (PP) population, OPR was significantly lower in the D3 group (59.4%, 155/261) than in the D5 group (68.4%, 175/256) (difference: -9.0%, 95% CI: -17.1%, -0.7%, P = 0.03). Analysis in mITT population showed a marginally significant difference in the OPR between the D3 + TL and D5 + CM groups (56.6 versus 64.1%, difference: -7.5%, 95% CI: -15.4%, 0.4%, P = 0.06). The D3 + TL group resulted in a markedly lower implantation rate than the D5 + CM group (64.4 versus 77.0%; P = 0.002) in the PP analysis, however, the early miscarriage rate did not significantly differ between the two groups. The study lacked a direct comparison between time-lapse and CM selections at cleavage-stage SET and was statistically underpowered to detect non-inferiority. The subject's eligibility criteria favouring women with a good prognosis for IVF weakened the generalizability of the results. The OPR from Day 3 cleavage-stage SET using hierarchical classification time-lapse selection was significantly lower compared with that from Day 5 blastocyst SET using conventional morphology, yet it appeared to be clinically acceptable in women underwent IVF. This study is supported by grants from Ferring Pharmaceuticals and the Program for New Century Excellent Talents in University, China. ChiCTR-ICR-15006600. 16 June 2015. 1 October 2015.
Machine Learning for Treatment Assignment: Improving Individualized Risk Attribution
Weiss, Jeremy; Kuusisto, Finn; Boyd, Kendrick; Liu, Jie; Page, David
2015-01-01
Clinical studies model the average treatment effect (ATE), but apply this population-level effect to future individuals. Due to recent developments of machine learning algorithms with useful statistical guarantees, we argue instead for modeling the individualized treatment effect (ITE), which has better applicability to new patients. We compare ATE-estimation using randomized and observational analysis methods against ITE-estimation using machine learning, and describe how the ITE theoretically generalizes to new population distributions, whereas the ATE may not. On a synthetic data set of statin use and myocardial infarction (MI), we show that a learned ITE model improves true ITE estimation and outperforms the ATE. We additionally argue that ITE models should be learned with a consistent, nonparametric algorithm from unweighted examples and show experiments in favor of our argument using our synthetic data model and a real data set of D-penicillamine use for primary biliary cirrhosis. PMID:26958271
Machine Learning for Treatment Assignment: Improving Individualized Risk Attribution.
Weiss, Jeremy; Kuusisto, Finn; Boyd, Kendrick; Liu, Jie; Page, David
2015-01-01
Clinical studies model the average treatment effect (ATE), but apply this population-level effect to future individuals. Due to recent developments of machine learning algorithms with useful statistical guarantees, we argue instead for modeling the individualized treatment effect (ITE), which has better applicability to new patients. We compare ATE-estimation using randomized and observational analysis methods against ITE-estimation using machine learning, and describe how the ITE theoretically generalizes to new population distributions, whereas the ATE may not. On a synthetic data set of statin use and myocardial infarction (MI), we show that a learned ITE model improves true ITE estimation and outperforms the ATE. We additionally argue that ITE models should be learned with a consistent, nonparametric algorithm from unweighted examples and show experiments in favor of our argument using our synthetic data model and a real data set of D-penicillamine use for primary biliary cirrhosis.
Robustly Aligning a Shape Model and Its Application to Car Alignment of Unknown Pose.
Li, Yan; Gu, Leon; Kanade, Takeo
2011-09-01
Precisely localizing in an image a set of feature points that form a shape of an object, such as car or face, is called alignment. Previous shape alignment methods attempted to fit a whole shape model to the observed data, based on the assumption of Gaussian observation noise and the associated regularization process. However, such an approach, though able to deal with Gaussian noise in feature detection, turns out not to be robust or precise because it is vulnerable to gross feature detection errors or outliers resulting from partial occlusions or spurious features from the background or neighboring objects. We address this problem by adopting a randomized hypothesis-and-test approach. First, a Bayesian inference algorithm is developed to generate a shape-and-pose hypothesis of the object from a partial shape or a subset of feature points. For alignment, a large number of hypotheses are generated by randomly sampling subsets of feature points, and then evaluated to find the one that minimizes the shape prediction error. This method of randomized subset-based matching can effectively handle outliers and recover the correct object shape. We apply this approach on a challenging data set of over 5,000 different-posed car images, spanning a wide variety of car types, lighting, background scenes, and partial occlusions. Experimental results demonstrate favorable improvements over previous methods on both accuracy and robustness.
A Comparison of Two Balance Calibration Model Building Methods
NASA Technical Reports Server (NTRS)
DeLoach, Richard; Ulbrich, Norbert
2007-01-01
Simulated strain-gage balance calibration data is used to compare the accuracy of two balance calibration model building methods for different noise environments and calibration experiment designs. The first building method obtains a math model for the analysis of balance calibration data after applying a candidate math model search algorithm to the calibration data set. The second building method uses stepwise regression analysis in order to construct a model for the analysis. Four balance calibration data sets were simulated in order to compare the accuracy of the two math model building methods. The simulated data sets were prepared using the traditional One Factor At a Time (OFAT) technique and the Modern Design of Experiments (MDOE) approach. Random and systematic errors were introduced in the simulated calibration data sets in order to study their influence on the math model building methods. Residuals of the fitted calibration responses and other statistical metrics were compared in order to evaluate the calibration models developed with different combinations of noise environment, experiment design, and model building method. Overall, predicted math models and residuals of both math model building methods show very good agreement. Significant differences in model quality were attributable to noise environment, experiment design, and their interaction. Generally, the addition of systematic error significantly degraded the quality of calibration models developed from OFAT data by either method, but MDOE experiment designs were more robust with respect to the introduction of a systematic component of the unexplained variance.
Mesoscale model response to random, surface-based perturbations — A sea-breeze experiment
NASA Astrophysics Data System (ADS)
Garratt, J. R.; Pielke, R. A.; Miller, W. F.; Lee, T. J.
1990-09-01
The introduction into a mesoscale model of random (in space) variations in roughness length, or random (in space and time) surface perturbations of temperature and friction velocity, produces a measurable, but barely significant, response in the simulated flow dynamics of the lower atmosphere. The perturbations are an attempt to include the effects of sub-grid variability into the ensemble-mean parameterization schemes used in many numerical models. Their magnitude is set in our experiments by appeal to real-world observations of the spatial variations in roughness length and daytime surface temperature over the land on horizontal scales of one to several tens of kilometers. With sea-breeze simulations, comparisons of a number of realizations forced by roughness-length and surface-temperature perturbations with the standard simulation reveal no significant change in ensemble mean statistics, and only small changes in the sea-breeze vertical velocity. Changes in the updraft velocity for individual runs, of up to several cms-1 (compared to a mean of 14 cms-1), are directly the result of prefrontal temperature changes of 0.1 to 0.2K, produced by the random surface forcing. The correlation and magnitude of the changes are entirely consistent with a gravity-current interpretation of the sea breeze.
Zhuang, Xiaodong; Guo, Yue; Ni, Ao; Yang, Daya; Liao, Lizhen; Zhang, Shaozhao; Zhou, Huimin; Sun, Xiuting; Wang, Lichun; Wang, Xueqin; Liao, Xinxue
2018-06-04
An environment-wide association study (EWAS) may be useful to comprehensively test and validate associations between environmental factors and cardiovascular disease (CVD) in an unbiased manner. Data from National Health and Nutrition Examination Survey (1999-2014) were randomly 50:50 spilt into training set and testing set. CVD was ascertained by a self-reported diagnosis of myocardial infarction, coronary heart disease or stroke. We performed multiple linear regression analyses associating 203 environmental factors and 132 clinical phenotypes with CVD in training set (false discovery rate < 5%) and significant factors were validated in the testing set (P < 0.05). Random forest (RF) model was used for multicollinearity elimination and variable importance ranking. Discriminative power of factors for CVD was calculated by area under the receiver operating characteristic (AUROC). Overall, 43,568 participants with 4084 (9.4%) CVD were included. After adjusting for age, sex, race, body mass index, blood pressure and socio-economic level, we identified 5 environmental variables and 19 clinical phenotypes associated with CVD in training and testing dataset. Top five factors in RF importance ranking were: waist, glucose, uric acid, and red cell distribution width and glycated hemoglobin. AUROC of the RF model was 0.816 (top 5 factors) and 0.819 (full model). Sensitivity analyses reveal no specific moderators of the associations. Our systematic evaluation provides new knowledge on the complex array of environmental correlates of CVD. These identified correlates may serve as a complementary approach to CVD risk assessment. Our findings need to be probed in further observational and interventional studies. Copyright © 2018. Published by Elsevier Ltd.
2009-08-01
Bryant, R, Engel, CC (2004). A therapist-assisted internet self-help program for traumatic stress . Professional Psychology: Research and Practice, 35...Combat-Related PTSD in Military Primary Healthcare Settings: A Randomized Trial of “DESTRESS-PC” PRINCIPAL INVESTIGATOR: Charles Engel...Early Resilience Intervention for Combat-Related PTSD in Military Primary Healthcare Settings: A Randomized Trial of DESTRESS-PC 5b. GRANT NUMBER
McMillen, J Curtis; Narendorf, Sarah Carter; Robinson, Debra; Havlicek, Judy; Fedoravicius, Nicole; Bertram, Julie; McNelly, David
2015-01-01
Older youth in out-of-home care often live in restrictive settings and face psychiatric issues without sufficient family support. This paper reports on the development and piloting of a manualized treatment foster care program designed to step down older youth with high psychiatric needs from residential programs to treatment foster care homes. A team of researchers and agency partners set out to develop a treatment foster care model for older youth based on Multi-dimensional Treatment Foster Care (MTFC). After matching youth by mental health condition and determining for whom randomization would be allowed, 14 youth were randomized to treatment as usual or a treatment foster home intervention. Stakeholders were interviewed qualitatively at multiple time points. Quantitative measures assessed mental health symptoms, days in locked facilities, employment and educational outcomes. Development efforts led to substantial variations from the MTFC model and a new model, Treatment Foster Care for Older Youth was piloted. Feasibility monitoring suggested that it was difficult, but possible to recruit and randomize youth from and out of residential homes and that foster parents could be recruited to serve them. Qualitative data pointed to some qualified clinical successes. Stakeholders viewed two team roles - that of psychiatric nurse and skills coaches - very highly. However, results also suggested that foster parents and some staff did not tolerate the intervention well and struggled to address the emotion dysregulation issues of the young people they served. Quantitative data demonstrated that the intervention was not keeping youth out of locked facilities. The intervention needed further refinement prior to a broader trial. Intervention development work continued until components were developed to help address emotion regulation problems among fostered youth. Psychiatric nurses and skills coaches who work with youth in community settings hold promise as important supports for older youth with psychiatric needs.
NASA Astrophysics Data System (ADS)
Hirakawa, E. T.; Pitarka, A.; Mellors, R. J.
2015-12-01
Evan Hirakawa, Arben Pitarka, and Robert Mellors One challenging task in explosion seismology is development of physical models for explaining the generation of S-waves during underground explosions. Pitarka et al. (2015) used finite difference simulations of SPE-3 (part of Source Physics Experiment, SPE, an ongoing series of underground chemical explosions at the Nevada National Security Site) and found that while a large component of shear motion was generated directly at the source, additional scattering from heterogeneous velocity structure and topography are necessary to better match the data. Large-scale features in the velocity model used in the SPE simulations are well constrained, however, small-scale heterogeneity is poorly constrained. In our study we used a stochastic representation of small-scale variability in order to produce additional high-frequency scattering. Two methods for generating the distributions of random scatterers are tested. The first is done in the spatial domain by essentially smoothing a set of random numbers over an ellipsoidal volume using a Gaussian weighting function. The second method consists of filtering a set of random numbers in the wavenumber domain to obtain a set of heterogeneities with a desired statistical distribution (Frankel and Clayton, 1986). This method is capable of generating distributions with either Gaussian or von Karman autocorrelation functions. The key parameters that affect scattering are the correlation length, the standard deviation of velocity for the heterogeneities, and the Hurst exponent, which is only present in the von Karman media. Overall, we find that shorter correlation lengths as well as higher standard deviations result in increased tangential motion in the frequency band of interest (0 - 10 Hz). This occurs partially through S-wave refraction, but mostly by P-S and Rg-S waves conversions. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344
Fang, Yun; Wu, Hulin; Zhu, Li-Xing
2011-07-01
We propose a two-stage estimation method for random coefficient ordinary differential equation (ODE) models. A maximum pseudo-likelihood estimator (MPLE) is derived based on a mixed-effects modeling approach and its asymptotic properties for population parameters are established. The proposed method does not require repeatedly solving ODEs, and is computationally efficient although it does pay a price with the loss of some estimation efficiency. However, the method does offer an alternative approach when the exact likelihood approach fails due to model complexity and high-dimensional parameter space, and it can also serve as a method to obtain the starting estimates for more accurate estimation methods. In addition, the proposed method does not need to specify the initial values of state variables and preserves all the advantages of the mixed-effects modeling approach. The finite sample properties of the proposed estimator are studied via Monte Carlo simulations and the methodology is also illustrated with application to an AIDS clinical data set.
Failure of self-consistency in the discrete resource model of visual working memory.
Bays, Paul M
2018-06-03
The discrete resource model of working memory proposes that each individual has a fixed upper limit on the number of items they can store at one time, due to division of memory into a few independent "slots". According to this model, responses on short-term memory tasks consist of a mixture of noisy recall (when the tested item is in memory) and random guessing (when the item is not in memory). This provides two opportunities to estimate capacity for each observer: first, based on their frequency of random guesses, and second, based on the set size at which the variability of stored items reaches a plateau. The discrete resource model makes the simple prediction that these two estimates will coincide. Data from eight published visual working memory experiments provide strong evidence against such a correspondence. These results present a challenge for discrete models of working memory that impose a fixed capacity limit. Copyright © 2018 The Author. Published by Elsevier Inc. All rights reserved.
NASA Astrophysics Data System (ADS)
Vile, Douglas J.
In radiation therapy, interfraction organ motion introduces a level of geometric uncertainty into the planning process. Plans, which are typically based upon a single instance of anatomy, must be robust against daily anatomical variations. For this problem, a model of the magnitude, direction, and likelihood of deformation is useful. In this thesis, principal component analysis (PCA) is used to statistically model the 3D organ motion for 19 prostate cancer patients, each with 8-13 fractional computed tomography (CT) images. Deformable image registration and the resultant displacement vector fields (DVFs) are used to quantify the interfraction systematic and random motion. By applying the PCA technique to the random DVFs, principal modes of random tissue deformation were determined for each patient, and a method for sampling synthetic random DVFs was developed. The PCA model was then extended to describe the principal modes of systematic and random organ motion for the population of patients. A leave-one-out study tested both the systematic and random motion model's ability to represent PCA training set DVFs. The random and systematic DVF PCA models allowed the reconstruction of these data with absolute mean errors between 0.5-0.9 mm and 1-2 mm, respectively. To the best of the author's knowledge, this study is the first successful effort to build a fully 3D statistical PCA model of systematic tissue deformation in a population of patients. By sampling synthetic systematic and random errors, organ occupancy maps were created for bony and prostate-centroid patient setup processes. By thresholding these maps, PCA-based planning target volume (PTV) was created and tested against conventional margin recipes (van Herk for bony alignment and 5 mm fixed [3 mm posterior] margin for centroid alignment) in a virtual clinical trial for low-risk prostate cancer. Deformably accumulated delivered dose served as a surrogate for clinical outcome. For the bony landmark setup subtrial, the PCA PTV significantly (p<0.05) reduced D30, D20, and D5 to bladder and D50 to rectum, while increasing rectal D20 and D5. For the centroid-aligned setup, the PCA PTV significantly reduced all bladder DVH metrics and trended to lower rectal toxicity metrics. All PTVs covered the prostate with the prescription dose.
An Approach for Dynamic Optimization of Prevention Program Implementation in Stochastic Environments
NASA Astrophysics Data System (ADS)
Kang, Yuncheol; Prabhu, Vittal
The science of preventing youth problems has significantly advanced in developing evidence-based prevention program (EBP) by using randomized clinical trials. Effective EBP can reduce delinquency, aggression, violence, bullying and substance abuse among youth. Unfortunately the outcomes of EBP implemented in natural settings usually tend to be lower than in clinical trials, which has motivated the need to study EBP implementations. In this paper we propose to model EBP implementations in natural settings as stochastic dynamic processes. Specifically, we propose Markov Decision Process (MDP) for modeling and dynamic optimization of such EBP implementations. We illustrate these concepts using simple numerical examples and discuss potential challenges in using such approaches in practice.
Luo, Wei; Phung, Dinh; Tran, Truyen; Gupta, Sunil; Rana, Santu; Karmakar, Chandan; Shilton, Alistair; Yearwood, John; Dimitrova, Nevenka; Ho, Tu Bao; Venkatesh, Svetha; Berk, Michael
2016-12-16
As more and more researchers are turning to big data for new opportunities of biomedical discoveries, machine learning models, as the backbone of big data analysis, are mentioned more often in biomedical journals. However, owing to the inherent complexity of machine learning methods, they are prone to misuse. Because of the flexibility in specifying machine learning models, the results are often insufficiently reported in research articles, hindering reliable assessment of model validity and consistent interpretation of model outputs. To attain a set of guidelines on the use of machine learning predictive models within clinical settings to make sure the models are correctly applied and sufficiently reported so that true discoveries can be distinguished from random coincidence. A multidisciplinary panel of machine learning experts, clinicians, and traditional statisticians were interviewed, using an iterative process in accordance with the Delphi method. The process produced a set of guidelines that consists of (1) a list of reporting items to be included in a research article and (2) a set of practical sequential steps for developing predictive models. A set of guidelines was generated to enable correct application of machine learning models and consistent reporting of model specifications and results in biomedical research. We believe that such guidelines will accelerate the adoption of big data analysis, particularly with machine learning methods, in the biomedical research community. ©Wei Luo, Dinh Phung, Truyen Tran, Sunil Gupta, Santu Rana, Chandan Karmakar, Alistair Shilton, John Yearwood, Nevenka Dimitrova, Tu Bao Ho, Svetha Venkatesh, Michael Berk. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 16.12.2016.
Royle, J. Andrew; Chandler, Richard B.; Yackulic, Charles; Nichols, James D.
2012-01-01
1. Understanding the factors affecting species occurrence is a pre-eminent focus of applied ecological research. However, direct information about species occurrence is lacking for many species. Instead, researchers sometimes have to rely on so-called presence-only data (i.e. when no direct information about absences is available), which often results from opportunistic, unstructured sampling. MAXENT is a widely used software program designed to model and map species distribution using presence-only data. 2. We provide a critical review of MAXENT as applied to species distribution modelling and discuss how it can lead to inferential errors. A chief concern is that MAXENT produces a number of poorly defined indices that are not directly related to the actual parameter of interest – the probability of occurrence (ψ). This focus on an index was motivated by the belief that it is not possible to estimate ψ from presence-only data; however, we demonstrate that ψ is identifiable using conventional likelihood methods under the assumptions of random sampling and constant probability of species detection. 3. The model is implemented in a convenient r package which we use to apply the model to simulated data and data from the North American Breeding Bird Survey. We demonstrate that MAXENT produces extreme under-predictions when compared to estimates produced by logistic regression which uses the full (presence/absence) data set. We note that MAXENT predictions are extremely sensitive to specification of the background prevalence, which is not objectively estimated using the MAXENT method. 4. As with MAXENT, formal model-based inference requires a random sample of presence locations. Many presence-only data sets, such as those based on museum records and herbarium collections, may not satisfy this assumption. However, when sampling is random, we believe that inference should be based on formal methods that facilitate inference about interpretable ecological quantities instead of vaguely defined indices.
Kilicoglu, Halil; Shin, Dongwook; Rindflesch, Thomas C.
2014-01-01
Gene regulatory networks are a crucial aspect of systems biology in describing molecular mechanisms of the cell. Various computational models rely on random gene selection to infer such networks from microarray data. While incorporation of prior knowledge into data analysis has been deemed important, in practice, it has generally been limited to referencing genes in probe sets and using curated knowledge bases. We investigate the impact of augmenting microarray data with semantic relations automatically extracted from the literature, with the view that relations encoding gene/protein interactions eliminate the need for random selection of components in non-exhaustive approaches, producing a more accurate model of cellular behavior. A genetic algorithm is then used to optimize the strength of interactions using microarray data and an artificial neural network fitness function. The result is a directed and weighted network providing the individual contribution of each gene to its target. For testing, we used invasive ductile carcinoma of the breast to query the literature and a microarray set containing gene expression changes in these cells over several time points. Our model demonstrates significantly better fitness than the state-of-the-art model, which relies on an initial random selection of genes. Comparison to the component pathways of the KEGG Pathways in Cancer map reveals that the resulting networks contain both known and novel relationships. The p53 pathway results were manually validated in the literature. 60% of non-KEGG relationships were supported (74% for highly weighted interactions). The method was then applied to yeast data and our model again outperformed the comparison model. Our results demonstrate the advantage of combining gene interactions extracted from the literature in the form of semantic relations with microarray analysis in generating contribution-weighted gene regulatory networks. This methodology can make a significant contribution to understanding the complex interactions involved in cellular behavior and molecular physiology. PMID:24921649
Chen, Guocai; Cairelli, Michael J; Kilicoglu, Halil; Shin, Dongwook; Rindflesch, Thomas C
2014-06-01
Gene regulatory networks are a crucial aspect of systems biology in describing molecular mechanisms of the cell. Various computational models rely on random gene selection to infer such networks from microarray data. While incorporation of prior knowledge into data analysis has been deemed important, in practice, it has generally been limited to referencing genes in probe sets and using curated knowledge bases. We investigate the impact of augmenting microarray data with semantic relations automatically extracted from the literature, with the view that relations encoding gene/protein interactions eliminate the need for random selection of components in non-exhaustive approaches, producing a more accurate model of cellular behavior. A genetic algorithm is then used to optimize the strength of interactions using microarray data and an artificial neural network fitness function. The result is a directed and weighted network providing the individual contribution of each gene to its target. For testing, we used invasive ductile carcinoma of the breast to query the literature and a microarray set containing gene expression changes in these cells over several time points. Our model demonstrates significantly better fitness than the state-of-the-art model, which relies on an initial random selection of genes. Comparison to the component pathways of the KEGG Pathways in Cancer map reveals that the resulting networks contain both known and novel relationships. The p53 pathway results were manually validated in the literature. 60% of non-KEGG relationships were supported (74% for highly weighted interactions). The method was then applied to yeast data and our model again outperformed the comparison model. Our results demonstrate the advantage of combining gene interactions extracted from the literature in the form of semantic relations with microarray analysis in generating contribution-weighted gene regulatory networks. This methodology can make a significant contribution to understanding the complex interactions involved in cellular behavior and molecular physiology.
Li, Changyang; Wang, Xiuying; Eberl, Stefan; Fulham, Michael; Yin, Yong; Dagan Feng, David
2015-01-01
Automated and general medical image segmentation can be challenging because the foreground and the background may have complicated and overlapping density distributions in medical imaging. Conventional region-based level set algorithms often assume piecewise constant or piecewise smooth for segments, which are implausible for general medical image segmentation. Furthermore, low contrast and noise make identification of the boundaries between foreground and background difficult for edge-based level set algorithms. Thus, to address these problems, we suggest a supervised variational level set segmentation model to harness the statistical region energy functional with a weighted probability approximation. Our approach models the region density distributions by using the mixture-of-mixtures Gaussian model to better approximate real intensity distributions and distinguish statistical intensity differences between foreground and background. The region-based statistical model in our algorithm can intuitively provide better performance on noisy images. We constructed a weighted probability map on graphs to incorporate spatial indications from user input with a contextual constraint based on the minimization of contextual graphs energy functional. We measured the performance of our approach on ten noisy synthetic images and 58 medical datasets with heterogeneous intensities and ill-defined boundaries and compared our technique to the Chan-Vese region-based level set model, the geodesic active contour model with distance regularization, and the random walker model. Our method consistently achieved the highest Dice similarity coefficient when compared to the other methods.
Query construction, entropy, and generalization in neural-network models
NASA Astrophysics Data System (ADS)
Sollich, Peter
1994-05-01
We study query construction algorithms, which aim at improving the generalization ability of systems that learn from examples by choosing optimal, nonredundant training sets. We set up a general probabilistic framework for deriving such algorithms from the requirement of optimizing a suitable objective function; specifically, we consider the objective functions entropy (or information gain) and generalization error. For two learning scenarios, the high-low game and the linear perceptron, we evaluate the generalization performance obtained by applying the corresponding query construction algorithms and compare it to training on random examples. We find qualitative differences between the two scenarios due to the different structure of the underlying rules (nonlinear and ``noninvertible'' versus linear); in particular, for the linear perceptron, random examples lead to the same generalization ability as a sequence of queries in the limit of an infinite number of examples. We also investigate learning algorithms which are ill matched to the learning environment and find that, in this case, minimum entropy queries can in fact yield a lower generalization ability than random examples. Finally, we study the efficiency of single queries and its dependence on the learning history, i.e., on whether the previous training examples were generated randomly or by querying, and the difference between globally and locally optimal query construction.
NASA Astrophysics Data System (ADS)
Cisneros, Sophia
2013-04-01
We present a new, heuristic, two-parameter model for predicting the rotation curves of disc galaxies. The model is tested on (22) randomly chosen galaxies, represented in 35 data sets. This Lorentz Convolution [LC] model is derived from a non-linear, relativistic solution of a Kerr-type wave equation, where small changes in the photon's frequencies, resulting from the curved space time, are convolved into a sequence of Lorentz transformations. The LC model is parametrized with only the diffuse, luminous stellar and gaseous masses reported with each data set of observations used. The LC model predicts observed rotation curves across a wide range of disk galaxies. The LC model was constructed to occupy the same place in the explanation of rotation curves that Dark Matter does, so that a simple investigation of the relation between luminous and dark matter might be made, via by a parameter (a). We find the parameter (a) to demonstrate interesting structure. We compare the new model prediction to both the NFW model and MOND fits when available.
GrowYourIC: A Step Toward a Coherent Model of the Earth's Inner Core Seismic Structure
NASA Astrophysics Data System (ADS)
Lasbleis, Marine; Waszek, Lauren; Day, Elizabeth A.
2017-11-01
A complex inner core structure has been well established from seismic studies, showing radial and lateral heterogeneities at various length scales. Yet no geodynamic model is able to explain all the features observed. One of the main limits for this is the lack of tools to compare seismic observations and numerical models successfully. We use here a new Python tool called GrowYourIC to compare models of inner core structure. We calculate properties of geodynamic models of the inner core along seismic raypaths, for random or user-specified data sets. We test kinematic models which simulate fast lateral translation, superrotation, and differential growth. We explore first the influence on a real inner core data set, which has a sparse coverage of the inner core boundary. Such a data set is however able to successfully constrain the hemispherical boundaries due to a good sampling of latitudes. Combining translation and rotation could explain some of the features of the boundaries separating the inner core hemispheres. The depth shift of the boundaries, observed by some authors, seems unlikely to be modeled by a fast translation but could be produced by slow translation associated with superrotation.
Efficient 3D porous microstructure reconstruction via Gaussian random field and hybrid optimization.
Jiang, Z; Chen, W; Burkhart, C
2013-11-01
Obtaining an accurate three-dimensional (3D) structure of a porous microstructure is important for assessing the material properties based on finite element analysis. Whereas directly obtaining 3D images of the microstructure is impractical under many circumstances, two sets of methods have been developed in literature to generate (reconstruct) 3D microstructure from its 2D images: one characterizes the microstructure based on certain statistical descriptors, typically two-point correlation function and cluster correlation function, and then performs an optimization process to build a 3D structure that matches those statistical descriptors; the other method models the microstructure using stochastic models like a Gaussian random field and generates a 3D structure directly from the function. The former obtains a relatively accurate 3D microstructure, but computationally the optimization process can be very intensive, especially for problems with large image size; the latter generates a 3D microstructure quickly but sacrifices the accuracy due to issues in numerical implementations. A hybrid optimization approach of modelling the 3D porous microstructure of random isotropic two-phase materials is proposed in this paper, which combines the two sets of methods and hence maintains the accuracy of the correlation-based method with improved efficiency. The proposed technique is verified for 3D reconstructions based on silica polymer composite images with different volume fractions. A comparison of the reconstructed microstructures and the optimization histories for both the original correlation-based method and our hybrid approach demonstrates the improved efficiency of the approach. © 2013 The Authors Journal of Microscopy © 2013 Royal Microscopical Society.
González-Recio, O; Jiménez-Montero, J A; Alenda, R
2013-01-01
In the next few years, with the advent of high-density single nucleotide polymorphism (SNP) arrays and genome sequencing, genomic evaluation methods will need to deal with a large number of genetic variants and an increasing sample size. The boosting algorithm is a machine-learning technique that may alleviate the drawbacks of dealing with such large data sets. This algorithm combines different predictors in a sequential manner with some shrinkage on them; each predictor is applied consecutively to the residuals from the committee formed by the previous ones to form a final prediction based on a subset of covariates. Here, a detailed description is provided and examples using a toy data set are included. A modification of the algorithm called "random boosting" was proposed to increase predictive ability and decrease computation time of genome-assisted evaluation in large data sets. Random boosting uses a random selection of markers to add a subsequent weak learner to the predictive model. These modifications were applied to a real data set composed of 1,797 bulls genotyped for 39,714 SNP. Deregressed proofs of 4 yield traits and 1 type trait from January 2009 routine evaluations were used as dependent variables. A 2-fold cross-validation scenario was implemented. Sires born before 2005 were used as a training sample (1,576 and 1,562 for production and type traits, respectively), whereas younger sires were used as a testing sample to evaluate predictive ability of the algorithm on yet-to-be-observed phenotypes. Comparison with the original algorithm was provided. The predictive ability of the algorithm was measured as Pearson correlations between observed and predicted responses. Further, estimated bias was computed as the average difference between observed and predicted phenotypes. The results showed that the modification of the original boosting algorithm could be run in 1% of the time used with the original algorithm and with negligible differences in accuracy and bias. This modification may be used to speed the calculus of genome-assisted evaluation in large data sets such us those obtained from consortiums. Copyright © 2013 American Dairy Science Association. Published by Elsevier Inc. All rights reserved.
Yiu, Sean; Farewell, Vernon T; Tom, Brian D M
2017-08-01
Many psoriatic arthritis patients do not progress to permanent joint damage in any of the 28 hand joints, even under prolonged follow-up. This has led several researchers to fit models that estimate the proportion of stayers (those who do not have the propensity to experience the event of interest) and to characterize the rate of developing damaged joints in the movers (those who have the propensity to experience the event of interest). However, when fitted to the same data, the paper demonstrates that the choice of model for the movers can lead to widely varying conclusions on a stayer population, thus implying that, if interest lies in a stayer population, a single analysis should not generally be adopted. The aim of the paper is to provide greater understanding regarding estimation of a stayer population by comparing the inferences, performance and features of multiple fitted models to real and simulated data sets. The models for the movers are based on Poisson processes with patient level random effects and/or dynamic covariates, which are used to induce within-patient correlation, and observation level random effects are used to account for time varying unobserved heterogeneity. The gamma, inverse Gaussian and compound Poisson distributions are considered for the random effects.
Petrinco, Michele; Pagano, Eva; Desideri, Alessandro; Bigi, Riccardo; Ghidina, Marco; Ferrando, Alberto; Cortigiani, Lauro; Merletti, Franco; Gregori, Dario
2009-01-01
Several methodological problems arise when health outcomes and resource utilization are collected at different sites. To avoid misleading conclusions in multi-center economic evaluations the center effect needs to be taken into adequate consideration. The aim of this article is to compare several models, which make use of a different amount of information about the enrolling center. To model the association of total medical costs with the levels of two sets of covariates, one at patient and one at center level, we considered four statistical models, based on the Gamma model in the class of the Generalized Linear Models with a log link, which use different amount of information on the enrolling centers. Models were applied to Cost of Strategies after Myocardial Infarction data, an international randomized trial on costs of uncomplicated acute myocardial infarction (AMI). The simple center effect adjustment based on a single random effect results in a more conservative estimation of the parameters as compared with approaches which make use of deeper information on the centers characteristics. This study shows, with reference to a real multicenter trial, that center information cannot be neglected and should be collected and inserted in the analysis, better in combination with one or more random effect, taking into account in this way also the heterogeneity among centers because of unobserved centers characteristics.
A characterization of linearly repetitive cut and project sets
NASA Astrophysics Data System (ADS)
Haynes, Alan; Koivusalo, Henna; Walton, James
2018-02-01
For the development of a mathematical theory which can be used to rigorously investigate physical properties of quasicrystals, it is necessary to understand regularity of patterns in special classes of aperiodic point sets in Euclidean space. In one dimension, prototypical mathematical models for quasicrystals are provided by Sturmian sequences and by point sets generated by substitution rules. Regularity properties of such sets are well understood, thanks mostly to well known results by Morse and Hedlund, and physicists have used this understanding to study one dimensional random Schrödinger operators and lattice gas models. A key fact which plays an important role in these problems is the existence of a subadditive ergodic theorem, which is guaranteed when the corresponding point set is linearly repetitive. In this paper we extend the one-dimensional model to cut and project sets, which generalize Sturmian sequences in higher dimensions, and which are frequently used in mathematical and physical literature as models for higher dimensional quasicrystals. By using a combination of algebraic, geometric, and dynamical techniques, together with input from higher dimensional Diophantine approximation, we give a complete characterization of all linearly repetitive cut and project sets with cubical windows. We also prove that these are precisely the collection of such sets which satisfy subadditive ergodic theorems. The results are explicit enough to allow us to apply them to known classical models, and to construct linearly repetitive cut and project sets in all pairs of dimensions and codimensions in which they exist. Research supported by EPSRC grants EP/L001462, EP/J00149X, EP/M023540. HK also gratefully acknowledges the support of the Osk. Huttunen foundation.
Exploiting Data Missingness in Bayesian Network Modeling
NASA Astrophysics Data System (ADS)
Rodrigues de Morais, Sérgio; Aussem, Alex
This paper proposes a framework built on the use of Bayesian networks (BN) for representing statistical dependencies between the existing random variables and additional dummy boolean variables, which represent the presence/absence of the respective random variable value. We show how augmenting the BN with these additional variables helps pinpoint the mechanism through which missing data contributes to the classification task. The missing data mechanism is thus explicitly taken into account to predict the class variable using the data at hand. Extensive experiments on synthetic and real-world incomplete data sets reveals that the missingness information improves classification accuracy.
Peñagaricano, F; Urioste, J I; Naya, H; de los Campos, G; Gianola, D
2011-04-01
Black skin spots are associated with pigmented fibres in wool, an important quality fault. Our objective was to assess alternative models for genetic analysis of presence (BINBS) and number (NUMBS) of black spots in Corriedale sheep. During 2002-08, 5624 records from 2839 animals in two flocks, aged 1 through 6 years, were taken at shearing. Four models were considered: linear and probit for BINBS and linear and Poisson for NUMBS. All models included flock-year and age as fixed effects and animal and permanent environmental as random effects. Models were fitted to the whole data set and were also compared based on their predictive ability in cross-validation. Estimates of heritability ranged from 0.154 to 0.230 for BINBS and 0.269 to 0.474 for NUMBS. For BINBS, the probit model fitted slightly better to the data than the linear model. Predictions of random effects from these models were highly correlated, and both models exhibited similar predictive ability. For NUMBS, the Poisson model, with a residual term to account for overdispersion, performed better than the linear model in goodness of fit and predictive ability. Predictions of random effects from the Poisson model were more strongly correlated with those from BINBS models than those from the linear model. Overall, the use of probit or linear models for BINBS and of a Poisson model with a residual for NUMBS seems a reasonable choice for genetic selection purposes in Corriedale sheep. © 2010 Blackwell Verlag GmbH.
Correcting Evaluation Bias of Relational Classifiers with Network Cross Validation
2010-01-01
classi- fication algorithms: simple random resampling (RRS), equal-instance random resampling (ERS), and network cross-validation ( NCV ). The first two... NCV procedure that eliminates overlap between test sets altogether. The procedure samples for k disjoint test sets that will be used for evaluation...propLabeled ∗ S) nodes from train Pool in f erenceSet =network − trainSet F = F ∪ < trainSet, test Set, in f erenceSet > end for output: F NCV addresses
2010-01-01
Background In bioinformatics it is common to search for a pattern of interest in a potentially large set of rather short sequences (upstream gene regions, proteins, exons, etc.). Although many methodological approaches allow practitioners to compute the distribution of a pattern count in a random sequence generated by a Markov source, no specific developments have taken into account the counting of occurrences in a set of independent sequences. We aim to address this problem by deriving efficient approaches and algorithms to perform these computations both for low and high complexity patterns in the framework of homogeneous or heterogeneous Markov models. Results The latest advances in the field allowed us to use a technique of optimal Markov chain embedding based on deterministic finite automata to introduce three innovative algorithms. Algorithm 1 is the only one able to deal with heterogeneous models. It also permits to avoid any product of convolution of the pattern distribution in individual sequences. When working with homogeneous models, Algorithm 2 yields a dramatic reduction in the complexity by taking advantage of previous computations to obtain moment generating functions efficiently. In the particular case of low or moderate complexity patterns, Algorithm 3 exploits power computation and binary decomposition to further reduce the time complexity to a logarithmic scale. All these algorithms and their relative interest in comparison with existing ones were then tested and discussed on a toy-example and three biological data sets: structural patterns in protein loop structures, PROSITE signatures in a bacterial proteome, and transcription factors in upstream gene regions. On these data sets, we also compared our exact approaches to the tempting approximation that consists in concatenating the sequences in the data set into a single sequence. Conclusions Our algorithms prove to be effective and able to handle real data sets with multiple sequences, as well as biological patterns of interest, even when the latter display a high complexity (PROSITE signatures for example). In addition, these exact algorithms allow us to avoid the edge effect observed under the single sequence approximation, which leads to erroneous results, especially when the marginal distribution of the model displays a slow convergence toward the stationary distribution. We end up with a discussion on our method and on its potential improvements. PMID:20205909
Nuel, Gregory; Regad, Leslie; Martin, Juliette; Camproux, Anne-Claude
2010-01-26
In bioinformatics it is common to search for a pattern of interest in a potentially large set of rather short sequences (upstream gene regions, proteins, exons, etc.). Although many methodological approaches allow practitioners to compute the distribution of a pattern count in a random sequence generated by a Markov source, no specific developments have taken into account the counting of occurrences in a set of independent sequences. We aim to address this problem by deriving efficient approaches and algorithms to perform these computations both for low and high complexity patterns in the framework of homogeneous or heterogeneous Markov models. The latest advances in the field allowed us to use a technique of optimal Markov chain embedding based on deterministic finite automata to introduce three innovative algorithms. Algorithm 1 is the only one able to deal with heterogeneous models. It also permits to avoid any product of convolution of the pattern distribution in individual sequences. When working with homogeneous models, Algorithm 2 yields a dramatic reduction in the complexity by taking advantage of previous computations to obtain moment generating functions efficiently. In the particular case of low or moderate complexity patterns, Algorithm 3 exploits power computation and binary decomposition to further reduce the time complexity to a logarithmic scale. All these algorithms and their relative interest in comparison with existing ones were then tested and discussed on a toy-example and three biological data sets: structural patterns in protein loop structures, PROSITE signatures in a bacterial proteome, and transcription factors in upstream gene regions. On these data sets, we also compared our exact approaches to the tempting approximation that consists in concatenating the sequences in the data set into a single sequence. Our algorithms prove to be effective and able to handle real data sets with multiple sequences, as well as biological patterns of interest, even when the latter display a high complexity (PROSITE signatures for example). In addition, these exact algorithms allow us to avoid the edge effect observed under the single sequence approximation, which leads to erroneous results, especially when the marginal distribution of the model displays a slow convergence toward the stationary distribution. We end up with a discussion on our method and on its potential improvements.
Kalderstam, Jonas; Edén, Patrik; Bendahl, Pär-Ola; Strand, Carina; Fernö, Mårten; Ohlsson, Mattias
2013-06-01
The concordance index (c-index) is the standard way of evaluating the performance of prognostic models in the presence of censored data. Constructing prognostic models using artificial neural networks (ANNs) is commonly done by training on error functions which are modified versions of the c-index. Our objective was to demonstrate the capability of training directly on the c-index and to evaluate our approach compared to the Cox proportional hazards model. We constructed a prognostic model using an ensemble of ANNs which were trained using a genetic algorithm. The individual networks were trained on a non-linear artificial data set divided into a training and test set both of size 2000, where 50% of the data was censored. The ANNs were also trained on a data set consisting of 4042 patients treated for breast cancer spread over five different medical studies, 2/3 used for training and 1/3 used as a test set. A Cox model was also constructed on the same data in both cases. The two models' c-indices on the test sets were then compared. The ranking performance of the models is additionally presented visually using modified scatter plots. Cross validation on the cancer training set did not indicate any non-linear effects between the covariates. An ensemble of 30 ANNs with one hidden neuron was therefore used. The ANN model had almost the same c-index score as the Cox model (c-index=0.70 and 0.71, respectively) on the cancer test set. Both models identified similarly sized low risk groups with at most 10% false positives, 49 for the ANN model and 60 for the Cox model, but repeated bootstrap runs indicate that the difference was not significant. A significant difference could however be seen when applied on the non-linear synthetic data set. In that case the ANN ensemble managed to achieve a c-index score of 0.90 whereas the Cox model failed to distinguish itself from the random case (c-index=0.49). We have found empirical evidence that ensembles of ANN models can be optimized directly on the c-index. Comparison with a Cox model indicates that near identical performance is achieved on a real cancer data set while on a non-linear data set the ANN model is clearly superior. Copyright © 2013 Elsevier B.V. All rights reserved.
Liu, Xian; Engel, Charles C
2012-12-20
Researchers often encounter longitudinal health data characterized with three or more ordinal or nominal categories. Random-effects multinomial logit models are generally applied to account for potential lack of independence inherent in such clustered data. When parameter estimates are used to describe longitudinal processes, however, random effects, both between and within individuals, need to be retransformed for correctly predicting outcome probabilities. This study attempts to go beyond existing work by developing a retransformation method that derives longitudinal growth trajectories of unbiased health probabilities. We estimated variances of the predicted probabilities by using the delta method. Additionally, we transformed the covariates' regression coefficients on the multinomial logit function, not substantively meaningful, to the conditional effects on the predicted probabilities. The empirical illustration uses the longitudinal data from the Asset and Health Dynamics among the Oldest Old. Our analysis compared three sets of the predicted probabilities of three health states at six time points, obtained from, respectively, the retransformation method, the best linear unbiased prediction, and the fixed-effects approach. The results demonstrate that neglect of retransforming random errors in the random-effects multinomial logit model results in severely biased longitudinal trajectories of health probabilities as well as overestimated effects of covariates on the probabilities. Copyright © 2012 John Wiley & Sons, Ltd.
Parametric Model Based On Imputations Techniques for Partly Interval Censored Data
NASA Astrophysics Data System (ADS)
Zyoud, Abdallah; Elfaki, F. A. M.; Hrairi, Meftah
2017-12-01
The term ‘survival analysis’ has been used in a broad sense to describe collection of statistical procedures for data analysis. In this case, outcome variable of interest is time until an event occurs where the time to failure of a specific experimental unit might be censored which can be right, left, interval, and Partly Interval Censored data (PIC). In this paper, analysis of this model was conducted based on parametric Cox model via PIC data. Moreover, several imputation techniques were used, which are: midpoint, left & right point, random, mean, and median. Maximum likelihood estimate was considered to obtain the estimated survival function. These estimations were then compared with the existing model, such as: Turnbull and Cox model based on clinical trial data (breast cancer data), for which it showed the validity of the proposed model. Result of data set indicated that the parametric of Cox model proved to be more superior in terms of estimation of survival functions, likelihood ratio tests, and their P-values. Moreover, based on imputation techniques; the midpoint, random, mean, and median showed better results with respect to the estimation of survival function.
Specific surface area of overlapping spheres in the presence of obstructions
NASA Astrophysics Data System (ADS)
Jenkins, D. R.
2013-02-01
This study considers the random placement of uniform sized spheres, which may overlap, in the presence of another set of randomly placed (hard) spheres, which do not overlap. The overlapping spheres do not intersect the hard spheres. It is shown that the specific surface area of the collection of overlapping spheres is affected by the hard spheres, such that there is a minimum in the specific surface area as a function of the relative size of the two sets of spheres. The occurrence of the minimum is explained in terms of the break-up of pore connectivity. The configuration can be considered to be a simple model of the structure of a porous composite material. In particular, the overlapping particles represent voids while the hard particles represent fillers. Example materials are pervious concrete, metallurgical coke, ice cream, and polymer composites. We also show how the material properties of such composites are affected by the void structure.
Specific surface area of overlapping spheres in the presence of obstructions.
Jenkins, D R
2013-02-21
This study considers the random placement of uniform sized spheres, which may overlap, in the presence of another set of randomly placed (hard) spheres, which do not overlap. The overlapping spheres do not intersect the hard spheres. It is shown that the specific surface area of the collection of overlapping spheres is affected by the hard spheres, such that there is a minimum in the specific surface area as a function of the relative size of the two sets of spheres. The occurrence of the minimum is explained in terms of the break-up of pore connectivity. The configuration can be considered to be a simple model of the structure of a porous composite material. In particular, the overlapping particles represent voids while the hard particles represent fillers. Example materials are pervious concrete, metallurgical coke, ice cream, and polymer composites. We also show how the material properties of such composites are affected by the void structure.
Conditional Random Fields for Activity Recognition
2008-04-01
final match. The final is never used as a training or hold out set. Table 4.1 lists the roles of the CMDragons’07 robot soccer team. The role of Goalie ...is not included because the goalie never changes roles. The classification task, which we formalize below, is to recognize robot roles from the avail...process and pull out the key information from the sensor data. Furthermore, as conditional models, CRFs do not waste modeling effort on the observations
Decorrelation of the true and estimated classifier errors in high-dimensional settings.
Hanczar, Blaise; Hua, Jianping; Dougherty, Edward R
2007-01-01
The aim of many microarray experiments is to build discriminatory diagnosis and prognosis models. Given the huge number of features and the small number of examples, model validity which refers to the precision of error estimation is a critical issue. Previous studies have addressed this issue via the deviation distribution (estimated error minus true error), in particular, the deterioration of cross-validation precision in high-dimensional settings where feature selection is used to mitigate the peaking phenomenon (overfitting). Because classifier design is based upon random samples, both the true and estimated errors are sample-dependent random variables, and one would expect a loss of precision if the estimated and true errors are not well correlated, so that natural questions arise as to the degree of correlation and the manner in which lack of correlation impacts error estimation. We demonstrate the effect of correlation on error precision via a decomposition of the variance of the deviation distribution, observe that the correlation is often severely decreased in high-dimensional settings, and show that the effect of high dimensionality on error estimation tends to result more from its decorrelating effects than from its impact on the variance of the estimated error. We consider the correlation between the true and estimated errors under different experimental conditions using both synthetic and real data, several feature-selection methods, different classification rules, and three error estimators commonly used (leave-one-out cross-validation, k-fold cross-validation, and .632 bootstrap). Moreover, three scenarios are considered: (1) feature selection, (2) known-feature set, and (3) all features. Only the first is of practical interest; however, the other two are needed for comparison purposes. We will observe that the true and estimated errors tend to be much more correlated in the case of a known feature set than with either feature selection or using all features, with the better correlation between the latter two showing no general trend, but differing for different models.
Chen, Henry W; Du, Jingcheng; Song, Hsing-Yi; Liu, Xiangyu; Jiang, Guoqian
2018-01-01
Background Today, there is an increasing need to centralize and standardize electronic health data within clinical research as the volume of data continues to balloon. Domain-specific common data elements (CDEs) are emerging as a standard approach to clinical research data capturing and reporting. Recent efforts to standardize clinical study CDEs have been of great benefit in facilitating data integration and data sharing. The importance of the temporal dimension of clinical research studies has been well recognized; however, very few studies have focused on the formal representation of temporal constraints and temporal relationships within clinical research data in the biomedical research community. In particular, temporal information can be extremely powerful to enable high-quality cancer research. Objective The objective of the study was to develop and evaluate an ontological approach to represent the temporal aspects of cancer study CDEs. Methods We used CDEs recorded in the National Cancer Institute (NCI) Cancer Data Standards Repository (caDSR) and created a CDE parser to extract time-relevant CDEs from the caDSR. Using the Web Ontology Language (OWL)–based Time Event Ontology (TEO), we manually derived representative patterns to semantically model the temporal components of the CDEs using an observing set of randomly selected time-related CDEs (n=600) to create a set of TEO ontological representation patterns. In evaluating TEO’s ability to represent the temporal components of the CDEs, this set of representation patterns was tested against two test sets of randomly selected time-related CDEs (n=425). Results It was found that 94.2% (801/850) of the CDEs in the test sets could be represented by the TEO representation patterns. Conclusions In conclusion, TEO is a good ontological model for representing the temporal components of the CDEs recorded in caDSR. Our representative model can harness the Semantic Web reasoning and inferencing functionalities and present a means for temporal CDEs to be machine-readable, streamlining meaningful searches. PMID:29472179
ERIC Educational Resources Information Center
Thompson, Bruce
The relationship between analysis of variance (ANOVA) methods and their analogs (analysis of covariance and multiple analyses of variance and covariance--collectively referred to as OVA methods) and the more general analytic case is explored. A small heuristic data set is used, with a hypothetical sample of 20 subjects, randomly assigned to five…
EvoluZion: A Computer Simulator for Teaching Genetic and Evolutionary Concepts
ERIC Educational Resources Information Center
Zurita, Adolfo R.
2017-01-01
EvoluZion is a forward-in-time genetic simulator developed in Java and designed to perform real time simulations on the evolutionary history of virtual organisms. These model organisms harbour a set of 13 genes that codify an equal number of phenotypic features. These genes change randomly during replication, and mutant genes can have null,…
ERIC Educational Resources Information Center
Chiu, Angela Wai Mon
2010-01-01
The current study used a programmatic dissemination model as a guiding framework for testing an evidence-supported treatment (EST) for child anxiety disorders in the school setting. The main goal of the project was to conduct the first of a planned series of partial-effectiveness tests (group-design randomized controlled trials) evaluating the…
Circular distributions based on nonnegative trigonometric sums.
Fernández-Durán, J J
2004-06-01
A new family of distributions for circular random variables is proposed. It is based on nonnegative trigonometric sums and can be used to model data sets which present skewness and/or multimodality. In this family of distributions, the trigonometric moments are easily expressed in terms of the parameters of the distribution. The proposed family is applied to two data sets, one related with the directions taken by ants and the other with the directions taken by turtles, to compare their goodness of fit versus common distributions used in the literature.
Quantitative prediction of oral cancer risk in patients with oral leukoplakia.
Liu, Yao; Li, Yicheng; Fu, Yue; Liu, Tong; Liu, Xiaoyong; Zhang, Xinyan; Fu, Jie; Guan, Xiaobing; Chen, Tong; Chen, Xiaoxin; Sun, Zheng
2017-07-11
Exfoliative cytology has been widely used for early diagnosis of oral squamous cell carcinoma. We have developed an oral cancer risk index using DNA index value to quantitatively assess cancer risk in patients with oral leukoplakia, but with limited success. In order to improve the performance of the risk index, we collected exfoliative cytology, histopathology, and clinical follow-up data from two independent cohorts of normal, leukoplakia and cancer subjects (training set and validation set). Peaks were defined on the basis of first derivatives with positives, and modern machine learning techniques were utilized to build statistical prediction models on the reconstructed data. Random forest was found to be the best model with high sensitivity (100%) and specificity (99.2%). Using the Peaks-Random Forest model, we constructed an index (OCRI2) as a quantitative measurement of cancer risk. Among 11 leukoplakia patients with an OCRI2 over 0.5, 4 (36.4%) developed cancer during follow-up (23 ± 20 months), whereas 3 (5.3%) of 57 leukoplakia patients with an OCRI2 less than 0.5 developed cancer (32 ± 31 months). OCRI2 is better than other methods in predicting oral squamous cell carcinoma during follow-up. In conclusion, we have developed an exfoliative cytology-based method for quantitative prediction of cancer risk in patients with oral leukoplakia.
Andridge, Rebecca. R.
2011-01-01
In cluster randomized trials (CRTs), identifiable clusters rather than individuals are randomized to study groups. Resulting data often consist of a small number of clusters with correlated observations within a treatment group. Missing data often present a problem in the analysis of such trials, and multiple imputation (MI) has been used to create complete data sets, enabling subsequent analysis with well-established analysis methods for CRTs. We discuss strategies for accounting for clustering when multiply imputing a missing continuous outcome, focusing on estimation of the variance of group means as used in an adjusted t-test or ANOVA. These analysis procedures are congenial to (can be derived from) a mixed effects imputation model; however, this imputation procedure is not yet available in commercial statistical software. An alternative approach that is readily available and has been used in recent studies is to include fixed effects for cluster, but the impact of using this convenient method has not been studied. We show that under this imputation model the MI variance estimator is positively biased and that smaller ICCs lead to larger overestimation of the MI variance. Analytical expressions for the bias of the variance estimator are derived in the case of data missing completely at random (MCAR), and cases in which data are missing at random (MAR) are illustrated through simulation. Finally, various imputation methods are applied to data from the Detroit Middle School Asthma Project, a recent school-based CRT, and differences in inference are compared. PMID:21259309
Evolution of the concentration PDF in random environments modeled by global random walk
NASA Astrophysics Data System (ADS)
Suciu, Nicolae; Vamos, Calin; Attinger, Sabine; Knabner, Peter
2013-04-01
The evolution of the probability density function (PDF) of concentrations of chemical species transported in random environments is often modeled by ensembles of notional particles. The particles move in physical space along stochastic-Lagrangian trajectories governed by Ito equations, with drift coefficients given by the local values of the resolved velocity field and diffusion coefficients obtained by stochastic or space-filtering upscaling procedures. A general model for the sub-grid mixing also can be formulated as a system of Ito equations solving for trajectories in the composition space. The PDF is finally estimated by the number of particles in space-concentration control volumes. In spite of their efficiency, Lagrangian approaches suffer from two severe limitations. Since the particle trajectories are constructed sequentially, the demanded computing resources increase linearly with the number of particles. Moreover, the need to gather particles at the center of computational cells to perform the mixing step and to estimate statistical parameters, as well as the interpolation of various terms to particle positions, inevitably produce numerical diffusion in either particle-mesh or grid-free particle methods. To overcome these limitations, we introduce a global random walk method to solve the system of Ito equations in physical and composition spaces, which models the evolution of the random concentration's PDF. The algorithm consists of a superposition on a regular lattice of many weak Euler schemes for the set of Ito equations. Since all particles starting from a site of the space-concentration lattice are spread in a single numerical procedure, one obtains PDF estimates at the lattice sites at computational costs comparable with those for solving the system of Ito equations associated to a single particle. The new method avoids the limitations concerning the number of particles in Lagrangian approaches, completely removes the numerical diffusion, and speeds up the computation by orders of magnitude. The approach is illustrated for the transport of passive scalars in heterogeneous aquifers, with hydraulic conductivity modeled as a random field.
Toropova, Alla P; Schultz, Terry W; Toropov, Andrey A
2016-03-01
Data on toxicity toward Tetrahymena pyriformis is indicator of applicability of a substance in ecologic and pharmaceutical aspects. Quantitative structure-activity relationships (QSARs) between the molecular structure of benzene derivatives and toxicity toward T. pyriformis (expressed as the negative logarithms of the population growth inhibition dose, mmol/L) are established. The available data were randomly distributed three times into the visible training and calibration sets, and invisible validation sets. The statistical characteristics for the validation set are the following: r(2)=0.8179 and s=0.338 (first distribution); r(2)=0.8682 and s=0.341 (second distribution); r(2)=0.8435 and s=0.323 (third distribution). These models are built up using only information on the molecular structure: no data on physicochemical parameters, 3D features of the molecular structure and quantum mechanics descriptors are involved in the modeling process. Copyright © 2016 Elsevier B.V. All rights reserved.
NASA Technical Reports Server (NTRS)
Muravyov, Alexander A.
1999-01-01
In this paper, a method for obtaining nonlinear stiffness coefficients in modal coordinates for geometrically nonlinear finite-element models is developed. The method requires application of a finite-element program with a geometrically non- linear static capability. The MSC/NASTRAN code is employed for this purpose. The equations of motion of a MDOF system are formulated in modal coordinates. A set of linear eigenvectors is used to approximate the solution of the nonlinear problem. The random vibration problem of the MDOF nonlinear system is then considered. The solutions obtained by application of two different versions of a stochastic linearization technique are compared with linear and exact (analytical) solutions in terms of root-mean-square (RMS) displacements and strains for a beam structure.
Droplet localization in the random XXZ model and its manifestations
NASA Astrophysics Data System (ADS)
Elgart, A.; Klein, A.; Stolz, G.
2018-01-01
We examine many-body localization properties for the eigenstates that lie in the droplet sector of the random-field spin- \\frac 1 2 XXZ chain. These states satisfy a basic single cluster localization property (SCLP), derived in Elgart et al (2018 J. Funct. Anal. (in press)). This leads to many consequences, including dynamical exponential clustering, non-spreading of information under the time evolution, and a zero velocity Lieb-Robinson bound. Since SCLP is only applicable to the droplet sector, our definitions and proofs do not rely on knowledge of the spectral and dynamical characteristics of the model outside this regime. Rather, to allow for a possible mobility transition, we adapt the notion of restricting the Hamiltonian to an energy window from the single particle setting to the many body context.
Gene expression models for prediction of longitudinal dispersion coefficient in streams
NASA Astrophysics Data System (ADS)
Sattar, Ahmed M. A.; Gharabaghi, Bahram
2015-05-01
Longitudinal dispersion is the key hydrologic process that governs transport of pollutants in natural streams. It is critical for spill action centers to be able to predict the pollutant travel time and break-through curves accurately following accidental spills in urban streams. This study presents a novel gene expression model for longitudinal dispersion developed using 150 published data sets of geometric and hydraulic parameters in natural streams in the United States, Canada, Europe, and New Zealand. The training and testing of the model were accomplished using randomly-selected 67% (100 data sets) and 33% (50 data sets) of the data sets, respectively. Gene expression programming (GEP) is used to develop empirical relations between the longitudinal dispersion coefficient and various control variables, including the Froude number which reflects the effect of reach slope, aspect ratio, and the bed material roughness on the dispersion coefficient. Two GEP models have been developed, and the prediction uncertainties of the developed GEP models are quantified and compared with those of existing models, showing improved prediction accuracy in favor of GEP models. Finally, a parametric analysis is performed for further verification of the developed GEP models. The main reason for the higher accuracy of the GEP models compared to the existing regression models is that exponents of the key variables (aspect ratio and bed material roughness) are not constants but a function of the Froude number. The proposed relations are both simple and accurate and can be effectively used to predict the longitudinal dispersion coefficients in natural streams.
Testing and Analysis of Sensor Ports
NASA Technical Reports Server (NTRS)
Zhang, M.; Frendi, A.; Thompson, W.; Casiano, M. J.
2016-01-01
This Technical Publication summarizes the work focused on the testing and analysis of sensor ports. The tasks under this contract were divided into three areas: (1) Development of an Analytical Model, (2) Conducting a Set of Experiments, and (3) Obtaining Computational Solutions. Results from the experiment using both short and long sensor ports were obtained using harmonic, random, and frequency sweep plane acoustic waves. An amplification factor of the pressure signal between the port inlet and the back of the port is obtained and compared to models. Comparisons of model and experimental results showed very good agreement.
Vassallo, Rebecca; Durrant, Gabriele B; Smith, Peter W F; Goldstein, Harvey
2015-01-01
The paper investigates two different multilevel approaches, the multilevel cross-classified and the multiple-membership models, for the analysis of interviewer effects on wave non-response in longitudinal surveys. The models proposed incorporate both interviewer and area effects to account for the non-hierarchical structure, the influence of potentially more than one interviewer across waves and possible confounding of area and interviewer effects arising from the non-random allocation of interviewers across areas. The methods are compared by using a data set: the UK Family and Children Survey. PMID:25598587
MIP models for connected facility location: A theoretical and computational study☆
Gollowitzer, Stefan; Ljubić, Ivana
2011-01-01
This article comprises the first theoretical and computational study on mixed integer programming (MIP) models for the connected facility location problem (ConFL). ConFL combines facility location and Steiner trees: given a set of customers, a set of potential facility locations and some inter-connection nodes, ConFL searches for the minimum-cost way of assigning each customer to exactly one open facility, and connecting the open facilities via a Steiner tree. The costs needed for building the Steiner tree, facility opening costs and the assignment costs need to be minimized. We model ConFL using seven compact and three mixed integer programming formulations of exponential size. We also show how to transform ConFL into the Steiner arborescence problem. A full hierarchy between the models is provided. For two exponential size models we develop a branch-and-cut algorithm. An extensive computational study is based on two benchmark sets of randomly generated instances with up to 1300 nodes and 115,000 edges. We empirically compare the presented models with respect to the quality of obtained bounds and the corresponding running time. We report optimal values for all but 16 instances for which the obtained gaps are below 0.6%. PMID:25009366
Randomization Methods in Emergency Setting Trials: A Descriptive Review
ERIC Educational Resources Information Center
Corbett, Mark Stephen; Moe-Byrne, Thirimon; Oddie, Sam; McGuire, William
2016-01-01
Background: Quasi-randomization might expedite recruitment into trials in emergency care settings but may also introduce selection bias. Methods: We searched the Cochrane Library and other databases for systematic reviews of interventions in emergency medicine or urgent care settings. We assessed selection bias (baseline imbalances) in prognostic…
NASA Technical Reports Server (NTRS)
Chambers, Jeffrey A.
1994-01-01
Finite element analysis is regularly used during the engineering cycle of mechanical systems to predict the response to static, thermal, and dynamic loads. The finite element model (FEM) used to represent the system is often correlated with physical test results to determine the validity of analytical results provided. Results from dynamic testing provide one means for performing this correlation. One of the most common methods of measuring accuracy is by classical modal testing, whereby vibratory mode shapes are compared to mode shapes provided by finite element analysis. The degree of correlation between the test and analytical mode shapes can be shown mathematically using the cross orthogonality check. A great deal of time and effort can be exhausted in generating the set of test acquired mode shapes needed for the cross orthogonality check. In most situations response data from vibration tests are digitally processed to generate the mode shapes from a combination of modal parameters, forcing functions, and recorded response data. An alternate method is proposed in which the same correlation of analytical and test acquired mode shapes can be achieved without conducting the modal survey. Instead a procedure is detailed in which a minimum of test information, specifically the acceleration response data from a random vibration test, is used to generate a set of equivalent local accelerations to be applied to the reduced analytical model at discrete points corresponding to the test measurement locations. The static solution of the analytical model then produces a set of deformations that once normalized can be used to represent the test acquired mode shapes in the cross orthogonality relation. The method proposed has been shown to provide accurate results for both a simple analytical model as well as a complex space flight structure.
NASA Astrophysics Data System (ADS)
Banerjee, Priyanka; Preissner, Robert
2018-04-01
Taste of a chemical compounds present in food stimulates us to take in nutrients and avoid poisons. However, the perception of taste greatly depends on the genetic as well as evolutionary perspectives. The aim of this work was the development and validation of a machine learning model based on molecular fingerprints to discriminate between sweet and bitter taste of molecules. BitterSweetForest is the first open access model based on KNIME workflow that provides platform for prediction of bitter and sweet taste of chemical compounds using molecular fingerprints and Random Forest based classifier. The constructed model yielded an accuracy of 95% and an AUC of 0.98 in cross-validation. In independent test set, BitterSweetForest achieved an accuracy of 96 % and an AUC of 0.98 for bitter and sweet taste prediction. The constructed model was further applied to predict the bitter and sweet taste of natural compounds, approved drugs as well as on an acute toxicity compound data set. BitterSweetForest suggests 70% of the natural product space, as bitter and 10 % of the natural product space as sweet with confidence score of 0.60 and above. 77 % of the approved drug set was predicted as bitter and 2% as sweet with a confidence scores of 0.75 and above. Similarly, 75% of the total compounds from acute oral toxicity class were predicted only as bitter with a minimum confidence score of 0.75, revealing toxic compounds are mostly bitter. Furthermore, we applied a Bayesian based feature analysis method to discriminate the most occurring chemical features between sweet and bitter compounds from the feature space of a circular fingerprint.
Banerjee, Priyanka; Preissner, Robert
2018-01-01
Taste of a chemical compound present in food stimulates us to take in nutrients and avoid poisons. However, the perception of taste greatly depends on the genetic as well as evolutionary perspectives. The aim of this work was the development and validation of a machine learning model based on molecular fingerprints to discriminate between sweet and bitter taste of molecules. BitterSweetForest is the first open access model based on KNIME workflow that provides platform for prediction of bitter and sweet taste of chemical compounds using molecular fingerprints and Random Forest based classifier. The constructed model yielded an accuracy of 95% and an AUC of 0.98 in cross-validation. In independent test set, BitterSweetForest achieved an accuracy of 96% and an AUC of 0.98 for bitter and sweet taste prediction. The constructed model was further applied to predict the bitter and sweet taste of natural compounds, approved drugs as well as on an acute toxicity compound data set. BitterSweetForest suggests 70% of the natural product space, as bitter and 10% of the natural product space as sweet with confidence score of 0.60 and above. 77% of the approved drug set was predicted as bitter and 2% as sweet with a confidence score of 0.75 and above. Similarly, 75% of the total compounds from acute oral toxicity class were predicted only as bitter with a minimum confidence score of 0.75, revealing toxic compounds are mostly bitter. Furthermore, we applied a Bayesian based feature analysis method to discriminate the most occurring chemical features between sweet and bitter compounds using the feature space of a circular fingerprint. PMID:29696137
A Dirichlet-Multinomial Bayes Classifier for Disease Diagnosis with Microbial Compositions.
Gao, Xiang; Lin, Huaiying; Dong, Qunfeng
2017-01-01
Dysbiosis of microbial communities is associated with various human diseases, raising the possibility of using microbial compositions as biomarkers for disease diagnosis. We have developed a Bayes classifier by modeling microbial compositions with Dirichlet-multinomial distributions, which are widely used to model multicategorical count data with extra variation. The parameters of the Dirichlet-multinomial distributions are estimated from training microbiome data sets based on maximum likelihood. The posterior probability of a microbiome sample belonging to a disease or healthy category is calculated based on Bayes' theorem, using the likelihood values computed from the estimated Dirichlet-multinomial distribution, as well as a prior probability estimated from the training microbiome data set or previously published information on disease prevalence. When tested on real-world microbiome data sets, our method, called DMBC (for Dirichlet-multinomial Bayes classifier), shows better classification accuracy than the only existing Bayesian microbiome classifier based on a Dirichlet-multinomial mixture model and the popular random forest method. The advantage of DMBC is its built-in automatic feature selection, capable of identifying a subset of microbial taxa with the best classification accuracy between different classes of samples based on cross-validation. This unique ability enables DMBC to maintain and even improve its accuracy at modeling species-level taxa. The R package for DMBC is freely available at https://github.com/qunfengdong/DMBC. IMPORTANCE By incorporating prior information on disease prevalence, Bayes classifiers have the potential to estimate disease probability better than other common machine-learning methods. Thus, it is important to develop Bayes classifiers specifically tailored for microbiome data. Our method shows higher classification accuracy than the only existing Bayesian classifier and the popular random forest method, and thus provides an alternative option for using microbial compositions for disease diagnosis.
Wickenberg-Bolin, Ulrika; Göransson, Hanna; Fryknäs, Mårten; Gustafsson, Mats G; Isaksson, Anders
2006-03-13
Supervised learning for classification of cancer employs a set of design examples to learn how to discriminate between tumors. In practice it is crucial to confirm that the classifier is robust with good generalization performance to new examples, or at least that it performs better than random guessing. A suggested alternative is to obtain a confidence interval of the error rate using repeated design and test sets selected from available examples. However, it is known that even in the ideal situation of repeated designs and tests with completely novel samples in each cycle, a small test set size leads to a large bias in the estimate of the true variance between design sets. Therefore different methods for small sample performance estimation such as a recently proposed procedure called Repeated Random Sampling (RSS) is also expected to result in heavily biased estimates, which in turn translates into biased confidence intervals. Here we explore such biases and develop a refined algorithm called Repeated Independent Design and Test (RIDT). Our simulations reveal that repeated designs and tests based on resampling in a fixed bag of samples yield a biased variance estimate. We also demonstrate that it is possible to obtain an improved variance estimate by means of a procedure that explicitly models how this bias depends on the number of samples used for testing. For the special case of repeated designs and tests using new samples for each design and test, we present an exact analytical expression for how the expected value of the bias decreases with the size of the test set. We show that via modeling and subsequent reduction of the small sample bias, it is possible to obtain an improved estimate of the variance of classifier performance between design sets. However, the uncertainty of the variance estimate is large in the simulations performed indicating that the method in its present form cannot be directly applied to small data sets.
Bayesian Regression with Network Prior: Optimal Bayesian Filtering Perspective
Qian, Xiaoning; Dougherty, Edward R.
2017-01-01
The recently introduced intrinsically Bayesian robust filter (IBRF) provides fully optimal filtering relative to a prior distribution over an uncertainty class ofjoint random process models, whereas formerly the theory was limited to model-constrained Bayesian robust filters, for which optimization was limited to the filters that are optimal for models in the uncertainty class. This paper extends the IBRF theory to the situation where there are both a prior on the uncertainty class and sample data. The result is optimal Bayesian filtering (OBF), where optimality is relative to the posterior distribution derived from the prior and the data. The IBRF theories for effective characteristics and canonical expansions extend to the OBF setting. A salient focus of the present work is to demonstrate the advantages of Bayesian regression within the OBF setting over the classical Bayesian approach in the context otlinear Gaussian models. PMID:28824268
Folding and stability of helical bundle proteins from coarse-grained models.
Kapoor, Abhijeet; Travesset, Alex
2013-07-01
We develop a coarse-grained model where solvent is considered implicitly, electrostatics are included as short-range interactions, and side-chains are coarse-grained to a single bead. The model depends on three main parameters: hydrophobic, electrostatic, and side-chain hydrogen bond strength. The parameters are determined by considering three level of approximations and characterizing the folding for three selected proteins (training set). Nine additional proteins (containing up to 126 residues) as well as mutated versions (test set) are folded with the given parameters. In all folding simulations, the initial state is a random coil configuration. Besides the native state, some proteins fold into an additional state differing in the topology (structure of the helical bundle). We discuss the stability of the native states, and compare the dynamics of our model to all atom molecular dynamics simulations as well as some general properties on the interactions governing folding dynamics. Copyright © 2013 Wiley Periodicals, Inc.
Development and validation of an all-cause mortality risk score in type 2 diabetes.
Yang, Xilin; So, Wing Yee; Tong, Peter C Y; Ma, Ronald C W; Kong, Alice P S; Lam, Christopher W K; Ho, Chung Shun; Cockram, Clive S; Ko, Gary T C; Chow, Chun-Chung; Wong, Vivian C W; Chan, Juliana C N
2008-03-10
Diabetes reduces life expectancy by 10 to 12 years, but whether death can be predicted in type 2 diabetes mellitus remains uncertain. A prospective cohort of 7583 type 2 diabetic patients enrolled since 1995 were censored on July 30, 2005, or after 6 years of follow-up, whichever came first. A restricted cubic spline model was used to check data linearity and to develop linear-transforming formulas. Data were randomly assigned to a training data set and to a test data set. A Cox model was used to develop risk scores in the test data set. Calibration and discrimination were assessed in the test data set. A total of 619 patients died during a median follow-up period of 5.51 years, resulting in a mortality rate of 18.69 per 1000 person-years. Age, sex, peripheral arterial disease, cancer history, insulin use, blood hemoglobin levels, linear-transformed body mass index, random spot urinary albumin-creatinine ratio, and estimated glomerular filtration rate at enrollment were predictors of all-cause death. A risk score for all-cause mortality was developed using these predictors. The predicted and observed death rates in the test data set were similar (P > .70). The area under the receiver operating characteristic curve was 0.85 for 5 years of follow-up. Using the risk score in ranking cause-specific deaths, the area under the receiver operating characteristic curve was 0.95 for genitourinary death, 0.85 for circulatory death, 0.85 for respiratory death, and 0.71 for neoplasm death. Death in type 2 diabetes mellitus can be predicted using a risk score consisting of commonly measured clinical and biochemical variables. Further validation is needed before clinical use.
Djurfeldt, Mikael
2012-07-01
The connection-set algebra (CSA) is a novel and general formalism for the description of connectivity in neuronal network models, from small-scale to large-scale structure. The algebra provides operators to form more complex sets of connections from simpler ones and also provides parameterization of such sets. CSA is expressive enough to describe a wide range of connection patterns, including multiple types of random and/or geometrically dependent connectivity, and can serve as a concise notation for network structure in scientific writing. CSA implementations allow for scalable and efficient representation of connectivity in parallel neuronal network simulators and could even allow for avoiding explicit representation of connections in computer memory. The expressiveness of CSA makes prototyping of network structure easy. A C+ + version of the algebra has been implemented and used in a large-scale neuronal network simulation (Djurfeldt et al., IBM J Res Dev 52(1/2):31-42, 2008b) and an implementation in Python has been publicly released.
Testing statistical self-similarity in the topology of river networks
Troutman, Brent M.; Mantilla, Ricardo; Gupta, Vijay K.
2010-01-01
Recent work has demonstrated that the topological properties of real river networks deviate significantly from predictions of Shreve's random model. At the same time the property of mean self-similarity postulated by Tokunaga's model is well supported by data. Recently, a new class of network model called random self-similar networks (RSN) that combines self-similarity and randomness has been introduced to replicate important topological features observed in real river networks. We investigate if the hypothesis of statistical self-similarity in the RSN model is supported by data on a set of 30 basins located across the continental United States that encompass a wide range of hydroclimatic variability. We demonstrate that the generators of the RSN model obey a geometric distribution, and self-similarity holds in a statistical sense in 26 of these 30 basins. The parameters describing the distribution of interior and exterior generators are tested to be statistically different and the difference is shown to produce the well-known Hack's law. The inter-basin variability of RSN parameters is found to be statistically significant. We also test generator dependence on two climatic indices, mean annual precipitation and radiative index of dryness. Some indication of climatic influence on the generators is detected, but this influence is not statistically significant with the sample size available. Finally, two key applications of the RSN model to hydrology and geomorphology are briefly discussed.
Revell, Andrew D; Wang, Dechao; Perez-Elias, Maria-Jesus; Wood, Robin; Cogill, Dolphina; Tempelman, Hugo; Hamers, Raph L; Reiss, Peter; van Sighem, Ard I; Rehm, Catherine A; Pozniak, Anton; Montaner, Julio S G; Lane, H Clifford; Larder, Brendan A
2018-06-08
Optimizing antiretroviral drug combination on an individual basis can be challenging, particularly in settings with limited access to drugs and genotypic resistance testing. Here we describe our latest computational models to predict treatment responses, with or without a genotype, and compare their predictive accuracy with that of genotyping. Random forest models were trained to predict the probability of virological response to a new therapy introduced following virological failure using up to 50 000 treatment change episodes (TCEs) without a genotype and 18 000 TCEs including genotypes. Independent data sets were used to evaluate the models. This study tested the effects on model accuracy of relaxing the baseline data timing windows, the use of a new filter to exclude probable non-adherent cases and the addition of maraviroc, tipranavir and elvitegravir to the system. The no-genotype models achieved area under the receiver operator characteristic curve (AUC) values of 0.82 and 0.81 using the standard and relaxed baseline data windows, respectively. The genotype models achieved AUC values of 0.86 with the new non-adherence filter and 0.84 without. Both sets of models were significantly more accurate than genotyping with rules-based interpretation, which achieved AUC values of only 0.55-0.63, and were marginally more accurate than previous models. The models were able to identify alternative regimens that were predicted to be effective for the vast majority of cases in which the new regimen prescribed in the clinic failed. These latest global models predict treatment responses accurately even without a genotype and have the potential to help optimize therapy, particularly in resource-limited settings.
Data-driven train set crash dynamics simulation
NASA Astrophysics Data System (ADS)
Tang, Zhao; Zhu, Yunrui; Nie, Yinyu; Guo, Shihui; Liu, Fengjia; Chang, Jian; Zhang, Jianjun
2017-02-01
Traditional finite element (FE) methods are arguably expensive in computation/simulation of the train crash. High computational cost limits their direct applications in investigating dynamic behaviours of an entire train set for crashworthiness design and structural optimisation. On the contrary, multi-body modelling is widely used because of its low computational cost with the trade-off in accuracy. In this study, a data-driven train crash modelling method is proposed to improve the performance of a multi-body dynamics simulation of train set crash without increasing the computational burden. This is achieved by the parallel random forest algorithm, which is a machine learning approach that extracts useful patterns of force-displacement curves and predicts a force-displacement relation in a given collision condition from a collection of offline FE simulation data on various collision conditions, namely different crash velocities in our analysis. Using the FE simulation results as a benchmark, we compared our method with traditional multi-body modelling methods and the result shows that our data-driven method improves the accuracy over traditional multi-body models in train crash simulation and runs at the same level of efficiency.
Covariate Selection for Multilevel Models with Missing Data
Marino, Miguel; Buxton, Orfeu M.; Li, Yi
2017-01-01
Missing covariate data hampers variable selection in multilevel regression settings. Current variable selection techniques for multiply-imputed data commonly address missingness in the predictors through list-wise deletion and stepwise-selection methods which are problematic. Moreover, most variable selection methods are developed for independent linear regression models and do not accommodate multilevel mixed effects regression models with incomplete covariate data. We develop a novel methodology that is able to perform covariate selection across multiply-imputed data for multilevel random effects models when missing data is present. Specifically, we propose to stack the multiply-imputed data sets from a multiple imputation procedure and to apply a group variable selection procedure through group lasso regularization to assess the overall impact of each predictor on the outcome across the imputed data sets. Simulations confirm the advantageous performance of the proposed method compared with the competing methods. We applied the method to reanalyze the Healthy Directions-Small Business cancer prevention study, which evaluated a behavioral intervention program targeting multiple risk-related behaviors in a working-class, multi-ethnic population. PMID:28239457
Genetic analyses of stillbirth in relation to litter size using random regression models.
Chen, C Y; Misztal, I; Tsuruta, S; Herring, W O; Holl, J; Culbertson, M
2010-12-01
Estimates of genetic parameters for number of stillborns (NSB) in relation to litter size (LS) were obtained with random regression models (RRM). Data were collected from 4 purebred Duroc nucleus farms between 2004 and 2008. Two data sets with 6,575 litters for the first parity (P1) and 6,259 litters for the second to fifth parity (P2-5) with a total of 8,217 and 5,066 animals in the pedigree were analyzed separately. Number of stillborns was studied as a trait on sow level. Fixed effects were contemporary groups (farm-year-season) and fixed cubic regression coefficients on LS with Legendre polynomials. Models for P2-5 included the fixed effect of parity. Random effects were additive genetic effects for both data sets with permanent environmental effects included for P2-5. Random effects modeled with Legendre polynomials (RRM-L), linear splines (RRM-S), and degree 0 B-splines (RRM-BS) with regressions on LS were used. For P1, the order of polynomial, the number of knots, and the number of intervals used for respective models were quadratic, 3, and 3, respectively. For P2-5, the same parameters were linear, 2, and 2, respectively. Heterogeneous residual variances were considered in the models. For P1, estimates of heritability were 12 to 15%, 5 to 6%, and 6 to 7% in LS 5, 9, and 13, respectively. For P2-5, estimates were 15 to 17%, 4 to 5%, and 4 to 6% in LS 6, 9, and 12, respectively. For P1, average estimates of genetic correlations between LS 5 to 9, 5 to 13, and 9 to 13 were 0.53, -0.29, and 0.65, respectively. For P2-5, same estimates averaged for RRM-L and RRM-S were 0.75, -0.21, and 0.50, respectively. For RRM-BS with 2 intervals, the correlation was 0.66 between LS 5 to 7 and 8 to 13. Parameters obtained by 3 RRM revealed the nonlinear relationship between additive genetic effect of NSB and the environmental deviation of LS. The negative correlations between the 2 extreme LS might possibly indicate different genetic bases on incidence of stillbirth.
A stochastic hybrid systems based framework for modeling dependent failure processes
Fan, Mengfei; Zeng, Zhiguo; Zio, Enrico; Kang, Rui; Chen, Ying
2017-01-01
In this paper, we develop a framework to model and analyze systems that are subject to dependent, competing degradation processes and random shocks. The degradation processes are described by stochastic differential equations, whereas transitions between the system discrete states are triggered by random shocks. The modeling is, then, based on Stochastic Hybrid Systems (SHS), whose state space is comprised of a continuous state determined by stochastic differential equations and a discrete state driven by stochastic transitions and reset maps. A set of differential equations are derived to characterize the conditional moments of the state variables. System reliability and its lower bounds are estimated from these conditional moments, using the First Order Second Moment (FOSM) method and Markov inequality, respectively. The developed framework is applied to model three dependent failure processes from literature and a comparison is made to Monte Carlo simulations. The results demonstrate that the developed framework is able to yield an accurate estimation of reliability with less computational costs compared to traditional Monte Carlo-based methods. PMID:28231313
A stochastic hybrid systems based framework for modeling dependent failure processes.
Fan, Mengfei; Zeng, Zhiguo; Zio, Enrico; Kang, Rui; Chen, Ying
2017-01-01
In this paper, we develop a framework to model and analyze systems that are subject to dependent, competing degradation processes and random shocks. The degradation processes are described by stochastic differential equations, whereas transitions between the system discrete states are triggered by random shocks. The modeling is, then, based on Stochastic Hybrid Systems (SHS), whose state space is comprised of a continuous state determined by stochastic differential equations and a discrete state driven by stochastic transitions and reset maps. A set of differential equations are derived to characterize the conditional moments of the state variables. System reliability and its lower bounds are estimated from these conditional moments, using the First Order Second Moment (FOSM) method and Markov inequality, respectively. The developed framework is applied to model three dependent failure processes from literature and a comparison is made to Monte Carlo simulations. The results demonstrate that the developed framework is able to yield an accurate estimation of reliability with less computational costs compared to traditional Monte Carlo-based methods.
Multivariate Longitudinal Analysis with Bivariate Correlation Test.
Adjakossa, Eric Houngla; Sadissou, Ibrahim; Hounkonnou, Mahouton Norbert; Nuel, Gregory
2016-01-01
In the context of multivariate multilevel data analysis, this paper focuses on the multivariate linear mixed-effects model, including all the correlations between the random effects when the dimensional residual terms are assumed uncorrelated. Using the EM algorithm, we suggest more general expressions of the model's parameters estimators. These estimators can be used in the framework of the multivariate longitudinal data analysis as well as in the more general context of the analysis of multivariate multilevel data. By using a likelihood ratio test, we test the significance of the correlations between the random effects of two dependent variables of the model, in order to investigate whether or not it is useful to model these dependent variables jointly. Simulation studies are done to assess both the parameter recovery performance of the EM estimators and the power of the test. Using two empirical data sets which are of longitudinal multivariate type and multivariate multilevel type, respectively, the usefulness of the test is illustrated.
Hanks, Ephraim M.; Schliep, Erin M.; Hooten, Mevin B.; Hoeting, Jennifer A.
2015-01-01
In spatial generalized linear mixed models (SGLMMs), covariates that are spatially smooth are often collinear with spatially smooth random effects. This phenomenon is known as spatial confounding and has been studied primarily in the case where the spatial support of the process being studied is discrete (e.g., areal spatial data). In this case, the most common approach suggested is restricted spatial regression (RSR) in which the spatial random effects are constrained to be orthogonal to the fixed effects. We consider spatial confounding and RSR in the geostatistical (continuous spatial support) setting. We show that RSR provides computational benefits relative to the confounded SGLMM, but that Bayesian credible intervals under RSR can be inappropriately narrow under model misspecification. We propose a posterior predictive approach to alleviating this potential problem and discuss the appropriateness of RSR in a variety of situations. We illustrate RSR and SGLMM approaches through simulation studies and an analysis of malaria frequencies in The Gambia, Africa.
Tobacco Town: Computational Modeling of Policy Options to Reduce Tobacco Retailer Density.
Luke, Douglas A; Hammond, Ross A; Combs, Todd; Sorg, Amy; Kasman, Matt; Mack-Crane, Austen; Ribisl, Kurt M; Henriksen, Lisa
2017-05-01
To identify the behavioral mechanisms and effects of tobacco control policies designed to reduce tobacco retailer density. We developed the Tobacco Town agent-based simulation model to examine 4 types of retailer reduction policies: (1) random retailer reduction, (2) restriction by type of retailer, (3) limiting proximity of retailers to schools, and (4) limiting proximity of retailers to each other. The model examined the effects of these policies alone and in combination across 4 different types of towns, defined by 2 levels of population density (urban vs suburban) and 2 levels of income (higher vs lower). Model results indicated that reduction of retailer density has the potential to decrease accessibility of tobacco products by driving up search and purchase costs. Policy effects varied by town type: proximity policies worked better in dense, urban towns whereas retailer type and random retailer reduction worked better in less-dense, suburban settings. Comprehensive retailer density reduction policies have excellent potential to reduce the public health burden of tobacco use in communities.
Mathematics of gravitational lensing: multiple imaging and magnification
NASA Astrophysics Data System (ADS)
Petters, A. O.; Werner, M. C.
2010-09-01
The mathematical theory of gravitational lensing has revealed many generic and global properties. Beginning with multiple imaging, we review Morse-theoretic image counting formulas and lower bound results, and complex-algebraic upper bounds in the case of single and multiple lens planes. We discuss recent advances in the mathematics of stochastic lensing, discussing a general formula for the global expected number of minimum lensed images as well as asymptotic formulas for the probability densities of the microlensing random time delay functions, random lensing maps, and random shear, and an asymptotic expression for the global expected number of micro-minima. Multiple imaging in optical geometry and a spacetime setting are treated. We review global magnification relation results for model-dependent scenarios and cover recent developments on universal local magnification relations for higher order caustics.
Predicting network modules of cell cycle regulators using relative protein abundance statistics.
Oguz, Cihan; Watson, Layne T; Baumann, William T; Tyson, John J
2017-02-28
Parameter estimation in systems biology is typically done by enforcing experimental observations through an objective function as the parameter space of a model is explored by numerical simulations. Past studies have shown that one usually finds a set of "feasible" parameter vectors that fit the available experimental data equally well, and that these alternative vectors can make different predictions under novel experimental conditions. In this study, we characterize the feasible region of a complex model of the budding yeast cell cycle under a large set of discrete experimental constraints in order to test whether the statistical features of relative protein abundance predictions are influenced by the topology of the cell cycle regulatory network. Using differential evolution, we generate an ensemble of feasible parameter vectors that reproduce the phenotypes (viable or inviable) of wild-type yeast cells and 110 mutant strains. We use this ensemble to predict the phenotypes of 129 mutant strains for which experimental data is not available. We identify 86 novel mutants that are predicted to be viable and then rank the cell cycle proteins in terms of their contributions to cumulative variability of relative protein abundance predictions. Proteins involved in "regulation of cell size" and "regulation of G1/S transition" contribute most to predictive variability, whereas proteins involved in "positive regulation of transcription involved in exit from mitosis," "mitotic spindle assembly checkpoint" and "negative regulation of cyclin-dependent protein kinase by cyclin degradation" contribute the least. These results suggest that the statistics of these predictions may be generating patterns specific to individual network modules (START, S/G2/M, and EXIT). To test this hypothesis, we develop random forest models for predicting the network modules of cell cycle regulators using relative abundance statistics as model inputs. Predictive performance is assessed by the areas under receiver operating characteristics curves (AUC). Our models generate an AUC range of 0.83-0.87 as opposed to randomized models with AUC values around 0.50. By using differential evolution and random forest modeling, we show that the model prediction statistics generate distinct network module-specific patterns within the cell cycle network.
A model for incomplete longitudinal multivariate ordinal data.
Liu, Li C
2008-12-30
In studies where multiple outcome items are repeatedly measured over time, missing data often occur. A longitudinal item response theory model is proposed for analysis of multivariate ordinal outcomes that are repeatedly measured. Under the MAR assumption, this model accommodates missing data at any level (missing item at any time point and/or missing time point). It allows for multiple random subject effects and the estimation of item discrimination parameters for the multiple outcome items. The covariates in the model can be at any level. Assuming either a probit or logistic response function, maximum marginal likelihood estimation is described utilizing multidimensional Gauss-Hermite quadrature for integration of the random effects. An iterative Fisher-scoring solution, which provides standard errors for all model parameters, is used. A data set from a longitudinal prevention study is used to motivate the application of the proposed model. In this study, multiple ordinal items of health behavior are repeatedly measured over time. Because of a planned missing design, subjects answered only two-third of all items at a given point. Copyright 2008 John Wiley & Sons, Ltd.
Modeling stock price dynamics by continuum percolation system and relevant complex systems analysis
NASA Astrophysics Data System (ADS)
Xiao, Di; Wang, Jun
2012-10-01
The continuum percolation system is developed to model a random stock price process in this work. Recent empirical research has demonstrated various statistical features of stock price changes, the financial model aiming at understanding price fluctuations needs to define a mechanism for the formation of the price, in an attempt to reproduce and explain this set of empirical facts. The continuum percolation model is usually referred to as a random coverage process or a Boolean model, the local interaction or influence among traders is constructed by the continuum percolation, and a cluster of continuum percolation is applied to define the cluster of traders sharing the same opinion about the market. We investigate and analyze the statistical behaviors of normalized returns of the price model by some analysis methods, including power-law tail distribution analysis, chaotic behavior analysis and Zipf analysis. Moreover, we consider the daily returns of Shanghai Stock Exchange Composite Index from January 1997 to July 2011, and the comparisons of return behaviors between the actual data and the simulation data are exhibited.
Password-only authenticated three-party key exchange with provable security in the standard model.
Nam, Junghyun; Choo, Kim-Kwang Raymond; Kim, Junghwan; Kang, Hyun-Kyu; Kim, Jinsoo; Paik, Juryon; Won, Dongho
2014-01-01
Protocols for password-only authenticated key exchange (PAKE) in the three-party setting allow two clients registered with the same authentication server to derive a common secret key from their individual password shared with the server. Existing three-party PAKE protocols were proven secure under the assumption of the existence of random oracles or in a model that does not consider insider attacks. Therefore, these protocols may turn out to be insecure when the random oracle is instantiated with a particular hash function or an insider attack is mounted against the partner client. The contribution of this paper is to present the first three-party PAKE protocol whose security is proven without any idealized assumptions in a model that captures insider attacks. The proof model we use is a variant of the indistinguishability-based model of Bellare, Pointcheval, and Rogaway (2000), which is one of the most widely accepted models for security analysis of password-based key exchange protocols. We demonstrated that our protocol achieves not only the typical indistinguishability-based security of session keys but also the password security against undetectable online dictionary attacks.
ERIC Educational Resources Information Center
Henry, James A.; Thielman, Emily J.; Zaugg, Tara L.; Kaelin, Christine; Schmidt, Caroline J.; Griest, Susan; McMillan, Garnett P.; Myers, Paula; Rivera, Izel; Baldwin, Robert; Carlson, Kathleen
2017-01-01
Purpose: This randomized controlled trial evaluated, within clinical settings, the effectiveness of coping skills education that is provided with progressive tinnitus management (PTM). Method: At 2 Veterans Affairs medical centers, N = 300 veterans were randomized to either PTM intervention or 6-month wait-list control. The PTM intervention…
The influence of negative training set size on machine learning-based virtual screening.
Kurczab, Rafał; Smusz, Sabina; Bojarski, Andrzej J
2014-01-01
The paper presents a thorough analysis of the influence of the number of negative training examples on the performance of machine learning methods. The impact of this rather neglected aspect of machine learning methods application was examined for sets containing a fixed number of positive and a varying number of negative examples randomly selected from the ZINC database. An increase in the ratio of positive to negative training instances was found to greatly influence most of the investigated evaluating parameters of ML methods in simulated virtual screening experiments. In a majority of cases, substantial increases in precision and MCC were observed in conjunction with some decreases in hit recall. The analysis of dynamics of those variations let us recommend an optimal composition of training data. The study was performed on several protein targets, 5 machine learning algorithms (SMO, Naïve Bayes, Ibk, J48 and Random Forest) and 2 types of molecular fingerprints (MACCS and CDK FP). The most effective classification was provided by the combination of CDK FP with SMO or Random Forest algorithms. The Naïve Bayes models appeared to be hardly sensitive to changes in the number of negative instances in the training set. In conclusion, the ratio of positive to negative training instances should be taken into account during the preparation of machine learning experiments, as it might significantly influence the performance of particular classifier. What is more, the optimization of negative training set size can be applied as a boosting-like approach in machine learning-based virtual screening.
The influence of negative training set size on machine learning-based virtual screening
2014-01-01
Background The paper presents a thorough analysis of the influence of the number of negative training examples on the performance of machine learning methods. Results The impact of this rather neglected aspect of machine learning methods application was examined for sets containing a fixed number of positive and a varying number of negative examples randomly selected from the ZINC database. An increase in the ratio of positive to negative training instances was found to greatly influence most of the investigated evaluating parameters of ML methods in simulated virtual screening experiments. In a majority of cases, substantial increases in precision and MCC were observed in conjunction with some decreases in hit recall. The analysis of dynamics of those variations let us recommend an optimal composition of training data. The study was performed on several protein targets, 5 machine learning algorithms (SMO, Naïve Bayes, Ibk, J48 and Random Forest) and 2 types of molecular fingerprints (MACCS and CDK FP). The most effective classification was provided by the combination of CDK FP with SMO or Random Forest algorithms. The Naïve Bayes models appeared to be hardly sensitive to changes in the number of negative instances in the training set. Conclusions In conclusion, the ratio of positive to negative training instances should be taken into account during the preparation of machine learning experiments, as it might significantly influence the performance of particular classifier. What is more, the optimization of negative training set size can be applied as a boosting-like approach in machine learning-based virtual screening. PMID:24976867
Coppack, Russell J; Kristensen, Jakob; Karageorghis, Costas I
2012-11-01
To examine the effects of a goal setting intervention on self-efficacy, treatment efficacy, adherence and treatment outcome in patients undergoing low back pain rehabilitation. A mixed-model 2 (time) × 3 (group) randomized controlled trial. A residential rehabilitation centre for military personnel. UK military personnel volunteers (N = 48); mean age was 32.9 (SD 7.9) with a diagnosis of non-specific low back pain. Subjects were randomly assigned to either a goal setting experimental group (Exp, n = 16), therapist-led exercise therapy group (C1, n = 16) or non-therapist-led exercise therapy group (C2, n = 16). Treatment duration for all groups was three weeks. Self-efficacy, treatment efficacy and treatment outcome were recorded before and after the treatment period. Adherence was rated during regularly scheduled treatment sessions using the Sports Injury Rehabilitation Adherence Scale (SIRAS). The Biering-Sørensen test was used as the primary measure of treatment outcome. ANCOVA results showed that adherence scores were significantly higher in the experimental group (13.70 ± 1.58) compared with C2 (11.74 ± 1.35), (P < 0.025). There was no significant difference for adherence between the experimental group and C1 (P = 0.13). Self-efficacy was significantly higher in the experimental group compared to both C1 and C2 (P < 0.05), whereas no significant difference was found for treatment efficacy. Treatment outcome did not differ significantly between the experimental and two control groups. The findings provide partial support for the use of goal setting to enhance adherence in clinical rehabilitation.
CHRR: coordinate hit-and-run with rounding for uniform sampling of constraint-based models
DOE Office of Scientific and Technical Information (OSTI.GOV)
Haraldsdóttir, Hulda S.; Cousins, Ben; Thiele, Ines
In constraint-based metabolic modelling, physical and biochemical constraints define a polyhedral convex set of feasible flux vectors. Uniform sampling of this set provides an unbiased characterization of the metabolic capabilities of a biochemical network. However, reliable uniform sampling of genome-scale biochemical networks is challenging due to their high dimensionality and inherent anisotropy. Here, we present an implementation of a new sampling algorithm, coordinate hit-and-run with rounding (CHRR). This algorithm is based on the provably efficient hit-and-run random walk and crucially uses a preprocessing step to round the anisotropic flux set. CHRR provably converges to a uniform stationary sampling distribution. Wemore » apply it to metabolic networks of increasing dimensionality. We show that it converges several times faster than a popular artificial centering hit-and-run algorithm, enabling reliable and tractable sampling of genome-scale biochemical networks.« less
CHRR: coordinate hit-and-run with rounding for uniform sampling of constraint-based models
Haraldsdóttir, Hulda S.; Cousins, Ben; Thiele, Ines; ...
2017-01-31
In constraint-based metabolic modelling, physical and biochemical constraints define a polyhedral convex set of feasible flux vectors. Uniform sampling of this set provides an unbiased characterization of the metabolic capabilities of a biochemical network. However, reliable uniform sampling of genome-scale biochemical networks is challenging due to their high dimensionality and inherent anisotropy. Here, we present an implementation of a new sampling algorithm, coordinate hit-and-run with rounding (CHRR). This algorithm is based on the provably efficient hit-and-run random walk and crucially uses a preprocessing step to round the anisotropic flux set. CHRR provably converges to a uniform stationary sampling distribution. Wemore » apply it to metabolic networks of increasing dimensionality. We show that it converges several times faster than a popular artificial centering hit-and-run algorithm, enabling reliable and tractable sampling of genome-scale biochemical networks.« less
Data-driven Modeling of Metal-oxide Sensors with Dynamic Bayesian Networks
NASA Astrophysics Data System (ADS)
Gosangi, Rakesh; Gutierrez-Osuna, Ricardo
2011-09-01
We present a data-driven probabilistic framework to model the transient response of MOX sensors modulated with a sequence of voltage steps. Analytical models of MOX sensors are usually built based on the physico-chemical properties of the sensing materials. Although building these models provides an insight into the sensor behavior, they also require a thorough understanding of the underlying operating principles. Here we propose a data-driven approach to characterize the dynamical relationship between sensor inputs and outputs. Namely, we use dynamic Bayesian networks (DBNs), probabilistic models that represent temporal relations between a set of random variables. We identify a set of control variables that influence the sensor responses, create a graphical representation that captures the causal relations between these variables, and finally train the model with experimental data. We validated the approach on experimental data in terms of predictive accuracy and classification performance. Our results show that DBNs can accurately predict the dynamic response of MOX sensors, as well as capture the discriminatory information present in the sensor transients.
Empirical likelihood inference in randomized clinical trials.
Zhang, Biao
2017-01-01
In individually randomized controlled trials, in addition to the primary outcome, information is often available on a number of covariates prior to randomization. This information is frequently utilized to undertake adjustment for baseline characteristics in order to increase precision of the estimation of average treatment effects; such adjustment is usually performed via covariate adjustment in outcome regression models. Although the use of covariate adjustment is widely seen as desirable for making treatment effect estimates more precise and the corresponding hypothesis tests more powerful, there are considerable concerns that objective inference in randomized clinical trials can potentially be compromised. In this paper, we study an empirical likelihood approach to covariate adjustment and propose two unbiased estimating functions that automatically decouple evaluation of average treatment effects from regression modeling of covariate-outcome relationships. The resulting empirical likelihood estimator of the average treatment effect is as efficient as the existing efficient adjusted estimators 1 when separate treatment-specific working regression models are correctly specified, yet are at least as efficient as the existing efficient adjusted estimators 1 for any given treatment-specific working regression models whether or not they coincide with the true treatment-specific covariate-outcome relationships. We present a simulation study to compare the finite sample performance of various methods along with some results on analysis of a data set from an HIV clinical trial. The simulation results indicate that the proposed empirical likelihood approach is more efficient and powerful than its competitors when the working covariate-outcome relationships by treatment status are misspecified.
Leveraging prognostic baseline variables to gain precision in randomized trials
Colantuoni, Elizabeth; Rosenblum, Michael
2015-01-01
We focus on estimating the average treatment effect in a randomized trial. If baseline variables are correlated with the outcome, then appropriately adjusting for these variables can improve precision. An example is the analysis of covariance (ANCOVA) estimator, which applies when the outcome is continuous, the quantity of interest is the difference in mean outcomes comparing treatment versus control, and a linear model with only main effects is used. ANCOVA is guaranteed to be at least as precise as the standard unadjusted estimator, asymptotically, under no parametric model assumptions and also is locally semiparametric efficient. Recently, several estimators have been developed that extend these desirable properties to more general settings that allow any real-valued outcome (e.g., binary or count), contrasts other than the difference in mean outcomes (such as the relative risk), and estimators based on a large class of generalized linear models (including logistic regression). To the best of our knowledge, we give the first simulation study in the context of randomized trials that compares these estimators. Furthermore, our simulations are not based on parametric models; instead, our simulations are based on resampling data from completed randomized trials in stroke and HIV in order to assess estimator performance in realistic scenarios. We provide practical guidance on when these estimators are likely to provide substantial precision gains and describe a quick assessment method that allows clinical investigators to determine whether these estimators could be useful in their specific trial contexts. PMID:25872751
Austin, Peter C.; van Klaveren, David; Vergouwe, Yvonne; Nieboer, Daan; Lee, Douglas S.; Steyerberg, Ewout W.
2017-01-01
Objective Validation of clinical prediction models traditionally refers to the assessment of model performance in new patients. We studied different approaches to geographic and temporal validation in the setting of multicenter data from two time periods. Study Design and Setting We illustrated different analytic methods for validation using a sample of 14,857 patients hospitalized with heart failure at 90 hospitals in two distinct time periods. Bootstrap resampling was used to assess internal validity. Meta-analytic methods were used to assess geographic transportability. Each hospital was used once as a validation sample, with the remaining hospitals used for model derivation. Hospital-specific estimates of discrimination (c-statistic) and calibration (calibration intercepts and slopes) were pooled using random effects meta-analysis methods. I2 statistics and prediction interval width quantified geographic transportability. Temporal transportability was assessed using patients from the earlier period for model derivation and patients from the later period for model validation. Results Estimates of reproducibility, pooled hospital-specific performance, and temporal transportability were on average very similar, with c-statistics of 0.75. Between-hospital variation was moderate according to I2 statistics and prediction intervals for c-statistics. Conclusion This study illustrates how performance of prediction models can be assessed in settings with multicenter data at different time periods. PMID:27262237
2012-01-01
Background With the current focus on personalized medicine, patient/subject level inference is often of key interest in translational research. As a result, random effects models (REM) are becoming popular for patient level inference. However, for very large data sets that are characterized by large sample size, it can be difficult to fit REM using commonly available statistical software such as SAS since they require inordinate amounts of computer time and memory allocations beyond what are available preventing model convergence. For example, in a retrospective cohort study of over 800,000 Veterans with type 2 diabetes with longitudinal data over 5 years, fitting REM via generalized linear mixed modeling using currently available standard procedures in SAS (e.g. PROC GLIMMIX) was very difficult and same problems exist in Stata’s gllamm or R’s lme packages. Thus, this study proposes and assesses the performance of a meta regression approach and makes comparison with methods based on sampling of the full data. Data We use both simulated and real data from a national cohort of Veterans with type 2 diabetes (n=890,394) which was created by linking multiple patient and administrative files resulting in a cohort with longitudinal data collected over 5 years. Methods and results The outcome of interest was mean annual HbA1c measured over a 5 years period. Using this outcome, we compared parameter estimates from the proposed random effects meta regression (REMR) with estimates based on simple random sampling and VISN (Veterans Integrated Service Networks) based stratified sampling of the full data. Our results indicate that REMR provides parameter estimates that are less likely to be biased with tighter confidence intervals when the VISN level estimates are homogenous. Conclusion When the interest is to fit REM in repeated measures data with very large sample size, REMR can be used as a good alternative. It leads to reasonable inference for both Gaussian and non-Gaussian responses if parameter estimates are homogeneous across VISNs. PMID:23095325
Rochefort, Christian M; Buckeridge, David L; Tanguay, Andréanne; Biron, Alain; D'Aragon, Frédérick; Wang, Shengrui; Gallix, Benoit; Valiquette, Louis; Audet, Li-Anne; Lee, Todd C; Jayaraman, Dev; Petrucci, Bruno; Lefebvre, Patricia
2017-02-16
Adverse events (AEs) in acute care hospitals are frequent and associated with significant morbidity, mortality, and costs. Measuring AEs is necessary for quality improvement and benchmarking purposes, but current detection methods lack in accuracy, efficiency, and generalizability. The growing availability of electronic health records (EHR) and the development of natural language processing techniques for encoding narrative data offer an opportunity to develop potentially better methods. The purpose of this study is to determine the accuracy and generalizability of using automated methods for detecting three high-incidence and high-impact AEs from EHR data: a) hospital-acquired pneumonia, b) ventilator-associated event and, c) central line-associated bloodstream infection. This validation study will be conducted among medical, surgical and ICU patients admitted between 2013 and 2016 to the Centre hospitalier universitaire de Sherbrooke (CHUS) and the McGill University Health Centre (MUHC), which has both French and English sites. A random 60% sample of CHUS patients will be used for model development purposes (cohort 1, development set). Using a random sample of these patients, a reference standard assessment of their medical chart will be performed. Multivariate logistic regression and the area under the curve (AUC) will be employed to iteratively develop and optimize three automated AE detection models (i.e., one per AE of interest) using EHR data from the CHUS. These models will then be validated on a random sample of the remaining 40% of CHUS patients (cohort 1, internal validation set) using chart review to assess accuracy. The most accurate models developed and validated at the CHUS will then be applied to EHR data from a random sample of patients admitted to the MUHC French site (cohort 2) and English site (cohort 3)-a critical requirement given the use of narrative data -, and accuracy will be assessed using chart review. Generalizability will be determined by comparing AUCs from cohorts 2 and 3 to those from cohort 1. This study will likely produce more accurate and efficient measures of AEs. These measures could be used to assess the incidence rates of AEs, evaluate the success of preventive interventions, or benchmark performance across hospitals.
Frequentist Model Averaging in Structural Equation Modelling.
Jin, Shaobo; Ankargren, Sebastian
2018-06-04
Model selection from a set of candidate models plays an important role in many structural equation modelling applications. However, traditional model selection methods introduce extra randomness that is not accounted for by post-model selection inference. In the current study, we propose a model averaging technique within the frequentist statistical framework. Instead of selecting an optimal model, the contributions of all candidate models are acknowledged. Valid confidence intervals and a [Formula: see text] test statistic are proposed. A simulation study shows that the proposed method is able to produce a robust mean-squared error, a better coverage probability, and a better goodness-of-fit test compared to model selection. It is an interesting compromise between model selection and the full model.
Silkworm cocoons inspire models for random fiber and particulate composites
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chen Fujia; Porter, David; Vollrath, Fritz
The bioengineering design principles evolved in silkworm cocoons make them ideal natural prototypes and models for structural composites. Cocoons depend for their stiffness and strength on the connectivity of bonding between their constituent materials of silk fibers and sericin binder. Strain-activated mechanisms for loss of bonding connectivity in cocoons can be translated directly into a surprisingly simple yet universal set of physically realistic as well as predictive quantitative structure-property relations for a wide range of technologically important fiber and particulate composite materials.
Silkworm cocoons inspire models for random fiber and particulate composites
NASA Astrophysics Data System (ADS)
Chen, Fujia; Porter, David; Vollrath, Fritz
2010-10-01
The bioengineering design principles evolved in silkworm cocoons make them ideal natural prototypes and models for structural composites. Cocoons depend for their stiffness and strength on the connectivity of bonding between their constituent materials of silk fibers and sericin binder. Strain-activated mechanisms for loss of bonding connectivity in cocoons can be translated directly into a surprisingly simple yet universal set of physically realistic as well as predictive quantitative structure-property relations for a wide range of technologically important fiber and particulate composite materials.
Eroglu, Duygu Yilmaz; Ozmutlu, H Cenk
2014-01-01
We developed mixed integer programming (MIP) models and hybrid genetic-local search algorithms for the scheduling problem of unrelated parallel machines with job sequence and machine-dependent setup times and with job splitting property. The first contribution of this paper is to introduce novel algorithms which make splitting and scheduling simultaneously with variable number of subjobs. We proposed simple chromosome structure which is constituted by random key numbers in hybrid genetic-local search algorithm (GAspLA). Random key numbers are used frequently in genetic algorithms, but it creates additional difficulty when hybrid factors in local search are implemented. We developed algorithms that satisfy the adaptation of results of local search into the genetic algorithms with minimum relocation operation of genes' random key numbers. This is the second contribution of the paper. The third contribution of this paper is three developed new MIP models which are making splitting and scheduling simultaneously. The fourth contribution of this paper is implementation of the GAspLAMIP. This implementation let us verify the optimality of GAspLA for the studied combinations. The proposed methods are tested on a set of problems taken from the literature and the results validate the effectiveness of the proposed algorithms.
Knowledge diffusion of dynamical network in terms of interaction frequency.
Liu, Jian-Guo; Zhou, Qing; Guo, Qiang; Yang, Zhen-Hua; Xie, Fei; Han, Jing-Ti
2017-09-07
In this paper, we present a knowledge diffusion (SKD) model for dynamic networks by taking into account the interaction frequency which always used to measure the social closeness. A set of agents, which are initially interconnected to form a random network, either exchange knowledge with their neighbors or move toward a new location through an edge-rewiring procedure. The activity of knowledge exchange between agents is determined by a knowledge transfer rule that the target node would preferentially select one neighbor node to transfer knowledge with probability p according to their interaction frequency instead of the knowledge distance, otherwise, the target node would build a new link with its second-order neighbor preferentially or select one node in the system randomly with probability 1 - p. The simulation results show that, comparing with the Null model defined by the random selection mechanism and the traditional knowledge diffusion (TKD) model driven by knowledge distance, the knowledge would spread more fast based on SKD driven by interaction frequency. In particular, the network structure of SKD would evolve as an assortative one, which is a fundamental feature of social networks. This work would be helpful for deeply understanding the coevolution of the knowledge diffusion and network structure.
Technology diffusion in hospitals: a log odds random effects regression model.
Blank, Jos L T; Valdmanis, Vivian G
2015-01-01
This study identifies the factors that affect the diffusion of hospital innovations. We apply a log odds random effects regression model on hospital micro data. We introduce the concept of clustering innovations and the application of a log odds random effects regression model to describe the diffusion of technologies. We distinguish a number of determinants, such as service, physician, and environmental, financial and organizational characteristics of the 60 Dutch hospitals in our sample. On the basis of this data set on Dutch general hospitals over the period 1995-2002, we conclude that there is a relation between a number of determinants and the diffusion of innovations underlining conclusions from earlier research. Positive effects were found on the basis of the size of the hospitals, competition and a hospital's commitment to innovation. It appears that if a policy is developed to further diffuse innovations, the external effects of demand and market competition need to be examined, which would de facto lead to an efficient use of technology. For the individual hospital, instituting an innovations office appears to be the most prudent course of action. © 2013 The Authors. International Journal of Health Planning and Management published by John Wiley & Sons, Ltd.
Introducing two Random Forest based methods for cloud detection in remote sensing images
NASA Astrophysics Data System (ADS)
Ghasemian, Nafiseh; Akhoondzadeh, Mehdi
2018-07-01
Cloud detection is a necessary phase in satellite images processing to retrieve the atmospheric and lithospheric parameters. Currently, some cloud detection methods based on Random Forest (RF) model have been proposed but they do not consider both spectral and textural characteristics of the image. Furthermore, they have not been tested in the presence of snow/ice. In this paper, we introduce two RF based algorithms, Feature Level Fusion Random Forest (FLFRF) and Decision Level Fusion Random Forest (DLFRF) to incorporate visible, infrared (IR) and thermal spectral and textural features (FLFRF) including Gray Level Co-occurrence Matrix (GLCM) and Robust Extended Local Binary Pattern (RELBP_CI) or visible, IR and thermal classifiers (DLFRF) for highly accurate cloud detection on remote sensing images. FLFRF first fuses visible, IR and thermal features. Thereafter, it uses the RF model to classify pixels to cloud, snow/ice and background or thick cloud, thin cloud and background. DLFRF considers visible, IR and thermal features (both spectral and textural) separately and inserts each set of features to RF model. Then, it holds vote matrix of each run of the model. Finally, it fuses the classifiers using the majority vote method. To demonstrate the effectiveness of the proposed algorithms, 10 Terra MODIS and 15 Landsat 8 OLI/TIRS images with different spatial resolutions are used in this paper. Quantitative analyses are based on manually selected ground truth data. Results show that after adding RELBP_CI to input feature set cloud detection accuracy improves. Also, the average cloud kappa values of FLFRF and DLFRF on MODIS images (1 and 0.99) are higher than other machine learning methods, Linear Discriminate Analysis (LDA), Classification And Regression Tree (CART), K Nearest Neighbor (KNN) and Support Vector Machine (SVM) (0.96). The average snow/ice kappa values of FLFRF and DLFRF on MODIS images (1 and 0.85) are higher than other traditional methods. The quantitative values on Landsat 8 images show similar trend. Consequently, while SVM and K-nearest neighbor show overestimation in predicting cloud and snow/ice pixels, our Random Forest (RF) based models can achieve higher cloud, snow/ice kappa values on MODIS and thin cloud, thick cloud and snow/ice kappa values on Landsat 8 images. Our algorithms predict both thin and thick cloud on Landsat 8 images while the existing cloud detection algorithm, Fmask cannot discriminate them. Compared to the state-of-the-art methods, our algorithms have acquired higher average cloud and snow/ice kappa values for different spatial resolutions.
Theory and generation of conditional, scalable sub-Gaussian random fields
NASA Astrophysics Data System (ADS)
Panzeri, M.; Riva, M.; Guadagnini, A.; Neuman, S. P.
2016-03-01
Many earth and environmental (as well as a host of other) variables, Y, and their spatial (or temporal) increments, ΔY, exhibit non-Gaussian statistical scaling. Previously we were able to capture key aspects of such non-Gaussian scaling by treating Y and/or ΔY as sub-Gaussian random fields (or processes). This however left unaddressed the empirical finding that whereas sample frequency distributions of Y tend to display relatively mild non-Gaussian peaks and tails, those of ΔY often reveal peaks that grow sharper and tails that become heavier with decreasing separation distance or lag. Recently we proposed a generalized sub-Gaussian model (GSG) which resolves this apparent inconsistency between the statistical scaling behaviors of observed variables and their increments. We presented an algorithm to generate unconditional random realizations of statistically isotropic or anisotropic GSG functions and illustrated it in two dimensions. Most importantly, we demonstrated the feasibility of estimating all parameters of a GSG model underlying a single realization of Y by analyzing jointly spatial moments of Y data and corresponding increments, ΔY. Here, we extend our GSG model to account for noisy measurements of Y at a discrete set of points in space (or time), present an algorithm to generate conditional realizations of corresponding isotropic or anisotropic random fields, introduce two approximate versions of this algorithm to reduce CPU time, and explore them on one and two-dimensional synthetic test cases.
Meta-analysis of two studies in the presence of heterogeneity with applications in rare diseases.
Friede, Tim; Röver, Christian; Wandel, Simon; Neuenschwander, Beat
2017-07-01
Random-effects meta-analyses are used to combine evidence of treatment effects from multiple studies. Since treatment effects may vary across trials due to differences in study characteristics, heterogeneity in treatment effects between studies must be accounted for to achieve valid inference. The standard model for random-effects meta-analysis assumes approximately normal effect estimates and a normal random-effects model. However, standard methods based on this model ignore the uncertainty in estimating the between-trial heterogeneity. In the special setting of only two studies and in the presence of heterogeneity, we investigate here alternatives such as the Hartung-Knapp-Sidik-Jonkman method (HKSJ), the modified Knapp-Hartung method (mKH, a variation of the HKSJ method) and Bayesian random-effects meta-analyses with priors covering plausible heterogeneity values; R code to reproduce the examples is presented in an appendix. The properties of these methods are assessed by applying them to five examples from various rare diseases and by a simulation study. Whereas the standard method based on normal quantiles has poor coverage, the HKSJ and mKH generally lead to very long, and therefore inconclusive, confidence intervals. The Bayesian intervals on the whole show satisfying properties and offer a reasonable compromise between these two extremes. © 2016 The Authors. Biometrical Journal published by WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
A random forest algorithm for nowcasting of intense precipitation events
NASA Astrophysics Data System (ADS)
Das, Saurabh; Chakraborty, Rohit; Maitra, Animesh
2017-09-01
Automatic nowcasting of convective initiation and thunderstorms has potential applications in several sectors including aviation planning and disaster management. In this paper, random forest based machine learning algorithm is tested for nowcasting of convective rain with a ground based radiometer. Brightness temperatures measured at 14 frequencies (7 frequencies in 22-31 GHz band and 7 frequencies in 51-58 GHz bands) are utilized as the inputs of the model. The lower frequency band is associated to the water vapor absorption whereas the upper frequency band relates to the oxygen absorption and hence, provide information on the temperature and humidity of the atmosphere. Synthetic minority over-sampling technique is used to balance the data set and 10-fold cross validation is used to assess the performance of the model. Results indicate that random forest algorithm with fixed alarm generation time of 30 min and 60 min performs quite well (probability of detection of all types of weather condition ∼90%) with low false alarms. It is, however, also observed that reducing the alarm generation time improves the threat score significantly and also decreases false alarms. The proposed model is found to be very sensitive to the boundary layer instability as indicated by the variable importance measure. The study shows the suitability of a random forest algorithm for nowcasting application utilizing a large number of input parameters from diverse sources and can be utilized in other forecasting problems.
Privacy-preserving outlier detection through random nonlinear data distortion.
Bhaduri, Kanishka; Stefanski, Mark D; Srivastava, Ashok N
2011-02-01
Consider a scenario in which the data owner has some private or sensitive data and wants a data miner to access them for studying important patterns without revealing the sensitive information. Privacy-preserving data mining aims to solve this problem by randomly transforming the data prior to their release to the data miners. Previous works only considered the case of linear data perturbations--additive, multiplicative, or a combination of both--for studying the usefulness of the perturbed output. In this paper, we discuss nonlinear data distortion using potentially nonlinear random data transformation and show how it can be useful for privacy-preserving anomaly detection from sensitive data sets. We develop bounds on the expected accuracy of the nonlinear distortion and also quantify privacy by using standard definitions. The highlight of this approach is to allow a user to control the amount of privacy by varying the degree of nonlinearity. We show how our general transformation can be used for anomaly detection in practice for two specific problem instances: a linear model and a popular nonlinear model using the sigmoid function. We also analyze the proposed nonlinear transformation in full generality and then show that, for specific cases, it is distance preserving. A main contribution of this paper is the discussion between the invertibility of a transformation and privacy preservation and the application of these techniques to outlier detection. The experiments conducted on real-life data sets demonstrate the effectiveness of the approach.
NASA Astrophysics Data System (ADS)
Wang, Yu; Guo, Yanzhi; Kuang, Qifan; Pu, Xuemei; Ji, Yue; Zhang, Zhihang; Li, Menglong
2015-04-01
The assessment of binding affinity between ligands and the target proteins plays an essential role in drug discovery and design process. As an alternative to widely used scoring approaches, machine learning methods have also been proposed for fast prediction of the binding affinity with promising results, but most of them were developed as all-purpose models despite of the specific functions of different protein families, since proteins from different function families always have different structures and physicochemical features. In this study, we proposed a random forest method to predict the protein-ligand binding affinity based on a comprehensive feature set covering protein sequence, binding pocket, ligand structure and intermolecular interaction. Feature processing and compression was respectively implemented for different protein family datasets, which indicates that different features contribute to different models, so individual representation for each protein family is necessary. Three family-specific models were constructed for three important protein target families of HIV-1 protease, trypsin and carbonic anhydrase respectively. As a comparison, two generic models including diverse protein families were also built. The evaluation results show that models on family-specific datasets have the superior performance to those on the generic datasets and the Pearson and Spearman correlation coefficients ( R p and Rs) on the test sets are 0.740, 0.874, 0.735 and 0.697, 0.853, 0.723 for HIV-1 protease, trypsin and carbonic anhydrase respectively. Comparisons with the other methods further demonstrate that individual representation and model construction for each protein family is a more reasonable way in predicting the affinity of one particular protein family.
2007-11-01
Florea, Anne-Laure Jousselme, Éloi Bossé ; DRDC Valcartier TR 2003-319 ; R & D pour la défense Canada – Valcartier ; novembre 2007. Contexte : Pour...12 3.3.2 Imprecise information . . . . . . . . . . . . . . . . . . . . . 13 3.3.3 Uncertain and imprecise information...information proposed by Philippe Smets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Figure 5: The process of information modelling
Thomas P. Holmes; Kevin J. Boyle
2005-01-01
A hybrid stated-preference model is presented that combines the referendum contingent valuation response format with an experimentally designed set of attributes. A sequence of valuation questions is asked to a random sample in a mailout mail-back format. Econometric analysis shows greater discrimination between alternatives in the final choice in the sequence, and the...
Chemical name extraction based on automatic training data generation and rich feature set.
Yan, Su; Spangler, W Scott; Chen, Ying
2013-01-01
The automation of extracting chemical names from text has significant value to biomedical and life science research. A major barrier in this task is the difficulty of getting a sizable and good quality data to train a reliable entity extraction model. Another difficulty is the selection of informative features of chemical names, since comprehensive domain knowledge on chemistry nomenclature is required. Leveraging random text generation techniques, we explore the idea of automatically creating training sets for the task of chemical name extraction. Assuming the availability of an incomplete list of chemical names, called a dictionary, we are able to generate well-controlled, random, yet realistic chemical-like training documents. We statistically analyze the construction of chemical names based on the incomplete dictionary, and propose a series of new features, without relying on any domain knowledge. Compared to state-of-the-art models learned from manually labeled data and domain knowledge, our solution shows better or comparable results in annotating real-world data with less human effort. Moreover, we report an interesting observation about the language for chemical names. That is, both the structural and semantic components of chemical names follow a Zipfian distribution, which resembles many natural languages.
Beccaria, Marco; Mellors, Theodore R; Petion, Jacky S; Rees, Christiaan A; Nasir, Mavra; Systrom, Hannah K; Sairistil, Jean W; Jean-Juste, Marc-Antoine; Rivera, Vanessa; Lavoile, Kerline; Severe, Patrice; Pape, Jean W; Wright, Peter F; Hill, Jane E
2018-02-01
Tuberculosis (TB) remains a global public health malady that claims almost 1.8 million lives annually. Diagnosis of TB represents perhaps one of the most challenging aspects of tuberculosis control. Gold standards for diagnosis of active TB (culture and nucleic acid amplification) are sputum-dependent, however, in up to a third of TB cases, an adequate biological sputum sample is not readily available. The analysis of exhaled breath, as an alternative to sputum-dependent tests, has the potential to provide a simple, fast, and non-invasive, and ready-available diagnostic service that could positively change TB detection. Human breath has been evaluated in the setting of active tuberculosis using thermal desorption-comprehensive two-dimensional gas chromatography-time of flight mass spectrometry methodology. From the entire spectrum of volatile metabolites in breath, three random forest machine learning models were applied leading to the generation of a panel of 46 breath features. The twenty-two common features within each random forest model used were selected as a set that could distinguish subjects with confirmed pulmonary M. tuberculosis infection and people with other pathologies than TB. Copyright © 2018 Elsevier B.V. All rights reserved.
In vivo growth of 60 non-screening detected lung cancers: a computed tomography study.
Mets, Onno M; Chung, Kaman; Zanen, Pieter; Scholten, Ernst T; Veldhuis, Wouter B; van Ginneken, Bram; Prokop, Mathias; Schaefer-Prokop, Cornelia M; de Jong, Pim A
2018-04-01
Current pulmonary nodule management guidelines are based on nodule volume doubling time, which assumes exponential growth behaviour. However, this is a theory that has never been validated in vivo in the routine-care target population. This study evaluates growth patterns of untreated solid and subsolid lung cancers of various histologies in a non-screening setting.Growth behaviour of pathology-proven lung cancers from two academic centres that were imaged at least three times before diagnosis (n=60) was analysed using dedicated software. Random-intercept random-slope mixed-models analysis was applied to test which growth pattern most accurately described lung cancer growth. Individual growth curves were plotted per pathology subgroup and nodule type.We confirmed that growth in both subsolid and solid lung cancers is best explained by an exponential model. However, subsolid lesions generally progress slower than solid ones. Baseline lesion volume was not related to growth, indicating that smaller lesions do not grow slower compared to larger ones.By showing that lung cancer conforms to exponential growth we provide the first experimental basis in the routine-care setting for the assumption made in volume doubling time analysis. Copyright ©ERS 2018.
Controlled recovery of phylogenetic communities from an evolutionary model using a network approach
NASA Astrophysics Data System (ADS)
Sousa, Arthur M. Y. R.; Vieira, André P.; Prado, Carmen P. C.; Andrade, Roberto F. S.
2016-04-01
This works reports the use of a complex network approach to produce a phylogenetic classification tree of a simple evolutionary model. This approach has already been used to treat proteomic data of actual extant organisms, but an investigation of its reliability to retrieve a traceable evolutionary history is missing. The used evolutionary model includes key ingredients for the emergence of groups of related organisms by differentiation through random mutations and population growth, but purposefully omits other realistic ingredients that are not strictly necessary to originate an evolutionary history. This choice causes the model to depend only on a small set of parameters, controlling the mutation probability and the population of different species. Our results indicate that for a set of parameter values, the phylogenetic classification produced by the used framework reproduces the actual evolutionary history with a very high average degree of accuracy. This includes parameter values where the species originated by the evolutionary dynamics have modular structures. In the more general context of community identification in complex networks, our model offers a simple setting for evaluating the effects, on the efficiency of community formation and identification, of the underlying dynamics generating the network itself.
Tsuruta, S; Misztal, I; Strandén, I
2001-05-01
Utility of the preconditioned conjugate gradient algorithm with a diagonal preconditioner for solving mixed-model equations in animal breeding applications was evaluated with 16 test problems. The problems included single- and multiple-trait analyses, with data on beef, dairy, and swine ranging from small examples to national data sets. Multiple-trait models considered low and high genetic correlations. Convergence was based on relative differences between left- and right-hand sides. The ordering of equations was fixed effects followed by random effects, with no special ordering within random effects. The preconditioned conjugate gradient program implemented with double precision converged for all models. However, when implemented in single precision, the preconditioned conjugate gradient algorithm did not converge for seven large models. The preconditioned conjugate gradient and successive overrelaxation algorithms were subsequently compared for 13 of the test problems. The preconditioned conjugate gradient algorithm was easy to implement with the iteration on data for general models. However, successive overrelaxation requires specific programming for each set of models. On average, the preconditioned conjugate gradient algorithm converged in three times fewer rounds of iteration than successive overrelaxation. With straightforward implementations, programs using the preconditioned conjugate gradient algorithm may be two or more times faster than those using successive overrelaxation. However, programs using the preconditioned conjugate gradient algorithm would use more memory than would comparable implementations using successive overrelaxation. Extensive optimization of either algorithm can influence rankings. The preconditioned conjugate gradient implemented with iteration on data, a diagonal preconditioner, and in double precision may be the algorithm of choice for solving mixed-model equations when sufficient memory is available and ease of implementation is essential.
Training a whole-book LSTM-based recognizer with an optimal training set
NASA Astrophysics Data System (ADS)
Soheili, Mohammad Reza; Yousefi, Mohammad Reza; Kabir, Ehsanollah; Stricker, Didier
2018-04-01
Despite the recent progress in OCR technologies, whole-book recognition, is still a challenging task, in particular in case of old and historical books, that the unknown font faces or low quality of paper and print contributes to the challenge. Therefore, pre-trained recognizers and generic methods do not usually perform up to required standards, and usually the performance degrades for larger scale recognition tasks, such as of a book. Such reportedly low error-rate methods turn out to require a great deal of manual correction. Generally, such methodologies do not make effective use of concepts such redundancy in whole-book recognition. In this work, we propose to train Long Short Term Memory (LSTM) networks on a minimal training set obtained from the book to be recognized. We show that clustering all the sub-words in the book, and using the sub-word cluster centers as the training set for the LSTM network, we can train models that outperform any identical network that is trained with randomly selected pages of the book. In our experiments, we also show that although the sub-word cluster centers are equivalent to about 8 pages of text for a 101- page book, a LSTM network trained on such a set performs competitively compared to an identical network that is trained on a set of 60 randomly selected pages of the book.
Double-Pulse Two-Micron IPDA Lidar Simulation for Airborne Carbon Dioxide Measurements
NASA Technical Reports Server (NTRS)
Refaat, Tamer F.; Singh, Upendra N.; Yu, Jirong; Petros, Mulugeta
2015-01-01
An advanced double-pulsed 2-micron integrated path differential absorption lidar has been developed at NASA Langley Research Center for measuring atmospheric carbon dioxide. The instrument utilizes a state-of-the-art 2-micron laser transmitter with tunable on-line wavelength and advanced receiver. Instrument modeling and airborne simulations are presented in this paper. Focusing on random errors, results demonstrate instrument capabilities of performing precise carbon dioxide differential optical depth measurement with less than 3% random error for single-shot operation from up to 11 km altitude. This study is useful for defining CO2 measurement weighting, instrument setting, validation and sensitivity trade-offs.
Generating synthetic wave climates for coastal modelling: a linear mixed modelling approach
NASA Astrophysics Data System (ADS)
Thomas, C.; Lark, R. M.
2013-12-01
Numerical coastline morphological evolution models require wave climate properties to drive morphological change through time. Wave climate properties (typically wave height, period and direction) may be temporally fixed, culled from real wave buoy data, or allowed to vary in some way defined by a Gaussian or other pdf. However, to examine sensitivity of coastline morphologies to wave climate change, it seems desirable to be able to modify wave climate time series from a current to some new state along a trajectory, but in a way consistent with, or initially conditioned by, the properties of existing data, or to generate fully synthetic data sets with realistic time series properties. For example, mean or significant wave height time series may have underlying periodicities, as revealed in numerous analyses of wave data. Our motivation is to develop a simple methodology to generate synthetic wave climate time series that can change in some stochastic way through time. We wish to use such time series in a coastline evolution model to test sensitivities of coastal landforms to changes in wave climate over decadal and centennial scales. We have worked initially on time series of significant wave height, based on data from a Waverider III buoy located off the coast of Yorkshire, England. The statistical framework for the simulation is the linear mixed model. The target variable, perhaps after transformation (Box-Cox), is modelled as a multivariate Gaussian, the mean modelled as a function of a fixed effect, and two random components, one of which is independently and identically distributed (iid) and the second of which is temporally correlated. The model was fitted to the data by likelihood methods. We considered the option of a periodic mean, the period either fixed (e.g. at 12 months) or estimated from the data. We considered two possible correlation structures for the second random effect. In one the correlation decays exponentially with time. In the second (spherical) model, it cuts off at a temporal range. Having fitted the model, multiple realisations were generated; the random effects were simulated by specifying a covariance matrix for the simulated values, with the estimated parameters. The Cholesky factorisation of the covariance matrix was computed and realizations of the random component of the model generated by pre-multiplying a vector of iid standard Gaussian variables by the lower triangular factor. The resulting random variate was added to the mean value computed from the fixed effects, and the result back-transformed to the original scale of the measurement. Realistic simulations result from approach described above. Background exploratory data analysis was undertaken on 20-day sets of 30-minute buoy data, selected from days 5-24 of months January, April, July, October, 2011, to elucidate daily to weekly variations, and to keep numerical analysis tractable computationally. Work remains to be undertaken to develop suitable models for synthetic directional data. We suggest that the general principles of the method will have applications in other geomorphological modelling endeavours requiring time series of stochastically variable environmental parameters.
Komócsi, András; Aradi, Dániel; Kehl, Dániel; Ungi, Imre; Thury, Attila; Pintér, Tünde; Di Nicolantonio, James J.; Tornyos, Adrienn
2014-01-01
Introduction Superior outcomes with transradial (TRPCI) versus transfemoral coronary intervention (TFPCI) in the setting of acute ST-segment elevation myocardial infarction (STEMI) have been suggested by earlier studies. However, this effect was not evident in randomized controlled trials (RCTs), suggesting a possible allocation bias in observational studies. Since important studies with heterogeneous results regarding mortality have been published recently, we aimed to perform an updated review and meta-analysis on the safety and efficacy of TRPCI compared to TFPCI in the setting of STEMI. Material and methods Electronic databases were searched for relevant studies from January 1993 to November 2012. Outcome parameters of RCTs were pooled with the DerSimonian-Laird random-effects model. Results Twelve RCTs involving 5,124 patients were identified. According to the pooled analysis, TRPCI was associated with a significant reduction in major bleeding (odds ratio (OR): 0.52 (95% confidence interval (CI) 0.38–0.71, p < 0.0001)). The risk of mortality and major adverse events was significantly lower after TRPCI (OR = 0.58 (95% CI: 0.43–0.79), p = 0.0005 and OR = 0.67 (95% CI: 0.52–0.86), p = 0.002 respectively). Conclusions Robust data from randomized clinical studies indicate that TRPCI reduces both ischemic and bleeding complications in STEMI. These findings support the preferential use of radial access for primary PCI. PMID:24904651
Komócsi, András; Aradi, Dániel; Kehl, Dániel; Ungi, Imre; Thury, Attila; Pintér, Tünde; Di Nicolantonio, James J; Tornyos, Adrienn; Vorobcsuk, András
2014-05-12
Superior outcomes with transradial (TRPCI) versus transfemoral coronary intervention (TFPCI) in the setting of acute ST-segment elevation myocardial infarction (STEMI) have been suggested by earlier studies. However, this effect was not evident in randomized controlled trials (RCTs), suggesting a possible allocation bias in observational studies. Since important studies with heterogeneous results regarding mortality have been published recently, we aimed to perform an updated review and meta-analysis on the safety and efficacy of TRPCI compared to TFPCI in the setting of STEMI. Electronic databases were searched for relevant studies from January 1993 to November 2012. Outcome parameters of RCTs were pooled with the DerSimonian-Laird random-effects model. Twelve RCTs involving 5,124 patients were identified. According to the pooled analysis, TRPCI was associated with a significant reduction in major bleeding (odds ratio (OR): 0.52 (95% confidence interval (CI) 0.38-0.71, p < 0.0001)). The risk of mortality and major adverse events was significantly lower after TRPCI (OR = 0.58 (95% CI: 0.43-0.79), p = 0.0005 and OR = 0.67 (95% CI: 0.52-0.86), p = 0.002 respectively). Robust data from randomized clinical studies indicate that TRPCI reduces both ischemic and bleeding complications in STEMI. These findings support the preferential use of radial access for primary PCI.
Dissecting random and systematic differences between noisy composite data sets.
Diederichs, Kay
2017-04-01
Composite data sets measured on different objects are usually affected by random errors, but may also be influenced by systematic (genuine) differences in the objects themselves, or the experimental conditions. If the individual measurements forming each data set are quantitative and approximately normally distributed, a correlation coefficient is often used to compare data sets. However, the relations between data sets are not obvious from the matrix of pairwise correlations since the numerical value of the correlation coefficient is lowered by both random and systematic differences between the data sets. This work presents a multidimensional scaling analysis of the pairwise correlation coefficients which places data sets into a unit sphere within low-dimensional space, at a position given by their CC* values [as defined by Karplus & Diederichs (2012), Science, 336, 1030-1033] in the radial direction and by their systematic differences in one or more angular directions. This dimensionality reduction can not only be used for classification purposes, but also to derive data-set relations on a continuous scale. Projecting the arrangement of data sets onto the subspace spanned by systematic differences (the surface of a unit sphere) allows, irrespective of the random-error levels, the identification of clusters of closely related data sets. The method gains power with increasing numbers of data sets. It is illustrated with an example from low signal-to-noise ratio image processing, and an application in macromolecular crystallography is shown, but the approach is completely general and thus should be widely applicable.
Randomized shortest-path problems: two related models.
Saerens, Marco; Achbany, Youssef; Fouss, François; Yen, Luh
2009-08-01
This letter addresses the problem of designing the transition probabilities of a finite Markov chain (the policy) in order to minimize the expected cost for reaching a destination node from a source node while maintaining a fixed level of entropy spread throughout the network (the exploration). It is motivated by the following scenario. Suppose you have to route agents through a network in some optimal way, for instance, by minimizing the total travel cost-nothing particular up to now-you could use a standard shortest-path algorithm. Suppose, however, that you want to avoid pure deterministic routing policies in order, for instance, to allow some continual exploration of the network, avoid congestion, or avoid complete predictability of your routing strategy. In other words, you want to introduce some randomness or unpredictability in the routing policy (i.e., the routing policy is randomized). This problem, which will be called the randomized shortest-path problem (RSP), is investigated in this work. The global level of randomness of the routing policy is quantified by the expected Shannon entropy spread throughout the network and is provided a priori by the designer. Then, necessary conditions to compute the optimal randomized policy-minimizing the expected routing cost-are derived. Iterating these necessary conditions, reminiscent of Bellman's value iteration equations, allows computing an optimal policy, that is, a set of transition probabilities in each node. Interestingly and surprisingly enough, this first model, while formulated in a totally different framework, is equivalent to Akamatsu's model ( 1996 ), appearing in transportation science, for a special choice of the entropy constraint. We therefore revisit Akamatsu's model by recasting it into a sum-over-paths statistical physics formalism allowing easy derivation of all the quantities of interest in an elegant, unified way. For instance, it is shown that the unique optimal policy can be obtained by solving a simple linear system of equations. This second model is therefore more convincing because of its computational efficiency and soundness. Finally, simulation results obtained on simple, illustrative examples show that the models behave as expected.
Soil property maps of Africa at 250 m resolution
NASA Astrophysics Data System (ADS)
Kempen, Bas; Hengl, Tomislav; Heuvelink, Gerard B. M.; Leenaars, Johan G. B.; Walsh, Markus G.; MacMillan, Robert A.; Mendes de Jesus, Jorge S.; Shepherd, Keith; Sila, Andrew; Desta, Lulseged T.; Tondoh, Jérôme E.
2015-04-01
Vast areas of arable land in sub-Saharan Africa suffer from low soil fertility and physical soil constraints, and significant amounts of nutrients are lost yearly due to unsustainable soil management practices. At the same time it is expected that agriculture in Africa must intensify to meet the growing demand for food and fiber the next decades. Protection and sustainable management of Africa's soil resources is crucial to achieve this. In this context, comprehensive, accurate and up-to-date soil information is an essential input to any agricultural or environmental management or policy and decision-making model. In Africa, detailed soil information has been fragmented and limited to specific zones of interest for decades. To help bridge the soil information gap in Africa, the Africa Soil Information Service (AfSIS) project was established in 2008. AfSIS builds on recent advances in digital soil mapping, infrared spectroscopy, remote sensing, (geo)statistics, and integrated soil fertility management to improve the way soils are evaluated, mapped, and monitored. Over the period 2008-2014, the AfSIS project has compiled two soil profile data sets (about 28,000 unique locations): the Africa Soil Profiles (legacy) database and the AfSIS Sentinel Site (new soil samples) database -- the two data sets represent the most comprehensive soil sample database of the African continent to date. In addition a large set of high-resolution environmental data layers (covariates) was assembled. The point data were used in the AfSIS project to generate a set of maps of key soil properties for the African continent at 250 m spatial resolution: sand, silt and clay fractions, bulk density, organic carbon, total nitrogen, pH, cation-exchange capacity, exchangeable bases (Ca, K, Mg, Na), exchangeable acidity, and Al content. These properties were mapped for six depth intervals up to 2 m: 0-5 cm, 5-15 cm, 15-30 cm, 30-60 cm, 60-100 cm, and 100-200 cm. Random forests modelling was used to relate the soil profile observations to a set covariates, that included global soil class and property maps, MODIS imagery and a DEM, in a 3D mapping framework. The model residuals were interpolated by 3D kriging, after which the kriging predictions were added to the random forests predictions to obtain the soil property predictions. The model predictions were validated with 5-fold cross-validation. The random forests models explained between 37% (exch. Na) and 85% (Al content) of the variation in the data. Results also show that globally predicted soil classes help improve continental scale mapping of the soil nutrients and are often among the most important predictors. We conclude that the first mapping results look promising. We used an automated modelling framework that enables re-computing the maps as new data becomes arrives, hereby gradually improving the maps. We showed that global maps of soil classes and properties produced with models that were predominantly calibrated on areas with plentiful observations can be used to improve the accuracy of predictions in regions with less plentiful data, such as Africa.
On the Estimation of Errors in Sparse Bathymetric Geophysical Data Sets
NASA Astrophysics Data System (ADS)
Jakobsson, M.; Calder, B.; Mayer, L.; Armstrong, A.
2001-05-01
There is a growing demand in the geophysical community for better regional representations of the world ocean's bathymetry. However, given the vastness of the oceans and the relative limited coverage of even the most modern mapping systems, it is likely that many of the older data sets will remain part of our cumulative database for several more decades. Therefore, regional bathymetrical compilations that are based on a mixture of historic and contemporary data sets will have to remain the standard. This raises the problem of assembling bathymetric compilations and utilizing data sets not only with a heterogeneous cover but also with a wide range of accuracies. In combining these data to regularly spaced grids of bathymetric values, which the majority of numerical procedures in earth sciences require, we are often forced to use a complex interpolation scheme due to the sparseness and irregularity of the input data points. Consequently, we are faced with the difficult task of assessing the confidence that we can assign to the final grid product, a task that is not usually addressed in most bathymetric compilations. We approach the problem of assessing the confidence via a direct-simulation Monte Carlo method. We start with a small subset of data from the International Bathymetric Chart of the Arctic Ocean (IBCAO) grid model [Jakobsson et al., 2000]. This grid is compiled from a mixture of data sources ranging from single beam soundings with available metadata to spot soundings with no available metadata, to digitized contours; the test dataset shows examples of all of these types. From this database, we assign a priori error variances based on available meta-data, and when this is not available, based on a worst-case scenario in an essentially heuristic manner. We then generate a number of synthetic datasets by randomly perturbing the base data using normally distributed random variates, scaled according to the predicted error model. These datasets are then re-gridded using the same methodology as the original product, generating a set of plausible grid models of the regional bathymetry that we can use for standard error estimates. Finally, we repeat the entire random estimation process and analyze each run's standard error grids in order to examine sampling bias and variance in the predictions. The final products of the estimation are a collection of standard error grids, which we combine with the source data density in order to create a grid that contains information about the bathymetry model's reliability. Jakobsson, M., Cherkis, N., Woodward, J., Coakley, B., and Macnab, R., 2000, A new grid of Arctic bathymetry: A significant resource for scientists and mapmakers, EOS Transactions, American Geophysical Union, v. 81, no. 9, p. 89, 93, 96.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bishop, Joseph E.; Emery, John M.; Battaile, Corbett C.
Two fundamental approximations in macroscale solid-mechanics modeling are (1) the assumption of scale separation in homogenization theory and (2) the use of a macroscopic plasticity material model that represents, in a mean sense, the multitude of inelastic processes occurring at the microscale. With the goal of quantifying the errors induced by these approximations on engineering quantities of interest, we perform a set of direct numerical simulations (DNS) in which polycrystalline microstructures are embedded throughout a macroscale structure. The largest simulations model over 50,000 grains. The microstructure is idealized using a randomly close-packed Voronoi tessellation in which each polyhedral Voronoi cellmore » represents a grain. An face centered cubic crystal-plasticity model is used to model the mechanical response of each grain. The overall grain structure is equiaxed, and each grain is randomly oriented with no overall texture. The detailed results from the DNS simulations are compared to results obtained from conventional macroscale simulations that use homogeneous isotropic plasticity models. The macroscale plasticity models are calibrated using a representative volume element of the idealized microstructure. Furthermore, we envision that DNS modeling will be used to gain new insights into the mechanics of material deformation and failure.« less
Lombardi, A M
2017-09-18
Stochastic models provide quantitative evaluations about the occurrence of earthquakes. A basic component of this type of models are the uncertainties in defining main features of an intrinsically random process. Even if, at a very basic level, any attempting to distinguish between types of uncertainty is questionable, an usual way to deal with this topic is to separate epistemic uncertainty, due to lack of knowledge, from aleatory variability, due to randomness. In the present study this problem is addressed in the narrow context of short-term modeling of earthquakes and, specifically, of ETAS modeling. By mean of an application of a specific version of the ETAS model to seismicity of Central Italy, recently struck by a sequence with a main event of Mw6.5, the aleatory and epistemic (parametric) uncertainty are separated and quantified. The main result of the paper is that the parametric uncertainty of the ETAS-type model, adopted here, is much lower than the aleatory variability in the process. This result points out two main aspects: an analyst has good chances to set the ETAS-type models, but he may retrospectively describe and forecast the earthquake occurrences with still limited precision and accuracy.
Use of Peritoneal Dialysis in AKI: A Systematic Review
Chionh, Chang Yin; Soni, Sachin S.; Finkelstein, Fredric O.; Ronco, Claudio
2013-01-01
Summary Background and objectives The role of peritoneal dialysis in the management of AKI is not well defined, although it remains frequently used, especially in low-resource settings. A systematic review was performed to describe outcomes in AKI treated with peritoneal dialysis and compare peritoneal dialysis with extracorporeal blood purification, such as continuous or intermittent hemodialysis. Design, setting, participants, & measurements MEDLINE, CINAHL, and Central Register of Controlled Trials were searched in July of 2012. Eligible studies selected were observational cohort or randomized adult population studies on peritoneal dialysis in the setting of AKI. The primary outcome of interest was all-cause mortality. Summary estimates of odds ratio were obtained using a random effects model. Results Of 982 citations, 24 studies (n=1556 patients) were identified. The overall methodological quality was low. Thirteen studies described patients (n=597) treated with peritoneal dialysis only; pooled mortality was 39.3%. In 11 studies (7 cohort studies and 4 randomized trials), patients received peritoneal dialysis (n=392, pooled mortality=58.0%) or extracorporeal blood purification (n=567, pooled mortality=56.1%). In the cohort studies, there was no difference in mortality between peritoneal dialysis and extracorporeal blood purification (odds ratio, 0.96; 95% confidence interval, 0.53 to 1.71). In four randomized trials, there was also no difference in mortality (odds ratio, 1.50; 95% confidence interval, 0.46 to 4.86); however, heterogeneity was significant (I2=73%, P=0.03). Conclusions There is currently no evidence to suggest significant differences in mortality between peritoneal dialysis and extracorporeal blood purification in AKI. There is a need for good-quality evidence in this important area. PMID:23833316
A framework for understanding cancer comparative effectiveness research data needs
Carpenter, William R; Meyer, Anne-Marie; Abernethy, Amy P.; Stürmer, Til; Kosorok, Michael R.
2012-01-01
Objective Randomized controlled trials remain the gold standard for evaluating cancer intervention efficacy. Randomized trials are not always feasible, practical, or timely, and often don’t adequately reflect patient heterogeneity and real-world clinical practice. Comparative effectiveness research can leverage secondary data to help fill knowledge gaps randomized trials leave unaddressed; however, comparative effectiveness research also faces shortcomings. The goal of this project was to develop a new model and inform an evolving framework articulating cancer comparative effectiveness research data needs. Study Design and Setting We examined prevalent models and conducted semi-structured discussions with 76 clinicians and comparative effectiveness research researchers affiliated with the Agency for Healthcare Research and Quality’s cancer comparative effectiveness research programs. Results A new model was iteratively developed, and presents cancer comparative effectiveness research and important measures in a patient-centered, longitudinal chronic care model better-reflecting contemporary cancer care in the context of the cancer care continuum, rather than a single-episode, acute-care perspective. Conclusion Immediately relevant for federally-funded comparative effectiveness research programs, the model informs an evolving framework articulating cancer comparative effectiveness research data needs, including evolutionary enhancements to registries and epidemiologic research data systems. We discuss elements of contemporary clinical practice, methodology improvements, and related needs affecting comparative effectiveness research’s ability to yield findings clinicians, policymakers, and stakeholders can confidently act on. PMID:23017633
DOE Office of Scientific and Technical Information (OSTI.GOV)
von Lilienfeld, O. Anatole; Ramakrishnan, Raghunathan; Rupp, Matthias
We introduce a fingerprint representation of molecules based on a Fourier series of atomic radial distribution functions. This fingerprint is unique (except for chirality), continuous, and differentiable with respect to atomic coordinates and nuclear charges. It is invariant with respect to translation, rotation, and nuclear permutation, and requires no preconceived knowledge about chemical bonding, topology, or electronic orbitals. As such, it meets many important criteria for a good molecular representation, suggesting its usefulness for machine learning models of molecular properties trained across chemical compound space. To assess the performance of this new descriptor, we have trained machine learning models ofmore » molecular enthalpies of atomization for training sets with up to 10 k organic molecules, drawn at random from a published set of 134 k organic molecules with an average atomization enthalpy of over 1770 kcal/mol. We validate the descriptor on all remaining molecules of the 134 k set. For a training set of 10 k molecules, the fingerprint descriptor achieves a mean absolute error of 8.0 kcal/mol. This is slightly worse than the performance attained using the Coulomb matrix, another popular alternative, reaching 6.2 kcal/mol for the same training and test sets. (c) 2015 Wiley Periodicals, Inc.« less
Beksinska, Mags E; Smit, Jenni; Greener, Ross; Todd, Catherine S; Lee, Mei-ling Ting; Maphumulo, Virginia; Hoffmann, Vivian
2015-02-01
In low-income settings, many women and girls face activity restrictions during menses, owing to lack of affordable menstrual products. The menstrual cup (MC) is a nonabsorbent reusable cup that collects menstrual blood. We assessed the acceptability and performance of the MPower® MC compared to pads or tampons among women in a low-resource setting. We conducted a randomized two-period crossover trial at one site in Durban, South Africa, between January and November 2013. Participants aged 18-45 years with regular menstrual cycles were eligible for inclusion if they had no intention of becoming pregnant, were using an effective contraceptive method, had water from the municipal system as their primary water source, and had no sexually transmitted infections. We used a computer-generated randomization sequence to assign participants to one of two sequences of menstrual product use, with allocation concealed only from the study investigators. Participants used each method over three menstrual cycles (total 6 months) and were interviewed at baseline and monthly follow-up visits. The product acceptability outcome compared product satisfaction question scores using an ordinal logistic regression model with individual random effects. This study is registered on the South African Clinical Trials database: number DOH-27-01134273. Of 124 women assessed, 110 were eligible and randomly assigned to selected menstrual products. One hundred and five women completed all follow-up visits. By comparison to pads/tampons (usual product used), the MC was rated significantly better for comfort, quality, menstrual blood collection, appearance, and preference. Both of these comparative outcome measures, along with likelihood of continued use, recommending the product, and future purchase, increased for the MC over time. MC acceptance in a population of novice users, many with limited experience with tampons, indicates that there is a pool of potential users in low-resource settings.
Bittel, Daniel C; Bittel, Adam J; Williams, Christine; Elazzazi, Ashraf
2017-05-01
Proper exercise form is critical for the safety and efficacy of therapeutic exercise. This research examines if a novel smartphone application, designed to monitor and provide real-time corrections during resistance training, can reduce performance errors and elicit a motor learning response. Forty-two participants aged 18 to 65 years were randomly assigned to treatment and control groups. Both groups were tested for the number of movement errors made during a 10-repetition set completed at baseline, immediately after, and 1 to 2 weeks after a single training session of knee extensions. The treatment group trained with real-time, smartphone-generated feedback, whereas the control subjects did not. Group performance (number of errors) was compared across test sets using a 2-factor mixed-model analysis of variance. No differences were observed between groups for age, sex, or resistance training experience. There was a significant interaction between test set and group. The treatment group demonstrated fewer errors on posttests 1 and 2 compared with pretest (P < 0.05). There was no reduction in the number of errors on any posttest for control subjects. Smartphone apps, such as the one used in this study, may enhance patient supervision, safety, and exercise efficacy across rehabilitation settings. A single training session with the app promoted motor learning and improved exercise performance.
Efficient Ab initio Modeling of Random Multicomponent Alloys
Jiang, Chao; Uberuaga, Blas P.
2016-03-08
Here, we present in this Letter a novel small set of ordered structures (SSOS) method that allows extremely efficient ab initio modeling of random multi-component alloys. Using inverse II-III spinel oxides and equiatomic quinary bcc (so-called high entropy) alloys as examples, we also demonstrate that a SSOS can achieve the same accuracy as a large supercell or a well-converged cluster expansion, but with significantly reduced computational cost. In particular, because of this efficiency, a large number of quinary alloy compositions can be quickly screened, leading to the identification of several new possible high entropy alloy chemistries. Furthermore, the SSOS methodmore » developed here can be broadly useful for the rapid computational design of multi-component materials, especially those with a large number of alloying elements, a challenging problem for other approaches.« less
Investigation of a protein complex network
NASA Astrophysics Data System (ADS)
Mashaghi, A. R.; Ramezanpour, A.; Karimipour, V.
2004-09-01
The budding yeast Saccharomyces cerevisiae is the first eukaryote whose genome has been completely sequenced. It is also the first eukaryotic cell whose proteome (the set of all proteins) and interactome (the network of all mutual interactions between proteins) has been analyzed. In this paper we study the structure of the yeast protein complex network in which weighted edges between complexes represent the number of shared proteins. It is found that the network of protein complexes is a small world network with scale free behavior for many of its distributions. However we find that there are no strong correlations between the weights and degrees of neighboring complexes. To reveal non-random features of the network we also compare it with a null model in which the complexes randomly select their proteins. Finally we propose a simple evolutionary model based on duplication and divergence of proteins.
NASA Astrophysics Data System (ADS)
Livan, Giacomo; Alfarano, Simone; Scalas, Enrico
2011-07-01
We study some properties of eigenvalue spectra of financial correlation matrices. In particular, we investigate the nature of the large eigenvalue bulks which are observed empirically, and which have often been regarded as a consequence of the supposedly large amount of noise contained in financial data. We challenge this common knowledge by acting on the empirical correlation matrices of two data sets with a filtering procedure which highlights some of the cluster structure they contain, and we analyze the consequences of such filtering on eigenvalue spectra. We show that empirically observed eigenvalue bulks emerge as superpositions of smaller structures, which in turn emerge as a consequence of cross correlations between stocks. We interpret and corroborate these findings in terms of factor models, and we compare empirical spectra to those predicted by random matrix theory for such models.
An Optimization-based Framework to Learn Conditional Random Fields for Multi-label Classification
Naeini, Mahdi Pakdaman; Batal, Iyad; Liu, Zitao; Hong, CharmGil; Hauskrecht, Milos
2015-01-01
This paper studies multi-label classification problem in which data instances are associated with multiple, possibly high-dimensional, label vectors. This problem is especially challenging when labels are dependent and one cannot decompose the problem into a set of independent classification problems. To address the problem and properly represent label dependencies we propose and study a pairwise conditional random Field (CRF) model. We develop a new approach for learning the structure and parameters of the CRF from data. The approach maximizes the pseudo likelihood of observed labels and relies on the fast proximal gradient descend for learning the structure and limited memory BFGS for learning the parameters of the model. Empirical results on several datasets show that our approach outperforms several multi-label classification baselines, including recently published state-of-the-art methods. PMID:25927015
Akkoç, Betül; Arslan, Ahmet; Kök, Hatice
2016-06-01
Gender is one of the intrinsic properties of identity, with performance enhancement reducing the cluster when a search is performed. Teeth have durable and resistant structure, and as such are important sources of identification in disasters (accident, fire, etc.). In this study, gender determination is accomplished by maxillary tooth plaster models of 40 people (20 males and 20 females). The images of tooth plaster models are taken with a lighting mechanism set-up. A gray level co-occurrence matrix of the image with segmentation is formed and classified via a Random Forest (RF) algorithm by extracting pertinent features of the matrix. Automatic gender determination has a 90% success rate, with an applicable system to determine gender from maxillary tooth plaster images. Copyright © 2016 Elsevier Ltd. All rights reserved.
Cushing, Christopher C; Walters, Ryan W; Hoffman, Lesa
2014-03-01
Aggregated N-of-1 randomized controlled trials (RCTs) combined with multilevel modeling represent a methodological advancement that may help bridge science and practice in pediatric psychology. The purpose of this article is to offer a primer for pediatric psychologists interested in conducting aggregated N-of-1 RCTs. An overview of N-of-1 RCT methodology is provided and 2 simulated data sets are analyzed to demonstrate the clinical and research potential of the methodology. The simulated data example demonstrates the utility of aggregated N-of-1 RCTs for understanding the clinical impact of an intervention for a given individual and the modeling of covariates to explain why an intervention worked for one patient and not another. Aggregated N-of-1 RCTs hold potential for improving the science and practice of pediatric psychology.
Economic lot sizing in a production system with random demand
NASA Astrophysics Data System (ADS)
Lee, Shine-Der; Yang, Chin-Ming; Lan, Shu-Chuan
2016-04-01
An extended economic production quantity model that copes with random demand is developed in this paper. A unique feature of the proposed study is the consideration of transient shortage during the production stage, which has not been explicitly analysed in existing literature. The considered costs include set-up cost for the batch production, inventory carrying cost during the production and depletion stages in one replenishment cycle, and shortage cost when demand cannot be satisfied from the shop floor immediately. Based on renewal reward process, a per-unit-time expected cost model is developed and analysed. Under some mild condition, it can be shown that the approximate cost function is convex. Computational experiments have demonstrated that the average reduction in total cost is significant when the proposed lot sizing policy is compared with those with deterministic demand.
A geometric theory for Lévy distributions
NASA Astrophysics Data System (ADS)
Eliazar, Iddo
2014-08-01
Lévy distributions are of prime importance in the physical sciences, and their universal emergence is commonly explained by the Generalized Central Limit Theorem (CLT). However, the Generalized CLT is a geometry-less probabilistic result, whereas physical processes usually take place in an embedding space whose spatial geometry is often of substantial significance. In this paper we introduce a model of random effects in random environments which, on the one hand, retains the underlying probabilistic structure of the Generalized CLT and, on the other hand, adds a general and versatile underlying geometric structure. Based on this model we obtain geometry-based counterparts of the Generalized CLT, thus establishing a geometric theory for Lévy distributions. The theory explains the universal emergence of Lévy distributions in physical settings which are well beyond the realm of the Generalized CLT.
A geometric theory for Lévy distributions
DOE Office of Scientific and Technical Information (OSTI.GOV)
Eliazar, Iddo, E-mail: eliazar@post.tau.ac.il
2014-08-15
Lévy distributions are of prime importance in the physical sciences, and their universal emergence is commonly explained by the Generalized Central Limit Theorem (CLT). However, the Generalized CLT is a geometry-less probabilistic result, whereas physical processes usually take place in an embedding space whose spatial geometry is often of substantial significance. In this paper we introduce a model of random effects in random environments which, on the one hand, retains the underlying probabilistic structure of the Generalized CLT and, on the other hand, adds a general and versatile underlying geometric structure. Based on this model we obtain geometry-based counterparts ofmore » the Generalized CLT, thus establishing a geometric theory for Lévy distributions. The theory explains the universal emergence of Lévy distributions in physical settings which are well beyond the realm of the Generalized CLT.« less
Toxicity challenges in environmental chemicals: Prediction of ...
Physiologically based pharmacokinetic (PBPK) models bridge the gap between in vitro assays and in vivo effects by accounting for the adsorption, distribution, metabolism, and excretion of xenobiotics, which is especially useful in the assessment of human toxicity. Quantitative structure-activity relationships (QSAR) serve as a vital tool for the high-throughput prediction of chemical-specific PBPK parameters, such as the fraction of a chemical unbound by plasma protein (Fub). The presented work explores the merit of utilizing experimental pharmaceutical Fub data for the construction of a universal QSAR model, in order to compensate for the limited range of high-quality experimental Fub data for environmentally relevant chemicals, such as pollutants, pesticides, and consumer products. Independent QSAR models were constructed with three machine-learning algorithms, k nearest neighbors (kNN), random forest (RF), and support vector machine (SVM) regression, from a large pharmaceutical training set (~1000) and assessed with independent test sets of pharmaceuticals (~200) and environmentally relevant chemicals in the ToxCast program (~400). Small descriptor sets yielded the optimal balance of model complexity and performance, providing insight into the biochemical factors of plasma protein binding, while preventing over fitting to the training set. Overlaps in chemical space between pharmaceutical and environmental compounds were considered through applicability of do
Informing the Human Plasma Protein Binding of ...
The free fraction of a xenobiotic in plasma (Fub) is an important determinant of chemical adsorption, distribution, metabolism, elimination, and toxicity, yet experimental plasma protein binding data is scarce for environmentally relevant chemicals. The presented work explores the merit of utilizing available pharmaceutical data to predict Fub for environmentally relevant chemicals via machine learning techniques. Quantitative structure-activity relationship (QSAR) models were constructed with k nearest neighbors (kNN), support vector machines (SVM), and random forest (RF) machine learning algorithms from a training set of 1045 pharmaceuticals. The models were then evaluated with independent test sets of pharmaceuticals (200 compounds) and environmentally relevant ToxCast chemicals (406 total, in two groups of 238 and 168 compounds). The selection of a minimal feature set of 10-15 2D molecular descriptors allowed for both informative feature interpretation and practical applicability domain assessment via a bounded box of descriptor ranges and principal component analysis. The diverse pharmaceutical and environmental chemical sets exhibit similarities in terms of chemical space (99-82% overlap), as well as comparable bias and variance in constructed learning curves. All the models exhibit significant predictability with mean absolute errors (MAE) in the range of 0.10-0.18 Fub. The models performed best for highly bound chemicals (MAE 0.07-0.12), neutrals (MAE 0