de Paula, Lauro C. M.; Soares, Anderson S.; de Lima, Telma W.; Delbem, Alexandre C. B.; Coelho, Clarimar J.; Filho, Arlindo R. G.
2014-01-01
Several variable selection algorithms in multivariate calibration can be accelerated using Graphics Processing Units (GPU). Among these algorithms, the Firefly Algorithm (FA) is a recent proposed metaheuristic that may be used for variable selection. This paper presents a GPU-based FA (FA-MLR) with multiobjective formulation for variable selection in multivariate calibration problems and compares it with some traditional sequential algorithms in the literature. The advantage of the proposed implementation is demonstrated in an example involving a relatively large number of variables. The results showed that the FA-MLR, in comparison with the traditional algorithms is a more suitable choice and a relevant contribution for the variable selection problem. Additionally, the results also demonstrated that the FA-MLR performed in a GPU can be five times faster than its sequential implementation. PMID:25493625
de Paula, Lauro C M; Soares, Anderson S; de Lima, Telma W; Delbem, Alexandre C B; Coelho, Clarimar J; Filho, Arlindo R G
2014-01-01
Several variable selection algorithms in multivariate calibration can be accelerated using Graphics Processing Units (GPU). Among these algorithms, the Firefly Algorithm (FA) is a recent proposed metaheuristic that may be used for variable selection. This paper presents a GPU-based FA (FA-MLR) with multiobjective formulation for variable selection in multivariate calibration problems and compares it with some traditional sequential algorithms in the literature. The advantage of the proposed implementation is demonstrated in an example involving a relatively large number of variables. The results showed that the FA-MLR, in comparison with the traditional algorithms is a more suitable choice and a relevant contribution for the variable selection problem. Additionally, the results also demonstrated that the FA-MLR performed in a GPU can be five times faster than its sequential implementation.
Efficient Variable Selection Method for Exposure Variables on Binary Data
NASA Astrophysics Data System (ADS)
Ohno, Manabu; Tarumi, Tomoyuki
In this paper, we propose a new variable selection method for "robust" exposure variables. We define "robust" as property that the same variable can select among original data and perturbed data. There are few studies of effective for the selection method. The problem that selects exposure variables is almost the same as a problem that extracts correlation rules without robustness. [Brin 97] is suggested that correlation rules are possible to extract efficiently using chi-squared statistic of contingency table having monotone property on binary data. But the chi-squared value does not have monotone property, so it's is easy to judge the method to be not independent with an increase in the dimension though the variable set is completely independent, and the method is not usable in variable selection for robust exposure variables. We assume anti-monotone property for independent variables to select robust independent variables and use the apriori algorithm for it. The apriori algorithm is one of the algorithms which find association rules from the market basket data. The algorithm use anti-monotone property on the support which is defined by association rules. But independent property does not completely have anti-monotone property on the AIC of independent probability model, but the tendency to have anti-monotone property is strong. Therefore, selected variables with anti-monotone property on the AIC have robustness. Our method judges whether a certain variable is exposure variable for the independent variable using previous comparison of the AIC. Our numerical experiments show that our method can select robust exposure variables efficiently and precisely.
Diversified models for portfolio selection based on uncertain semivariance
NASA Astrophysics Data System (ADS)
Chen, Lin; Peng, Jin; Zhang, Bo; Rosyida, Isnaini
2017-02-01
Since the financial markets are complex, sometimes the future security returns are represented mainly based on experts' estimations due to lack of historical data. This paper proposes a semivariance method for diversified portfolio selection, in which the security returns are given subjective to experts' estimations and depicted as uncertain variables. In the paper, three properties of the semivariance of uncertain variables are verified. Based on the concept of semivariance of uncertain variables, two types of mean-semivariance diversified models for uncertain portfolio selection are proposed. Since the models are complex, a hybrid intelligent algorithm which is based on 99-method and genetic algorithm is designed to solve the models. In this hybrid intelligent algorithm, 99-method is applied to compute the expected value and semivariance of uncertain variables, and genetic algorithm is employed to seek the best allocation plan for portfolio selection. At last, several numerical examples are presented to illustrate the modelling idea and the effectiveness of the algorithm.
Variable screening via quantile partial correlation
Ma, Shujie; Tsai, Chih-Ling
2016-01-01
In quantile linear regression with ultra-high dimensional data, we propose an algorithm for screening all candidate variables and subsequently selecting relevant predictors. Specifically, we first employ quantile partial correlation for screening, and then we apply the extended Bayesian information criterion (EBIC) for best subset selection. Our proposed method can successfully select predictors when the variables are highly correlated, and it can also identify variables that make a contribution to the conditional quantiles but are marginally uncorrelated or weakly correlated with the response. Theoretical results show that the proposed algorithm can yield the sure screening set. By controlling the false selection rate, model selection consistency can be achieved theoretically. In practice, we proposed using EBIC for best subset selection so that the resulting model is screening consistent. Simulation studies demonstrate that the proposed algorithm performs well, and an empirical example is presented. PMID:28943683
NASA Astrophysics Data System (ADS)
Wang, Lijuan; Yan, Yong; Wang, Xue; Wang, Tao
2017-03-01
Input variable selection is an essential step in the development of data-driven models for environmental, biological and industrial applications. Through input variable selection to eliminate the irrelevant or redundant variables, a suitable subset of variables is identified as the input of a model. Meanwhile, through input variable selection the complexity of the model structure is simplified and the computational efficiency is improved. This paper describes the procedures of the input variable selection for the data-driven models for the measurement of liquid mass flowrate and gas volume fraction under two-phase flow conditions using Coriolis flowmeters. Three advanced input variable selection methods, including partial mutual information (PMI), genetic algorithm-artificial neural network (GA-ANN) and tree-based iterative input selection (IIS) are applied in this study. Typical data-driven models incorporating support vector machine (SVM) are established individually based on the input candidates resulting from the selection methods. The validity of the selection outcomes is assessed through an output performance comparison of the SVM based data-driven models and sensitivity analysis. The validation and analysis results suggest that the input variables selected from the PMI algorithm provide more effective information for the models to measure liquid mass flowrate while the IIS algorithm provides a fewer but more effective variables for the models to predict gas volume fraction.
Tian, Xin; Xin, Mingyuan; Luo, Jian; Liu, Mingyao; Jiang, Zhenran
2017-02-01
The selection of relevant genes for breast cancer metastasis is critical for the treatment and prognosis of cancer patients. Although much effort has been devoted to the gene selection procedures by use of different statistical analysis methods or computational techniques, the interpretation of the variables in the resulting survival models has been limited so far. This article proposes a new Random Forest (RF)-based algorithm to identify important variables highly related with breast cancer metastasis, which is based on the important scores of two variable selection algorithms, including the mean decrease Gini (MDG) criteria of Random Forest and the GeneRank algorithm with protein-protein interaction (PPI) information. The new gene selection algorithm can be called PPIRF. The improved prediction accuracy fully illustrated the reliability and high interpretability of gene list selected by the PPIRF approach.
Variables selection methods in near-infrared spectroscopy.
Xiaobo, Zou; Jiewen, Zhao; Povey, Malcolm J W; Holmes, Mel; Hanpin, Mao
2010-05-14
Near-infrared (NIR) spectroscopy has increasingly been adopted as an analytical tool in various fields, such as the petrochemical, pharmaceutical, environmental, clinical, agricultural, food and biomedical sectors during the past 15 years. A NIR spectrum of a sample is typically measured by modern scanning instruments at hundreds of equally spaced wavelengths. The large number of spectral variables in most data sets encountered in NIR spectral chemometrics often renders the prediction of a dependent variable unreliable. Recently, considerable effort has been directed towards developing and evaluating different procedures that objectively identify variables which contribute useful information and/or eliminate variables containing mostly noise. This review focuses on the variable selection methods in NIR spectroscopy. Selection methods include some classical approaches, such as manual approach (knowledge based selection), "Univariate" and "Sequential" selection methods; sophisticated methods such as successive projections algorithm (SPA) and uninformative variable elimination (UVE), elaborate search-based strategies such as simulated annealing (SA), artificial neural networks (ANN) and genetic algorithms (GAs) and interval base algorithms such as interval partial least squares (iPLS), windows PLS and iterative PLS. Wavelength selection with B-spline, Kalman filtering, Fisher's weights and Bayesian are also mentioned. Finally, the websites of some variable selection software and toolboxes for non-commercial use are given. Copyright 2010 Elsevier B.V. All rights reserved.
Wang, Zhu; Shuangge, Ma; Wang, Ching-Yun
2017-01-01
In health services and outcome research, count outcomes are frequently encountered and often have a large proportion of zeros. The zero-inflated negative binomial (ZINB) regression model has important applications for this type of data. With many possible candidate risk factors, this paper proposes new variable selection methods for the ZINB model. We consider maximum likelihood function plus a penalty including the least absolute shrinkage and selection operator (LASSO), smoothly clipped absolute deviation (SCAD) and minimax concave penalty (MCP). An EM (expectation-maximization) algorithm is proposed for estimating the model parameters and conducting variable selection simultaneously. This algorithm consists of estimating penalized weighted negative binomial models and penalized logistic models via the coordinated descent algorithm. Furthermore, statistical properties including the standard error formulae are provided. A simulation study shows that the new algorithm not only has more accurate or at least comparable estimation, also is more robust than the traditional stepwise variable selection. The proposed methods are applied to analyze the health care demand in Germany using an open-source R package mpath. PMID:26059498
A non-linear data mining parameter selection algorithm for continuous variables
Razavi, Marianne; Brady, Sean
2017-01-01
In this article, we propose a new data mining algorithm, by which one can both capture the non-linearity in data and also find the best subset model. To produce an enhanced subset of the original variables, a preferred selection method should have the potential of adding a supplementary level of regression analysis that would capture complex relationships in the data via mathematical transformation of the predictors and exploration of synergistic effects of combined variables. The method that we present here has the potential to produce an optimal subset of variables, rendering the overall process of model selection more efficient. This algorithm introduces interpretable parameters by transforming the original inputs and also a faithful fit to the data. The core objective of this paper is to introduce a new estimation technique for the classical least square regression framework. This new automatic variable transformation and model selection method could offer an optimal and stable model that minimizes the mean square error and variability, while combining all possible subset selection methodology with the inclusion variable transformations and interactions. Moreover, this method controls multicollinearity, leading to an optimal set of explanatory variables. PMID:29131829
Do bioclimate variables improve performance of climate envelope models?
Watling, James I.; Romañach, Stephanie S.; Bucklin, David N.; Speroterra, Carolina; Brandt, Laura A.; Pearlstine, Leonard G.; Mazzotti, Frank J.
2012-01-01
Climate envelope models are widely used to forecast potential effects of climate change on species distributions. A key issue in climate envelope modeling is the selection of predictor variables that most directly influence species. To determine whether model performance and spatial predictions were related to the selection of predictor variables, we compared models using bioclimate variables with models constructed from monthly climate data for twelve terrestrial vertebrate species in the southeastern USA using two different algorithms (random forests or generalized linear models), and two model selection techniques (using uncorrelated predictors or a subset of user-defined biologically relevant predictor variables). There were no differences in performance between models created with bioclimate or monthly variables, but one metric of model performance was significantly greater using the random forest algorithm compared with generalized linear models. Spatial predictions between maps using bioclimate and monthly variables were very consistent using the random forest algorithm with uncorrelated predictors, whereas we observed greater variability in predictions using generalized linear models.
Wang, Zhu; Ma, Shuangge; Wang, Ching-Yun
2015-09-01
In health services and outcome research, count outcomes are frequently encountered and often have a large proportion of zeros. The zero-inflated negative binomial (ZINB) regression model has important applications for this type of data. With many possible candidate risk factors, this paper proposes new variable selection methods for the ZINB model. We consider maximum likelihood function plus a penalty including the least absolute shrinkage and selection operator (LASSO), smoothly clipped absolute deviation (SCAD), and minimax concave penalty (MCP). An EM (expectation-maximization) algorithm is proposed for estimating the model parameters and conducting variable selection simultaneously. This algorithm consists of estimating penalized weighted negative binomial models and penalized logistic models via the coordinated descent algorithm. Furthermore, statistical properties including the standard error formulae are provided. A simulation study shows that the new algorithm not only has more accurate or at least comparable estimation, but also is more robust than the traditional stepwise variable selection. The proposed methods are applied to analyze the health care demand in Germany using the open-source R package mpath. © 2015 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Tang, Rongnian; Chen, Xupeng; Li, Chuang
2018-05-01
Near-infrared spectroscopy is an efficient, low-cost technology that has potential as an accurate method in detecting the nitrogen content of natural rubber leaves. Successive projections algorithm (SPA) is a widely used variable selection method for multivariate calibration, which uses projection operations to select a variable subset with minimum multi-collinearity. However, due to the fluctuation of correlation between variables, high collinearity may still exist in non-adjacent variables of subset obtained by basic SPA. Based on analysis to the correlation matrix of the spectra data, this paper proposed a correlation-based SPA (CB-SPA) to apply the successive projections algorithm in regions with consistent correlation. The result shows that CB-SPA can select variable subsets with more valuable variables and less multi-collinearity. Meanwhile, models established by the CB-SPA subset outperform basic SPA subsets in predicting nitrogen content in terms of both cross-validation and external prediction. Moreover, CB-SPA is assured to be more efficient, for the time cost in its selection procedure is one-twelfth that of the basic SPA.
NASA Astrophysics Data System (ADS)
Shams Esfand Abadi, Mohammad; AbbasZadeh Arani, Seyed Ali Asghar
2011-12-01
This paper extends the recently introduced variable step-size (VSS) approach to the family of adaptive filter algorithms. This method uses prior knowledge of the channel impulse response statistic. Accordingly, optimal step-size vector is obtained by minimizing the mean-square deviation (MSD). The presented algorithms are the VSS affine projection algorithm (VSS-APA), the VSS selective partial update NLMS (VSS-SPU-NLMS), the VSS-SPU-APA, and the VSS selective regressor APA (VSS-SR-APA). In VSS-SPU adaptive algorithms the filter coefficients are partially updated which reduce the computational complexity. In VSS-SR-APA, the optimal selection of input regressors is performed during the adaptation. The presented algorithms have good convergence speed, low steady state mean square error (MSE), and low computational complexity features. We demonstrate the good performance of the proposed algorithms through several simulations in system identification scenario.
NASA Astrophysics Data System (ADS)
Attia, Khalid A. M.; Nassar, Mohammed W. I.; El-Zeiny, Mohamed B.; Serag, Ahmed
2017-01-01
For the first time, a new variable selection method based on swarm intelligence namely firefly algorithm is coupled with three different multivariate calibration models namely, concentration residual augmented classical least squares, artificial neural network and support vector regression in UV spectral data. A comparative study between the firefly algorithm and the well-known genetic algorithm was developed. The discussion revealed the superiority of using this new powerful algorithm over the well-known genetic algorithm. Moreover, different statistical tests were performed and no significant differences were found between all the models regarding their predictabilities. This ensures that simpler and faster models were obtained without any deterioration of the quality of the calibration.
Fast Solution in Sparse LDA for Binary Classification
NASA Technical Reports Server (NTRS)
Moghaddam, Baback
2010-01-01
An algorithm that performs sparse linear discriminant analysis (Sparse-LDA) finds near-optimal solutions in far less time than the prior art when specialized to binary classification (of 2 classes). Sparse-LDA is a type of feature- or variable- selection problem with numerous applications in statistics, machine learning, computer vision, computational finance, operations research, and bio-informatics. Because of its combinatorial nature, feature- or variable-selection problems are NP-hard or computationally intractable in cases involving more than 30 variables or features. Therefore, one typically seeks approximate solutions by means of greedy search algorithms. The prior Sparse-LDA algorithm was a greedy algorithm that considered the best variable or feature to add/ delete to/ from its subsets in order to maximally discriminate between multiple classes of data. The present algorithm is designed for the special but prevalent case of 2-class or binary classification (e.g. 1 vs. 0, functioning vs. malfunctioning, or change versus no change). The present algorithm provides near-optimal solutions on large real-world datasets having hundreds or even thousands of variables or features (e.g. selecting the fewest wavelength bands in a hyperspectral sensor to do terrain classification) and does so in typical computation times of minutes as compared to days or weeks as taken by the prior art. Sparse LDA requires solving generalized eigenvalue problems for a large number of variable subsets (represented by the submatrices of the input within-class and between-class covariance matrices). In the general (fullrank) case, the amount of computation scales at least cubically with the number of variables and thus the size of the problems that can be solved is limited accordingly. However, in binary classification, the principal eigenvalues can be found using a special analytic formula, without resorting to costly iterative techniques. The present algorithm exploits this analytic form along with the inherent sequential nature of greedy search itself. Together this enables the use of highly-efficient partitioned-matrix-inverse techniques that result in large speedups of computation in both the forward-selection and backward-elimination stages of greedy algorithms in general.
Attia, Khalid A M; Nassar, Mohammed W I; El-Zeiny, Mohamed B; Serag, Ahmed
2017-01-05
For the first time, a new variable selection method based on swarm intelligence namely firefly algorithm is coupled with three different multivariate calibration models namely, concentration residual augmented classical least squares, artificial neural network and support vector regression in UV spectral data. A comparative study between the firefly algorithm and the well-known genetic algorithm was developed. The discussion revealed the superiority of using this new powerful algorithm over the well-known genetic algorithm. Moreover, different statistical tests were performed and no significant differences were found between all the models regarding their predictabilities. This ensures that simpler and faster models were obtained without any deterioration of the quality of the calibration. Copyright © 2016 Elsevier B.V. All rights reserved.
NASA Astrophysics Data System (ADS)
Agjee, Na'eem Hoosen; Ismail, Riyad; Mutanga, Onisimo
2016-10-01
Water hyacinth plants (Eichhornia crassipes) are threatening freshwater ecosystems throughout Africa. The Neochetina spp. weevils are seen as an effective solution that can combat the proliferation of the invasive alien plant. We aimed to determine if multitemporal hyperspectral data could be utilized to detect the efficacy of the biocontrol agent. The random forest (RF) algorithm was used to classify variable infestation levels for 6 weeks using: (1) all the hyperspectral bands, (2) bands selected by the recursive feature elimination (RFE) algorithm, and (3) bands selected by the Boruta algorithm. Results showed that the RF model using all the bands successfully produced low-classification errors (12.50% to 32.29%) for all 6 weeks. However, the RF model using Boruta selected bands produced lower classification errors (8.33% to 15.62%) than the RF model using all the bands or bands selected by the RFE algorithm (11.25% to 21.25%) for all 6 weeks, highlighting the utility of Boruta as an all relevant band selection algorithm. All relevant bands selected by Boruta included: 352, 754, 770, 771, 775, 781, 782, 783, 786, and 789 nm. It was concluded that RF coupled with Boruta band-selection algorithm can be utilized to undertake multitemporal monitoring of variable infestation levels on water hyacinth plants.
Fernandez-Lozano, C.; Canto, C.; Gestal, M.; Andrade-Garda, J. M.; Rabuñal, J. R.; Dorado, J.; Pazos, A.
2013-01-01
Given the background of the use of Neural Networks in problems of apple juice classification, this paper aim at implementing a newly developed method in the field of machine learning: the Support Vector Machines (SVM). Therefore, a hybrid model that combines genetic algorithms and support vector machines is suggested in such a way that, when using SVM as a fitness function of the Genetic Algorithm (GA), the most representative variables for a specific classification problem can be selected. PMID:24453933
NASA Astrophysics Data System (ADS)
Duan, Fajie; Fu, Xiao; Jiang, Jiajia; Huang, Tingting; Ma, Ling; Zhang, Cong
2018-05-01
In this work, an automatic variable selection method for quantitative analysis of soil samples using laser-induced breakdown spectroscopy (LIBS) is proposed, which is based on full spectrum correction (FSC) and modified iterative predictor weighting-partial least squares (mIPW-PLS). The method features automatic selection without artificial processes. To illustrate the feasibility and effectiveness of the method, a comparison with genetic algorithm (GA) and successive projections algorithm (SPA) for different elements (copper, barium and chromium) detection in soil was implemented. The experimental results showed that all the three methods could accomplish variable selection effectively, among which FSC-mIPW-PLS required significantly shorter computation time (12 s approximately for 40,000 initial variables) than the others. Moreover, improved quantification models were got with variable selection approaches. The root mean square errors of prediction (RMSEP) of models utilizing the new method were 27.47 (copper), 37.15 (barium) and 39.70 (chromium) mg/kg, which showed comparable prediction effect with GA and SPA.
Variable selection and model choice in geoadditive regression models.
Kneib, Thomas; Hothorn, Torsten; Tutz, Gerhard
2009-06-01
Model choice and variable selection are issues of major concern in practical regression analyses, arising in many biometric applications such as habitat suitability analyses, where the aim is to identify the influence of potentially many environmental conditions on certain species. We describe regression models for breeding bird communities that facilitate both model choice and variable selection, by a boosting algorithm that works within a class of geoadditive regression models comprising spatial effects, nonparametric effects of continuous covariates, interaction surfaces, and varying coefficients. The major modeling components are penalized splines and their bivariate tensor product extensions. All smooth model terms are represented as the sum of a parametric component and a smooth component with one degree of freedom to obtain a fair comparison between the model terms. A generic representation of the geoadditive model allows us to devise a general boosting algorithm that automatically performs model choice and variable selection.
PCA-LBG-based algorithms for VQ codebook generation
NASA Astrophysics Data System (ADS)
Tsai, Jinn-Tsong; Yang, Po-Yuan
2015-04-01
Vector quantisation (VQ) codebooks are generated by combining principal component analysis (PCA) algorithms with Linde-Buzo-Gray (LBG) algorithms. All training vectors are grouped according to the projected values of the principal components. The PCA-LBG-based algorithms include (1) PCA-LBG-Median, which selects the median vector of each group, (2) PCA-LBG-Centroid, which adopts the centroid vector of each group, and (3) PCA-LBG-Random, which randomly selects a vector of each group. The LBG algorithm finds a codebook based on the better vectors sent to an initial codebook by the PCA. The PCA performs an orthogonal transformation to convert a set of potentially correlated variables into a set of variables that are not linearly correlated. Because the orthogonal transformation efficiently distinguishes test image vectors, the proposed PCA-LBG-based algorithm is expected to outperform conventional algorithms in designing VQ codebooks. The experimental results confirm that the proposed PCA-LBG-based algorithms indeed obtain better results compared to existing methods reported in the literature.
de Almeida, Valber Elias; de Araújo Gomes, Adriano; de Sousa Fernandes, David Douglas; Goicoechea, Héctor Casimiro; Galvão, Roberto Kawakami Harrop; Araújo, Mario Cesar Ugulino
2018-05-01
This paper proposes a new variable selection method for nonlinear multivariate calibration, combining the Successive Projections Algorithm for interval selection (iSPA) with the Kernel Partial Least Squares (Kernel-PLS) modelling technique. The proposed iSPA-Kernel-PLS algorithm is employed in a case study involving a Vis-NIR spectrometric dataset with complex nonlinear features. The analytical problem consists of determining Brix and sucrose content in samples from a sugar production system, on the basis of transflectance spectra. As compared to full-spectrum Kernel-PLS, the iSPA-Kernel-PLS models involve a smaller number of variables and display statistically significant superiority in terms of accuracy and/or bias in the predictions. Published by Elsevier B.V.
NASA Astrophysics Data System (ADS)
Yi, Jin; Li, Xinyu; Xiao, Mi; Xu, Junnan; Zhang, Lin
2017-01-01
Engineering design often involves different types of simulation, which results in expensive computational costs. Variable fidelity approximation-based design optimization approaches can realize effective simulation and efficiency optimization of the design space using approximation models with different levels of fidelity and have been widely used in different fields. As the foundations of variable fidelity approximation models, the selection of sample points of variable-fidelity approximation, called nested designs, is essential. In this article a novel nested maximin Latin hypercube design is constructed based on successive local enumeration and a modified novel global harmony search algorithm. In the proposed nested designs, successive local enumeration is employed to select sample points for a low-fidelity model, whereas the modified novel global harmony search algorithm is employed to select sample points for a high-fidelity model. A comparative study with multiple criteria and an engineering application are employed to verify the efficiency of the proposed nested designs approach.
Improving permafrost distribution modelling using feature selection algorithms
NASA Astrophysics Data System (ADS)
Deluigi, Nicola; Lambiel, Christophe; Kanevski, Mikhail
2016-04-01
The availability of an increasing number of spatial data on the occurrence of mountain permafrost allows the employment of machine learning (ML) classification algorithms for modelling the distribution of the phenomenon. One of the major problems when dealing with high-dimensional dataset is the number of input features (variables) involved. Application of ML classification algorithms to this large number of variables leads to the risk of overfitting, with the consequence of a poor generalization/prediction. For this reason, applying feature selection (FS) techniques helps simplifying the amount of factors required and improves the knowledge on adopted features and their relation with the studied phenomenon. Moreover, taking away irrelevant or redundant variables from the dataset effectively improves the quality of the ML prediction. This research deals with a comparative analysis of permafrost distribution models supported by FS variable importance assessment. The input dataset (dimension = 20-25, 10 m spatial resolution) was constructed using landcover maps, climate data and DEM derived variables (altitude, aspect, slope, terrain curvature, solar radiation, etc.). It was completed with permafrost evidences (geophysical and thermal data and rock glacier inventories) that serve as training permafrost data. Used FS algorithms informed about variables that appeared less statistically important for permafrost presence/absence. Three different algorithms were compared: Information Gain (IG), Correlation-based Feature Selection (CFS) and Random Forest (RF). IG is a filter technique that evaluates the worth of a predictor by measuring the information gain with respect to the permafrost presence/absence. Conversely, CFS is a wrapper technique that evaluates the worth of a subset of predictors by considering the individual predictive ability of each variable along with the degree of redundancy between them. Finally, RF is a ML algorithm that performs FS as part of its overall operation. It operates by constructing a large collection of decorrelated classification trees, and then predicts the permafrost occurrence through a majority vote. With the so-called out-of-bag (OOB) error estimate, the classification of permafrost data can be validated as well as the contribution of each predictor can be assessed. The performances of compared permafrost distribution models (computed on independent testing sets) increased with the application of FS algorithms on the original dataset and irrelevant or redundant variables were removed. As a consequence, the process provided faster and more cost-effective predictors and a better understanding of the underlying structures residing in permafrost data. Our work demonstrates the usefulness of a feature selection step prior to applying a machine learning algorithm. In fact, permafrost predictors could be ranked not only based on their heuristic and subjective importance (expert knowledge), but also based on their statistical relevance in relation of the permafrost distribution.
Eyler, Lauren; Hubbard, Alan; Juillard, Catherine
2016-10-01
Low and middle-income countries (LMICs) and the world's poor bear a disproportionate share of the global burden of injury. Data regarding disparities in injury are vital to inform injury prevention and trauma systems strengthening interventions targeted towards vulnerable populations, but are limited in LMICs. We aim to facilitate injury disparities research by generating a standardized methodology for assessing economic status in resource-limited country trauma registries where complex metrics such as income, expenditures, and wealth index are infeasible to assess. To address this need, we developed a cluster analysis-based algorithm for generating simple population-specific metrics of economic status using nationally representative Demographic and Health Surveys (DHS) household assets data. For a limited number of variables, g, our algorithm performs weighted k-medoids clustering of the population using all combinations of g asset variables and selects the combination of variables and number of clusters that maximize average silhouette width (ASW). In simulated datasets containing both randomly distributed variables and "true" population clusters defined by correlated categorical variables, the algorithm selected the correct variable combination and appropriate cluster numbers unless variable correlation was very weak. When used with 2011 Cameroonian DHS data, our algorithm identified twenty economic clusters with ASW 0.80, indicating well-defined population clusters. This economic model for assessing health disparities will be used in the new Cameroonian six-hospital centralized trauma registry. By describing our standardized methodology and algorithm for generating economic clustering models, we aim to facilitate measurement of health disparities in other trauma registries in resource-limited countries. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Clustering Words to Match Conditions: An Algorithm for Stimuli Selection in Factorial Designs
ERIC Educational Resources Information Center
Guasch, Marc; Haro, Juan; Boada, Roger
2017-01-01
With the increasing refinement of language processing models and the new discoveries about which variables can modulate these processes, stimuli selection for experiments with a factorial design is becoming a tough task. Selecting sets of words that differ in one variable, while matching these same words into dozens of other confounding variables…
An Ensemble Successive Project Algorithm for Liquor Detection Using Near Infrared Sensor.
Qu, Fangfang; Ren, Dong; Wang, Jihua; Zhang, Zhong; Lu, Na; Meng, Lei
2016-01-11
Spectral analysis technique based on near infrared (NIR) sensor is a powerful tool for complex information processing and high precision recognition, and it has been widely applied to quality analysis and online inspection of agricultural products. This paper proposes a new method to address the instability of small sample sizes in the successive projections algorithm (SPA) as well as the lack of association between selected variables and the analyte. The proposed method is an evaluated bootstrap ensemble SPA method (EBSPA) based on a variable evaluation index (EI) for variable selection, and is applied to the quantitative prediction of alcohol concentrations in liquor using NIR sensor. In the experiment, the proposed EBSPA with three kinds of modeling methods are established to test their performance. In addition, the proposed EBSPA combined with partial least square is compared with other state-of-the-art variable selection methods. The results show that the proposed method can solve the defects of SPA and it has the best generalization performance and stability. Furthermore, the physical meaning of the selected variables from the near infrared sensor data is clear, which can effectively reduce the variables and improve their prediction accuracy.
Žuvela, Petar; Liu, J Jay; Macur, Katarzyna; Bączek, Tomasz
2015-10-06
In this work, performance of five nature-inspired optimization algorithms, genetic algorithm (GA), particle swarm optimization (PSO), artificial bee colony (ABC), firefly algorithm (FA), and flower pollination algorithm (FPA), was compared in molecular descriptor selection for development of quantitative structure-retention relationship (QSRR) models for 83 peptides that originate from eight model proteins. The matrix with 423 descriptors was used as input, and QSRR models based on selected descriptors were built using partial least squares (PLS), whereas root mean square error of prediction (RMSEP) was used as a fitness function for their selection. Three performance criteria, prediction accuracy, computational cost, and the number of selected descriptors, were used to evaluate the developed QSRR models. The results show that all five variable selection methods outperform interval PLS (iPLS), sparse PLS (sPLS), and the full PLS model, whereas GA is superior because of its lowest computational cost and higher accuracy (RMSEP of 5.534%) with a smaller number of variables (nine descriptors). The GA-QSRR model was validated initially through Y-randomization. In addition, it was successfully validated with an external testing set out of 102 peptides originating from Bacillus subtilis proteomes (RMSEP of 22.030%). Its applicability domain was defined, from which it was evident that the developed GA-QSRR exhibited strong robustness. All the sources of the model's error were identified, thus allowing for further application of the developed methodology in proteomics.
Cao, Hongbao; Duan, Junbo; Lin, Dongdong; Shugart, Yin Yao; Calhoun, Vince; Wang, Yu-Ping
2014-11-15
Integrative analysis of multiple data types can take advantage of their complementary information and therefore may provide higher power to identify potential biomarkers that would be missed using individual data analysis. Due to different natures of diverse data modality, data integration is challenging. Here we address the data integration problem by developing a generalized sparse model (GSM) using weighting factors to integrate multi-modality data for biomarker selection. As an example, we applied the GSM model to a joint analysis of two types of schizophrenia data sets: 759,075 SNPs and 153,594 functional magnetic resonance imaging (fMRI) voxels in 208 subjects (92 cases/116 controls). To solve this small-sample-large-variable problem, we developed a novel sparse representation based variable selection (SRVS) algorithm, with the primary aim to identify biomarkers associated with schizophrenia. To validate the effectiveness of the selected variables, we performed multivariate classification followed by a ten-fold cross validation. We compared our proposed SRVS algorithm with an earlier sparse model based variable selection algorithm for integrated analysis. In addition, we compared with the traditional statistics method for uni-variant data analysis (Chi-squared test for SNP data and ANOVA for fMRI data). Results showed that our proposed SRVS method can identify novel biomarkers that show stronger capability in distinguishing schizophrenia patients from healthy controls. Moreover, better classification ratios were achieved using biomarkers from both types of data, suggesting the importance of integrative analysis. Copyright © 2014 Elsevier Inc. All rights reserved.
Ouyang, Qin; Zhao, Jiewen; Chen, Quansheng
2015-01-01
The non-sugar solids (NSS) content is one of the most important nutrition indicators of Chinese rice wine. This study proposed a rapid method for the measurement of NSS content in Chinese rice wine using near infrared (NIR) spectroscopy. We also systemically studied the efficient spectral variables selection algorithms that have to go through modeling. A new algorithm of synergy interval partial least square with competitive adaptive reweighted sampling (Si-CARS-PLS) was proposed for modeling. The performance of the final model was back-evaluated using root mean square error of calibration (RMSEC) and correlation coefficient (Rc) in calibration set and similarly tested by mean square error of prediction (RMSEP) and correlation coefficient (Rp) in prediction set. The optimum model by Si-CARS-PLS algorithm was achieved when 7 PLS factors and 18 variables were included, and the results were as follows: Rc=0.95 and RMSEC=1.12 in the calibration set, Rp=0.95 and RMSEP=1.22 in the prediction set. In addition, Si-CARS-PLS algorithm showed its superiority when compared with the commonly used algorithms in multivariate calibration. This work demonstrated that NIR spectroscopy technique combined with a suitable multivariate calibration algorithm has a high potential in rapid measurement of NSS content in Chinese rice wine. Copyright © 2015 Elsevier B.V. All rights reserved.
Bayesian block-diagonal variable selection and model averaging
Papaspiliopoulos, O.; Rossell, D.
2018-01-01
Summary We propose a scalable algorithmic framework for exact Bayesian variable selection and model averaging in linear models under the assumption that the Gram matrix is block-diagonal, and as a heuristic for exploring the model space for general designs. In block-diagonal designs our approach returns the most probable model of any given size without resorting to numerical integration. The algorithm also provides a novel and efficient solution to the frequentist best subset selection problem for block-diagonal designs. Posterior probabilities for any number of models are obtained by evaluating a single one-dimensional integral, and other quantities of interest such as variable inclusion probabilities and model-averaged regression estimates are obtained by an adaptive, deterministic one-dimensional numerical integration. The overall computational cost scales linearly with the number of blocks, which can be processed in parallel, and exponentially with the block size, rendering it most adequate in situations where predictors are organized in many moderately-sized blocks. For general designs, we approximate the Gram matrix by a block-diagonal matrix using spectral clustering and propose an iterative algorithm that capitalizes on the block-diagonal algorithms to explore efficiently the model space. All methods proposed in this paper are implemented in the R library mombf. PMID:29861501
Hromadka, T.V.; Guymon, G.L.
1985-01-01
An algorithm is presented for the numerical solution of the Laplace equation boundary-value problem, which is assumed to apply to soil freezing or thawing. The Laplace equation is numerically approximated by the complex-variable boundary-element method. The algorithm aids in reducing integrated relative error by providing a true measure of modeling error along the solution domain boundary. This measure of error can be used to select locations for adding, removing, or relocating nodal points on the boundary or to provide bounds for the integrated relative error of unknown nodal variable values along the boundary.
NASA Astrophysics Data System (ADS)
Wang, Chun; Ji, Zhicheng; Wang, Yan
2017-07-01
In this paper, multi-objective flexible job shop scheduling problem (MOFJSP) was studied with the objects to minimize makespan, total workload and critical workload. A variable neighborhood evolutionary algorithm (VNEA) was proposed to obtain a set of Pareto optimal solutions. First, two novel crowded operators in terms of the decision space and object space were proposed, and they were respectively used in mating selection and environmental selection. Then, two well-designed neighborhood structures were used in local search, which consider the problem characteristics and can hold fast convergence. Finally, extensive comparison was carried out with the state-of-the-art methods specially presented for solving MOFJSP on well-known benchmark instances. The results show that the proposed VNEA is more effective than other algorithms in solving MOFJSP.
Application of the Trend Filtering Algorithm for Photometric Time Series Data
NASA Astrophysics Data System (ADS)
Gopalan, Giri; Plavchan, Peter; van Eyken, Julian; Ciardi, David; von Braun, Kaspar; Kane, Stephen R.
2016-08-01
Detecting transient light curves (e.g., transiting planets) requires high-precision data, and thus it is important to effectively filter systematic trends affecting ground-based wide-field surveys. We apply an implementation of the Trend Filtering Algorithm (TFA) to the 2MASS calibration catalog and select Palomar Transient Factory (PTF) photometric time series data. TFA is successful at reducing the overall dispersion of light curves, however, it may over-filter intrinsic variables and increase “instantaneous” dispersion when a template set is not judiciously chosen. In an attempt to rectify these issues we modify the original TFA from the literature by including measurement uncertainties in its computation, including ancillary data correlated with noise, and algorithmically selecting a template set using clustering algorithms as suggested by various authors. This approach may be particularly useful for appropriately accounting for variable photometric precision surveys and/or combined data sets. In summary, our contributions are to provide a MATLAB software implementation of TFA and a number of modifications tested on synthetics and real data, summarize the performance of TFA and various modifications on real ground-based data sets (2MASS and PTF), and assess the efficacy of TFA and modifications using synthetic light curve tests consisting of transiting and sinusoidal variables. While the transiting variables test indicates that these modifications confer no advantage to transit detection, the sinusoidal variables test indicates potential improvements in detection accuracy.
Mao, Yong; Zhou, Xiao-Bo; Pi, Dao-Ying; Sun, You-Xian; Wong, Stephen T C
2005-10-01
In microarray-based cancer classification, gene selection is an important issue owing to the large number of variables and small number of samples as well as its non-linearity. It is difficult to get satisfying results by using conventional linear statistical methods. Recursive feature elimination based on support vector machine (SVM RFE) is an effective algorithm for gene selection and cancer classification, which are integrated into a consistent framework. In this paper, we propose a new method to select parameters of the aforementioned algorithm implemented with Gaussian kernel SVMs as better alternatives to the common practice of selecting the apparently best parameters by using a genetic algorithm to search for a couple of optimal parameter. Fast implementation issues for this method are also discussed for pragmatic reasons. The proposed method was tested on two representative hereditary breast cancer and acute leukaemia datasets. The experimental results indicate that the proposed method performs well in selecting genes and achieves high classification accuracies with these genes.
Penalized regression procedures for variable selection in the potential outcomes framework
Ghosh, Debashis; Zhu, Yeying; Coffman, Donna L.
2015-01-01
A recent topic of much interest in causal inference is model selection. In this article, we describe a framework in which to consider penalized regression approaches to variable selection for causal effects. The framework leads to a simple ‘impute, then select’ class of procedures that is agnostic to the type of imputation algorithm as well as penalized regression used. It also clarifies how model selection involves a multivariate regression model for causal inference problems, and that these methods can be applied for identifying subgroups in which treatment effects are homogeneous. Analogies and links with the literature on machine learning methods, missing data and imputation are drawn. A difference LASSO algorithm is defined, along with its multiple imputation analogues. The procedures are illustrated using a well-known right heart catheterization dataset. PMID:25628185
Recursive Branching Simulated Annealing Algorithm
NASA Technical Reports Server (NTRS)
Bolcar, Matthew; Smith, J. Scott; Aronstein, David
2012-01-01
This innovation is a variation of a simulated-annealing optimization algorithm that uses a recursive-branching structure to parallelize the search of a parameter space for the globally optimal solution to an objective. The algorithm has been demonstrated to be more effective at searching a parameter space than traditional simulated-annealing methods for a particular problem of interest, and it can readily be applied to a wide variety of optimization problems, including those with a parameter space having both discrete-value parameters (combinatorial) and continuous-variable parameters. It can take the place of a conventional simulated- annealing, Monte-Carlo, or random- walk algorithm. In a conventional simulated-annealing (SA) algorithm, a starting configuration is randomly selected within the parameter space. The algorithm randomly selects another configuration from the parameter space and evaluates the objective function for that configuration. If the objective function value is better than the previous value, the new configuration is adopted as the new point of interest in the parameter space. If the objective function value is worse than the previous value, the new configuration may be adopted, with a probability determined by a temperature parameter, used in analogy to annealing in metals. As the optimization continues, the region of the parameter space from which new configurations can be selected shrinks, and in conjunction with lowering the annealing temperature (and thus lowering the probability for adopting configurations in parameter space with worse objective functions), the algorithm can converge on the globally optimal configuration. The Recursive Branching Simulated Annealing (RBSA) algorithm shares some features with the SA algorithm, notably including the basic principles that a starting configuration is randomly selected from within the parameter space, the algorithm tests other configurations with the goal of finding the globally optimal solution, and the region from which new configurations can be selected shrinks as the search continues. The key difference between these algorithms is that in the SA algorithm, a single path, or trajectory, is taken in parameter space, from the starting point to the globally optimal solution, while in the RBSA algorithm, many trajectories are taken; by exploring multiple regions of the parameter space simultaneously, the algorithm has been shown to converge on the globally optimal solution about an order of magnitude faster than when using conventional algorithms. Novel features of the RBSA algorithm include: 1. More efficient searching of the parameter space due to the branching structure, in which multiple random configurations are generated and multiple promising regions of the parameter space are explored; 2. The implementation of a trust region for each parameter in the parameter space, which provides a natural way of enforcing upper- and lower-bound constraints on the parameters; and 3. The optional use of a constrained gradient- search optimization, performed on the continuous variables around each branch s configuration in parameter space to improve search efficiency by allowing for fast fine-tuning of the continuous variables within the trust region at that configuration point.
Balabin, Roman M; Smirnov, Sergey V
2011-04-29
During the past several years, near-infrared (near-IR/NIR) spectroscopy has increasingly been adopted as an analytical tool in various fields from petroleum to biomedical sectors. The NIR spectrum (above 4000 cm(-1)) of a sample is typically measured by modern instruments at a few hundred of wavelengths. Recently, considerable effort has been directed towards developing procedures to identify variables (wavelengths) that contribute useful information. Variable selection (VS) or feature selection, also called frequency selection or wavelength selection, is a critical step in data analysis for vibrational spectroscopy (infrared, Raman, or NIRS). In this paper, we compare the performance of 16 different feature selection methods for the prediction of properties of biodiesel fuel, including density, viscosity, methanol content, and water concentration. The feature selection algorithms tested include stepwise multiple linear regression (MLR-step), interval partial least squares regression (iPLS), backward iPLS (BiPLS), forward iPLS (FiPLS), moving window partial least squares regression (MWPLS), (modified) changeable size moving window partial least squares (CSMWPLS/MCSMWPLSR), searching combination moving window partial least squares (SCMWPLS), successive projections algorithm (SPA), uninformative variable elimination (UVE, including UVE-SPA), simulated annealing (SA), back-propagation artificial neural networks (BP-ANN), Kohonen artificial neural network (K-ANN), and genetic algorithms (GAs, including GA-iPLS). Two linear techniques for calibration model building, namely multiple linear regression (MLR) and partial least squares regression/projection to latent structures (PLS/PLSR), are used for the evaluation of biofuel properties. A comparison with a non-linear calibration model, artificial neural networks (ANN-MLP), is also provided. Discussion of gasoline, ethanol-gasoline (bioethanol), and diesel fuel data is presented. The results of other spectroscopic techniques application, such as Raman, ultraviolet-visible (UV-vis), or nuclear magnetic resonance (NMR) spectroscopies, can be greatly improved by an appropriate feature selection choice. Copyright © 2011 Elsevier B.V. All rights reserved.
Gui, Guan; Chen, Zhang-xin; Xu, Li; Wan, Qun; Huang, Jiyan; Adachi, Fumiyuki
2014-01-01
Channel estimation problem is one of the key technical issues in sparse frequency-selective fading multiple-input multiple-output (MIMO) communication systems using orthogonal frequency division multiplexing (OFDM) scheme. To estimate sparse MIMO channels, sparse invariable step-size normalized least mean square (ISS-NLMS) algorithms were applied to adaptive sparse channel estimation (ACSE). It is well known that step-size is a critical parameter which controls three aspects: algorithm stability, estimation performance, and computational cost. However, traditional methods are vulnerable to cause estimation performance loss because ISS cannot balance the three aspects simultaneously. In this paper, we propose two stable sparse variable step-size NLMS (VSS-NLMS) algorithms to improve the accuracy of MIMO channel estimators. First, ASCE is formulated in MIMO-OFDM systems. Second, different sparse penalties are introduced to VSS-NLMS algorithm for ASCE. In addition, difference between sparse ISS-NLMS algorithms and sparse VSS-NLMS ones is explained and their lower bounds are also derived. At last, to verify the effectiveness of the proposed algorithms for ASCE, several selected simulation results are shown to prove that the proposed sparse VSS-NLMS algorithms can achieve better estimation performance than the conventional methods via mean square error (MSE) and bit error rate (BER) metrics.
Gui, Guan; Chen, Zhang-xin; Xu, Li; Wan, Qun; Huang, Jiyan; Adachi, Fumiyuki
2014-01-01
Channel estimation problem is one of the key technical issues in sparse frequency-selective fading multiple-input multiple-output (MIMO) communication systems using orthogonal frequency division multiplexing (OFDM) scheme. To estimate sparse MIMO channels, sparse invariable step-size normalized least mean square (ISS-NLMS) algorithms were applied to adaptive sparse channel estimation (ACSE). It is well known that step-size is a critical parameter which controls three aspects: algorithm stability, estimation performance, and computational cost. However, traditional methods are vulnerable to cause estimation performance loss because ISS cannot balance the three aspects simultaneously. In this paper, we propose two stable sparse variable step-size NLMS (VSS-NLMS) algorithms to improve the accuracy of MIMO channel estimators. First, ASCE is formulated in MIMO-OFDM systems. Second, different sparse penalties are introduced to VSS-NLMS algorithm for ASCE. In addition, difference between sparse ISS-NLMS algorithms and sparse VSS-NLMS ones is explained and their lower bounds are also derived. At last, to verify the effectiveness of the proposed algorithms for ASCE, several selected simulation results are shown to prove that the proposed sparse VSS-NLMS algorithms can achieve better estimation performance than the conventional methods via mean square error (MSE) and bit error rate (BER) metrics. PMID:25089286
Prediction of Baseflow Index of Catchments using Machine Learning Algorithms
NASA Astrophysics Data System (ADS)
Yadav, B.; Hatfield, K.
2017-12-01
We present the results of eight machine learning techniques for predicting the baseflow index (BFI) of ungauged basins using a surrogate of catchment scale climate and physiographic data. The tested algorithms include ordinary least squares, ridge regression, least absolute shrinkage and selection operator (lasso), elasticnet, support vector machine, gradient boosted regression trees, random forests, and extremely randomized trees. Our work seeks to identify the dominant controls of BFI that can be readily obtained from ancillary geospatial databases and remote sensing measurements, such that the developed techniques can be extended to ungauged catchments. More than 800 gauged catchments spanning the continental United States were selected to develop the general methodology. The BFI calculation was based on the baseflow separated from daily streamflow hydrograph using HYSEP filter. The surrogate catchment attributes were compiled from multiple sources including digital elevation model, soil, landuse, climate data, other publicly available ancillary and geospatial data. 80% catchments were used to train the ML algorithms, and the remaining 20% of the catchments were used as an independent test set to measure the generalization performance of fitted models. A k-fold cross-validation using exhaustive grid search was used to fit the hyperparameters of each model. Initial model development was based on 19 independent variables, but after variable selection and feature ranking, we generated revised sparse models of BFI prediction that are based on only six catchment attributes. These key predictive variables selected after the careful evaluation of bias-variance tradeoff include average catchment elevation, slope, fraction of sand, permeability, temperature, and precipitation. The most promising algorithms exceeding an accuracy score (r-square) of 0.7 on test data include support vector machine, gradient boosted regression trees, random forests, and extremely randomized trees. Considering both the accuracy and the computational complexity of these algorithms, we identify the extremely randomized trees as the best performing algorithm for BFI prediction in ungauged basins.
Zuhtuogullari, Kursat; Allahverdi, Novruz; Arikan, Nihat
2013-01-01
The systems consisting high input spaces require high processing times and memory usage. Most of the attribute selection algorithms have the problems of input dimensions limits and information storage problems. These problems are eliminated by means of developed feature reduction software using new modified selection mechanism with middle region solution candidates adding. The hybrid system software is constructed for reducing the input attributes of the systems with large number of input variables. The designed software also supports the roulette wheel selection mechanism. Linear order crossover is used as the recombination operator. In the genetic algorithm based soft computing methods, locking to the local solutions is also a problem which is eliminated by using developed software. Faster and effective results are obtained in the test procedures. Twelve input variables of the urological system have been reduced to the reducts (reduced input attributes) with seven, six, and five elements. It can be seen from the obtained results that the developed software with modified selection has the advantages in the fields of memory allocation, execution time, classification accuracy, sensitivity, and specificity values when compared with the other reduction algorithms by using the urological test data.
Zuhtuogullari, Kursat; Allahverdi, Novruz; Arikan, Nihat
2013-01-01
The systems consisting high input spaces require high processing times and memory usage. Most of the attribute selection algorithms have the problems of input dimensions limits and information storage problems. These problems are eliminated by means of developed feature reduction software using new modified selection mechanism with middle region solution candidates adding. The hybrid system software is constructed for reducing the input attributes of the systems with large number of input variables. The designed software also supports the roulette wheel selection mechanism. Linear order crossover is used as the recombination operator. In the genetic algorithm based soft computing methods, locking to the local solutions is also a problem which is eliminated by using developed software. Faster and effective results are obtained in the test procedures. Twelve input variables of the urological system have been reduced to the reducts (reduced input attributes) with seven, six, and five elements. It can be seen from the obtained results that the developed software with modified selection has the advantages in the fields of memory allocation, execution time, classification accuracy, sensitivity, and specificity values when compared with the other reduction algorithms by using the urological test data. PMID:23573172
Yu, P.; Sun, J.; Wolz, R.; Stephenson, D.; Brewer, J.; Fox, N.C.; Cole, P.E.; Jack, C.R.; Hill, D.L.G.; Schwarz, A.J.
2014-01-01
Objective To evaluate the effect of computational algorithm, measurement variability and cut-point on hippocampal volume (HCV)-based patient selection for clinical trials in mild cognitive impairment (MCI). Methods We used normal control and amnestic MCI subjects from ADNI-1 as normative reference and screening cohorts. We evaluated the enrichment performance of four widely-used hippocampal segmentation algorithms (FreeSurfer, HMAPS, LEAP and NeuroQuant) in terms of two-year changes in MMSE, ADAS-Cog and CDR-SB. We modeled the effect of algorithm, test-retest variability and cut-point on sample size, screen fail rates and trial cost and duration. Results HCV-based patient selection yielded not only reduced sample sizes (by ~40–60%) but also lower trial costs (by ~30–40%) across a wide range of cut-points. Overall, the dependence on the cut-point value was similar for the three clinical instruments considered. Conclusion These results provide a guide to the choice of HCV cut-point for aMCI clinical trials, allowing an informed trade-off between statistical and practical considerations. PMID:24211008
A New Algorithm to Optimize Maximal Information Coefficient
Luo, Feng; Yuan, Zheming
2016-01-01
The maximal information coefficient (MIC) captures dependences between paired variables, including both functional and non-functional relationships. In this paper, we develop a new method, ChiMIC, to calculate the MIC values. The ChiMIC algorithm uses the chi-square test to terminate grid optimization and then removes the restriction of maximal grid size limitation of original ApproxMaxMI algorithm. Computational experiments show that ChiMIC algorithm can maintain same MIC values for noiseless functional relationships, but gives much smaller MIC values for independent variables. For noise functional relationship, the ChiMIC algorithm can reach the optimal partition much faster. Furthermore, the MCN values based on MIC calculated by ChiMIC can capture the complexity of functional relationships in a better way, and the statistical powers of MIC calculated by ChiMIC are higher than those calculated by ApproxMaxMI. Moreover, the computational costs of ChiMIC are much less than those of ApproxMaxMI. We apply the MIC values tofeature selection and obtain better classification accuracy using features selected by the MIC values from ChiMIC. PMID:27333001
Ronald E. McRoberts
2009-01-01
Nearest neighbors techniques have been shown to be useful for predicting multiple forest attributes from forest inventory and Landsat satellite image data. However, in regions lacking good digital land cover information, nearest neighbors selected to predict continuous variables such as tree volume must be selected without regard to relevant categorical variables such...
Algorithms for Discovery of Multiple Markov Boundaries
Statnikov, Alexander; Lytkin, Nikita I.; Lemeire, Jan; Aliferis, Constantin F.
2013-01-01
Algorithms for Markov boundary discovery from data constitute an important recent development in machine learning, primarily because they offer a principled solution to the variable/feature selection problem and give insight on local causal structure. Over the last decade many sound algorithms have been proposed to identify a single Markov boundary of the response variable. Even though faithful distributions and, more broadly, distributions that satisfy the intersection property always have a single Markov boundary, other distributions/data sets may have multiple Markov boundaries of the response variable. The latter distributions/data sets are common in practical data-analytic applications, and there are several reasons why it is important to induce multiple Markov boundaries from such data. However, there are currently no sound and efficient algorithms that can accomplish this task. This paper describes a family of algorithms TIE* that can discover all Markov boundaries in a distribution. The broad applicability as well as efficiency of the new algorithmic family is demonstrated in an extensive benchmarking study that involved comparison with 26 state-of-the-art algorithms/variants in 15 data sets from a diversity of application domains. PMID:25285052
Regularization Paths for Conditional Logistic Regression: The clogitL1 Package.
Reid, Stephen; Tibshirani, Rob
2014-07-01
We apply the cyclic coordinate descent algorithm of Friedman, Hastie, and Tibshirani (2010) to the fitting of a conditional logistic regression model with lasso [Formula: see text] and elastic net penalties. The sequential strong rules of Tibshirani, Bien, Hastie, Friedman, Taylor, Simon, and Tibshirani (2012) are also used in the algorithm and it is shown that these offer a considerable speed up over the standard coordinate descent algorithm with warm starts. Once implemented, the algorithm is used in simulation studies to compare the variable selection and prediction performance of the conditional logistic regression model against that of its unconditional (standard) counterpart. We find that the conditional model performs admirably on datasets drawn from a suitable conditional distribution, outperforming its unconditional counterpart at variable selection. The conditional model is also fit to a small real world dataset, demonstrating how we obtain regularization paths for the parameters of the model and how we apply cross validation for this method where natural unconditional prediction rules are hard to come by.
Regularization Paths for Conditional Logistic Regression: The clogitL1 Package
Reid, Stephen; Tibshirani, Rob
2014-01-01
We apply the cyclic coordinate descent algorithm of Friedman, Hastie, and Tibshirani (2010) to the fitting of a conditional logistic regression model with lasso (ℓ1) and elastic net penalties. The sequential strong rules of Tibshirani, Bien, Hastie, Friedman, Taylor, Simon, and Tibshirani (2012) are also used in the algorithm and it is shown that these offer a considerable speed up over the standard coordinate descent algorithm with warm starts. Once implemented, the algorithm is used in simulation studies to compare the variable selection and prediction performance of the conditional logistic regression model against that of its unconditional (standard) counterpart. We find that the conditional model performs admirably on datasets drawn from a suitable conditional distribution, outperforming its unconditional counterpart at variable selection. The conditional model is also fit to a small real world dataset, demonstrating how we obtain regularization paths for the parameters of the model and how we apply cross validation for this method where natural unconditional prediction rules are hard to come by. PMID:26257587
Wavelet neural networks: a practical guide.
Alexandridis, Antonios K; Zapranis, Achilleas D
2013-06-01
Wavelet networks (WNs) are a new class of networks which have been used with great success in a wide range of applications. However a general accepted framework for applying WNs is missing from the literature. In this study, we present a complete statistical model identification framework in order to apply WNs in various applications. The following subjects were thoroughly examined: the structure of a WN, training methods, initialization algorithms, variable significance and variable selection algorithms, model selection methods and finally methods to construct confidence and prediction intervals. In addition the complexity of each algorithm is discussed. Our proposed framework was tested in two simulated cases, in one chaotic time series described by the Mackey-Glass equation and in three real datasets described by daily temperatures in Berlin, daily wind speeds in New York and breast cancer classification. Our results have shown that the proposed algorithms produce stable and robust results indicating that our proposed framework can be applied in various applications. Copyright © 2013 Elsevier Ltd. All rights reserved.
Olivera, André Rodrigues; Roesler, Valter; Iochpe, Cirano; Schmidt, Maria Inês; Vigo, Álvaro; Barreto, Sandhi Maria; Duncan, Bruce Bartholow
2017-01-01
Type 2 diabetes is a chronic disease associated with a wide range of serious health complications that have a major impact on overall health. The aims here were to develop and validate predictive models for detecting undiagnosed diabetes using data from the Longitudinal Study of Adult Health (ELSA-Brasil) and to compare the performance of different machine-learning algorithms in this task. Comparison of machine-learning algorithms to develop predictive models using data from ELSA-Brasil. After selecting a subset of 27 candidate variables from the literature, models were built and validated in four sequential steps: (i) parameter tuning with tenfold cross-validation, repeated three times; (ii) automatic variable selection using forward selection, a wrapper strategy with four different machine-learning algorithms and tenfold cross-validation (repeated three times), to evaluate each subset of variables; (iii) error estimation of model parameters with tenfold cross-validation, repeated ten times; and (iv) generalization testing on an independent dataset. The models were created with the following machine-learning algorithms: logistic regression, artificial neural network, naïve Bayes, K-nearest neighbor and random forest. The best models were created using artificial neural networks and logistic regression. -These achieved mean areas under the curve of, respectively, 75.24% and 74.98% in the error estimation step and 74.17% and 74.41% in the generalization testing step. Most of the predictive models produced similar results, and demonstrated the feasibility of identifying individuals with highest probability of having undiagnosed diabetes, through easily-obtained clinical data.
Huang, Tao; Li, Xiao-yu; Xu, Meng-ling; Jin, Rui; Ku, Jing; Xu, Sen-miao; Wu, Zhen-zhong
2015-01-01
The quality of potato is directly related to their edible value and industrial value. Hollow heart of potato, as a physiological disease occurred inside the tuber, is difficult to be detected. This paper put forward a non-destructive detection method by using semi-transmission hyperspectral imaging with support vector machine (SVM) to detect hollow heart of potato. Compared to reflection and transmission hyperspectral image, semi-transmission hyperspectral image can get clearer image which contains the internal quality information of agricultural products. In this study, 224 potato samples (149 normal samples and 75 hollow samples) were selected as the research object, and semi-transmission hyperspectral image acquisition system was constructed to acquire the hyperspectral images (390-1 040 nn) of the potato samples, and then the average spectrum of region of interest were extracted for spectral characteristics analysis. Normalize was used to preprocess the original spectrum, and prediction model were developed based on SVM using all wave bands, the accurate recognition rate of test set is only 87. 5%. In order to simplify the model competitive.adaptive reweighed sampling algorithm (CARS) and successive projection algorithm (SPA) were utilized to select important variables from the all 520 spectral variables and 8 variables were selected (454, 601, 639, 664, 748, 827, 874 and 936 nm). 94. 64% of the accurate recognition rate of test set was obtained by using the 8 variables to develop SVM model. Parameter optimization algorithms, including artificial fish swarm algorithm (AFSA), genetic algorithm (GA) and grid search algorithm, were used to optimize the SVM model parameters: penalty parameter c and kernel parameter g. After comparative analysis, AFSA, a new bionic optimization algorithm based on the foraging behavior of fish swarm, was proved to get the optimal model parameter (c=10. 659 1, g=0. 349 7), and the recognition accuracy of 10% were obtained for the AFSA-SVM model. The results indicate that combining the semi-transmission hyperspectral imaging technology with CARS-SPA and AFSA-SVM can accurately detect hollow heart of potato, and also provide technical support for rapid non-destructive detecting of hollow heart of potato.
Random forest feature selection approach for image segmentation
NASA Astrophysics Data System (ADS)
Lefkovits, László; Lefkovits, Szidónia; Emerich, Simina; Vaida, Mircea Florin
2017-03-01
In the field of image segmentation, discriminative models have shown promising performance. Generally, every such model begins with the extraction of numerous features from annotated images. Most authors create their discriminative model by using many features without using any selection criteria. A more reliable model can be built by using a framework that selects the important variables, from the point of view of the classification, and eliminates the unimportant once. In this article we present a framework for feature selection and data dimensionality reduction. The methodology is built around the random forest (RF) algorithm and its variable importance evaluation. In order to deal with datasets so large as to be practically unmanageable, we propose an algorithm based on RF that reduces the dimension of the database by eliminating irrelevant features. Furthermore, this framework is applied to optimize our discriminative model for brain tumor segmentation.
A Selective Review of Group Selection in High-Dimensional Models
Huang, Jian; Breheny, Patrick; Ma, Shuangge
2013-01-01
Grouping structures arise naturally in many statistical modeling problems. Several methods have been proposed for variable selection that respect grouping structure in variables. Examples include the group LASSO and several concave group selection methods. In this article, we give a selective review of group selection concerning methodological developments, theoretical properties and computational algorithms. We pay particular attention to group selection methods involving concave penalties. We address both group selection and bi-level selection methods. We describe several applications of these methods in nonparametric additive models, semiparametric regression, seemingly unrelated regressions, genomic data analysis and genome wide association studies. We also highlight some issues that require further study. PMID:24174707
Data-driven process decomposition and robust online distributed modelling for large-scale processes
NASA Astrophysics Data System (ADS)
Shu, Zhang; Lijuan, Li; Lijuan, Yao; Shipin, Yang; Tao, Zou
2018-02-01
With the increasing attention of networked control, system decomposition and distributed models show significant importance in the implementation of model-based control strategy. In this paper, a data-driven system decomposition and online distributed subsystem modelling algorithm was proposed for large-scale chemical processes. The key controlled variables are first partitioned by affinity propagation clustering algorithm into several clusters. Each cluster can be regarded as a subsystem. Then the inputs of each subsystem are selected by offline canonical correlation analysis between all process variables and its controlled variables. Process decomposition is then realised after the screening of input and output variables. When the system decomposition is finished, the online subsystem modelling can be carried out by recursively block-wise renewing the samples. The proposed algorithm was applied in the Tennessee Eastman process and the validity was verified.
Tighe, Patrick J.; Harle, Christopher A.; Hurley, Robert W.; Aytug, Haldun; Boezaart, Andre P.; Fillingim, Roger B.
2015-01-01
Background Given their ability to process highly dimensional datasets with hundreds of variables, machine learning algorithms may offer one solution to the vexing challenge of predicting postoperative pain. Methods Here, we report on the application of machine learning algorithms to predict postoperative pain outcomes in a retrospective cohort of 8071 surgical patients using 796 clinical variables. Five algorithms were compared in terms of their ability to forecast moderate to severe postoperative pain: Least Absolute Shrinkage and Selection Operator (LASSO), gradient-boosted decision tree, support vector machine, neural network, and k-nearest neighbor, with logistic regression included for baseline comparison. Results In forecasting moderate to severe postoperative pain for postoperative day (POD) 1, the LASSO algorithm, using all 796 variables, had the highest accuracy with an area under the receiver-operating curve (ROC) of 0.704. Next, the gradient-boosted decision tree had an ROC of 0.665 and the k-nearest neighbor algorithm had an ROC of 0.643. For POD 3, the LASSO algorithm, using all variables, again had the highest accuracy, with an ROC of 0.727. Logistic regression had a lower ROC of 0.5 for predicting pain outcomes on POD 1 and 3. Conclusions Machine learning algorithms, when combined with complex and heterogeneous data from electronic medical record systems, can forecast acute postoperative pain outcomes with accuracies similar to methods that rely only on variables specifically collected for pain outcome prediction. PMID:26031220
A bootstrap based Neyman-Pearson test for identifying variable importance.
Ditzler, Gregory; Polikar, Robi; Rosen, Gail
2015-04-01
Selection of most informative features that leads to a small loss on future data are arguably one of the most important steps in classification, data analysis and model selection. Several feature selection (FS) algorithms are available; however, due to noise present in any data set, FS algorithms are typically accompanied by an appropriate cross-validation scheme. In this brief, we propose a statistical hypothesis test derived from the Neyman-Pearson lemma for determining if a feature is statistically relevant. The proposed approach can be applied as a wrapper to any FS algorithm, regardless of the FS criteria used by that algorithm, to determine whether a feature belongs in the relevant set. Perhaps more importantly, this procedure efficiently determines the number of relevant features given an initial starting point. We provide freely available software implementations of the proposed methodology.
Bayesian Group Bridge for Bi-level Variable Selection.
Mallick, Himel; Yi, Nengjun
2017-06-01
A Bayesian bi-level variable selection method (BAGB: Bayesian Analysis of Group Bridge) is developed for regularized regression and classification. This new development is motivated by grouped data, where generic variables can be divided into multiple groups, with variables in the same group being mechanistically related or statistically correlated. As an alternative to frequentist group variable selection methods, BAGB incorporates structural information among predictors through a group-wise shrinkage prior. Posterior computation proceeds via an efficient MCMC algorithm. In addition to the usual ease-of-interpretation of hierarchical linear models, the Bayesian formulation produces valid standard errors, a feature that is notably absent in the frequentist framework. Empirical evidence of the attractiveness of the method is illustrated by extensive Monte Carlo simulations and real data analysis. Finally, several extensions of this new approach are presented, providing a unified framework for bi-level variable selection in general models with flexible penalties.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Tumuluru, Jaya Shankar; McCulloch, Richard Chet James
In this work a new hybrid genetic algorithm was developed which combines a rudimentary adaptive steepest ascent hill climbing algorithm with a sophisticated evolutionary algorithm in order to optimize complex multivariate design problems. By combining a highly stochastic algorithm (evolutionary) with a simple deterministic optimization algorithm (adaptive steepest ascent) computational resources are conserved and the solution converges rapidly when compared to either algorithm alone. In genetic algorithms natural selection is mimicked by random events such as breeding and mutation. In the adaptive steepest ascent algorithm each variable is perturbed by a small amount and the variable that caused the mostmore » improvement is incremented by a small step. If the direction of most benefit is exactly opposite of the previous direction with the most benefit then the step size is reduced by a factor of 2, thus the step size adapts to the terrain. A graphical user interface was created in MATLAB to provide an interface between the hybrid genetic algorithm and the user. Additional features such as bounding the solution space and weighting the objective functions individually are also built into the interface. The algorithm developed was tested to optimize the functions developed for a wood pelleting process. Using process variables (such as feedstock moisture content, die speed, and preheating temperature) pellet properties were appropriately optimized. Specifically, variables were found which maximized unit density, bulk density, tapped density, and durability while minimizing pellet moisture content and specific energy consumption. The time and computational resources required for the optimization were dramatically decreased using the hybrid genetic algorithm when compared to MATLAB's native evolutionary optimization tool.« less
Concave 1-norm group selection
Jiang, Dingfeng; Huang, Jian
2015-01-01
Grouping structures arise naturally in many high-dimensional problems. Incorporation of such information can improve model fitting and variable selection. Existing group selection methods, such as the group Lasso, require correct membership. However, in practice it can be difficult to correctly specify group membership of all variables. Thus, it is important to develop group selection methods that are robust against group mis-specification. Also, it is desirable to select groups as well as individual variables in many applications. We propose a class of concave \\documentclass[12pt]{minimal} \\usepackage{amsmath} \\usepackage{wasysym} \\usepackage{amsfonts} \\usepackage{amssymb} \\usepackage{amsbsy} \\usepackage{upgreek} \\usepackage{mathrsfs} \\setlength{\\oddsidemargin}{-69pt} \\begin{document} }{}$1$\\end{document}-norm group penalties that is robust to grouping structure and can perform bi-level selection. A coordinate descent algorithm is developed to calculate solutions of the proposed group selection method. Theoretical convergence of the algorithm is proved under certain regularity conditions. Comparison with other methods suggests the proposed method is the most robust approach under membership mis-specification. Simulation studies and real data application indicate that the \\documentclass[12pt]{minimal} \\usepackage{amsmath} \\usepackage{wasysym} \\usepackage{amsfonts} \\usepackage{amssymb} \\usepackage{amsbsy} \\usepackage{upgreek} \\usepackage{mathrsfs} \\setlength{\\oddsidemargin}{-69pt} \\begin{document} }{}$1$\\end{document}-norm concave group selection approach achieves better control of false discovery rates. An R package grppenalty implementing the proposed method is available at CRAN. PMID:25417206
Gomes, Adriano de Araújo; Alcaraz, Mirta Raquel; Goicoechea, Hector C; Araújo, Mario Cesar U
2014-02-06
In this work the Successive Projection Algorithm is presented for intervals selection in N-PLS for three-way data modeling. The proposed algorithm combines noise-reduction properties of PLS with the possibility of discarding uninformative variables in SPA. In addition, second-order advantage can be achieved by the residual bilinearization (RBL) procedure when an unexpected constituent is present in a test sample. For this purpose, SPA was modified in order to select intervals for use in trilinear PLS. The ability of the proposed algorithm, namely iSPA-N-PLS, was evaluated on one simulated and two experimental data sets, comparing the results to those obtained by N-PLS. In the simulated system, two analytes were quantitated in two test sets, with and without unexpected constituent. In the first experimental system, the determination of the four fluorophores (l-phenylalanine; l-3,4-dihydroxyphenylalanine; 1,4-dihydroxybenzene and l-tryptophan) was conducted with excitation-emission data matrices. In the second experimental system, quantitation of ofloxacin was performed in water samples containing two other uncalibrated quinolones (ciprofloxacin and danofloxacin) by high performance liquid chromatography with UV-vis diode array detector. For comparison purpose, a GA algorithm coupled with N-PLS/RBL was also used in this work. In most of the studied cases iSPA-N-PLS proved to be a promising tool for selection of variables in second-order calibration, generating models with smaller RMSEP, when compared to both the global model using all of the sensors in two dimensions and GA-NPLS/RBL. Copyright © 2013 Elsevier B.V. All rights reserved.
Rahman, Anisur; Faqeerzada, Mohammad A; Cho, Byoung-Kwan
2018-03-14
Allicin and soluble solid content (SSC) in garlic is the responsible for its pungent flavor and odor. However, current conventional methods such as the use of high-pressure liquid chromatography and a refractometer have critical drawbacks in that they are time-consuming, labor-intensive and destructive procedures. The present study aimed to predict allicin and SSC in garlic using hyperspectral imaging in combination with variable selection algorithms and calibration models. Hyperspectral images of 100 garlic cloves were acquired that covered two spectral ranges, from which the mean spectra of each clove were extracted. The calibration models included partial least squares (PLS) and least squares-support vector machine (LS-SVM) regression, as well as different spectral pre-processing techniques, from which the highest performing spectral preprocessing technique and spectral range were selected. Then, variable selection methods, such as regression coefficients, variable importance in projection (VIP) and the successive projections algorithm (SPA), were evaluated for the selection of effective wavelengths (EWs). Furthermore, PLS and LS-SVM regression methods were applied to quantitatively predict the quality attributes of garlic using the selected EWs. Of the established models, the SPA-LS-SVM model obtained an Rpred2 of 0.90 and standard error of prediction (SEP) of 1.01% for SSC prediction, whereas the VIP-LS-SVM model produced the best result with an Rpred2 of 0.83 and SEP of 0.19 mg g -1 for allicin prediction in the range 1000-1700 nm. Furthermore, chemical images of garlic were developed using the best predictive model to facilitate visualization of the spatial distributions of allicin and SSC. The present study clearly demonstrates that hyperspectral imaging combined with an appropriate chemometrics method can potentially be employed as a fast, non-invasive method to predict the allicin and SSC in garlic. © 2018 Society of Chemical Industry. © 2018 Society of Chemical Industry.
NASA Astrophysics Data System (ADS)
Attia, Khalid A. M.; Nassar, Mohammed W. I.; El-Zeiny, Mohamed B.; Serag, Ahmed
2016-03-01
Different chemometric models were applied for the quantitative analysis of amoxicillin (AMX), and flucloxacillin (FLX) in their binary mixtures, namely, partial least squares (PLS), spectral residual augmented classical least squares (SRACLS), concentration residual augmented classical least squares (CRACLS) and artificial neural networks (ANNs). All methods were applied with and without variable selection procedure (genetic algorithm GA). The methods were used for the quantitative analysis of the drugs in laboratory prepared mixtures and real market sample via handling the UV spectral data. Robust and simpler models were obtained by applying GA. The proposed methods were found to be rapid, simple and required no preliminary separation steps.
Genetic Algorithms Applied to Multi-Objective Aerodynamic Shape Optimization
NASA Technical Reports Server (NTRS)
Holst, Terry L.
2004-01-01
A genetic algorithm approach suitable for solving multi-objective optimization problems is described and evaluated using a series of aerodynamic shape optimization problems. Several new features including two variations of a binning selection algorithm and a gene-space transformation procedure are included. The genetic algorithm is suitable for finding pareto optimal solutions in search spaces that are defined by any number of genes and that contain any number of local extrema. A new masking array capability is included allowing any gene or gene subset to be eliminated as decision variables from the design space. This allows determination of the effect of a single gene or gene subset on the pareto optimal solution. Results indicate that the genetic algorithm optimization approach is flexible in application and reliable. The binning selection algorithms generally provide pareto front quality enhancements and moderate convergence efficiency improvements for most of the problems solved.
Genetic Algorithms Applied to Multi-Objective Aerodynamic Shape Optimization
NASA Technical Reports Server (NTRS)
Holst, Terry L.
2005-01-01
A genetic algorithm approach suitable for solving multi-objective problems is described and evaluated using a series of aerodynamic shape optimization problems. Several new features including two variations of a binning selection algorithm and a gene-space transformation procedure are included. The genetic algorithm is suitable for finding Pareto optimal solutions in search spaces that are defined by any number of genes and that contain any number of local extrema. A new masking array capability is included allowing any gene or gene subset to be eliminated as decision variables from the design space. This allows determination of the effect of a single gene or gene subset on the Pareto optimal solution. Results indicate that the genetic algorithm optimization approach is flexible in application and reliable. The binning selection algorithms generally provide Pareto front quality enhancements and moderate convergence efficiency improvements for most of the problems solved.
Applications of information theory, genetic algorithms, and neural models to predict oil flow
NASA Astrophysics Data System (ADS)
Ludwig, Oswaldo; Nunes, Urbano; Araújo, Rui; Schnitman, Leizer; Lepikson, Herman Augusto
2009-07-01
This work introduces a new information-theoretic methodology for choosing variables and their time lags in a prediction setting, particularly when neural networks are used in non-linear modeling. The first contribution of this work is the Cross Entropy Function (XEF) proposed to select input variables and their lags in order to compose the input vector of black-box prediction models. The proposed XEF method is more appropriate than the usually applied Cross Correlation Function (XCF) when the relationship among the input and output signals comes from a non-linear dynamic system. The second contribution is a method that minimizes the Joint Conditional Entropy (JCE) between the input and output variables by means of a Genetic Algorithm (GA). The aim is to take into account the dependence among the input variables when selecting the most appropriate set of inputs for a prediction problem. In short, theses methods can be used to assist the selection of input training data that have the necessary information to predict the target data. The proposed methods are applied to a petroleum engineering problem; predicting oil production. Experimental results obtained with a real-world dataset are presented demonstrating the feasibility and effectiveness of the method.
Guo, Pi; Zeng, Fangfang; Hu, Xiaomin; Zhang, Dingmei; Zhu, Shuming; Deng, Yu; Hao, Yuantao
2015-01-01
Objectives In epidemiological studies, it is important to identify independent associations between collective exposures and a health outcome. The current stepwise selection technique ignores stochastic errors and suffers from a lack of stability. The alternative LASSO-penalized regression model can be applied to detect significant predictors from a pool of candidate variables. However, this technique is prone to false positives and tends to create excessive biases. It remains challenging to develop robust variable selection methods and enhance predictability. Material and methods Two improved algorithms denoted the two-stage hybrid and bootstrap ranking procedures, both using a LASSO-type penalty, were developed for epidemiological association analysis. The performance of the proposed procedures and other methods including conventional LASSO, Bolasso, stepwise and stability selection models were evaluated using intensive simulation. In addition, methods were compared by using an empirical analysis based on large-scale survey data of hepatitis B infection-relevant factors among Guangdong residents. Results The proposed procedures produced comparable or less biased selection results when compared to conventional variable selection models. In total, the two newly proposed procedures were stable with respect to various scenarios of simulation, demonstrating a higher power and a lower false positive rate during variable selection than the compared methods. In empirical analysis, the proposed procedures yielding a sparse set of hepatitis B infection-relevant factors gave the best predictive performance and showed that the procedures were able to select a more stringent set of factors. The individual history of hepatitis B vaccination, family and individual history of hepatitis B infection were associated with hepatitis B infection in the studied residents according to the proposed procedures. Conclusions The newly proposed procedures improve the identification of significant variables and enable us to derive a new insight into epidemiological association analysis. PMID:26214802
Multidisciplinary Optimization of a Transport Aircraft Wing using Particle Swarm Optimization
NASA Technical Reports Server (NTRS)
Sobieszczanski-Sobieski, Jaroslaw; Venter, Gerhard
2002-01-01
The purpose of this paper is to demonstrate the application of particle swarm optimization to a realistic multidisciplinary optimization test problem. The paper's new contributions to multidisciplinary optimization is the application of a new algorithm for dealing with the unique challenges associated with multidisciplinary optimization problems, and recommendations as to the utility of the algorithm in future multidisciplinary optimization applications. The selected example is a bi-level optimization problem that demonstrates severe numerical noise and has a combination of continuous and truly discrete design variables. The use of traditional gradient-based optimization algorithms is thus not practical. The numerical results presented indicate that the particle swarm optimization algorithm is able to reliably find the optimum design for the problem presented here. The algorithm is capable of dealing with the unique challenges posed by multidisciplinary optimization as well as the numerical noise and truly discrete variables present in the current example problem.
NASA Astrophysics Data System (ADS)
Seo, Seung Beom; Kim, Young-Oh; Kim, Youngil; Eum, Hyung-Il
2018-04-01
When selecting a subset of climate change scenarios (GCM models), the priority is to ensure that the subset reflects the comprehensive range of possible model results for all variables concerned. Though many studies have attempted to improve the scenario selection, there is a lack of studies that discuss methods to ensure that the results from a subset of climate models contain the same range of uncertainty in hydrologic variables as when all models are considered. We applied the Katsavounidis-Kuo-Zhang (KKZ) algorithm to select a subset of climate change scenarios and demonstrated its ability to reduce the number of GCM models in an ensemble, while the ranges of multiple climate extremes indices were preserved. First, we analyzed the role of 27 ETCCDI climate extremes indices for scenario selection and selected the representative climate extreme indices. Before the selection of a subset, we excluded a few deficient GCM models that could not represent the observed climate regime. Subsequently, we discovered that a subset of GCM models selected by the KKZ algorithm with the representative climate extreme indices could not capture the full potential range of changes in hydrologic extremes (e.g., 3-day peak flow and 7-day low flow) in some regional case studies. However, the application of the KKZ algorithm with a different set of climate indices, which are correlated to the hydrologic extremes, enabled the overcoming of this limitation. Key climate indices, dependent on the hydrologic extremes to be projected, must therefore be determined prior to the selection of a subset of GCM models.
Ning, Jing; Chen, Yong; Piao, Jin
2017-07-01
Publication bias occurs when the published research results are systematically unrepresentative of the population of studies that have been conducted, and is a potential threat to meaningful meta-analysis. The Copas selection model provides a flexible framework for correcting estimates and offers considerable insight into the publication bias. However, maximizing the observed likelihood under the Copas selection model is challenging because the observed data contain very little information on the latent variable. In this article, we study a Copas-like selection model and propose an expectation-maximization (EM) algorithm for estimation based on the full likelihood. Empirical simulation studies show that the EM algorithm and its associated inferential procedure performs well and avoids the non-convergence problem when maximizing the observed likelihood. © The Author 2017. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
Hao, Yong; Sun, Xu-Dong; Yang, Qiang
2012-12-01
Variables selection strategy combined with local linear embedding (LLE) was introduced for the analysis of complex samples by using near infrared spectroscopy (NIRS). Three methods include Monte Carlo uninformation variable elimination (MCUVE), successive projections algorithm (SPA) and MCUVE connected with SPA were used for eliminating redundancy spectral variables. Partial least squares regression (PLSR) and LLE-PLSR were used for modeling complex samples. The results shown that MCUVE can both extract effective informative variables and improve the precision of models. Compared with PLSR models, LLE-PLSR models can achieve more accurate analysis results. MCUVE combined with LLE-PLSR is an effective modeling method for NIRS quantitative analysis.
Deng, Bai-chuan; Yun, Yong-huan; Liang, Yi-zeng; Yi, Lun-zhao
2014-10-07
In this study, a new optimization algorithm called the Variable Iterative Space Shrinkage Approach (VISSA) that is based on the idea of model population analysis (MPA) is proposed for variable selection. Unlike most of the existing optimization methods for variable selection, VISSA statistically evaluates the performance of variable space in each step of optimization. Weighted binary matrix sampling (WBMS) is proposed to generate sub-models that span the variable subspace. Two rules are highlighted during the optimization procedure. First, the variable space shrinks in each step. Second, the new variable space outperforms the previous one. The second rule, which is rarely satisfied in most of the existing methods, is the core of the VISSA strategy. Compared with some promising variable selection methods such as competitive adaptive reweighted sampling (CARS), Monte Carlo uninformative variable elimination (MCUVE) and iteratively retaining informative variables (IRIV), VISSA showed better prediction ability for the calibration of NIR data. In addition, VISSA is user-friendly; only a few insensitive parameters are needed, and the program terminates automatically without any additional conditions. The Matlab codes for implementing VISSA are freely available on the website: https://sourceforge.net/projects/multivariateanalysis/files/VISSA/.
Efficient least angle regression for identification of linear-in-the-parameters models
Beach, Thomas H.; Rezgui, Yacine
2017-01-01
Least angle regression, as a promising model selection method, differentiates itself from conventional stepwise and stagewise methods, in that it is neither too greedy nor too slow. It is closely related to L1 norm optimization, which has the advantage of low prediction variance through sacrificing part of model bias property in order to enhance model generalization capability. In this paper, we propose an efficient least angle regression algorithm for model selection for a large class of linear-in-the-parameters models with the purpose of accelerating the model selection process. The entire algorithm works completely in a recursive manner, where the correlations between model terms and residuals, the evolving directions and other pertinent variables are derived explicitly and updated successively at every subset selection step. The model coefficients are only computed when the algorithm finishes. The direct involvement of matrix inversions is thereby relieved. A detailed computational complexity analysis indicates that the proposed algorithm possesses significant computational efficiency, compared with the original approach where the well-known efficient Cholesky decomposition is involved in solving least angle regression. Three artificial and real-world examples are employed to demonstrate the effectiveness, efficiency and numerical stability of the proposed algorithm. PMID:28293140
Tra, Viet; Kim, Jaeyoung; Kim, Jong-Myon
2017-01-01
This paper presents a novel method for diagnosing incipient bearing defects under variable operating speeds using convolutional neural networks (CNNs) trained via the stochastic diagonal Levenberg-Marquardt (S-DLM) algorithm. The CNNs utilize the spectral energy maps (SEMs) of the acoustic emission (AE) signals as inputs and automatically learn the optimal features, which yield the best discriminative models for diagnosing incipient bearing defects under variable operating speeds. The SEMs are two-dimensional maps that show the distribution of energy across different bands of the AE spectrum. It is hypothesized that the variation of a bearing’s speed would not alter the overall shape of the AE spectrum rather, it may only scale and translate it. Thus, at different speeds, the same defect would yield SEMs that are scaled and shifted versions of each other. This hypothesis is confirmed by the experimental results, where CNNs trained using the S-DLM algorithm yield significantly better diagnostic performance under variable operating speeds compared to existing methods. In this work, the performance of different training algorithms is also evaluated to select the best training algorithm for the CNNs. The proposed method is used to diagnose both single and compound defects at six different operating speeds. PMID:29211025
Shouval, Roni; Labopin, Myriam; Unger, Ron; Giebel, Sebastian; Ciceri, Fabio; Schmid, Christoph; Esteve, Jordi; Baron, Frederic; Gorin, Norbert Claude; Savani, Bipin; Shimoni, Avichai; Mohty, Mohamad; Nagler, Arnon
2016-01-01
Models for prediction of allogeneic hematopoietic stem transplantation (HSCT) related mortality partially account for transplant risk. Improving predictive accuracy requires understating of prediction limiting factors, such as the statistical methodology used, number and quality of features collected, or simply the population size. Using an in-silico approach (i.e., iterative computerized simulations), based on machine learning (ML) algorithms, we set out to analyze these factors. A cohort of 25,923 adult acute leukemia patients from the European Society for Blood and Marrow Transplantation (EBMT) registry was analyzed. Predictive objective was non-relapse mortality (NRM) 100 days following HSCT. Thousands of prediction models were developed under varying conditions: increasing sample size, specific subpopulations and an increasing number of variables, which were selected and ranked by separate feature selection algorithms. Depending on the algorithm, predictive performance plateaued on a population size of 6,611-8,814 patients, reaching a maximal area under the receiver operator characteristic curve (AUC) of 0.67. AUCs' of models developed on specific subpopulation ranged from 0.59 to 0.67 for patients in second complete remission and receiving reduced intensity conditioning, respectively. Only 3-5 variables were necessary to achieve near maximal AUCs. The top 3 ranking variables, shared by all algorithms were disease stage, donor type, and conditioning regimen. Our findings empirically demonstrate that with regards to NRM prediction, few variables "carry the weight" and that traditional HSCT data has been "worn out". "Breaking through" the predictive boundaries will likely require additional types of inputs.
Zhu, Hongyan; Chu, Bingquan; Fan, Yangyang; Tao, Xiaoya; Yin, Wenxin; He, Yong
2017-08-10
We investigated the feasibility and potentiality of determining firmness, soluble solids content (SSC), and pH in kiwifruits using hyperspectral imaging, combined with variable selection methods and calibration models. The images were acquired by a push-broom hyperspectral reflectance imaging system covering two spectral ranges. Weighted regression coefficients (BW), successive projections algorithm (SPA) and genetic algorithm-partial least square (GAPLS) were compared and evaluated for the selection of effective wavelengths. Moreover, multiple linear regression (MLR), partial least squares regression and least squares support vector machine (LS-SVM) were developed to predict quality attributes quantitatively using effective wavelengths. The established models, particularly SPA-MLR, SPA-LS-SVM and GAPLS-LS-SVM, performed well. The SPA-MLR models for firmness (R pre = 0.9812, RPD = 5.17) and SSC (R pre = 0.9523, RPD = 3.26) at 380-1023 nm showed excellent performance, whereas GAPLS-LS-SVM was the optimal model at 874-1734 nm for predicting pH (R pre = 0.9070, RPD = 2.60). Image processing algorithms were developed to transfer the predictive model in every pixel to generate prediction maps that visualize the spatial distribution of firmness and SSC. Hence, the results clearly demonstrated that hyperspectral imaging has the potential as a fast and non-invasive method to predict the quality attributes of kiwifruits.
Geng, Zhigeng; Wang, Sijian; Yu, Menggang; Monahan, Patrick O.; Champion, Victoria; Wahba, Grace
2017-01-01
Summary In many scientific and engineering applications, covariates are naturally grouped. When the group structures are available among covariates, people are usually interested in identifying both important groups and important variables within the selected groups. Among existing successful group variable selection methods, some methods fail to conduct the within group selection. Some methods are able to conduct both group and within group selection, but the corresponding objective functions are non-convex. Such a non-convexity may require extra numerical effort. In this article, we propose a novel Log-Exp-Sum(LES) penalty for group variable selection. The LES penalty is strictly convex. It can identify important groups as well as select important variables within the group. We develop an efficient group-level coordinate descent algorithm to fit the model. We also derive non-asymptotic error bounds and asymptotic group selection consistency for our method in the high-dimensional setting where the number of covariates can be much larger than the sample size. Numerical results demonstrate the good performance of our method in both variable selection and prediction. We applied the proposed method to an American Cancer Society breast cancer survivor dataset. The findings are clinically meaningful and may help design intervention programs to improve the qualify of life for breast cancer survivors. PMID:25257196
NASA Astrophysics Data System (ADS)
Hu, Chia-Chang; Lin, Hsuan-Yu; Chen, Yu-Fan; Wen, Jyh-Horng
2006-12-01
An adaptive minimum mean-square error (MMSE) array receiver based on the fuzzy-logic recursive least-squares (RLS) algorithm is developed for asynchronous DS-CDMA interference suppression in the presence of frequency-selective multipath fading. This receiver employs a fuzzy-logic control mechanism to perform the nonlinear mapping of the squared error and squared error variation, denoted by ([InlineEquation not available: see fulltext.],[InlineEquation not available: see fulltext.]), into a forgetting factor[InlineEquation not available: see fulltext.]. For the real-time applicability, a computationally efficient version of the proposed receiver is derived based on the least-mean-square (LMS) algorithm using the fuzzy-inference-controlled step-size[InlineEquation not available: see fulltext.]. This receiver is capable of providing both fast convergence/tracking capability as well as small steady-state misadjustment as compared with conventional LMS- and RLS-based MMSE DS-CDMA receivers. Simulations show that the fuzzy-logic LMS and RLS algorithms outperform, respectively, other variable step-size LMS (VSS-LMS) and variable forgetting factor RLS (VFF-RLS) algorithms at least 3 dB and 1.5 dB in bit-error-rate (BER) for multipath fading channels.
She, Ji; Wang, Fei; Zhou, Jianjiang
2016-01-01
Radar networks are proven to have numerous advantages over traditional monostatic and bistatic radar. With recent developments, radar networks have become an attractive platform due to their low probability of intercept (LPI) performance for target tracking. In this paper, a joint sensor selection and power allocation algorithm for multiple-target tracking in a radar network based on LPI is proposed. It is found that this algorithm can minimize the total transmitted power of a radar network on the basis of a predetermined mutual information (MI) threshold between the target impulse response and the reflected signal. The MI is required by the radar network system to estimate target parameters, and it can be calculated predictively with the estimation of target state. The optimization problem of sensor selection and power allocation, which contains two variables, is non-convex and it can be solved by separating power allocation problem from sensor selection problem. To be specific, the optimization problem of power allocation can be solved by using the bisection method for each sensor selection scheme. Also, the optimization problem of sensor selection can be solved by a lower complexity algorithm based on the allocated powers. According to the simulation results, it can be found that the proposed algorithm can effectively reduce the total transmitted power of a radar network, which can be conducive to improving LPI performance. PMID:28009819
An investigation of messy genetic algorithms
NASA Technical Reports Server (NTRS)
Goldberg, David E.; Deb, Kalyanmoy; Korb, Bradley
1990-01-01
Genetic algorithms (GAs) are search procedures based on the mechanics of natural selection and natural genetics. They combine the use of string codings or artificial chromosomes and populations with the selective and juxtapositional power of reproduction and recombination to motivate a surprisingly powerful search heuristic in many problems. Despite their empirical success, there has been a long standing objection to the use of GAs in arbitrarily difficult problems. A new approach was launched. Results to a 30-bit, order-three-deception problem were obtained using a new type of genetic algorithm called a messy genetic algorithm (mGAs). Messy genetic algorithms combine the use of variable-length strings, a two-phase selection scheme, and messy genetic operators to effect a solution to the fixed-coding problem of standard simple GAs. The results of the study of mGAs in problems with nonuniform subfunction scale and size are presented. The mGA approach is summarized, both its operation and the theory of its use. Experiments on problems of varying scale, varying building-block size, and combined varying scale and size are presented.
VARIABLE SELECTION FOR REGRESSION MODELS WITH MISSING DATA
Garcia, Ramon I.; Ibrahim, Joseph G.; Zhu, Hongtu
2009-01-01
We consider the variable selection problem for a class of statistical models with missing data, including missing covariate and/or response data. We investigate the smoothly clipped absolute deviation penalty (SCAD) and adaptive LASSO and propose a unified model selection and estimation procedure for use in the presence of missing data. We develop a computationally attractive algorithm for simultaneously optimizing the penalized likelihood function and estimating the penalty parameters. Particularly, we propose to use a model selection criterion, called the ICQ statistic, for selecting the penalty parameters. We show that the variable selection procedure based on ICQ automatically and consistently selects the important covariates and leads to efficient estimates with oracle properties. The methodology is very general and can be applied to numerous situations involving missing data, from covariates missing at random in arbitrary regression models to nonignorably missing longitudinal responses and/or covariates. Simulations are given to demonstrate the methodology and examine the finite sample performance of the variable selection procedures. Melanoma data from a cancer clinical trial is presented to illustrate the proposed methodology. PMID:20336190
Fan, Shu-Xiang; Huang, Wen-Qian; Li, Jiang-Bo; Guo, Zhi-Ming; Zhaq, Chun-Jiang
2014-10-01
In order to detect the soluble solids content(SSC)of apple conveniently and rapidly, a ring fiber probe and a portable spectrometer were applied to obtain the spectroscopy of apple. Different wavelength variable selection methods, including unin- formative variable elimination (UVE), competitive adaptive reweighted sampling (CARS) and genetic algorithm (GA) were pro- posed to select effective wavelength variables of the NIR spectroscopy of the SSC in apple based on PLS. The back interval LS- SVM (BiLS-SVM) and GA were used to select effective wavelength variables based on LS-SVM. Selected wavelength variables and full wavelength range were set as input variables of PLS model and LS-SVM model, respectively. The results indicated that PLS model built using GA-CARS on 50 characteristic variables selected from full-spectrum which had 1512 wavelengths achieved the optimal performance. The correlation coefficient (Rp) and root mean square error of prediction (RMSEP) for prediction sets were 0.962, 0.403°Brix respectively for SSC. The proposed method of GA-CARS could effectively simplify the portable detection model of SSC in apple based on near infrared spectroscopy and enhance the predictive precision. The study can provide a reference for the development of portable apple soluble solids content spectrometer.
NASA Astrophysics Data System (ADS)
Sirait, Kamson; Tulus; Budhiarti Nababan, Erna
2017-12-01
Clustering methods that have high accuracy and time efficiency are necessary for the filtering process. One method that has been known and applied in clustering is K-Means Clustering. In its application, the determination of the begining value of the cluster center greatly affects the results of the K-Means algorithm. This research discusses the results of K-Means Clustering with starting centroid determination with a random and KD-Tree method. The initial determination of random centroid on the data set of 1000 student academic data to classify the potentially dropout has a sse value of 952972 for the quality variable and 232.48 for the GPA, whereas the initial centroid determination by KD-Tree has a sse value of 504302 for the quality variable and 214,37 for the GPA variable. The smaller sse values indicate that the result of K-Means Clustering with initial KD-Tree centroid selection have better accuracy than K-Means Clustering method with random initial centorid selection.
Development of a Robust Identifier for NPPs Transients Combining ARIMA Model and EBP Algorithm
NASA Astrophysics Data System (ADS)
Moshkbar-Bakhshayesh, Khalil; Ghofrani, Mohammad B.
2014-08-01
This study introduces a novel identification method for recognition of nuclear power plants (NPPs) transients by combining the autoregressive integrated moving-average (ARIMA) model and the neural network with error backpropagation (EBP) learning algorithm. The proposed method consists of three steps. First, an EBP based identifier is adopted to distinguish the plant normal states from the faulty ones. In the second step, ARIMA models use integrated (I) process to convert non-stationary data of the selected variables into stationary ones. Subsequently, ARIMA processes, including autoregressive (AR), moving-average (MA), or autoregressive moving-average (ARMA) are used to forecast time series of the selected plant variables. In the third step, for identification the type of transients, the forecasted time series are fed to the modular identifier which has been developed using the latest advances of EBP learning algorithm. Bushehr nuclear power plant (BNPP) transients are probed to analyze the ability of the proposed identifier. Recognition of transient is based on similarity of its statistical properties to the reference one, rather than the values of input patterns. More robustness against noisy data and improvement balance between memorization and generalization are salient advantages of the proposed identifier. Reduction of false identification, sole dependency of identification on the sign of each output signal, selection of the plant variables for transients training independent of each other, and extendibility for identification of more transients without unfavorable effects are other merits of the proposed identifier.
Adaptive distributed source coding.
Varodayan, David; Lin, Yao-Chung; Girod, Bernd
2012-05-01
We consider distributed source coding in the presence of hidden variables that parameterize the statistical dependence among sources. We derive the Slepian-Wolf bound and devise coding algorithms for a block-candidate model of this problem. The encoder sends, in addition to syndrome bits, a portion of the source to the decoder uncoded as doping bits. The decoder uses the sum-product algorithm to simultaneously recover the source symbols and the hidden statistical dependence variables. We also develop novel techniques based on density evolution (DE) to analyze the coding algorithms. We experimentally confirm that our DE analysis closely approximates practical performance. This result allows us to efficiently optimize parameters of the algorithms. In particular, we show that the system performs close to the Slepian-Wolf bound when an appropriate doping rate is selected. We then apply our coding and analysis techniques to a reduced-reference video quality monitoring system and show a bit rate saving of about 75% compared with fixed-length coding.
Miaw, Carolina Sheng Whei; Assis, Camila; Silva, Alessandro Rangel Carolino Sales; Cunha, Maria Luísa; Sena, Marcelo Martins; de Souza, Scheilla Vitorino Carvalho
2018-07-15
Grape, orange, peach and passion fruit nectars were formulated and adulterated by dilution with syrup, apple and cashew juices at 10 levels for each adulterant. Attenuated total reflectance Fourier transform mid infrared (ATR-FTIR) spectra were obtained. Partial least squares (PLS) multivariate calibration models allied to different variable selection methods, such as interval partial least squares (iPLS), ordered predictors selection (OPS) and genetic algorithm (GA), were used to quantify the main fruits. PLS improved by iPLS-OPS variable selection showed the highest predictive capacity to quantify the main fruit contents. The selected variables in the final models varied from 72 to 100; the root mean square errors of prediction were estimated from 0.5 to 2.6%; the correlation coefficients of prediction ranged from 0.948 to 0.990; and, the mean relative errors of prediction varied from 3.0 to 6.7%. All of the developed models were validated. Copyright © 2018 Elsevier Ltd. All rights reserved.
Cross-validation pitfalls when selecting and assessing regression and classification models.
Krstajic, Damjan; Buturovic, Ljubomir J; Leahy, David E; Thomas, Simon
2014-03-29
We address the problem of selecting and assessing classification and regression models using cross-validation. Current state-of-the-art methods can yield models with high variance, rendering them unsuitable for a number of practical applications including QSAR. In this paper we describe and evaluate best practices which improve reliability and increase confidence in selected models. A key operational component of the proposed methods is cloud computing which enables routine use of previously infeasible approaches. We describe in detail an algorithm for repeated grid-search V-fold cross-validation for parameter tuning in classification and regression, and we define a repeated nested cross-validation algorithm for model assessment. As regards variable selection and parameter tuning we define two algorithms (repeated grid-search cross-validation and double cross-validation), and provide arguments for using the repeated grid-search in the general case. We show results of our algorithms on seven QSAR datasets. The variation of the prediction performance, which is the result of choosing different splits of the dataset in V-fold cross-validation, needs to be taken into account when selecting and assessing classification and regression models. We demonstrate the importance of repeating cross-validation when selecting an optimal model, as well as the importance of repeating nested cross-validation when assessing a prediction error.
Optimizing event selection with the random grid search
NASA Astrophysics Data System (ADS)
Bhat, Pushpalatha C.; Prosper, Harrison B.; Sekmen, Sezen; Stewart, Chip
2018-07-01
The random grid search (RGS) is a simple, but efficient, stochastic algorithm to find optimal cuts that was developed in the context of the search for the top quark at Fermilab in the mid-1990s. The algorithm, and associated code, have been enhanced recently with the introduction of two new cut types, one of which has been successfully used in searches for supersymmetry at the Large Hadron Collider. The RGS optimization algorithm is described along with the recent developments, which are illustrated with two examples from particle physics. One explores the optimization of the selection of vector boson fusion events in the four-lepton decay mode of the Higgs boson and the other optimizes SUSY searches using boosted objects and the razor variables.
NASA Astrophysics Data System (ADS)
Wei, Jun; Jiang, Guo-Qing; Liu, Xin
2017-09-01
This study proposed three algorithms that can potentially be used to provide sea surface temperature (SST) conditions for typhoon prediction models. Different from traditional data assimilation approaches, which provide prescribed initial/boundary conditions, our proposed algorithms aim to resolve a flow-dependent SST feedback between growing typhoons and oceans in the future time. Two of these algorithms are based on linear temperature equations (TE-based), and the other is based on an innovative technique involving machine learning (ML-based). The algorithms are then implemented into a Weather Research and Forecasting model for the simulation of typhoon to assess their effectiveness, and the results show significant improvement in simulated storm intensities by including ocean cooling feedback. The TE-based algorithm I considers wind-induced ocean vertical mixing and upwelling processes only, and thus obtained a synoptic and relatively smooth sea surface temperature cooling. The TE-based algorithm II incorporates not only typhoon winds but also ocean information, and thus resolves more cooling features. The ML-based algorithm is based on a neural network, consisting of multiple layers of input variables and neurons, and produces the best estimate of the cooling structure, in terms of its amplitude and position. Sensitivity analysis indicated that the typhoon-induced ocean cooling is a nonlinear process involving interactions of multiple atmospheric and oceanic variables. Therefore, with an appropriate selection of input variables and neuron sizes, the ML-based algorithm appears to be more efficient in prognosing the typhoon-induced ocean cooling and in predicting typhoon intensity than those algorithms based on linear regression methods.
Enhancing PC Cluster-Based Parallel Branch-and-Bound Algorithms for the Graph Coloring Problem
NASA Astrophysics Data System (ADS)
Taoka, Satoshi; Takafuji, Daisuke; Watanabe, Toshimasa
A branch-and-bound algorithm (BB for short) is the most general technique to deal with various combinatorial optimization problems. Even if it is used, computation time is likely to increase exponentially. So we consider its parallelization to reduce it. It has been reported that the computation time of a parallel BB heavily depends upon node-variable selection strategies. And, in case of a parallel BB, it is also necessary to prevent increase in communication time. So, it is important to pay attention to how many and what kind of nodes are to be transferred (called sending-node selection strategy). In this paper, for the graph coloring problem, we propose some sending-node selection strategies for a parallel BB algorithm by adopting MPI for parallelization and experimentally evaluate how these strategies affect computation time of a parallel BB on a PC cluster network.
Cai, Jia; Tang, Yi
2018-02-01
Canonical correlation analysis (CCA) is a powerful statistical tool for detecting the linear relationship between two sets of multivariate variables. Kernel generalization of it, namely, kernel CCA is proposed to describe nonlinear relationship between two variables. Although kernel CCA can achieve dimensionality reduction results for high-dimensional data feature selection problem, it also yields the so called over-fitting phenomenon. In this paper, we consider a new kernel CCA algorithm via randomized Kaczmarz method. The main contributions of the paper are: (1) A new kernel CCA algorithm is developed, (2) theoretical convergence of the proposed algorithm is addressed by means of scaled condition number, (3) a lower bound which addresses the minimum number of iterations is presented. We test on both synthetic dataset and several real-world datasets in cross-language document retrieval and content-based image retrieval to demonstrate the effectiveness of the proposed algorithm. Numerical results imply the performance and efficiency of the new algorithm, which is competitive with several state-of-the-art kernel CCA methods. Copyright © 2017 Elsevier Ltd. All rights reserved.
Lin, Fen-Fang; Wang, Ke; Yang, Ning; Yan, Shi-Guang; Zheng, Xin-Yu
2012-02-01
In this paper, some main factors such as soil type, land use pattern, lithology type, topography, road, and industry type that affect soil quality were used to precisely obtain the spatial distribution characteristics of regional soil quality, mutual information theory was adopted to select the main environmental factors, and decision tree algorithm See 5.0 was applied to predict the grade of regional soil quality. The main factors affecting regional soil quality were soil type, land use, lithology type, distance to town, distance to water area, altitude, distance to road, and distance to industrial land. The prediction accuracy of the decision tree model with the variables selected by mutual information was obviously higher than that of the model with all variables, and, for the former model, whether of decision tree or of decision rule, its prediction accuracy was all higher than 80%. Based on the continuous and categorical data, the method of mutual information theory integrated with decision tree could not only reduce the number of input parameters for decision tree algorithm, but also predict and assess regional soil quality effectively.
TRANPLAN and GIS support for agencies in Alabama
DOT National Transportation Integrated Search
2001-08-06
Travel demand models are computerized programs intended to forecast future roadway traffic volumes for a community based on selected socioeconomic variables and travel behavior algorithms. Software to operate these travel demand models is currently a...
Linear and nonlinear pattern selection in Rayleigh-Benard stability problems
NASA Technical Reports Server (NTRS)
Davis, Sanford S.
1993-01-01
A new algorithm is introduced to compute finite-amplitude states using primitive variables for Rayleigh-Benard convection on relatively coarse meshes. The algorithm is based on a finite-difference matrix-splitting approach that separates all physical and dimensional effects into one-dimensional subsets. The nonlinear pattern selection process for steady convection in an air-filled square cavity with insulated side walls is investigated for Rayleigh numbers up to 20,000. The internalization of disturbances that evolve into coherent patterns is investigated and transient solutions from linear perturbation theory are compared with and contrasted to the full numerical simulations.
Designing basin-customized combined drought indices via feature extraction
NASA Astrophysics Data System (ADS)
Zaniolo, Marta; Giuliani, Matteo; Castelletti, Andrea
2017-04-01
The socio-economic costs of drought are progressively increasing worldwide due to the undergoing alteration of hydro-meteorological regimes induced by climate change. Although drought management is largely studied in the literature, most of the traditional drought indexes fail in detecting critical events in highly regulated systems, which generally rely on ad-hoc formulations and cannot be generalized to different context. In this study, we contribute a novel framework for the design of a basin-customized drought index. This index represents a surrogate of the state of the basin and is computed by combining the available information about the water available in the system to reproduce a representative target variable for the drought condition of the basin (e.g., water deficit). To select the relevant variables and how to combine them, we use an advanced feature extraction algorithm called Wrapper for Quasi Equally Informative Subset Selection (W-QEISS). The W-QEISS algorithm relies on a multi-objective evolutionary algorithm to find Pareto-efficient subsets of variables by maximizing the wrapper accuracy, minimizing the number of selected variables (cardinality) and optimizing relevance and redundancy of the subset. The accuracy objective is evaluated trough the calibration of a pre-defined model (i.e., an extreme learning machine) of the water deficit for each candidate subset of variables, with the index selected from the resulting solutions identifying a suitable compromise between accuracy, cardinality, relevance, and redundancy. The proposed methodology is tested in the case study of Lake Como in northern Italy, a regulated lake mainly operated for irrigation supply to four downstream agricultural districts. In the absence of an institutional drought monitoring system, we constructed the combined index using all the hydrological variables from the existing monitoring system as well as the most common drought indicators at multiple time aggregations. The soil moisture deficit in the root zone computed by a distributed-parameter water balance model of the agricultural districts is used as target variable. Numerical results show that our framework succeeds in constructing a combined drought index that reproduces the soil moisture deficit. Moreover, this index represents a valuable information for supporting appropriate drought management strategies, including the possibility of directly informing the lake operations about the drought conditions and improve the overall reliability of the irrigation supply system.
a Empirical Modelation of Runoff in Small Watersheds Using LIDAR Data
NASA Astrophysics Data System (ADS)
Lopatin, J.; Hernández, J.; Galleguillos, M.; Mancilla, G.
2013-12-01
Hydrological models allow the simulation of water natural processes and also the quantification and prediction of the effects of human impacts in runoff behavior. However, obtaining the information that is need for applying these models can be costly in both time and resources, especially in large and difficult to access areas. The objective of this research was to integrate LiDAR data in the hydrological modeling of runoff in small watersheds, using derivated hydrologic, vegetation and topography variables. The study area includes 10 small head watersheds cover bay forest, between 2 and 16 ha, which are located in the south-central coastal range of Chile. In each of the former instantaneous rainfall and runoff flow of a total of 15 rainfall events were measured, between August 2012 and July 2013, yielding a total of 79 observations. In March 2011 a Harrier 54/G4 Dual System was used to obtain a LiDAR point cloud of discrete pulse with an average of 4.64 points per square meter. A Digital Terrain Model (DTM) of 1 meter resolution was obtained from the point cloud, and subsequently 55 topographic variables were derived, such as physical watershed parameters and morphometric features. At the same time, 30 vegetation descriptive variables were obtained directly from the point cloud and from a Digital Canopy Model (DCM). The classification and regression "Random Forest" (RF) algorithm was used to select the most important variables in predicting water height (liters), and the "Partial Least Squares Path Modeling" (PLS-PM) algorithm was used to fit a model using the selected set of variables. Four Latent variables were selected (outer model) related to: climate, topography, vegetation and runoff, where in each one was designated a group of the predictor variables selected by RF (inner model). The coefficient of determination (R2) and Goodnes-of-Fit (GoF) of the final model were obtained. The best results were found when modeling using only the upper 50th percentile of rainfall events. The best variables selected by the RF algorithm were three topographic variables and three vegetation related ones. We obtained an R2 of 0.82 and a GoF of 0.87 with a 95% of confidence interval. This study shows that it is possible to predict the water harvesting collected during a rainstorm event in forest environment using only LiDAR data. However, this type of methodology does not have good result in flow produced by low magnitude rainfall events, as these are more influenced by initial conditions of soil, vegetation and climate, which make their behavior slower and erratic.
Zhang, Yiyan; Xin, Yi; Li, Qin; Ma, Jianshe; Li, Shuai; Lv, Xiaodan; Lv, Weiqi
2017-11-02
Various kinds of data mining algorithms are continuously raised with the development of related disciplines. The applicable scopes and their performances of these algorithms are different. Hence, finding a suitable algorithm for a dataset is becoming an important emphasis for biomedical researchers to solve practical problems promptly. In this paper, seven kinds of sophisticated active algorithms, namely, C4.5, support vector machine, AdaBoost, k-nearest neighbor, naïve Bayes, random forest, and logistic regression, were selected as the research objects. The seven algorithms were applied to the 12 top-click UCI public datasets with the task of classification, and their performances were compared through induction and analysis. The sample size, number of attributes, number of missing values, and the sample size of each class, correlation coefficients between variables, class entropy of task variable, and the ratio of the sample size of the largest class to the least class were calculated to character the 12 research datasets. The two ensemble algorithms reach high accuracy of classification on most datasets. Moreover, random forest performs better than AdaBoost on the unbalanced dataset of the multi-class task. Simple algorithms, such as the naïve Bayes and logistic regression model are suitable for a small dataset with high correlation between the task and other non-task attribute variables. K-nearest neighbor and C4.5 decision tree algorithms perform well on binary- and multi-class task datasets. Support vector machine is more adept on the balanced small dataset of the binary-class task. No algorithm can maintain the best performance in all datasets. The applicability of the seven data mining algorithms on the datasets with different characteristics was summarized to provide a reference for biomedical researchers or beginners in different fields.
The Performance of Short-Term Heart Rate Variability in the Detection of Congestive Heart Failure
Barros, Allan Kardec; Ohnishi, Noboru
2016-01-01
Congestive heart failure (CHF) is a cardiac disease associated with the decreasing capacity of the cardiac output. It has been shown that the CHF is the main cause of the cardiac death around the world. Some works proposed to discriminate CHF subjects from healthy subjects using either electrocardiogram (ECG) or heart rate variability (HRV) from long-term recordings. In this work, we propose an alternative framework to discriminate CHF from healthy subjects by using HRV short-term intervals based on 256 RR continuous samples. Our framework uses a matching pursuit algorithm based on Gabor functions. From the selected Gabor functions, we derived a set of features that are inputted into a hybrid framework which uses a genetic algorithm and k-nearest neighbour classifier to select a subset of features that has the best classification performance. The performance of the framework is analyzed using both Fantasia and CHF database from Physionet archives which are, respectively, composed of 40 healthy volunteers and 29 subjects. From a set of nonstandard 16 features, the proposed framework reaches an overall accuracy of 100% with five features. Our results suggest that the application of hybrid frameworks whose classifier algorithms are based on genetic algorithms has outperformed well-known classifier methods. PMID:27891509
Peltola, Tomi; Marttinen, Pekka; Vehtari, Aki
2012-01-01
High-dimensional datasets with large amounts of redundant information are nowadays available for hypothesis-free exploration of scientific questions. A particular case is genome-wide association analysis, where variations in the genome are searched for effects on disease or other traits. Bayesian variable selection has been demonstrated as a possible analysis approach, which can account for the multifactorial nature of the genetic effects in a linear regression model. Yet, the computation presents a challenge and application to large-scale data is not routine. Here, we study aspects of the computation using the Metropolis-Hastings algorithm for the variable selection: finite adaptation of the proposal distributions, multistep moves for changing the inclusion state of multiple variables in a single proposal and multistep move size adaptation. We also experiment with a delayed rejection step for the multistep moves. Results on simulated and real data show increase in the sampling efficiency. We also demonstrate that with application specific proposals, the approach can overcome a specific mixing problem in real data with 3822 individuals and 1,051,811 single nucleotide polymorphisms and uncover a variant pair with synergistic effect on the studied trait. Moreover, we illustrate multimodality in the real dataset related to a restrictive prior distribution on the genetic effect sizes and advocate a more flexible alternative. PMID:23166669
An improved partial least-squares regression method for Raman spectroscopy
NASA Astrophysics Data System (ADS)
Momenpour Tehran Monfared, Ali; Anis, Hanan
2017-10-01
It is known that the performance of partial least-squares (PLS) regression analysis can be improved using the backward variable selection method (BVSPLS). In this paper, we further improve the BVSPLS based on a novel selection mechanism. The proposed method is based on sorting the weighted regression coefficients, and then the importance of each variable of the sorted list is evaluated using root mean square errors of prediction (RMSEP) criterion in each iteration step. Our Improved BVSPLS (IBVSPLS) method has been applied to leukemia and heparin data sets and led to an improvement in limit of detection of Raman biosensing ranged from 10% to 43% compared to PLS. Our IBVSPLS was also compared to the jack-knifing (simpler) and Genetic Algorithm (more complex) methods. Our method was consistently better than the jack-knifing method and showed either a similar or a better performance compared to the genetic algorithm.
Genetic Algorithm for Initial Orbit Determination with Too Short Arc
NASA Astrophysics Data System (ADS)
Li, Xin-ran; Wang, Xin
2017-01-01
A huge quantity of too-short-arc (TSA) observational data have been obtained in sky surveys of space objects. However, reasonable results for the TSAs can hardly be obtained with the classical methods of initial orbit determination (IOD). In this paper, the IOD is reduced to a two-stage hierarchical optimization problem containing three variables for each stage. Using the genetic algorithm, a new method of the IOD for TSAs is established, through the selections of the optimized variables and the corresponding genetic operators for specific problems. Numerical experiments based on the real measurements show that the method can provide valid initial values for the follow-up work.
Genetic Algorithm for Initial Orbit Determination with Too Short Arc
NASA Astrophysics Data System (ADS)
Li, X. R.; Wang, X.
2016-01-01
The sky surveys of space objects have obtained a huge quantity of too-short-arc (TSA) observation data. However, the classical method of initial orbit determination (IOD) can hardly get reasonable results for the TSAs. The IOD is reduced to a two-stage hierarchical optimization problem containing three variables for each stage. Using the genetic algorithm, a new method of the IOD for TSAs is established, through the selection of optimizing variables as well as the corresponding genetic operator for specific problems. Numerical experiments based on the real measurements show that the method can provide valid initial values for the follow-up work.
Can Geostatistical Models Represent Nature's Variability? An Analysis Using Flume Experiments
NASA Astrophysics Data System (ADS)
Scheidt, C.; Fernandes, A. M.; Paola, C.; Caers, J.
2015-12-01
The lack of understanding in the Earth's geological and physical processes governing sediment deposition render subsurface modeling subject to large uncertainty. Geostatistics is often used to model uncertainty because of its capability to stochastically generate spatially varying realizations of the subsurface. These methods can generate a range of realizations of a given pattern - but how representative are these of the full natural variability? And how can we identify the minimum set of images that represent this natural variability? Here we use this minimum set to define the geostatistical prior model: a set of training images that represent the range of patterns generated by autogenic variability in the sedimentary environment under study. The proper definition of the prior model is essential in capturing the variability of the depositional patterns. This work starts with a set of overhead images from an experimental basin that showed ongoing autogenic variability. We use the images to analyze the essential characteristics of this suite of patterns. In particular, our goal is to define a prior model (a minimal set of selected training images) such that geostatistical algorithms, when applied to this set, can reproduce the full measured variability. A necessary prerequisite is to define a measure of variability. In this study, we measure variability using a dissimilarity distance between the images. The distance indicates whether two snapshots contain similar depositional patterns. To reproduce the variability in the images, we apply an MPS algorithm to the set of selected snapshots of the sedimentary basin that serve as training images. The training images are chosen from among the initial set by using the distance measure to ensure that only dissimilar images are chosen. Preliminary investigations show that MPS can reproduce fairly accurately the natural variability of the experimental depositional system. Furthermore, the selected training images provide process information. They fall into three basic patterns: a channelized end member, a sheet flow end member, and one intermediate case. These represent the continuum between autogenic bypass or erosion, and net deposition.
NASA Astrophysics Data System (ADS)
Sheykhizadeh, Saheleh; Naseri, Abdolhossein
2018-04-01
Variable selection plays a key role in classification and multivariate calibration. Variable selection methods are aimed at choosing a set of variables, from a large pool of available predictors, relevant to the analyte concentrations estimation, or to achieve better classification results. Many variable selection techniques have now been introduced among which, those which are based on the methodologies of swarm intelligence optimization have been more respected during a few last decades since they are mainly inspired by nature. In this work, a simple and new variable selection algorithm is proposed according to the invasive weed optimization (IWO) concept. IWO is considered a bio-inspired metaheuristic mimicking the weeds ecological behavior in colonizing as well as finding an appropriate place for growth and reproduction; it has been shown to be very adaptive and powerful to environmental changes. In this paper, the first application of IWO, as a very simple and powerful method, to variable selection is reported using different experimental datasets including FTIR and NIR data, so as to undertake classification and multivariate calibration tasks. Accordingly, invasive weed optimization - linear discrimination analysis (IWO-LDA) and invasive weed optimization- partial least squares (IWO-PLS) are introduced for multivariate classification and calibration, respectively.
Sheykhizadeh, Saheleh; Naseri, Abdolhossein
2018-04-05
Variable selection plays a key role in classification and multivariate calibration. Variable selection methods are aimed at choosing a set of variables, from a large pool of available predictors, relevant to the analyte concentrations estimation, or to achieve better classification results. Many variable selection techniques have now been introduced among which, those which are based on the methodologies of swarm intelligence optimization have been more respected during a few last decades since they are mainly inspired by nature. In this work, a simple and new variable selection algorithm is proposed according to the invasive weed optimization (IWO) concept. IWO is considered a bio-inspired metaheuristic mimicking the weeds ecological behavior in colonizing as well as finding an appropriate place for growth and reproduction; it has been shown to be very adaptive and powerful to environmental changes. In this paper, the first application of IWO, as a very simple and powerful method, to variable selection is reported using different experimental datasets including FTIR and NIR data, so as to undertake classification and multivariate calibration tasks. Accordingly, invasive weed optimization - linear discrimination analysis (IWO-LDA) and invasive weed optimization- partial least squares (IWO-PLS) are introduced for multivariate classification and calibration, respectively. Copyright © 2018 Elsevier B.V. All rights reserved.
Optimizing event selection with the random grid search
Bhat, Pushpalatha C.; Prosper, Harrison B.; Sekmen, Sezen; ...
2018-02-27
In this paper, the random grid search (RGS) is a simple, but efficient, stochastic algorithm to find optimal cuts that was developed in the context of the search for the top quark at Fermilab in the mid-1990s. The algorithm, and associated code, have been enhanced recently with the introduction of two new cut types, one of which has been successfully used in searches for supersymmetry at the Large Hadron Collider. The RGS optimization algorithm is described along with the recent developments, which are illustrated with two examples from particle physics. One explores the optimization of the selection of vector bosonmore » fusion events in the four-lepton decay mode of the Higgs boson and the other optimizes SUSY searches using boosted objects and the razor variables.« less
Genetic Algorithms and Classification Trees in Feature Discovery: Diabetes and the NHANES database
DOE Office of Scientific and Technical Information (OSTI.GOV)
Heredia-Langner, Alejandro; Jarman, Kristin H.; Amidan, Brett G.
2013-09-01
This paper presents a feature selection methodology that can be applied to datasets containing a mixture of continuous and categorical variables. Using a Genetic Algorithm (GA), this method explores a dataset and selects a small set of features relevant for the prediction of a binary (1/0) response. Binary classification trees and an objective function based on conditional probabilities are used to measure the fitness of a given subset of features. The method is applied to health data in order to find factors useful for the prediction of diabetes. Results show that our algorithm is capable of narrowing down the setmore » of predictors to around 8 factors that can be validated using reputable medical and public health resources.« less
An Artificial Bee Colony Algorithm for Uncertain Portfolio Selection
Chen, Wei
2014-01-01
Portfolio selection is an important issue for researchers and practitioners. In this paper, under the assumption that security returns are given by experts' evaluations rather than historical data, we discuss the portfolio adjusting problem which takes transaction costs and diversification degree of portfolio into consideration. Uncertain variables are employed to describe the security returns. In the proposed mean-variance-entropy model, the uncertain mean value of the return is used to measure investment return, the uncertain variance of the return is used to measure investment risk, and the entropy is used to measure diversification degree of portfolio. In order to solve the proposed model, a modified artificial bee colony (ABC) algorithm is designed. Finally, a numerical example is given to illustrate the modelling idea and the effectiveness of the proposed algorithm. PMID:25089292
Optimizing Event Selection with the Random Grid Search
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bhat, Pushpalatha C.; Prosper, Harrison B.; Sekmen, Sezen
2017-06-29
The random grid search (RGS) is a simple, but efficient, stochastic algorithm to find optimal cuts that was developed in the context of the search for the top quark at Fermilab in the mid-1990s. The algorithm, and associated code, have been enhanced recently with the introduction of two new cut types, one of which has been successfully used in searches for supersymmetry at the Large Hadron Collider. The RGS optimization algorithm is described along with the recent developments, which are illustrated with two examples from particle physics. One explores the optimization of the selection of vector boson fusion events inmore » the four-lepton decay mode of the Higgs boson and the other optimizes SUSY searches using boosted objects and the razor variables.« less
Optimizing event selection with the random grid search
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bhat, Pushpalatha C.; Prosper, Harrison B.; Sekmen, Sezen
In this paper, the random grid search (RGS) is a simple, but efficient, stochastic algorithm to find optimal cuts that was developed in the context of the search for the top quark at Fermilab in the mid-1990s. The algorithm, and associated code, have been enhanced recently with the introduction of two new cut types, one of which has been successfully used in searches for supersymmetry at the Large Hadron Collider. The RGS optimization algorithm is described along with the recent developments, which are illustrated with two examples from particle physics. One explores the optimization of the selection of vector bosonmore » fusion events in the four-lepton decay mode of the Higgs boson and the other optimizes SUSY searches using boosted objects and the razor variables.« less
An artificial bee colony algorithm for uncertain portfolio selection.
Chen, Wei
2014-01-01
Portfolio selection is an important issue for researchers and practitioners. In this paper, under the assumption that security returns are given by experts' evaluations rather than historical data, we discuss the portfolio adjusting problem which takes transaction costs and diversification degree of portfolio into consideration. Uncertain variables are employed to describe the security returns. In the proposed mean-variance-entropy model, the uncertain mean value of the return is used to measure investment return, the uncertain variance of the return is used to measure investment risk, and the entropy is used to measure diversification degree of portfolio. In order to solve the proposed model, a modified artificial bee colony (ABC) algorithm is designed. Finally, a numerical example is given to illustrate the modelling idea and the effectiveness of the proposed algorithm.
Lafuente, Victoria; Herrera, Luis J; Pérez, María del Mar; Val, Jesús; Negueruela, Ignacio
2015-08-15
In this work, near infrared spectroscopy (NIR) and an acoustic measure (AWETA) (two non-destructive methods) were applied in Prunus persica fruit 'Calrico' (n = 260) to predict Magness-Taylor (MT) firmness. Separate and combined use of these measures was evaluated and compared using partial least squares (PLS) and least squares support vector machine (LS-SVM) regression methods. Also, a mutual-information-based variable selection method, seeking to find the most significant variables to produce optimal accuracy of the regression models, was applied to a joint set of variables (NIR wavelengths and AWETA measure). The newly proposed combined NIR-AWETA model gave good values of the determination coefficient (R(2)) for PLS and LS-SVM methods (0.77 and 0.78, respectively), improving the reliability of MT firmness prediction in comparison with separate NIR and AWETA predictions. The three variables selected by the variable selection method (AWETA measure plus NIR wavelengths 675 and 697 nm) achieved R(2) values 0.76 and 0.77, PLS and LS-SVM. These results indicated that the proposed mutual-information-based variable selection algorithm was a powerful tool for the selection of the most relevant variables. © 2014 Society of Chemical Industry.
Gene selection heuristic algorithm for nutrigenomics studies.
Valour, D; Hue, I; Grimard, B; Valour, B
2013-07-15
Large datasets from -omics studies need to be deeply investigated. The aim of this paper is to provide a new method (LEM method) for the search of transcriptome and metabolome connections. The heuristic algorithm here described extends the classical canonical correlation analysis (CCA) to a high number of variables (without regularization) and combines well-conditioning and fast-computing in "R." Reduced CCA models are summarized in PageRank matrices, the product of which gives a stochastic matrix that resumes the self-avoiding walk covered by the algorithm. Then, a homogeneous Markov process applied to this stochastic matrix converges the probabilities of interconnection between genes, providing a selection of disjointed subsets of genes. This is an alternative to regularized generalized CCA for the determination of blocks within the structure matrix. Each gene subset is thus linked to the whole metabolic or clinical dataset that represents the biological phenotype of interest. Moreover, this selection process reaches the aim of biologists who often need small sets of genes for further validation or extended phenotyping. The algorithm is shown to work efficiently on three published datasets, resulting in meaningfully broadened gene networks.
Cheng, Jun-Hu; Sun, Da-Wen; Pu, Hongbin
2016-04-15
The potential use of feature wavelengths for predicting drip loss in grass carp fish, as affected by being frozen at -20°C for 24 h and thawed at 4°C for 1, 2, 4, and 6 days, was investigated. Hyperspectral images of frozen-thawed fish were obtained and their corresponding spectra were extracted. Least-squares support vector machine and multiple linear regression (MLR) models were established using five key wavelengths, selected by combining a genetic algorithm and successive projections algorithm, and this showed satisfactory performance in drip loss prediction. The MLR model with a determination coefficient of prediction (R(2)P) of 0.9258, and lower root mean square error estimated by a prediction (RMSEP) of 1.12%, was applied to transfer each pixel of the image and generate the distribution maps of exudation changes. The results confirmed that it is feasible to identify the feature wavelengths using variable selection methods and chemometric analysis for developing on-line multispectral imaging. Copyright © 2015 Elsevier Ltd. All rights reserved.
Liu, Xiang; Peng, Yingwei; Tu, Dongsheng; Liang, Hua
2012-10-30
Survival data with a sizable cure fraction are commonly encountered in cancer research. The semiparametric proportional hazards cure model has been recently used to analyze such data. As seen in the analysis of data from a breast cancer study, a variable selection approach is needed to identify important factors in predicting the cure status and risk of breast cancer recurrence. However, no specific variable selection method for the cure model is available. In this paper, we present a variable selection approach with penalized likelihood for the cure model. The estimation can be implemented easily by combining the computational methods for penalized logistic regression and the penalized Cox proportional hazards models with the expectation-maximization algorithm. We illustrate the proposed approach on data from a breast cancer study. We conducted Monte Carlo simulations to evaluate the performance of the proposed method. We used and compared different penalty functions in the simulation studies. Copyright © 2012 John Wiley & Sons, Ltd.
Passos, Ives Cavalcante; Mwangi, Benson; Cao, Bo; Hamilton, Jane E; Wu, Mon-Ju; Zhang, Xiang Yang; Zunta-Soares, Giovana B; Quevedo, Joao; Kauer-Sant'Anna, Marcia; Kapczinski, Flávio; Soares, Jair C
2016-03-15
A growing body of evidence has put forward clinical risk factors associated with patients with mood disorders that attempt suicide. However, what is not known is how to integrate clinical variables into a clinically useful tool in order to estimate the probability of an individual patient attempting suicide. A total of 144 patients with mood disorders were included. Clinical variables associated with suicide attempts among patients with mood disorders and demographic variables were used to 'train' a machine learning algorithm. The resulting algorithm was utilized in identifying novel or 'unseen' individual subjects as either suicide attempters or non-attempters. Three machine learning algorithms were implemented and evaluated. All algorithms distinguished individual suicide attempters from non-attempters with prediction accuracy ranging between 65% and 72% (p<0.05). In particular, the relevance vector machine (RVM) algorithm correctly predicted 103 out of 144 subjects translating into 72% accuracy (72.1% sensitivity and 71.3% specificity) and an area under the curve of 0.77 (p<0.0001). The most relevant predictor variables in distinguishing attempters from non-attempters included previous hospitalizations for depression, a history of psychosis, cocaine dependence and post-traumatic stress disorder (PTSD) comorbidity. Risk for suicide attempt among patients with mood disorders can be estimated at an individual subject level by incorporating both demographic and clinical variables. Future studies should examine the performance of this model in other populations and its subsequent utility in facilitating selection of interventions to prevent suicide. Copyright © 2015 Elsevier B.V. All rights reserved.
Passos, Ives Cavalcante; Mwangi, Benson; Cao, Bo; Hamilton, Jane E; Wu, Mon-Ju; Zhang, Xiang Yang; Zunta-Soares, Giovana B.; Quevedo, Joao; Kauer-Sant'Anna, Marcia; Kapczinski, Flávio; Soares, Jair C.
2016-01-01
Objective A growing body of evidence has put forward clinical risk factors associated with patients with mood disorders that attempt suicide. However, what is not known is how to integrate clinical variables into a clinically useful tool in order to estimate the probability of an individual patient attempting suicide. Method A total of 144 patients with mood disorders were included. Clinical variables associated with suicide attempts among patients with mood disorders and demographic variables were used to ‘train’ a machine learning algorithm. The resulting algorithm was utilized in identifying novel or ‘unseen’ individual subjects as either suicide attempters or non-attempters. Three machine learning algorithms were implemented and evaluated. Results All algorithms distinguished individual suicide attempters from non-attempters with prediction accuracy ranging between 65%-72% (p<0.05). In particular, the relevance vector machine (RVM) algorithm correctly predicted 103 out of 144 subjects translating into 72% accuracy (72.1% sensitivity and 71.3% specificity) and an area under the curve of 0.77 (p<0.0001). The most relevant predictor variables in distinguishing attempters from non-attempters included previous hospitalizations for depression, a history of psychosis, cocaine dependence and post-traumatic stress disorder (PTSD) comorbidity. Conclusion Risk for suicide attempt among patients with mood disorders can be estimated at an individual subject level by incorporating both demographic and clinical variables. Future studies should examine the performance of this model in other populations and its subsequent utility in facilitating selection of interventions to prevent suicide. PMID:26773901
DOE Office of Scientific and Technical Information (OSTI.GOV)
Johnson, Kevin J.; Wright, Bob W.; Jarman, Kristin H.
2003-05-09
A rapid retention time alignment algorithm was developed as a preprocessing utility to be used prior to chemometric analysis of large datasets of diesel fuel gas chromatographic profiles. Retention time variation from chromatogram-to-chromatogram has been a significant impediment against the use of chemometric techniques in the analysis of chromatographic data due to the inability of current multivariate techniques to correctly model information that shifts from variable to variable within a dataset. The algorithm developed is shown to increase the efficacy of pattern recognition methods applied to a set of diesel fuel chromatograms by retaining chemical selectivity while reducing chromatogram-to-chromatogram retentionmore » time variations and to do so on a time scale that makes analysis of large sets of chromatographic data practical.« less
Generic Feature Selection with Short Fat Data
Clarke, B.; Chu, J.-H.
2014-01-01
SUMMARY Consider a regression problem in which there are many more explanatory variables than data points, i.e., p ≫ n. Essentially, without reducing the number of variables inference is impossible. So, we group the p explanatory variables into blocks by clustering, evaluate statistics on the blocks and then regress the response on these statistics under a penalized error criterion to obtain estimates of the regression coefficients. We examine the performance of this approach for a variety of choices of n, p, classes of statistics, clustering algorithms, penalty terms, and data types. When n is not large, the discrimination over number of statistics is weak, but computations suggest regressing on approximately [n/K] statistics where K is the number of blocks formed by a clustering algorithm. Small deviations from this are observed when the blocks of variables are of very different sizes. Larger deviations are observed when the penalty term is an Lq norm with high enough q. PMID:25346546
NASA Astrophysics Data System (ADS)
Yang, Ruijie; Dai, Jianrong; Yang, Yong; Hu, Yimin
2006-08-01
The purpose of this study is to extend an algorithm proposed for beam orientation optimization in classical conformal radiotherapy to intensity-modulated radiation therapy (IMRT) and to evaluate the algorithm's performance in IMRT scenarios. In addition, the effect of the candidate pool of beam orientations, in terms of beam orientation resolution and starting orientation, on the optimized beam configuration, plan quality and optimization time is also explored. The algorithm is based on the technique of mixed integer linear programming in which binary and positive float variables are employed to represent candidates for beam orientation and beamlet weights in beam intensity maps. Both beam orientations and beam intensity maps are simultaneously optimized in the algorithm with a deterministic method. Several different clinical cases were used to test the algorithm and the results show that both target coverage and critical structures sparing were significantly improved for the plans with optimized beam orientations compared to those with equi-spaced beam orientations. The calculation time was less than an hour for the cases with 36 binary variables on a PC with a Pentium IV 2.66 GHz processor. It is also found that decreasing beam orientation resolution to 10° greatly reduced the size of the candidate pool of beam orientations without significant influence on the optimized beam configuration and plan quality, while selecting different starting orientations had large influence. Our study demonstrates that the algorithm can be applied to IMRT scenarios, and better beam orientation configurations can be obtained using this algorithm. Furthermore, the optimization efficiency can be greatly increased through proper selection of beam orientation resolution and starting beam orientation while guaranteeing the optimized beam configurations and plan quality.
Individual treatment selection for patients with posttraumatic stress disorder.
Deisenhofer, Anne-Katharina; Delgadillo, Jaime; Rubel, Julian A; Böhnke, Jan R; Zimmermann, Dirk; Schwartz, Brian; Lutz, Wolfgang
2018-04-16
Trauma-focused cognitive behavioral therapy (Tf-CBT) and eye movement desensitization and reprocessing (EMDR) are two highly effective treatment options for posttraumatic stress disorder (PTSD). Yet, on an individual level, PTSD patients vary substantially in treatment response. The aim of the paper is to test the application of a treatment selection method based on a personalized advantage index (PAI). The study used clinical data for patients accessing treatment for PTSD in a primary care mental health service in the north of England. PTSD patients received either EMDR (N = 75) or Tf-CBT (N = 242). The Patient Health Questionnaire (PHQ-9) was used as an outcome measure for depressive symptoms associated with PTSD. Variables predicting differential treatment response were identified using an automated variable selection approach (genetic algorithm) and afterwards included in regression models, allowing the calculation of each patient's PAI. Age, employment status, gender, and functional impairment were identified as relevant variables for Tf-CBT. For EMDR, baseline depressive symptoms as well as prescribed antidepressant medication were selected as predictor variables. Fifty-six percent of the patients (n = 125) had a PAI equal or higher than one standard deviation. From those patients, 62 (50%) did not receive their model-predicted treatment and could have benefited from a treatment assignment based on the PAI. Using a PAI-based algorithm has the potential to improve clinical decision making and to enhance individual patient outcomes, although further replication is necessary before such an approach can be implemented in prospective studies. © 2018 Wiley Periodicals, Inc.
A survey of variable selection methods in two Chinese epidemiology journals
2010-01-01
Background Although much has been written on developing better procedures for variable selection, there is little research on how it is practiced in actual studies. This review surveys the variable selection methods reported in two high-ranking Chinese epidemiology journals. Methods Articles published in 2004, 2006, and 2008 in the Chinese Journal of Epidemiology and the Chinese Journal of Preventive Medicine were reviewed. Five categories of methods were identified whereby variables were selected using: A - bivariate analyses; B - multivariable analysis; e.g. stepwise or individual significance testing of model coefficients; C - first bivariate analyses, followed by multivariable analysis; D - bivariate analyses or multivariable analysis; and E - other criteria like prior knowledge or personal judgment. Results Among the 287 articles that reported using variable selection methods, 6%, 26%, 30%, 21%, and 17% were in categories A through E, respectively. One hundred sixty-three studies selected variables using bivariate analyses, 80% (130/163) via multiple significance testing at the 5% alpha-level. Of the 219 multivariable analyses, 97 (44%) used stepwise procedures, 89 (41%) tested individual regression coefficients, but 33 (15%) did not mention how variables were selected. Sixty percent (58/97) of the stepwise routines also did not specify the algorithm and/or significance levels. Conclusions The variable selection methods reported in the two journals were limited in variety, and details were often missing. Many studies still relied on problematic techniques like stepwise procedures and/or multiple testing of bivariate associations at the 0.05 alpha-level. These deficiencies should be rectified to safeguard the scientific validity of articles published in Chinese epidemiology journals. PMID:20920252
On the Latent Variable Interpretation in Sum-Product Networks.
Peharz, Robert; Gens, Robert; Pernkopf, Franz; Domingos, Pedro
2017-10-01
One of the central themes in Sum-Product networks (SPNs) is the interpretation of sum nodes as marginalized latent variables (LVs). This interpretation yields an increased syntactic or semantic structure, allows the application of the EM algorithm and to efficiently perform MPE inference. In literature, the LV interpretation was justified by explicitly introducing the indicator variables corresponding to the LVs' states. However, as pointed out in this paper, this approach is in conflict with the completeness condition in SPNs and does not fully specify the probabilistic model. We propose a remedy for this problem by modifying the original approach for introducing the LVs, which we call SPN augmentation. We discuss conditional independencies in augmented SPNs, formally establish the probabilistic interpretation of the sum-weights and give an interpretation of augmented SPNs as Bayesian networks. Based on these results, we find a sound derivation of the EM algorithm for SPNs. Furthermore, the Viterbi-style algorithm for MPE proposed in literature was never proven to be correct. We show that this is indeed a correct algorithm, when applied to selective SPNs, and in particular when applied to augmented SPNs. Our theoretical results are confirmed in experiments on synthetic data and 103 real-world datasets.
An Update on Statistical Boosting in Biomedicine.
Mayr, Andreas; Hofner, Benjamin; Waldmann, Elisabeth; Hepp, Tobias; Meyer, Sebastian; Gefeller, Olaf
2017-01-01
Statistical boosting algorithms have triggered a lot of research during the last decade. They combine a powerful machine learning approach with classical statistical modelling, offering various practical advantages like automated variable selection and implicit regularization of effect estimates. They are extremely flexible, as the underlying base-learners (regression functions defining the type of effect for the explanatory variables) can be combined with any kind of loss function (target function to be optimized, defining the type of regression setting). In this review article, we highlight the most recent methodological developments on statistical boosting regarding variable selection, functional regression, and advanced time-to-event modelling. Additionally, we provide a short overview on relevant applications of statistical boosting in biomedicine.
NASA Astrophysics Data System (ADS)
Brzęczek, Mateusz; Bartela, Łukasz
2013-12-01
This paper presents the parameters of the reference oxy combustion block operating with supercritical steam parameters, equipped with an air separation unit and a carbon dioxide capture and compression installation. The possibility to recover the heat in the analyzed power plant is discussed. The decision variables and the thermodynamic functions for the optimization algorithm were identified. The principles of operation of genetic algorithm and methodology of conducted calculations are presented. The sensitivity analysis was performed for the best solutions to determine the effects of the selected variables on the power and efficiency of the unit. Optimization of the heat recovery from the air separation unit, flue gas condition and CO2 capture and compression installation using genetic algorithm was designed to replace the low-pressure section of the regenerative water heaters of steam cycle in analyzed unit. The result was to increase the power and efficiency of the entire power plant.
Mujalli, Randa Oqab; de Oña, Juan
2011-10-01
This study describes a method for reducing the number of variables frequently considered in modeling the severity of traffic accidents. The method's efficiency is assessed by constructing Bayesian networks (BN). It is based on a two stage selection process. Several variable selection algorithms, commonly used in data mining, are applied in order to select subsets of variables. BNs are built using the selected subsets and their performance is compared with the original BN (with all the variables) using five indicators. The BNs that improve the indicators' values are further analyzed for identifying the most significant variables (accident type, age, atmospheric factors, gender, lighting, number of injured, and occupant involved). A new BN is built using these variables, where the results of the indicators indicate, in most of the cases, a statistically significant improvement with respect to the original BN. It is possible to reduce the number of variables used to model traffic accidents injury severity through BNs without reducing the performance of the model. The study provides the safety analysts a methodology that could be used to minimize the number of variables used in order to determine efficiently the injury severity of traffic accidents without reducing the performance of the model. Copyright © 2011 Elsevier Ltd. All rights reserved.
A Novel Multiobjective Evolutionary Algorithm Based on Regression Analysis
Song, Zhiming; Wang, Maocai; Dai, Guangming; Vasile, Massimiliano
2015-01-01
As is known, the Pareto set of a continuous multiobjective optimization problem with m objective functions is a piecewise continuous (m − 1)-dimensional manifold in the decision space under some mild conditions. However, how to utilize the regularity to design multiobjective optimization algorithms has become the research focus. In this paper, based on this regularity, a model-based multiobjective evolutionary algorithm with regression analysis (MMEA-RA) is put forward to solve continuous multiobjective optimization problems with variable linkages. In the algorithm, the optimization problem is modelled as a promising area in the decision space by a probability distribution, and the centroid of the probability distribution is (m − 1)-dimensional piecewise continuous manifold. The least squares method is used to construct such a model. A selection strategy based on the nondominated sorting is used to choose the individuals to the next generation. The new algorithm is tested and compared with NSGA-II and RM-MEDA. The result shows that MMEA-RA outperforms RM-MEDA and NSGA-II on the test instances with variable linkages. At the same time, MMEA-RA has higher efficiency than the other two algorithms. A few shortcomings of MMEA-RA have also been identified and discussed in this paper. PMID:25874246
Evaluation of variable selection methods for random forests and omics data sets.
Degenhardt, Frauke; Seifert, Stephan; Szymczak, Silke
2017-10-16
Machine learning methods and in particular random forests are promising approaches for prediction based on high dimensional omics data sets. They provide variable importance measures to rank predictors according to their predictive power. If building a prediction model is the main goal of a study, often a minimal set of variables with good prediction performance is selected. However, if the objective is the identification of involved variables to find active networks and pathways, approaches that aim to select all relevant variables should be preferred. We evaluated several variable selection procedures based on simulated data as well as publicly available experimental methylation and gene expression data. Our comparison included the Boruta algorithm, the Vita method, recurrent relative variable importance, a permutation approach and its parametric variant (Altmann) as well as recursive feature elimination (RFE). In our simulation studies, Boruta was the most powerful approach, followed closely by the Vita method. Both approaches demonstrated similar stability in variable selection, while Vita was the most robust approach under a pure null model without any predictor variables related to the outcome. In the analysis of the different experimental data sets, Vita demonstrated slightly better stability in variable selection and was less computationally intensive than Boruta.In conclusion, we recommend the Boruta and Vita approaches for the analysis of high-dimensional data sets. Vita is considerably faster than Boruta and thus more suitable for large data sets, but only Boruta can also be applied in low-dimensional settings. © The Author 2017. Published by Oxford University Press.
Multiple-input multiple-output causal strategies for gene selection.
Bontempi, Gianluca; Haibe-Kains, Benjamin; Desmedt, Christine; Sotiriou, Christos; Quackenbush, John
2011-11-25
Traditional strategies for selecting variables in high dimensional classification problems aim to find sets of maximally relevant variables able to explain the target variations. If these techniques may be effective in generalization accuracy they often do not reveal direct causes. The latter is essentially related to the fact that high correlation (or relevance) does not imply causation. In this study, we show how to efficiently incorporate causal information into gene selection by moving from a single-input single-output to a multiple-input multiple-output setting. We show in synthetic case study that a better prioritization of causal variables can be obtained by considering a relevance score which incorporates a causal term. In addition we show, in a meta-analysis study of six publicly available breast cancer microarray datasets, that the improvement occurs also in terms of accuracy. The biological interpretation of the results confirms the potential of a causal approach to gene selection. Integrating causal information into gene selection algorithms is effective both in terms of prediction accuracy and biological interpretation.
Genetic-evolution-based optimization methods for engineering design
NASA Technical Reports Server (NTRS)
Rao, S. S.; Pan, T. S.; Dhingra, A. K.; Venkayya, V. B.; Kumar, V.
1990-01-01
This paper presents the applicability of a biological model, based on genetic evolution, for engineering design optimization. Algorithms embodying the ideas of reproduction, crossover, and mutation are developed and applied to solve different types of structural optimization problems. Both continuous and discrete variable optimization problems are solved. A two-bay truss for maximum fundamental frequency is considered to demonstrate the continuous variable case. The selection of locations of actuators in an actively controlled structure, for minimum energy dissipation, is considered to illustrate the discrete variable case.
NASA Technical Reports Server (NTRS)
Arduini, R. F.; Aherron, R. M.; Samms, R. W.
1984-01-01
A computational model of the deterministic and stochastic processes involved in multispectral remote sensing was designed to evaluate the performance of sensor systems and data processing algorithms for spectral feature classification. Accuracy in distinguishing between categories of surfaces or between specific types is developed as a means to compare sensor systems and data processing algorithms. The model allows studies to be made of the effects of variability of the atmosphere and of surface reflectance, as well as the effects of channel selection and sensor noise. Examples of these effects are shown.
Monte Carlo PDF method for turbulent reacting flow in a jet-stirred reactor
NASA Astrophysics Data System (ADS)
Roekaerts, D.
1992-01-01
A stochastic algorithm for the solution of the modeled scalar probability density function (PDF) transport equation for single-phase turbulent reacting flow is described. Cylindrical symmetry is assumed. The PDF is represented by ensembles of N representative values of the thermochemical variables in each cell of a nonuniform finite-difference grid and operations on these elements representing convection, diffusion, mixing and reaction are derived. A simplified model and solution algorithm which neglects the influence of turbulent fluctuations on mean reaction rates is also described. Both algorithms are applied to a selectivity problem in a real reactor.
Firefly as a novel swarm intelligence variable selection method in spectroscopy.
Goodarzi, Mohammad; dos Santos Coelho, Leandro
2014-12-10
A critical step in multivariate calibration is wavelength selection, which is used to build models with better prediction performance when applied to spectral data. Up to now, many feature selection techniques have been developed. Among all different types of feature selection techniques, those based on swarm intelligence optimization methodologies are more interesting since they are usually simulated based on animal and insect life behavior to, e.g., find the shortest path between a food source and their nests. This decision is made by a crowd, leading to a more robust model with less falling in local minima during the optimization cycle. This paper represents a novel feature selection approach to the selection of spectroscopic data, leading to more robust calibration models. The performance of the firefly algorithm, a swarm intelligence paradigm, was evaluated and compared with genetic algorithm and particle swarm optimization. All three techniques were coupled with partial least squares (PLS) and applied to three spectroscopic data sets. They demonstrate improved prediction results in comparison to when only a PLS model was built using all wavelengths. Results show that firefly algorithm as a novel swarm paradigm leads to a lower number of selected wavelengths while the prediction performance of built PLS stays the same. Copyright © 2014. Published by Elsevier B.V.
Boonjing, Veera; Intakosum, Sarun
2016-01-01
This study investigated the use of Artificial Neural Network (ANN) and Genetic Algorithm (GA) for prediction of Thailand's SET50 index trend. ANN is a widely accepted machine learning method that uses past data to predict future trend, while GA is an algorithm that can find better subsets of input variables for importing into ANN, hence enabling more accurate prediction by its efficient feature selection. The imported data were chosen technical indicators highly regarded by stock analysts, each represented by 4 input variables that were based on past time spans of 4 different lengths: 3-, 5-, 10-, and 15-day spans before the day of prediction. This import undertaking generated a big set of diverse input variables with an exponentially higher number of possible subsets that GA culled down to a manageable number of more effective ones. SET50 index data of the past 6 years, from 2009 to 2014, were used to evaluate this hybrid intelligence prediction accuracy, and the hybrid's prediction results were found to be more accurate than those made by a method using only one input variable for one fixed length of past time span. PMID:27974883
Inthachot, Montri; Boonjing, Veera; Intakosum, Sarun
2016-01-01
This study investigated the use of Artificial Neural Network (ANN) and Genetic Algorithm (GA) for prediction of Thailand's SET50 index trend. ANN is a widely accepted machine learning method that uses past data to predict future trend, while GA is an algorithm that can find better subsets of input variables for importing into ANN, hence enabling more accurate prediction by its efficient feature selection. The imported data were chosen technical indicators highly regarded by stock analysts, each represented by 4 input variables that were based on past time spans of 4 different lengths: 3-, 5-, 10-, and 15-day spans before the day of prediction. This import undertaking generated a big set of diverse input variables with an exponentially higher number of possible subsets that GA culled down to a manageable number of more effective ones. SET50 index data of the past 6 years, from 2009 to 2014, were used to evaluate this hybrid intelligence prediction accuracy, and the hybrid's prediction results were found to be more accurate than those made by a method using only one input variable for one fixed length of past time span.
Spectral unmixing of urban land cover using a generic library approach
NASA Astrophysics Data System (ADS)
Degerickx, Jeroen; Lordache, Marian-Daniel; Okujeni, Akpona; Hermy, Martin; van der Linden, Sebastian; Somers, Ben
2016-10-01
Remote sensing based land cover classification in urban areas generally requires the use of subpixel classification algorithms to take into account the high spatial heterogeneity. These spectral unmixing techniques often rely on spectral libraries, i.e. collections of pure material spectra (endmembers, EM), which ideally cover the large EM variability typically present in urban scenes. Despite the advent of several (semi-) automated EM detection algorithms, the collection of such image-specific libraries remains a tedious and time-consuming task. As an alternative, we suggest the use of a generic urban EM library, containing material spectra under varying conditions, acquired from different locations and sensors. This approach requires an efficient EM selection technique, capable of only selecting those spectra relevant for a specific image. In this paper, we evaluate and compare the potential of different existing library pruning algorithms (Iterative Endmember Selection and MUSIC) using simulated hyperspectral (APEX) data of the Brussels metropolitan area. In addition, we develop a new hybrid EM selection method which is shown to be highly efficient in dealing with both imagespecific and generic libraries, subsequently yielding more robust land cover classification results compared to existing methods. Future research will include further optimization of the proposed algorithm and additional tests on both simulated and real hyperspectral data.
Accelerating rejection-based simulation of biochemical reactions with bounded acceptance probability
NASA Astrophysics Data System (ADS)
Thanh, Vo Hong; Priami, Corrado; Zunino, Roberto
2016-06-01
Stochastic simulation of large biochemical reaction networks is often computationally expensive due to the disparate reaction rates and high variability of population of chemical species. An approach to accelerate the simulation is to allow multiple reaction firings before performing update by assuming that reaction propensities are changing of a negligible amount during a time interval. Species with small population in the firings of fast reactions significantly affect both performance and accuracy of this simulation approach. It is even worse when these small population species are involved in a large number of reactions. We present in this paper a new approximate algorithm to cope with this problem. It is based on bounding the acceptance probability of a reaction selected by the exact rejection-based simulation algorithm, which employs propensity bounds of reactions and the rejection-based mechanism to select next reaction firings. The reaction is ensured to be selected to fire with an acceptance rate greater than a predefined probability in which the selection becomes exact if the probability is set to one. Our new algorithm improves the computational cost for selecting the next reaction firing and reduces the updating the propensities of reactions.
Accelerating rejection-based simulation of biochemical reactions with bounded acceptance probability
DOE Office of Scientific and Technical Information (OSTI.GOV)
Thanh, Vo Hong, E-mail: vo@cosbi.eu; Priami, Corrado, E-mail: priami@cosbi.eu; Department of Mathematics, University of Trento, Trento
Stochastic simulation of large biochemical reaction networks is often computationally expensive due to the disparate reaction rates and high variability of population of chemical species. An approach to accelerate the simulation is to allow multiple reaction firings before performing update by assuming that reaction propensities are changing of a negligible amount during a time interval. Species with small population in the firings of fast reactions significantly affect both performance and accuracy of this simulation approach. It is even worse when these small population species are involved in a large number of reactions. We present in this paper a new approximatemore » algorithm to cope with this problem. It is based on bounding the acceptance probability of a reaction selected by the exact rejection-based simulation algorithm, which employs propensity bounds of reactions and the rejection-based mechanism to select next reaction firings. The reaction is ensured to be selected to fire with an acceptance rate greater than a predefined probability in which the selection becomes exact if the probability is set to one. Our new algorithm improves the computational cost for selecting the next reaction firing and reduces the updating the propensities of reactions.« less
NASA Astrophysics Data System (ADS)
Morello, Giuseppe; Morris, P. W.; Van Dyk, S. D.; Marston, A. P.; Mauerhan, J. C.
2018-01-01
We have investigated and applied machine-learning algorithms for infrared colour selection of Galactic Wolf-Rayet (WR) candidates. Objects taken from the Spitzer Galactic Legacy Infrared Midplane Survey Extraordinaire (GLIMPSE) catalogue of the infrared objects in the Galactic plane can be classified into different stellar populations based on the colours inferred from their broad-band photometric magnitudes [J, H and Ks from 2 Micron All Sky Survey (2MASS), and the four Spitzer/IRAC bands]. The algorithms tested in this pilot study are variants of the k-nearest neighbours approach, which is ideal for exploratory studies of classification problems where interrelations between variables and classes are complicated. The aims of this study are (1) to provide an automated tool to select reliable WR candidates and potentially other classes of objects, (2) to measure the efficiency of infrared colour selection at performing these tasks and (3) to lay the groundwork for statistically inferring the total number of WR stars in our Galaxy. We report the performance results obtained over a set of known objects and selected candidates for which we have carried out follow-up spectroscopic observations, and confirm the discovery of four new WR stars.
A firefly algorithm for optimum design of new-generation beams
NASA Astrophysics Data System (ADS)
Erdal, F.
2017-06-01
This research addresses the minimum weight design of new-generation steel beams with sinusoidal openings using a metaheuristic search technique, namely the firefly method. The proposed algorithm is also used to compare the optimum design results of sinusoidal web-expanded beams with steel castellated and cellular beams. Optimum design problems of all beams are formulated according to the design limitations stipulated by the Steel Construction Institute. The design methods adopted in these publications are consistent with BS 5950 specifications. The formulation of the design problem considering the above-mentioned limitations turns out to be a discrete programming problem. The design algorithms based on the technique select the optimum universal beam sections, dimensional properties of sinusoidal, hexagonal and circular holes, and the total number of openings along the beam as design variables. Furthermore, this selection is also carried out such that the behavioural limitations are satisfied. Numerical examples are presented, where the suggested algorithm is implemented to achieve the minimum weight design of these beams subjected to loading combinations.
Ice_Sheets_CCI: Essential Climate Variables for the Greenland Ice Sheet
NASA Astrophysics Data System (ADS)
Forsberg, R.; Sørensen, L. S.; Khan, A.; Aas, C.; Evansberget, D.; Adalsteinsdottir, G.; Mottram, R.; Andersen, S. B.; Ahlstrøm, A.; Dall, J.; Kusk, A.; Merryman, J.; Hvidberg, C.; Khvorostovsky, K.; Nagler, T.; Rott, H.; Scharrer, M.; Shepard, A.; Ticconi, F.; Engdahl, M.
2012-04-01
As part of the ESA Climate Change Initiative (www.esa-cci.org) a long-term project "ice_sheets_cci" started January 1, 2012, in addition to the existing 11 projects already generating Essential Climate Variables (ECV) for the Global Climate Observing System (GCOS). The "ice_sheets_cci" goal is to generate a consistent, long-term and timely set of key climate parameters for the Greenland ice sheet, to maximize the impact of European satellite data on climate research, from missions such as ERS, Envisat and the future Sentinel satellites. The climate parameters to be provided, at first in a research context, and in the longer perspective by a routine production system, would be grids of Greenland ice sheet elevation changes from radar altimetry, ice velocity from repeat-pass SAR data, as well as time series of marine-terminating glacier calving front locations and grounding lines for floating-front glaciers. The ice_sheets_cci project will involve a broad interaction of the relevant cryosphere and climate communities, first through user consultations and specifications, and later in 2012 optional participation in "best" algorithm selection activities, where prototype climate parameter variables for selected regions and time frames will be produced and validated using an objective set of criteria ("Round-Robin intercomparison"). This comparative algorithm selection activity will be completely open, and we invite all interested scientific groups with relevant experience to participate. The results of the "Round Robin" exercise will form the algorithmic basis for the future ECV production system. First prototype results will be generated and validated by early 2014. The poster will show the planned outline of the project and some early prototype results.
A Distributed and Energy-Efficient Algorithm for Event K-Coverage in Underwater Sensor Networks
Jiang, Peng; Xu, Yiming; Liu, Jun
2017-01-01
For event dynamic K-coverage algorithms, each management node selects its assistant node by using a greedy algorithm without considering the residual energy and situations in which a node is selected by several events. This approach affects network energy consumption and balance. Therefore, this study proposes a distributed and energy-efficient event K-coverage algorithm (DEEKA). After the network achieves 1-coverage, the nodes that detect the same event compete for the event management node with the number of candidate nodes and the average residual energy, as well as the distance to the event. Second, each management node estimates the probability of its neighbor nodes’ being selected by the event it manages with the distance level, the residual energy level, and the number of dynamic coverage event of these nodes. Third, each management node establishes an optimization model that uses expectation energy consumption and the residual energy variance of its neighbor nodes and detects the performance of the events it manages as targets. Finally, each management node uses a constrained non-dominated sorting genetic algorithm (NSGA-II) to obtain the Pareto set of the model and the best strategy via technique for order preference by similarity to an ideal solution (TOPSIS). The algorithm first considers the effect of harsh underwater environments on information collection and transmission. It also considers the residual energy of a node and a situation in which the node is selected by several other events. Simulation results show that, unlike the on-demand variable sensing K-coverage algorithm, DEEKA balances and reduces network energy consumption, thereby prolonging the network’s best service quality and lifetime. PMID:28106837
A Distributed and Energy-Efficient Algorithm for Event K-Coverage in Underwater Sensor Networks.
Jiang, Peng; Xu, Yiming; Liu, Jun
2017-01-19
For event dynamic K-coverage algorithms, each management node selects its assistant node by using a greedy algorithm without considering the residual energy and situations in which a node is selected by several events. This approach affects network energy consumption and balance. Therefore, this study proposes a distributed and energy-efficient event K-coverage algorithm (DEEKA). After the network achieves 1-coverage, the nodes that detect the same event compete for the event management node with the number of candidate nodes and the average residual energy, as well as the distance to the event. Second, each management node estimates the probability of its neighbor nodes' being selected by the event it manages with the distance level, the residual energy level, and the number of dynamic coverage event of these nodes. Third, each management node establishes an optimization model that uses expectation energy consumption and the residual energy variance of its neighbor nodes and detects the performance of the events it manages as targets. Finally, each management node uses a constrained non-dominated sorting genetic algorithm (NSGA-II) to obtain the Pareto set of the model and the best strategy via technique for order preference by similarity to an ideal solution (TOPSIS). The algorithm first considers the effect of harsh underwater environments on information collection and transmission. It also considers the residual energy of a node and a situation in which the node is selected by several other events. Simulation results show that, unlike the on-demand variable sensing K-coverage algorithm, DEEKA balances and reduces network energy consumption, thereby prolonging the network's best service quality and lifetime.
Comparison Between Supervised and Unsupervised Classifications of Neuronal Cell Types: A Case Study
Guerra, Luis; McGarry, Laura M; Robles, Víctor; Bielza, Concha; Larrañaga, Pedro; Yuste, Rafael
2011-01-01
In the study of neural circuits, it becomes essential to discern the different neuronal cell types that build the circuit. Traditionally, neuronal cell types have been classified using qualitative descriptors. More recently, several attempts have been made to classify neurons quantitatively, using unsupervised clustering methods. While useful, these algorithms do not take advantage of previous information known to the investigator, which could improve the classification task. For neocortical GABAergic interneurons, the problem to discern among different cell types is particularly difficult and better methods are needed to perform objective classifications. Here we explore the use of supervised classification algorithms to classify neurons based on their morphological features, using a database of 128 pyramidal cells and 199 interneurons from mouse neocortex. To evaluate the performance of different algorithms we used, as a “benchmark,” the test to automatically distinguish between pyramidal cells and interneurons, defining “ground truth” by the presence or absence of an apical dendrite. We compared hierarchical clustering with a battery of different supervised classification algorithms, finding that supervised classifications outperformed hierarchical clustering. In addition, the selection of subsets of distinguishing features enhanced the classification accuracy for both sets of algorithms. The analysis of selected variables indicates that dendritic features were most useful to distinguish pyramidal cells from interneurons when compared with somatic and axonal morphological variables. We conclude that supervised classification algorithms are better matched to the general problem of distinguishing neuronal cell types when some information on these cell groups, in our case being pyramidal or interneuron, is known a priori. As a spin-off of this methodological study, we provide several methods to automatically distinguish neocortical pyramidal cells from interneurons, based on their morphologies. © 2010 Wiley Periodicals, Inc. Develop Neurobiol 71: 71–82, 2011 PMID:21154911
NASA Astrophysics Data System (ADS)
Cheng, Jun-Hu; Jin, Huali; Liu, Zhiwei
2018-01-01
The feasibility of developing a multispectral imaging method using important wavelengths from hyperspectral images selected by genetic algorithm (GA), successive projection algorithm (SPA) and regression coefficient (RC) methods for modeling and predicting protein content in peanut kernel was investigated for the first time. Partial least squares regression (PLSR) calibration model was established between the spectral data from the selected optimal wavelengths and the reference measured protein content ranged from 23.46% to 28.43%. The RC-PLSR model established using eight key wavelengths (1153, 1567, 1972, 2143, 2288, 2339, 2389 and 2446 nm) showed the best predictive results with the coefficient of determination of prediction (R2P) of 0.901, and root mean square error of prediction (RMSEP) of 0.108 and residual predictive deviation (RPD) of 2.32. Based on the obtained best model and image processing algorithms, the distribution maps of protein content were generated. The overall results of this study indicated that developing a rapid and online multispectral imaging system using the feature wavelengths and PLSR analysis is potential and feasible for determination of the protein content in peanut kernels.
NASA Technical Reports Server (NTRS)
Reid, G. F.
1976-01-01
A technique is presented for determining state variable feedback gains that will place both the poles and zeros of a selected transfer function of a dual-input control system at pre-determined locations in the s-plane. Leverrier's algorithm is used to determine the numerator and denominator coefficients of the closed-loop transfer function as functions of the feedback gains. The values of gain that match these coefficients to those of a pre-selected model are found by solving two systems of linear simultaneous equations. The algorithm has been used in a computer simulation of the CH-47 helicopter to control longitudinal dynamics.
Dipnall, Joanna F.
2016-01-01
Background Atheoretical large-scale data mining techniques using machine learning algorithms have promise in the analysis of large epidemiological datasets. This study illustrates the use of a hybrid methodology for variable selection that took account of missing data and complex survey design to identify key biomarkers associated with depression from a large epidemiological study. Methods The study used a three-step methodology amalgamating multiple imputation, a machine learning boosted regression algorithm and logistic regression, to identify key biomarkers associated with depression in the National Health and Nutrition Examination Study (2009–2010). Depression was measured using the Patient Health Questionnaire-9 and 67 biomarkers were analysed. Covariates in this study included gender, age, race, smoking, food security, Poverty Income Ratio, Body Mass Index, physical activity, alcohol use, medical conditions and medications. The final imputed weighted multiple logistic regression model included possible confounders and moderators. Results After the creation of 20 imputation data sets from multiple chained regression sequences, machine learning boosted regression initially identified 21 biomarkers associated with depression. Using traditional logistic regression methods, including controlling for possible confounders and moderators, a final set of three biomarkers were selected. The final three biomarkers from the novel hybrid variable selection methodology were red cell distribution width (OR 1.15; 95% CI 1.01, 1.30), serum glucose (OR 1.01; 95% CI 1.00, 1.01) and total bilirubin (OR 0.12; 95% CI 0.05, 0.28). Significant interactions were found between total bilirubin with Mexican American/Hispanic group (p = 0.016), and current smokers (p<0.001). Conclusion The systematic use of a hybrid methodology for variable selection, fusing data mining techniques using a machine learning algorithm with traditional statistical modelling, accounted for missing data and complex survey sampling methodology and was demonstrated to be a useful tool for detecting three biomarkers associated with depression for future hypothesis generation: red cell distribution width, serum glucose and total bilirubin. PMID:26848571
Dipnall, Joanna F; Pasco, Julie A; Berk, Michael; Williams, Lana J; Dodd, Seetal; Jacka, Felice N; Meyer, Denny
2016-01-01
Atheoretical large-scale data mining techniques using machine learning algorithms have promise in the analysis of large epidemiological datasets. This study illustrates the use of a hybrid methodology for variable selection that took account of missing data and complex survey design to identify key biomarkers associated with depression from a large epidemiological study. The study used a three-step methodology amalgamating multiple imputation, a machine learning boosted regression algorithm and logistic regression, to identify key biomarkers associated with depression in the National Health and Nutrition Examination Study (2009-2010). Depression was measured using the Patient Health Questionnaire-9 and 67 biomarkers were analysed. Covariates in this study included gender, age, race, smoking, food security, Poverty Income Ratio, Body Mass Index, physical activity, alcohol use, medical conditions and medications. The final imputed weighted multiple logistic regression model included possible confounders and moderators. After the creation of 20 imputation data sets from multiple chained regression sequences, machine learning boosted regression initially identified 21 biomarkers associated with depression. Using traditional logistic regression methods, including controlling for possible confounders and moderators, a final set of three biomarkers were selected. The final three biomarkers from the novel hybrid variable selection methodology were red cell distribution width (OR 1.15; 95% CI 1.01, 1.30), serum glucose (OR 1.01; 95% CI 1.00, 1.01) and total bilirubin (OR 0.12; 95% CI 0.05, 0.28). Significant interactions were found between total bilirubin with Mexican American/Hispanic group (p = 0.016), and current smokers (p<0.001). The systematic use of a hybrid methodology for variable selection, fusing data mining techniques using a machine learning algorithm with traditional statistical modelling, accounted for missing data and complex survey sampling methodology and was demonstrated to be a useful tool for detecting three biomarkers associated with depression for future hypothesis generation: red cell distribution width, serum glucose and total bilirubin.
Ließ, Mareike; Schmidt, Johannes; Glaser, Bruno
2016-01-01
Tropical forests are significant carbon sinks and their soils' carbon storage potential is immense. However, little is known about the soil organic carbon (SOC) stocks of tropical mountain areas whose complex soil-landscape and difficult accessibility pose a challenge to spatial analysis. The choice of methodology for spatial prediction is of high importance to improve the expected poor model results in case of low predictor-response correlations. Four aspects were considered to improve model performance in predicting SOC stocks of the organic layer of a tropical mountain forest landscape: Different spatial predictor settings, predictor selection strategies, various machine learning algorithms and model tuning. Five machine learning algorithms: random forests, artificial neural networks, multivariate adaptive regression splines, boosted regression trees and support vector machines were trained and tuned to predict SOC stocks from predictors derived from a digital elevation model and satellite image. Topographical predictors were calculated with a GIS search radius of 45 to 615 m. Finally, three predictor selection strategies were applied to the total set of 236 predictors. All machine learning algorithms-including the model tuning and predictor selection-were compared via five repetitions of a tenfold cross-validation. The boosted regression tree algorithm resulted in the overall best model. SOC stocks ranged between 0.2 to 17.7 kg m-2, displaying a huge variability with diffuse insolation and curvatures of different scale guiding the spatial pattern. Predictor selection and model tuning improved the models' predictive performance in all five machine learning algorithms. The rather low number of selected predictors favours forward compared to backward selection procedures. Choosing predictors due to their indiviual performance was vanquished by the two procedures which accounted for predictor interaction.
Constrained Stochastic Extended Redundancy Analysis.
DeSarbo, Wayne S; Hwang, Heungsun; Stadler Blank, Ashley; Kappe, Eelco
2015-06-01
We devise a new statistical methodology called constrained stochastic extended redundancy analysis (CSERA) to examine the comparative impact of various conceptual factors, or drivers, as well as the specific predictor variables that contribute to each driver on designated dependent variable(s). The technical details of the proposed methodology, the maximum likelihood estimation algorithm, and model selection heuristics are discussed. A sports marketing consumer psychology application is provided in a Major League Baseball (MLB) context where the effects of six conceptual drivers of game attendance and their defining predictor variables are estimated. Results compare favorably to those obtained using traditional extended redundancy analysis (ERA).
NASA Astrophysics Data System (ADS)
Vathsala, H.; Koolagudi, Shashidhar G.
2017-01-01
In this paper we discuss a data mining application for predicting peninsular Indian summer monsoon rainfall, and propose an algorithm that combine data mining and statistical techniques. We select likely predictors based on association rules that have the highest confidence levels. We then cluster the selected predictors to reduce their dimensions and use cluster membership values for classification. We derive the predictors from local conditions in southern India, including mean sea level pressure, wind speed, and maximum and minimum temperatures. The global condition variables include southern oscillation and Indian Ocean dipole conditions. The algorithm predicts rainfall in five categories: Flood, Excess, Normal, Deficit and Drought. We use closed itemset mining, cluster membership calculations and a multilayer perceptron function in the algorithm to predict monsoon rainfall in peninsular India. Using Indian Institute of Tropical Meteorology data, we found the prediction accuracy of our proposed approach to be exceptionally good.
Fan, Shu-xiang; Huang, Wen-qian; Li, Jiang-bo; Zhao, Chun-jiang; Zhang, Bao-hua
2014-08-01
To improve the precision and robustness of the NIR model of the soluble solid content (SSC) on pear. The total number of 160 pears was for the calibration (n=120) and prediction (n=40). Different spectral pretreatment methods, including standard normal variate (SNV) and multiplicative scatter correction (MSC) were used before further analysis. A combination of genetic algorithm (GA) and successive projections algorithm (SPA) was proposed to select most effective wavelengths after uninformative variable elimination (UVE) from original spectra, SNV pretreated spectra and MSC pretreated spectra respectively. The selected variables were used as the inputs of least squares-support vector machine (LS-SVM) model to build models for de- termining the SSC of pear. The results indicated that LS-SVM model built using SNVE-UVE-GA-SPA on 30 characteristic wavelengths selected from full-spectrum which had 3112 wavelengths achieved the optimal performance. The correlation coefficient (Rp) and root mean square error of prediction (RMSEP) for prediction sets were 0.956, 0.271 for SSC. The model is reliable and the predicted result is effective. The method can meet the requirement of quick measuring SSC of pear and might be important for the development of portable instruments and online monitoring.
GrammarViz 3.0: Interactive Discovery of Variable-Length Time Series Patterns
Senin, Pavel; Lin, Jessica; Wang, Xing; ...
2018-02-23
The problems of recurrent and anomalous pattern discovery in time series, e.g., motifs and discords, respectively, have received a lot of attention from researchers in the past decade. However, since the pattern search space is usually intractable, most existing detection algorithms require that the patterns have discriminative characteristics and have its length known in advance and provided as input, which is an unreasonable requirement for many real-world problems. In addition, patterns of similar structure, but of different lengths may co-exist in a time series. In order to address these issues, we have developed algorithms for variable-length time series pattern discoverymore » that are based on symbolic discretization and grammar inference—two techniques whose combination enables the structured reduction of the search space and discovery of the candidate patterns in linear time. In this work, we present GrammarViz 3.0—a software package that provides implementations of proposed algorithms and graphical user interface for interactive variable-length time series pattern discovery. The current version of the software provides an alternative grammar inference algorithm that improves the time series motif discovery workflow, and introduces an experimental procedure for automated discretization parameter selection that builds upon the minimum cardinality maximum cover principle and aids the time series recurrent and anomalous pattern discovery.« less
GrammarViz 3.0: Interactive Discovery of Variable-Length Time Series Patterns
DOE Office of Scientific and Technical Information (OSTI.GOV)
Senin, Pavel; Lin, Jessica; Wang, Xing
The problems of recurrent and anomalous pattern discovery in time series, e.g., motifs and discords, respectively, have received a lot of attention from researchers in the past decade. However, since the pattern search space is usually intractable, most existing detection algorithms require that the patterns have discriminative characteristics and have its length known in advance and provided as input, which is an unreasonable requirement for many real-world problems. In addition, patterns of similar structure, but of different lengths may co-exist in a time series. In order to address these issues, we have developed algorithms for variable-length time series pattern discoverymore » that are based on symbolic discretization and grammar inference—two techniques whose combination enables the structured reduction of the search space and discovery of the candidate patterns in linear time. In this work, we present GrammarViz 3.0—a software package that provides implementations of proposed algorithms and graphical user interface for interactive variable-length time series pattern discovery. The current version of the software provides an alternative grammar inference algorithm that improves the time series motif discovery workflow, and introduces an experimental procedure for automated discretization parameter selection that builds upon the minimum cardinality maximum cover principle and aids the time series recurrent and anomalous pattern discovery.« less
NASA Astrophysics Data System (ADS)
Hanan, Lu; Qiushi, Li; Shaobin, Li
2016-12-01
This paper presents an integrated optimization design method in which uniform design, response surface methodology and genetic algorithm are used in combination. In detail, uniform design is used to select the experimental sampling points in the experimental domain and the system performance is evaluated by means of computational fluid dynamics to construct a database. After that, response surface methodology is employed to generate a surrogate mathematical model relating the optimization objective and the design variables. Subsequently, genetic algorithm is adopted and applied to the surrogate model to acquire the optimal solution in the case of satisfying some constraints. The method has been applied to the optimization design of an axisymmetric diverging duct, dealing with three design variables including one qualitative variable and two quantitative variables. The method of modeling and optimization design performs well in improving the duct aerodynamic performance and can be also applied to wider fields of mechanical design and seen as a useful tool for engineering designers, by reducing the design time and computation consumption.
Encke-Beta Predictor for Orion Burn Targeting and Guidance
NASA Technical Reports Server (NTRS)
Robinson, Shane; Scarritt, Sara; Goodman, John L.
2016-01-01
The state vector prediction algorithm selected for Orion on-board targeting and guidance is known as the Encke-Beta method. Encke-Beta uses a universal anomaly (beta) as the independent variable, valid for circular, elliptical, parabolic, and hyperbolic orbits. The variable, related to the change in eccentric anomaly, results in integration steps that cover smaller arcs of the trajectory at or near perigee, when velocity is higher. Some burns in the EM-1 and EM-2 mission plans are much longer than burns executed with the Apollo and Space Shuttle vehicles. Burn length, as well as hyperbolic trajectories, has driven the use of the Encke-Beta numerical predictor by the predictor/corrector guidance algorithm in place of legacy analytic thrust and gravity integrals.
Design and Application of Drought Indexes in Highly Regulated Mediterranean Water Systems
NASA Astrophysics Data System (ADS)
Castelletti, A.; Zaniolo, M.; Giuliani, M.
2017-12-01
Costs of drought are progressively increasing due to the undergoing alteration of hydro-meteorological regimes induced by climate change. Although drought management is largely studied in the literature, most of the traditional drought indexes fail in detecting critical events in highly regulated systems, which generally rely on ad-hoc formulations and cannot be generalized to different context. In this study, we contribute a novel framework for the design of a basin-customized drought index. This index represents a surrogate of the state of the basin and is computed by combining the available information about the water available in the system to reproduce a representative target variable for the drought condition of the basin (e.g., water deficit). To select the relevant variables and combinatione thereof, we use an advanced feature extraction algorithm called Wrapper for Quasi Equally Informative Subset Selection (W-QEISS). W-QEISS relies on a multi-objective evolutionary algorithm to find Pareto-efficient subsets of variables by maximizing the wrapper accuracy, minimizing the number of selected variables, and optimizing relevance and redundancy of the subset. The accuracy objective is evaluated trough the calibration of an extreme learning machine of the water deficit for each candidate subset of variables, with the index selected from the resulting solutions identifying a suitable compromise between accuracy, cardinality, relevance, and redundancy. The approach is tested on Lake Como, Italy, a regulated lake mainly operated for irrigation supply. In the absence of an institutional drought monitoring system, we constructed the combined index using all the hydrological variables from the existing monitoring system as well as common drought indicators at multiple time aggregations. The soil moisture deficit in the root zone computed by a distributed-parameter water balance model of the agricultural districts is used as target variable. Numerical results show that our combined drought index succesfully reproduces the deficit. The index represents a valuable information for supporting appropriate drought management strategies, including the possibility of directly informing the lake operations about the drought conditions and improve the overall reliability of the irrigation supply system.
Guisande, Cástor; Vari, Richard P; Heine, Jürgen; García-Roselló, Emilio; González-Dacosta, Jacinto; Perez-Schofield, Baltasar J García; González-Vilas, Luis; Pelayo-Villamil, Patricia
2016-09-12
We present and discuss VARSEDIG, an algorithm which identifies the morphometric features that significantly discriminate two taxa and validates the morphological distinctness between them via a Monte-Carlo test. VARSEDIG is freely available as a function of the RWizard application PlotsR (http://www.ipez.es/RWizard) and as R package on CRAN. The variables selected by VARSEDIG with the overlap method were very similar to those selected by logistic regression and discriminant analysis, but overcomes some shortcomings of these methods. VARSEDIG is, therefore, a good alternative by comparison to current classical classification methods for identifying morphometric features that significantly discriminate a taxon and for validating its morphological distinctness from other taxa. As a demonstration of the potential of VARSEDIG for this purpose, we analyze morphological discrimination among some species of the Neotropical freshwater family Characidae.
NASA Astrophysics Data System (ADS)
Attia, Khalid A. M.; El-Abasawi, Nasr M.; El-Olemy, Ahmed; Abdelazim, Ahmed H.
2018-01-01
The first three UV spectrophotometric methods have been developed of simultaneous determination of two new FDA approved drugs namely; elbasvir and grazoprevir in their combined pharmaceutical dosage form. These methods include simultaneous equation, partial least squares with and without variable selection procedure (genetic algorithm). For simultaneous equation method, the absorbance values at 369 (λmax of elbasvir) and 253 nm (λmax of grazoprevir) have been selected for the formation of two simultaneous equations required for the mathematical processing and quantitative analysis of the studied drugs. Alternatively, the partial least squares with and without variable selection procedure (genetic algorithm) have been applied in the spectra analysis because the synchronous inclusion of many unreal wavelengths rather than by using a single or dual wavelength which greatly increases the precision and predictive ability of the methods. Successfully assay of the drugs in their pharmaceutical formulation has been done by the proposed methods. Statistically comparative analysis for the obtained results with the manufacturing methods has been performed. It is noteworthy to mention that there was no significant difference between the proposed methods and the manufacturing one with respect to the validation parameters.
Kim, Hyungjin; Park, Chang Min; Lee, Myunghee; Park, Sang Joon; Song, Yong Sub; Lee, Jong Hyuk; Hwang, Eui Jin; Goo, Jin Mo
2016-01-01
To identify the impact of reconstruction algorithms on CT radiomic features of pulmonary tumors and to reveal and compare the intra- and inter-reader and inter-reconstruction algorithm variability of each feature. Forty-two patients (M:F = 19:23; mean age, 60.43±10.56 years) with 42 pulmonary tumors (22.56±8.51mm) underwent contrast-enhanced CT scans, which were reconstructed with filtered back projection and commercial iterative reconstruction algorithm (level 3 and 5). Two readers independently segmented the whole tumor volume. Fifteen radiomic features were extracted and compared among reconstruction algorithms. Intra- and inter-reader variability and inter-reconstruction algorithm variability were calculated using coefficients of variation (CVs) and then compared. Among the 15 features, 5 first-order tumor intensity features and 4 gray level co-occurrence matrix (GLCM)-based features showed significant differences (p<0.05) among reconstruction algorithms. As for the variability, effective diameter, sphericity, entropy, and GLCM entropy were the most robust features (CV≤5%). Inter-reader variability was larger than intra-reader or inter-reconstruction algorithm variability in 9 features. However, for entropy, homogeneity, and 4 GLCM-based features, inter-reconstruction algorithm variability was significantly greater than inter-reader variability (p<0.013). Most of the radiomic features were significantly affected by the reconstruction algorithms. Inter-reconstruction algorithm variability was greater than inter-reader variability for entropy, homogeneity, and GLCM-based features.
Automated method for measuring the extent of selective logging damage with airborne LiDAR data
NASA Astrophysics Data System (ADS)
Melendy, L.; Hagen, S. C.; Sullivan, F. B.; Pearson, T. R. H.; Walker, S. M.; Ellis, P.; Kustiyo; Sambodo, Ari Katmoko; Roswintiarti, O.; Hanson, M. A.; Klassen, A. W.; Palace, M. W.; Braswell, B. H.; Delgado, G. M.
2018-05-01
Selective logging has an impact on the global carbon cycle, as well as on the forest micro-climate, and longer-term changes in erosion, soil and nutrient cycling, and fire susceptibility. Our ability to quantify these impacts is dependent on methods and tools that accurately identify the extent and features of logging activity. LiDAR-based measurements of these features offers significant promise. Here, we present a set of algorithms for automated detection and mapping of critical features associated with logging - roads/decks, skid trails, and gaps - using commercial airborne LiDAR data as input. The automated algorithm was applied to commercial LiDAR data collected over two logging concessions in Kalimantan, Indonesia in 2014. The algorithm results were compared to measurements of the logging features collected in the field soon after logging was complete. The automated algorithm-mapped road/deck and skid trail features match closely with features measured in the field, with agreement levels ranging from 69% to 99% when adjusting for GPS location error. The algorithm performed most poorly with gaps, which, by their nature, are variable due to the unpredictable impact of tree fall versus the linear and regular features directly created by mechanical means. Overall, the automated algorithm performs well and offers significant promise as a generalizable tool useful to efficiently and accurately capture the effects of selective logging, including the potential to distinguish reduced impact logging from conventional logging.
Yu, Peng; Sun, Jia; Wolz, Robin; Stephenson, Diane; Brewer, James; Fox, Nick C; Cole, Patricia E; Jack, Clifford R; Hill, Derek L G; Schwarz, Adam J
2014-04-01
The objective of this study was to evaluate the effect of computational algorithm, measurement variability, and cut point on hippocampal volume (HCV)-based patient selection for clinical trials in mild cognitive impairment (MCI). We used normal control and amnestic MCI subjects from the Alzheimer's Disease Neuroimaging Initiative 1 (ADNI-1) as normative reference and screening cohorts. We evaluated the enrichment performance of 4 widely used hippocampal segmentation algorithms (FreeSurfer, Hippocampus Multi-Atlas Propagation and Segmentation (HMAPS), Learning Embeddings Atlas Propagation (LEAP), and NeuroQuant) in terms of 2-year changes in Mini-Mental State Examination (MMSE), Alzheimer's Disease Assessment Scale-Cognitive Subscale (ADAS-Cog), and Clinical Dementia Rating Sum of Boxes (CDR-SB). We modeled the implications for sample size, screen fail rates, and trial cost and duration. HCV based patient selection yielded reduced sample sizes (by ∼40%-60%) and lower trial costs (by ∼30%-40%) across a wide range of cut points. These results provide a guide to the choice of HCV cut point for amnestic MCI clinical trials, allowing an informed tradeoff between statistical and practical considerations. Copyright © 2014 Elsevier Inc. All rights reserved.
Data compression using adaptive transform coding. Appendix 1: Item 1. Ph.D. Thesis
NASA Technical Reports Server (NTRS)
Rost, Martin Christopher
1988-01-01
Adaptive low-rate source coders are described in this dissertation. These coders adapt by adjusting the complexity of the coder to match the local coding difficulty of the image. This is accomplished by using a threshold driven maximum distortion criterion to select the specific coder used. The different coders are built using variable blocksized transform techniques, and the threshold criterion selects small transform blocks to code the more difficult regions and larger blocks to code the less complex regions. A theoretical framework is constructed from which the study of these coders can be explored. An algorithm for selecting the optimal bit allocation for the quantization of transform coefficients is developed. The bit allocation algorithm is more fully developed, and can be used to achieve more accurate bit assignments than the algorithms currently used in the literature. Some upper and lower bounds for the bit-allocation distortion-rate function are developed. An obtainable distortion-rate function is developed for a particular scalar quantizer mixing method that can be used to code transform coefficients at any rate.
Wang, Shenghao; Zhang, Yuyan; Cao, Fuyi; Pei, Zhenying; Gao, Xuewei; Zhang, Xu; Zhao, Yong
2018-02-13
This paper presents a novel spectrum analysis tool named synergy adaptive moving window modeling based on immune clone algorithm (SA-MWM-ICA) considering the tedious and inconvenient labor involved in the selection of pre-processing methods and spectral variables by prior experience. In this work, immune clone algorithm is first introduced into the spectrum analysis field as a new optimization strategy, covering the shortage of the relative traditional methods. Based on the working principle of the human immune system, the performance of the quantitative model is regarded as antigen, and a special vector corresponding to the above mentioned antigen is regarded as antibody. The antibody contains a pre-processing method optimization region which is created by 11 decimal digits, and a spectrum variable optimization region which is formed by some moving windows with changeable width and position. A set of original antibodies are created by modeling with this algorithm. After calculating the affinity of these antibodies, those with high affinity will be selected to clone. The regulation for cloning is that the higher the affinity, the more copies will be. In the next step, another import operation named hyper-mutation is applied to the antibodies after cloning. Moreover, the regulation for hyper-mutation is that the lower the affinity, the more possibility will be. Several antibodies with high affinity will be created on the basis of these steps. Groups of simulated dataset, gasoline near-infrared spectra dataset, and soil near-infrared spectra dataset are employed to verify and illustrate the performance of SA-MWM-ICA. Analysis results show that the performance of the quantitative models adopted by SA-MWM-ICA are better especially for structures with relatively complex spectra than traditional models such as partial least squares (PLS), moving window PLS (MWPLS), genetic algorithm PLS (GAPLS), and pretreatment method classification and adjustable parameter changeable size moving window PLS (CA-CSMWPLS). The selected pre-processing methods and spectrum variables are easily explained. The proposed method will converge in few generations and can be used not only for near-infrared spectroscopy analysis but also for other similar spectral analysis, such as infrared spectroscopy. Copyright © 2017 Elsevier B.V. All rights reserved.
Quantifying natural delta variability using a multiple-point geostatistics prior uncertainty model
NASA Astrophysics Data System (ADS)
Scheidt, Céline; Fernandes, Anjali M.; Paola, Chris; Caers, Jef
2016-10-01
We address the question of quantifying uncertainty associated with autogenic pattern variability in a channelized transport system by means of a modern geostatistical method. This question has considerable relevance for practical subsurface applications as well, particularly those related to uncertainty quantification relying on Bayesian approaches. Specifically, we show how the autogenic variability in a laboratory experiment can be represented and reproduced by a multiple-point geostatistical prior uncertainty model. The latter geostatistical method requires selection of a limited set of training images from which a possibly infinite set of geostatistical model realizations, mimicking the training image patterns, can be generated. To that end, we investigate two methods to determine how many training images and what training images should be provided to reproduce natural autogenic variability. The first method relies on distance-based clustering of overhead snapshots of the experiment; the second method relies on a rate of change quantification by means of a computer vision algorithm termed the demon algorithm. We show quantitatively that with either training image selection method, we can statistically reproduce the natural variability of the delta formed in the experiment. In addition, we study the nature of the patterns represented in the set of training images as a representation of the "eigenpatterns" of the natural system. The eigenpattern in the training image sets display patterns consistent with previous physical interpretations of the fundamental modes of this type of delta system: a highly channelized, incisional mode; a poorly channelized, depositional mode; and an intermediate mode between the two.
An Orthogonal Evolutionary Algorithm With Learning Automata for Multiobjective Optimization.
Dai, Cai; Wang, Yuping; Ye, Miao; Xue, Xingsi; Liu, Hailin
2016-12-01
Research on multiobjective optimization problems becomes one of the hottest topics of intelligent computation. In order to improve the search efficiency of an evolutionary algorithm and maintain the diversity of solutions, in this paper, the learning automata (LA) is first used for quantization orthogonal crossover (QOX), and a new fitness function based on decomposition is proposed to achieve these two purposes. Based on these, an orthogonal evolutionary algorithm with LA for complex multiobjective optimization problems with continuous variables is proposed. The experimental results show that in continuous states, the proposed algorithm is able to achieve accurate Pareto-optimal sets and wide Pareto-optimal fronts efficiently. Moreover, the comparison with the several existing well-known algorithms: nondominated sorting genetic algorithm II, decomposition-based multiobjective evolutionary algorithm, decomposition-based multiobjective evolutionary algorithm with an ensemble of neighborhood sizes, multiobjective optimization by LA, and multiobjective immune algorithm with nondominated neighbor-based selection, on 15 multiobjective benchmark problems, shows that the proposed algorithm is able to find more accurate and evenly distributed Pareto-optimal fronts than the compared ones.
Shouval, Roni; Hadanny, Amir; Shlomo, Nir; Iakobishvili, Zaza; Unger, Ron; Zahger, Doron; Alcalai, Ronny; Atar, Shaul; Gottlieb, Shmuel; Matetzky, Shlomi; Goldenberg, Ilan; Beigel, Roy
2017-11-01
Risk scores for prediction of mortality 30-days following a ST-segment elevation myocardial infarction (STEMI) have been developed using a conventional statistical approach. To evaluate an array of machine learning (ML) algorithms for prediction of mortality at 30-days in STEMI patients and to compare these to the conventional validated risk scores. This was a retrospective, supervised learning, data mining study. Out of a cohort of 13,422 patients from the Acute Coronary Syndrome Israeli Survey (ACSIS) registry, 2782 patients fulfilled inclusion criteria and 54 variables were considered. Prediction models for overall mortality 30days after STEMI were developed using 6 ML algorithms. Models were compared to each other and to the Global Registry of Acute Coronary Events (GRACE) and Thrombolysis In Myocardial Infarction (TIMI) scores. Depending on the algorithm, using all available variables, prediction models' performance measured in an area under the receiver operating characteristic curve (AUC) ranged from 0.64 to 0.91. The best models performed similarly to the Global Registry of Acute Coronary Events (GRACE) score (0.87 SD 0.06) and outperformed the Thrombolysis In Myocardial Infarction (TIMI) score (0.82 SD 0.06, p<0.05). Performance of most algorithms plateaued when introduced with 15 variables. Among the top predictors were creatinine, Killip class on admission, blood pressure, glucose level, and age. We present a data mining approach for prediction of mortality post-ST-segment elevation myocardial infarction. The algorithms selected showed competence in prediction across an increasing number of variables. ML may be used for outcome prediction in complex cardiology settings. Copyright © 2017 Elsevier Ireland Ltd. All rights reserved.
Identification of eggs from different production systems based on hyperspectra and CS-SVM.
Sun, J; Cong, S L; Mao, H P; Zhou, X; Wu, X H; Zhang, X D
2017-06-01
1. To identify the origin of table eggs more accurately, a method based on hyperspectral imaging technology was studied. 2. The hyperspectral data of 200 samples of intensive and extensive eggs were collected. Standard normalised variables combined with a Savitzky-Golay were used to eliminate noise, then stepwise regression (SWR) was used for feature selection. Grid search algorithm (GS), genetic search algorithm (GA), particle swarm optimisation algorithm (PSO) and cuckoo search algorithm (CS) were applied by support vector machine (SVM) methods to establish an SVM identification model with the optimal parameters. The full spectrum data and the data after feature selection were the input of the model, while egg category was the output. 3. The SWR-CS-SVM model performed better than the other models, including SWR-GS-SVM, SWR-GA-SVM, SWR-PSO-SVM and others based on full spectral data. The training and test classification accuracy of the SWR-CS-SVM model were respectively 99.3% and 96%. 4. SWR-CS-SVM proved effective for identifying egg varieties and could also be useful for the non-destructive identification of other types of egg.
NASA Astrophysics Data System (ADS)
Uma Maheswari, R.; Umamaheswari, R.
2017-02-01
Condition Monitoring System (CMS) substantiates potential economic benefits and enables prognostic maintenance in wind turbine-generator failure prevention. Vibration Monitoring and Analysis is a powerful tool in drive train CMS, which enables the early detection of impending failure/damage. In variable speed drives such as wind turbine-generator drive trains, the vibration signal acquired is of non-stationary and non-linear. The traditional stationary signal processing techniques are inefficient to diagnose the machine faults in time varying conditions. The current research trend in CMS for drive-train focuses on developing/improving non-linear, non-stationary feature extraction and fault classification algorithms to improve fault detection/prediction sensitivity and selectivity and thereby reducing the misdetection and false alarm rates. In literature, review of stationary signal processing algorithms employed in vibration analysis is done at great extent. In this paper, an attempt is made to review the recent research advances in non-linear non-stationary signal processing algorithms particularly suited for variable speed wind turbines.
An evolution based biosensor receptor DNA sequence generation algorithm.
Kim, Eungyeong; Lee, Malrey; Gatton, Thomas M; Lee, Jaewan; Zang, Yupeng
2010-01-01
A biosensor is composed of a bioreceptor, an associated recognition molecule, and a signal transducer that can selectively detect target substances for analysis. DNA based biosensors utilize receptor molecules that allow hybridization with the target analyte. However, most DNA biosensor research uses oligonucleotides as the target analytes and does not address the potential problems of real samples. The identification of recognition molecules suitable for real target analyte samples is an important step towards further development of DNA biosensors. This study examines the characteristics of DNA used as bioreceptors and proposes a hybrid evolution-based DNA sequence generating algorithm, based on DNA computing, to identify suitable DNA bioreceptor recognition molecules for stable hybridization with real target substances. The Traveling Salesman Problem (TSP) approach is applied in the proposed algorithm to evaluate the safety and fitness of the generated DNA sequences. This approach improves efficiency and stability for enhanced and variable-length DNA sequence generation and allows extension to generation of variable-length DNA sequences with diverse receptor recognition requirements.
NASA Astrophysics Data System (ADS)
Xia, Y.; Tian, J.; d'Angelo, P.; Reinartz, P.
2018-05-01
3D reconstruction of plants is hard to implement, as the complex leaf distribution highly increases the difficulty level in dense matching. Semi-Global Matching has been successfully applied to recover the depth information of a scene, but may perform variably when different matching cost algorithms are used. In this paper two matching cost computation algorithms, Census transform and an algorithm using a convolutional neural network, are tested for plant reconstruction based on Semi-Global Matching. High resolution close-range photogrammetric images from a handheld camera are used for the experiment. The disparity maps generated based on the two selected matching cost methods are comparable with acceptable quality, which shows the good performance of Census and the potential of neural networks to improve the dense matching.
NASA Astrophysics Data System (ADS)
Moslemipour, Ghorbanali
2018-07-01
This paper aims at proposing a quadratic assignment-based mathematical model to deal with the stochastic dynamic facility layout problem. In this problem, product demands are assumed to be dependent normally distributed random variables with known probability density function and covariance that change from period to period at random. To solve the proposed model, a novel hybrid intelligent algorithm is proposed by combining the simulated annealing and clonal selection algorithms. The proposed model and the hybrid algorithm are verified and validated using design of experiment and benchmark methods. The results show that the hybrid algorithm has an outstanding performance from both solution quality and computational time points of view. Besides, the proposed model can be used in both of the stochastic and deterministic situations.
Some aspects of algorithm performance and modeling in transient analysis of structures
NASA Technical Reports Server (NTRS)
Adelman, H. M.; Haftka, R. T.; Robinson, J. C.
1981-01-01
The status of an effort to increase the efficiency of calculating transient temperature fields in complex aerospace vehicle structures is described. The advantages and disadvantages of explicit algorithms with variable time steps, known as the GEAR package, is described. Four test problems, used for evaluating and comparing various algorithms, were selected and finite-element models of the configurations are described. These problems include a space shuttle frame component, an insulated cylinder, a metallic panel for a thermal protection system, and a model of the wing of the space shuttle orbiter. Results generally indicate a preference for implicit over explicit algorithms for solution of transient structural heat transfer problems when the governing equations are stiff (typical of many practical problems such as insulated metal structures).
NASA Astrophysics Data System (ADS)
Müller, Aline Lima Hermes; Picoloto, Rochele Sogari; Mello, Paola de Azevedo; Ferrão, Marco Flores; dos Santos, Maria de Fátima Pereira; Guimarães, Regina Célia Lourenço; Müller, Edson Irineu; Flores, Erico Marlon Moraes
2012-04-01
Total sulfur concentration was determined in atmospheric residue (AR) and vacuum residue (VR) samples obtained from petroleum distillation process by Fourier transform infrared spectroscopy with attenuated total reflectance (FT-IR/ATR) in association with chemometric methods. Calibration and prediction set consisted of 40 and 20 samples, respectively. Calibration models were developed using two variable selection models: interval partial least squares (iPLS) and synergy interval partial least squares (siPLS). Different treatments and pre-processing steps were also evaluated for the development of models. The pre-treatment based on multiplicative scatter correction (MSC) and the mean centered data were selected for models construction. The use of siPLS as variable selection method provided a model with root mean square error of prediction (RMSEP) values significantly better than those obtained by PLS model using all variables. The best model was obtained using siPLS algorithm with spectra divided in 20 intervals and combinations of 3 intervals (911-824, 823-736 and 737-650 cm-1). This model produced a RMSECV of 400 mg kg-1 S and RMSEP of 420 mg kg-1 S, showing a correlation coefficient of 0.990.
Uncertain programming models for portfolio selection with uncertain returns
NASA Astrophysics Data System (ADS)
Zhang, Bo; Peng, Jin; Li, Shengguo
2015-10-01
In an indeterminacy economic environment, experts' knowledge about the returns of securities consists of much uncertainty instead of randomness. This paper discusses portfolio selection problem in uncertain environment in which security returns cannot be well reflected by historical data, but can be evaluated by the experts. In the paper, returns of securities are assumed to be given by uncertain variables. According to various decision criteria, the portfolio selection problem in uncertain environment is formulated as expected-variance-chance model and chance-expected-variance model by using the uncertainty programming. Within the framework of uncertainty theory, for the convenience of solving the models, some crisp equivalents are discussed under different conditions. In addition, a hybrid intelligent algorithm is designed in the paper to provide a general method for solving the new models in general cases. At last, two numerical examples are provided to show the performance and applications of the models and algorithm.
Long-range prediction of Indian summer monsoon rainfall using data mining and statistical approaches
NASA Astrophysics Data System (ADS)
H, Vathsala; Koolagudi, Shashidhar G.
2017-10-01
This paper presents a hybrid model to better predict Indian summer monsoon rainfall. The algorithm considers suitable techniques for processing dense datasets. The proposed three-step algorithm comprises closed itemset generation-based association rule mining for feature selection, cluster membership for dimensionality reduction, and simple logistic function for prediction. The application of predicting rainfall into flood, excess, normal, deficit, and drought based on 36 predictors consisting of land and ocean variables is presented. Results show good accuracy in the considered study period of 37years (1969-2005).
Statistical simplex approach to primary and secondary color correction in thick lens assemblies
NASA Astrophysics Data System (ADS)
Ament, Shelby D. V.; Pfisterer, Richard
2017-11-01
A glass selection optimization algorithm is developed for primary and secondary color correction in thick lens systems. The approach is based on the downhill simplex method, and requires manipulation of the surface color equations to obtain a single glass-dependent parameter for each lens element. Linear correlation is used to relate this parameter to all other glass-dependent variables. The algorithm provides a statistical distribution of Abbe numbers for each element in the system. Examples of several lenses, from 2-element to 6-element systems, are performed to verify this approach. The optimization algorithm proposed is capable of finding glass solutions with high color correction without requiring an exhaustive search of the glass catalog.
NASA Astrophysics Data System (ADS)
Tomiwa, K. G.
2017-09-01
The search for new physics in the H → γγ+met relies on how well the missing transverse energy is reconstructed. The Met algorithm used by the ATLAS experiment in turns uses input variables like photon and jets which depend on the reconstruction of the primary vertex. This document presents the performance of di-photon vertex reconstruction algorithms (hardest vertex method and Neural Network method). Comparing the performance of these algorithms for the nominal Standard Model sample and the Beyond Standard Model sample, we see the overall performance of the Neural Network method of primary vertex selection performed better than the Hardest vertex method.
Gross, Douglas P; Zhang, Jing; Steenstra, Ivan; Barnsley, Susan; Haws, Calvin; Amell, Tyler; McIntosh, Greg; Cooper, Juliette; Zaiane, Osmar
2013-12-01
To develop a classification algorithm and accompanying computer-based clinical decision support tool to help categorize injured workers toward optimal rehabilitation interventions based on unique worker characteristics. Population-based historical cohort design. Data were extracted from a Canadian provincial workers' compensation database on all claimants undergoing work assessment between December 2009 and January 2011. Data were available on: (1) numerous personal, clinical, occupational, and social variables; (2) type of rehabilitation undertaken; and (3) outcomes following rehabilitation (receiving time loss benefits or undergoing repeat programs). Machine learning, concerned with the design of algorithms to discriminate between classes based on empirical data, was the foundation of our approach to build a classification system with multiple independent and dependent variables. The population included 8,611 unique claimants. Subjects were predominantly employed (85 %) males (64 %) with diagnoses of sprain/strain (44 %). Baseline clinician classification accuracy was high (ROC = 0.86) for selecting programs that lead to successful return-to-work. Classification performance for machine learning techniques outperformed the clinician baseline classification (ROC = 0.94). The final classifiers were multifactorial and included the variables: injury duration, occupation, job attachment status, work status, modified work availability, pain intensity rating, self-rated occupational disability, and 9 items from the SF-36 Health Survey. The use of machine learning classification techniques appears to have resulted in classification performance better than clinician decision-making. The final algorithm has been integrated into a computer-based clinical decision support tool that requires additional validation in a clinical sample.
Stochastic model search with binary outcomes for genome-wide association studies.
Russu, Alberto; Malovini, Alberto; Puca, Annibale A; Bellazzi, Riccardo
2012-06-01
The spread of case-control genome-wide association studies (GWASs) has stimulated the development of new variable selection methods and predictive models. We introduce a novel Bayesian model search algorithm, Binary Outcome Stochastic Search (BOSS), which addresses the model selection problem when the number of predictors far exceeds the number of binary responses. Our method is based on a latent variable model that links the observed outcomes to the underlying genetic variables. A Markov Chain Monte Carlo approach is used for model search and to evaluate the posterior probability of each predictor. BOSS is compared with three established methods (stepwise regression, logistic lasso, and elastic net) in a simulated benchmark. Two real case studies are also investigated: a GWAS on the genetic bases of longevity, and the type 2 diabetes study from the Wellcome Trust Case Control Consortium. Simulations show that BOSS achieves higher precisions than the reference methods while preserving good recall rates. In both experimental studies, BOSS successfully detects genetic polymorphisms previously reported to be associated with the analyzed phenotypes. BOSS outperforms the other methods in terms of F-measure on simulated data. In the two real studies, BOSS successfully detects biologically relevant features, some of which are missed by univariate analysis and the three reference techniques. The proposed algorithm is an advance in the methodology for model selection with a large number of features. Our simulated and experimental results showed that BOSS proves effective in detecting relevant markers while providing a parsimonious model.
Implementation of Data Mining to Analyze Drug Cases Using C4.5 Decision Tree
NASA Astrophysics Data System (ADS)
Wahyuni, Sri
2018-03-01
Data mining was the process of finding useful information from a large set of databases. One of the existing techniques in data mining was classification. The method used was decision tree method and algorithm used was C4.5 algorithm. The decision tree method was a method that transformed a very large fact into a decision tree which was presenting the rules. Decision tree method was useful for exploring data, as well as finding a hidden relationship between a number of potential input variables with a target variable. The decision tree of the C4.5 algorithm was constructed with several stages including the selection of attributes as roots, created a branch for each value and divided the case into the branch. These stages would be repeated for each branch until all the cases on the branch had the same class. From the solution of the decision tree there would be some rules of a case. In this case the researcher classified the data of prisoners at Labuhan Deli prison to know the factors of detainees committing criminal acts of drugs. By applying this C4.5 algorithm, then the knowledge was obtained as information to minimize the criminal acts of drugs. From the findings of the research, it was found that the most influential factor of the detainee committed the criminal act of drugs was from the address variable.
Shahinfar, Saleh; Page, David; Guenther, Jerry; Cabrera, Victor; Fricke, Paul; Weigel, Kent
2014-02-01
When making the decision about whether or not to breed a given cow, knowledge about the expected outcome would have an economic impact on profitability of the breeding program and net income of the farm. The outcome of each breeding can be affected by many management and physiological features that vary between farms and interact with each other. Hence, the ability of machine learning algorithms to accommodate complex relationships in the data and missing values for explanatory variables makes these algorithms well suited for investigation of reproduction performance in dairy cattle. The objective of this study was to develop a user-friendly and intuitive on-farm tool to help farmers make reproduction management decisions. Several different machine learning algorithms were applied to predict the insemination outcomes of individual cows based on phenotypic and genotypic data. Data from 26 dairy farms in the Alta Genetics (Watertown, WI) Advantage Progeny Testing Program were used, representing a 10-yr period from 2000 to 2010. Health, reproduction, and production data were extracted from on-farm dairy management software, and estimated breeding values were downloaded from the US Department of Agriculture Agricultural Research Service Animal Improvement Programs Laboratory (Beltsville, MD) database. The edited data set consisted of 129,245 breeding records from primiparous Holstein cows and 195,128 breeding records from multiparous Holstein cows. Each data point in the final data set included 23 and 25 explanatory variables and 1 binary outcome for of 0.756 ± 0.005 and 0.736 ± 0.005 for primiparous and multiparous cows, respectively. The naïve Bayes algorithm, Bayesian network, and decision tree algorithms showed somewhat poorer classification performance. An information-based variable selection procedure identified herd average conception rate, incidence of ketosis, number of previous (failed) inseminations, days in milk at breeding, and mastitis as the most effective explanatory variables in predicting pregnancy outcome. Copyright © 2014 American Dairy Science Association. Published by Elsevier Inc. All rights reserved.
Stephen, Reejis; Boxwala, Aziz; Gertman, Paul
2003-01-01
Data from Clinical Data Warehouses (CDWs) can be used for retrospective studies and for benchmarking. However, automated identification of cases from large datasets containing data items in free text fields is challenging. We developed an algorithm for categorizing pediatric patients presenting with respiratory distress into Bronchiolitis, Bacterial pneumonia and Asthma using clinical variables from a CDW. A feasibility study of this approach indicates that case selection may be automated.
LQTA-QSAR: a new 4D-QSAR methodology.
Martins, João Paulo A; Barbosa, Euzébio G; Pasqualoto, Kerly F M; Ferreira, Márcia M C
2009-06-01
A novel 4D-QSAR approach which makes use of the molecular dynamics (MD) trajectories and topology information retrieved from the GROMACS package is presented in this study. This new methodology, named LQTA-QSAR (LQTA, Laboratório de Quimiometria Teórica e Aplicada), has a module (LQTAgrid) that calculates intermolecular interaction energies at each grid point considering probes and all aligned conformations resulting from MD simulations. These interaction energies are the independent variables or descriptors employed in a QSAR analysis. The comparison of the proposed methodology to other 4D-QSAR and CoMFA formalisms was performed using a set of forty-seven glycogen phosphorylase b inhibitors (data set 1) and a set of forty-four MAP p38 kinase inhibitors (data set 2). The QSAR models for both data sets were built using the ordered predictor selection (OPS) algorithm for variable selection. Model validation was carried out applying y-randomization and leave-N-out cross-validation in addition to the external validation. PLS models for data set 1 and 2 provided the following statistics: q(2) = 0.72, r(2) = 0.81 for 12 variables selected and 2 latent variables and q(2) = 0.82, r(2) = 0.90 for 10 variables selected and 5 latent variables, respectively. Visualization of the descriptors in 3D space was successfully interpreted from the chemical point of view, supporting the applicability of this new approach in rational drug design.
NASA Technical Reports Server (NTRS)
Brewin, Robert J.W.; Sathyendranath, Shubha; Muller, Dagmar; Brockmann, Carsten; Deschamps, Pierre-Yves; Devred, Emmanuel; Doerffer, Roland; Fomferra, Norman; Franz, Bryan; Grant, Mike;
2013-01-01
Satellite-derived remote-sensing reflectance (Rrs) can be used for mapping biogeochemically relevant variables, such as the chlorophyll concentration and the Inherent Optical Properties (IOPs) of the water, at global scale for use in climate-change studies. Prior to generating such products, suitable algorithms have to be selected that are appropriate for the purpose. Algorithm selection needs to account for both qualitative and quantitative requirements. In this paper we develop an objective methodology designed to rank the quantitative performance of a suite of bio-optical models. The objective classification is applied using the NASA bio-Optical Marine Algorithm Dataset (NOMAD). Using in situ Rrs as input to the models, the performance of eleven semianalytical models, as well as five empirical chlorophyll algorithms and an empirical diffuse attenuation coefficient algorithm, is ranked for spectrally-resolved IOPs, chlorophyll concentration and the diffuse attenuation coefficient at 489 nm. The sensitivity of the objective classification and the uncertainty in the ranking are tested using a Monte-Carlo approach (bootstrapping). Results indicate that the performance of the semi-analytical models varies depending on the product and wavelength of interest. For chlorophyll retrieval, empirical algorithms perform better than semi-analytical models, in general. The performance of these empirical models reflects either their immunity to scale errors or instrument noise in Rrs data, or simply that the data used for model parameterisation were not independent of NOMAD. Nonetheless, uncertainty in the classification suggests that the performance of some semi-analytical algorithms at retrieving chlorophyll is comparable with the empirical algorithms. For phytoplankton absorption at 443 nm, some semi-analytical models also perform with similar accuracy to an empirical model. We discuss the potential biases, limitations and uncertainty in the approach, as well as additional qualitative considerations for algorithm selection for climate-change studies. Our classification has the potential to be routinely implemented, such that the performance of emerging algorithms can be compared with existing algorithms as they become available. In the long-term, such an approach will further aid algorithm development for ocean-colour studies.
Application of Multivariate Modeling for Radiation Injury Assessment: A Proof of Concept
Bolduc, David L.; Villa, Vilmar; Sandgren, David J.; Ledney, G. David; Blakely, William F.; Bünger, Rolf
2014-01-01
Multivariate radiation injury estimation algorithms were formulated for estimating severe hematopoietic acute radiation syndrome (H-ARS) injury (i.e., response category three or RC3) in a rhesus monkey total-body irradiation (TBI) model. Classical CBC and serum chemistry blood parameters were examined prior to irradiation (d 0) and on d 7, 10, 14, 21, and 25 after irradiation involving 24 nonhuman primates (NHP) (Macaca mulatta) given 6.5-Gy 60Co Υ-rays (0.4 Gy min−1) TBI. A correlation matrix was formulated with the RC3 severity level designated as the “dependent variable” and independent variables down selected based on their radioresponsiveness and relatively low multicollinearity using stepwise-linear regression analyses. Final candidate independent variables included CBC counts (absolute number of neutrophils, lymphocytes, and platelets) in formulating the “CBC” RC3 estimation algorithm. Additionally, the formulation of a diagnostic CBC and serum chemistry “CBC-SCHEM” RC3 algorithm expanded upon the CBC algorithm model with the addition of hematocrit and the serum enzyme levels of aspartate aminotransferase, creatine kinase, and lactate dehydrogenase. Both algorithms estimated RC3 with over 90% predictive power. Only the CBC-SCHEM RC3 algorithm, however, met the critical three assumptions of linear least squares demonstrating slightly greater precision for radiation injury estimation, but with significantly decreased prediction error indicating increased statistical robustness. PMID:25165485
Combinatorial Multiobjective Optimization Using Genetic Algorithms
NASA Technical Reports Server (NTRS)
Crossley, William A.; Martin. Eric T.
2002-01-01
The research proposed in this document investigated multiobjective optimization approaches based upon the Genetic Algorithm (GA). Several versions of the GA have been adopted for multiobjective design, but, prior to this research, there had not been significant comparisons of the most popular strategies. The research effort first generalized the two-branch tournament genetic algorithm in to an N-branch genetic algorithm, then the N-branch GA was compared with a version of the popular Multi-Objective Genetic Algorithm (MOGA). Because the genetic algorithm is well suited to combinatorial (mixed discrete / continuous) optimization problems, the GA can be used in the conceptual phase of design to combine selection (discrete variable) and sizing (continuous variable) tasks. Using a multiobjective formulation for the design of a 50-passenger aircraft to meet the competing objectives of minimizing takeoff gross weight and minimizing trip time, the GA generated a range of tradeoff designs that illustrate which aircraft features change from a low-weight, slow trip-time aircraft design to a heavy-weight, short trip-time aircraft design. Given the objective formulation and analysis methods used, the results of this study identify where turboprop-powered aircraft and turbofan-powered aircraft become more desirable for the 50 seat passenger application. This aircraft design application also begins to suggest how a combinatorial multiobjective optimization technique could be used to assist in the design of morphing aircraft.
Evolutionary algorithm for vehicle driving cycle generation.
Perhinschi, Mario G; Marlowe, Christopher; Tamayo, Sergio; Tu, Jun; Wayne, W Scott
2011-09-01
Modeling transit bus emissions and fuel economy requires a large amount of experimental data over wide ranges of operational conditions. Chassis dynamometer tests are typically performed using representative driving cycles defined based on vehicle instantaneous speed as sequences of "microtrips", which are intervals between consecutive vehicle stops. Overall significant parameters of the driving cycle, such as average speed, stops per mile, kinetic intensity, and others, are used as independent variables in the modeling process. Performing tests at all the necessary combinations of parameters is expensive and time consuming. In this paper, a methodology is proposed for building driving cycles at prescribed independent variable values using experimental data through the concatenation of "microtrips" isolated from a limited number of standard chassis dynamometer test cycles. The selection of the adequate "microtrips" is achieved through a customized evolutionary algorithm. The genetic representation uses microtrip definitions as genes. Specific mutation, crossover, and karyotype alteration operators have been defined. The Roulette-Wheel selection technique with elitist strategy drives the optimization process, which consists of minimizing the errors to desired overall cycle parameters. This utility is part of the Integrated Bus Information System developed at West Virginia University.
Automatic design of basin-specific drought indexes for highly regulated water systems
NASA Astrophysics Data System (ADS)
Zaniolo, Marta; Giuliani, Matteo; Castelletti, Andrea Francesco; Pulido-Velazquez, Manuel
2018-04-01
Socio-economic costs of drought are progressively increasing worldwide due to undergoing alterations of hydro-meteorological regimes induced by climate change. Although drought management is largely studied in the literature, traditional drought indexes often fail at detecting critical events in highly regulated systems, where natural water availability is conditioned by the operation of water infrastructures such as dams, diversions, and pumping wells. Here, ad hoc index formulations are usually adopted based on empirical combinations of several, supposed-to-be significant, hydro-meteorological variables. These customized formulations, however, while effective in the design basin, can hardly be generalized and transferred to different contexts. In this study, we contribute FRIDA (FRamework for Index-based Drought Analysis), a novel framework for the automatic design of basin-customized drought indexes. In contrast to ad hoc empirical approaches, FRIDA is fully automated, generalizable, and portable across different basins. FRIDA builds an index representing a surrogate of the drought conditions of the basin, computed by combining all the relevant available information about the water circulating in the system identified by means of a feature extraction algorithm. We used the Wrapper for Quasi-Equally Informative Subset Selection (W-QEISS), which features a multi-objective evolutionary algorithm to find Pareto-efficient subsets of variables by maximizing the wrapper accuracy, minimizing the number of selected variables, and optimizing relevance and redundancy of the subset. The preferred variable subset is selected among the efficient solutions and used to formulate the final index according to alternative model structures. We apply FRIDA to the case study of the Jucar river basin (Spain), a drought-prone and highly regulated Mediterranean water resource system, where an advanced drought management plan relying on the formulation of an ad hoc state index
is used for triggering drought management measures. The state index was constructed empirically with a trial-and-error process begun in the 1980s and finalized in 2007, guided by the experts from the Confederación Hidrográfica del Júcar (CHJ). Our results show that the automated variable selection outcomes align with CHJ's 25-year-long empirical refinement. In addition, the resultant FRIDA index outperforms the official State Index in terms of accuracy in reproducing the target variable and cardinality of the selected inputs set.
Glisson, Courtenay L; Altamar, Hernan O; Herrell, S Duke; Clark, Peter; Galloway, Robert L
2011-11-01
Image segmentation is integral to implementing intraoperative guidance for kidney tumor resection. Results seen in computed tomography (CT) data are affected by target organ physiology as well as by the segmentation algorithm used. This work studies variables involved in using level set methods found in the Insight Toolkit to segment kidneys from CT scans and applies the results to an image guidance setting. A composite algorithm drawing on the strengths of multiple level set approaches was built using the Insight Toolkit. This algorithm requires image contrast state and seed points to be identified as input, and functions independently thereafter, selecting and altering method and variable choice as needed. Semi-automatic results were compared to expert hand segmentation results directly and by the use of the resultant surfaces for registration of intraoperative data. Direct comparison using the Dice metric showed average agreement of 0.93 between semi-automatic and hand segmentation results. Use of the segmented surfaces in closest point registration of intraoperative laser range scan data yielded average closest point distances of approximately 1 mm. Application of both inverse registration transforms from the previous step to all hand segmented image space points revealed that the distance variability introduced by registering to the semi-automatically segmented surface versus the hand segmented surface was typically less than 3 mm both near the tumor target and at distal points, including subsurface points. Use of the algorithm shortened user interaction time and provided results which were comparable to the gold standard of hand segmentation. Further, the use of the algorithm's resultant surfaces in image registration provided comparable transformations to surfaces produced by hand segmentation. These data support the applicability and utility of such an algorithm as part of an image guidance workflow.
Acevedo-Sáenz, Liliana; Ochoa, Rodrigo; Rugeles, Maria Teresa; Olaya-García, Patricia; Velilla-Hernández, Paula Andrea; Diaz, Francisco J.
2015-01-01
One of the main characteristics of the human immunodeficiency virus is its genetic variability and rapid adaptation to changing environmental conditions. This variability, resulting from the lack of proofreading activity of the viral reverse transcriptase, generates mutations that could be fixed either by random genetic drift or by positive selection. Among the forces driving positive selection are antiretroviral therapy and CD8+ T-cells, the most important immune mechanism involved in viral control. Here, we describe mutations induced by these selective forces acting on the pol gene of HIV in a group of infected individuals. We used Maximum Likelihood analyses of the ratio of non-synonymous to synonymous mutations per site (dN/dS) to study the extent of positive selection in the protease and the reverse transcriptase, using 614 viral sequences from Colombian patients. We also performed computational approaches, docking and algorithmic analyses, to assess whether the positively selected mutations affected binding to the HLA molecules. We found 19 positively-selected codons in drug resistance-associated sites and 22 located within CD8+ T-cell epitopes. A high percentage of mutations in these epitopes has not been previously reported. According to the docking analyses only one of those mutations affected HLA binding. However, algorithmic methods predicted a decrease in the affinity for the HLA molecule in seven mutated peptides. The bioinformatics strategies described here are useful to identify putative positively selected mutations associated with immune escape but should be complemented with an experimental approach to define the impact of these mutations on the functional profile of the CD8+ T-cells. PMID:25803098
Zawbaa, Hossam M; Szlȩk, Jakub; Grosan, Crina; Jachowicz, Renata; Mendyk, Aleksander
2016-01-01
Poly-lactide-co-glycolide (PLGA) is a copolymer of lactic and glycolic acid. Drug release from PLGA microspheres depends not only on polymer properties but also on drug type, particle size, morphology of microspheres, release conditions, etc. Selecting a subset of relevant properties for PLGA is a challenging machine learning task as there are over three hundred features to consider. In this work, we formulate the selection of critical attributes for PLGA as a multiobjective optimization problem with the aim of minimizing the error of predicting the dissolution profile while reducing the number of attributes selected. Four bio-inspired optimization algorithms: antlion optimization, binary version of antlion optimization, grey wolf optimization, and social spider optimization are used to select the optimal feature set for predicting the dissolution profile of PLGA. Besides these, LASSO algorithm is also used for comparisons. Selection of crucial variables is performed under the assumption that both predictability and model simplicity are of equal importance to the final result. During the feature selection process, a set of input variables is employed to find minimum generalization error across different predictive models and their settings/architectures. The methodology is evaluated using predictive modeling for which various tools are chosen, such as Cubist, random forests, artificial neural networks (monotonic MLP, deep learning MLP), multivariate adaptive regression splines, classification and regression tree, and hybrid systems of fuzzy logic and evolutionary computations (fugeR). The experimental results are compared with the results reported by Szlȩk. We obtain a normalized root mean square error (NRMSE) of 15.97% versus 15.4%, and the number of selected input features is smaller, nine versus eleven.
Zawbaa, Hossam M.; Szlȩk, Jakub; Grosan, Crina; Jachowicz, Renata; Mendyk, Aleksander
2016-01-01
Poly-lactide-co-glycolide (PLGA) is a copolymer of lactic and glycolic acid. Drug release from PLGA microspheres depends not only on polymer properties but also on drug type, particle size, morphology of microspheres, release conditions, etc. Selecting a subset of relevant properties for PLGA is a challenging machine learning task as there are over three hundred features to consider. In this work, we formulate the selection of critical attributes for PLGA as a multiobjective optimization problem with the aim of minimizing the error of predicting the dissolution profile while reducing the number of attributes selected. Four bio-inspired optimization algorithms: antlion optimization, binary version of antlion optimization, grey wolf optimization, and social spider optimization are used to select the optimal feature set for predicting the dissolution profile of PLGA. Besides these, LASSO algorithm is also used for comparisons. Selection of crucial variables is performed under the assumption that both predictability and model simplicity are of equal importance to the final result. During the feature selection process, a set of input variables is employed to find minimum generalization error across different predictive models and their settings/architectures. The methodology is evaluated using predictive modeling for which various tools are chosen, such as Cubist, random forests, artificial neural networks (monotonic MLP, deep learning MLP), multivariate adaptive regression splines, classification and regression tree, and hybrid systems of fuzzy logic and evolutionary computations (fugeR). The experimental results are compared with the results reported by Szlȩk. We obtain a normalized root mean square error (NRMSE) of 15.97% versus 15.4%, and the number of selected input features is smaller, nine versus eleven. PMID:27315205
Development of an automated energy audit protocol for office buildings
NASA Astrophysics Data System (ADS)
Deb, Chirag
This study aims to enhance the building energy audit process, and bring about reduction in time and cost requirements in the conduction of a full physical audit. For this, a total of 5 Energy Service Companies in Singapore have collaborated and provided energy audit reports for 62 office buildings. Several statistical techniques are adopted to analyse these reports. These techniques comprise cluster analysis and development of prediction models to predict energy savings for buildings. The cluster analysis shows that there are 3 clusters of buildings experiencing different levels of energy savings. To understand the effect of building variables on the change in EUI, a robust iterative process for selecting the appropriate variables is developed. The results show that the 4 variables of GFA, non-air-conditioning energy consumption, average chiller plant efficiency and installed capacity of chillers should be taken for clustering. This analysis is extended to the development of prediction models using linear regression and artificial neural networks (ANN). An exhaustive variable selection algorithm is developed to select the input variables for the two energy saving prediction models. The results show that the ANN prediction model can predict the energy saving potential of a given building with an accuracy of +/-14.8%.
NASA Astrophysics Data System (ADS)
Lü, Chengxu; Jiang, Xunpeng; Zhou, Xingfan; Zhang, Yinqiao; Zhang, Naiqian; Wei, Chongfeng; Mao, Wenhua
2017-10-01
Wet gluten is a useful quality indicator for wheat, and short wave near infrared spectroscopy (NIRS) is a high performance technique with the advantage of economic rapid and nondestructive test. To study the feasibility of short wave NIRS analyzing wet gluten directly from wheat seed, 54 representative wheat seed samples were collected and scanned by spectrometer. 8 spectral pretreatment method and genetic algorithm (GA) variable selection method were used to optimize analysis. Both quantitative and qualitative model of wet gluten were built by partial least squares regression and discriminate analysis. For quantitative analysis, normalization is the optimized pretreatment method, 17 wet gluten sensitive variables are selected by GA, and GA model performs a better result than that of all variable model, with R2V=0.88, and RMSEV=1.47. For qualitative analysis, automatic weighted least squares baseline is the optimized pretreatment method, all variable models perform better results than those of GA models. The correct classification rates of 3 class of <24%, 24-30%, >30% wet gluten content are 95.45, 84.52, and 90.00%, respectively. The short wave NIRS technique shows potential for both quantitative and qualitative analysis of wet gluten for wheat seed.
C-fuzzy variable-branch decision tree with storage and classification error rate constraints
NASA Astrophysics Data System (ADS)
Yang, Shiueng-Bien
2009-10-01
The C-fuzzy decision tree (CFDT), which is based on the fuzzy C-means algorithm, has recently been proposed. The CFDT is grown by selecting the nodes to be split according to its classification error rate. However, the CFDT design does not consider the classification time taken to classify the input vector. Thus, the CFDT can be improved. We propose a new C-fuzzy variable-branch decision tree (CFVBDT) with storage and classification error rate constraints. The design of the CFVBDT consists of two phases-growing and pruning. The CFVBDT is grown by selecting the nodes to be split according to the classification error rate and the classification time in the decision tree. Additionally, the pruning method selects the nodes to prune based on the storage requirement and the classification time of the CFVBDT. Furthermore, the number of branches of each internal node is variable in the CFVBDT. Experimental results indicate that the proposed CFVBDT outperforms the CFDT and other methods.
Variable Selection for Support Vector Machines in Moderately High Dimensions
Zhang, Xiang; Wu, Yichao; Wang, Lan; Li, Runze
2015-01-01
Summary The support vector machine (SVM) is a powerful binary classification tool with high accuracy and great flexibility. It has achieved great success, but its performance can be seriously impaired if many redundant covariates are included. Some efforts have been devoted to studying variable selection for SVMs, but asymptotic properties, such as variable selection consistency, are largely unknown when the number of predictors diverges to infinity. In this work, we establish a unified theory for a general class of nonconvex penalized SVMs. We first prove that in ultra-high dimensions, there exists one local minimizer to the objective function of nonconvex penalized SVMs possessing the desired oracle property. We further address the problem of nonunique local minimizers by showing that the local linear approximation algorithm is guaranteed to converge to the oracle estimator even in the ultra-high dimensional setting if an appropriate initial estimator is available. This condition on initial estimator is verified to be automatically valid as long as the dimensions are moderately high. Numerical examples provide supportive evidence. PMID:26778916
A method for the dynamic management of genetic variability in dairy cattle
Colleau, Jean-Jacques; Moureaux, Sophie; Briend, Michèle; Bechu, Jérôme
2004-01-01
According to the general approach developed in this paper, dynamic management of genetic variability in selected populations of dairy cattle is carried out for three simultaneous purposes: procreation of young bulls to be further progeny-tested, use of service bulls already selected and approval of recently progeny-tested bulls for use. At each step, the objective is to minimize the average pairwise relationship coefficient in the future population born from programmed matings and the existing population. As a common constraint, the average estimated breeding value of the new population, for a selection goal including many important traits, is set to a desired value. For the procreation of young bulls, breeding costs are additionally constrained. Optimization is fully analytical and directly considers matings. Corresponding algorithms are presented in detail. The efficiency of these procedures was tested on the current Norman population. Comparisons between optimized and real matings, clearly showed that optimization would have saved substantial genetic variability without reducing short-term genetic gains. PMID:15231230
Evenly spaced Detrended Fluctuation Analysis: Selecting the number of points for the diffusion plot
NASA Astrophysics Data System (ADS)
Liddy, Joshua J.; Haddad, Jeffrey M.
2018-02-01
Detrended Fluctuation Analysis (DFA) has become a widely-used tool to examine the correlation structure of a time series and provided insights into neuromuscular health and disease states. As the popularity of utilizing DFA in the human behavioral sciences has grown, understanding its limitations and how to properly determine parameters is becoming increasingly important. DFA examines the correlation structure of variability in a time series by computing α, the slope of the log SD- log n diffusion plot. When using the traditional DFA algorithm, the timescales, n, are often selected as a set of integers between a minimum and maximum length based on the number of data points in the time series. This produces non-uniformly distributed values of n in logarithmic scale, which influences the estimation of α due to a disproportionate weighting of the long-timescale regions of the diffusion plot. Recently, the evenly spaced DFA and evenly spaced average DFA algorithms were introduced. Both algorithms compute α by selecting k points for the diffusion plot based on the minimum and maximum timescales of interest and improve the consistency of α estimates for simulated fractional Gaussian noise and fractional Brownian motion time series. Two issues that remain unaddressed are (1) how to select k and (2) whether the evenly-spaced DFA algorithms show similar benefits when assessing human behavioral data. We manipulated k and examined its effects on the accuracy, consistency, and confidence limits of α in simulated and experimental time series. We demonstrate that the accuracy and consistency of α are relatively unaffected by the selection of k. However, the confidence limits of α narrow as k increases, dramatically reducing measurement uncertainty for single trials. We provide guidelines for selecting k and discuss potential uses of the evenly spaced DFA algorithms when assessing human behavioral data.
NASA Astrophysics Data System (ADS)
Rosero-Vlasova, O.; Borini Alves, D.; Vlassova, L.; Perez-Cabello, F.; Montorio Lloveria, R.
2017-10-01
Deforestation in Amazon basin due, among other factors, to frequent wildfires demands continuous post-fire monitoring of soil and vegetation. Thus, the study posed two objectives: (1) evaluate the capacity of Visible - Near InfraRed - ShortWave InfraRed (VIS-NIR-SWIR) spectroscopy to estimate soil organic matter (SOM) in fire-affected soils, and (2) assess the feasibility of SOM mapping from satellite images. For this purpose, 30 soil samples (surface layer) were collected in 2016 in areas of grass and riparian vegetation of Campos Amazonicos National Park, Brazil, repeatedly affected by wildfires. Standard laboratory procedures were applied to determine SOM. Reflectance spectra of soils were obtained in controlled laboratory conditions using Fieldspec4 spectroradiometer (spectral range 350nm- 2500nm). Measured spectra were resampled to simulate reflectances for Landsat-8, Sentinel-2 and EnMap spectral bands, used as predictors in SOM models developed using Partial Least Squares regression and step-down variable selection algorithm (PLSR-SD). The best fit was achieved with models based on reflectances simulated for EnMap bands (R2=0.93; R2cv=0.82 and NMSE=0.07; NMSEcv=0.19). The model uses only 8 out of 244 predictors (bands) chosen by the step-down variable selection algorithm. The least reliable estimates (R2=0.55 and R2cv=0.40 and NMSE=0.43; NMSEcv=0.60) resulted from Landsat model, while Sentinel-2 model showed R2=0.68 and R2cv=0.63; NMSE=0.31 and NMSEcv=0.38. The results confirm high potential of VIS-NIR-SWIR spectroscopy for SOM estimation. Application of step-down produces sparser and better-fit models. Finally, SOM can be estimated with an acceptable accuracy (NMSE 0.35) from EnMap and Sentinel-2 data enabling mapping and analysis of impacts of repeated wildfires on soils in the study area.
Stratiform/convective rain delineation for TRMM microwave imager
NASA Astrophysics Data System (ADS)
Islam, Tanvir; Srivastava, Prashant K.; Dai, Qiang; Gupta, Manika; Wan Jaafar, Wan Zurina
2015-10-01
This article investigates the potential for using machine learning algorithms to delineate stratiform/convective (S/C) rain regimes for passive microwave imager taking calibrated brightness temperatures as only spectral parameters. The algorithms have been implemented for the Tropical Rainfall Measuring Mission (TRMM) microwave imager (TMI), and calibrated as well as validated taking the Precipitation Radar (PR) S/C information as the target class variables. Two different algorithms are particularly explored for the delineation. The first one is metaheuristic adaptive boosting algorithm that includes the real, gentle, and modest versions of the AdaBoost. The second one is the classical linear discriminant analysis that includes the Fisher's and penalized versions of the linear discriminant analysis. Furthermore, prior to the development of the delineation algorithms, a feature selection analysis has been conducted for a total of 85 features, which contains the combinations of brightness temperatures from 10 GHz to 85 GHz and some derived indexes, such as scattering index, polarization corrected temperature, and polarization difference with the help of mutual information aided minimal redundancy maximal relevance criterion (mRMR). It has been found that the polarization corrected temperature at 85 GHz and the features derived from the "addition" operator associated with the 85 GHz channels have good statistical dependency to the S/C target class variables. Further, it has been shown how the mRMR feature selection technique helps to reduce the number of features without deteriorating the results when applying through the machine learning algorithms. The proposed scheme is able to delineate the S/C rain regimes with reasonable accuracy. Based on the statistical validation experience from the validation period, the Matthews correlation coefficients are in the range of 0.60-0.70. Since, the proposed method does not rely on any a priori information, this makes it very suitable for other microwave sensors having similar channels to the TMI. The method could possibly benefit the constellation sensors in the Global Precipitation Measurement (GPM) mission era.
Improved Sparse Multi-Class SVM and Its Application for Gene Selection in Cancer Classification
Huang, Lingkang; Zhang, Hao Helen; Zeng, Zhao-Bang; Bushel, Pierre R.
2013-01-01
Background Microarray techniques provide promising tools for cancer diagnosis using gene expression profiles. However, molecular diagnosis based on high-throughput platforms presents great challenges due to the overwhelming number of variables versus the small sample size and the complex nature of multi-type tumors. Support vector machines (SVMs) have shown superior performance in cancer classification due to their ability to handle high dimensional low sample size data. The multi-class SVM algorithm of Crammer and Singer provides a natural framework for multi-class learning. Despite its effective performance, the procedure utilizes all variables without selection. In this paper, we propose to improve the procedure by imposing shrinkage penalties in learning to enforce solution sparsity. Results The original multi-class SVM of Crammer and Singer is effective for multi-class classification but does not conduct variable selection. We improved the method by introducing soft-thresholding type penalties to incorporate variable selection into multi-class classification for high dimensional data. The new methods were applied to simulated data and two cancer gene expression data sets. The results demonstrate that the new methods can select a small number of genes for building accurate multi-class classification rules. Furthermore, the important genes selected by the methods overlap significantly, suggesting general agreement among different variable selection schemes. Conclusions High accuracy and sparsity make the new methods attractive for cancer diagnostics with gene expression data and defining targets of therapeutic intervention. Availability: The source MATLAB code are available from http://math.arizona.edu/~hzhang/software.html. PMID:23966761
Annealed Importance Sampling Reversible Jump MCMC algorithms
DOE Office of Scientific and Technical Information (OSTI.GOV)
Karagiannis, Georgios; Andrieu, Christophe
2013-03-20
It will soon be 20 years since reversible jump Markov chain Monte Carlo (RJ-MCMC) algorithms have been proposed. They have significantly extended the scope of Markov chain Monte Carlo simulation methods, offering the promise to be able to routinely tackle transdimensional sampling problems, as encountered in Bayesian model selection problems for example, in a principled and flexible fashion. Their practical efficient implementation, however, still remains a challenge. A particular difficulty encountered in practice is in the choice of the dimension matching variables (both their nature and their distribution) and the reversible transformations which allow one to define the one-to-one mappingsmore » underpinning the design of these algorithms. Indeed, even seemingly sensible choices can lead to algorithms with very poor performance. The focus of this paper is the development and performance evaluation of a method, annealed importance sampling RJ-MCMC (aisRJ), which addresses this problem by mitigating the sensitivity of RJ-MCMC algorithms to the aforementioned poor design. As we shall see the algorithm can be understood as being an “exact approximation” of an idealized MCMC algorithm that would sample from the model probabilities directly in a model selection set-up. Such an idealized algorithm may have good theoretical convergence properties, but typically cannot be implemented, and our algorithms can approximate the performance of such idealized algorithms to an arbitrary degree while not introducing any bias for any degree of approximation. Our approach combines the dimension matching ideas of RJ-MCMC with annealed importance sampling and its Markov chain Monte Carlo implementation. We illustrate the performance of the algorithm with numerical simulations which indicate that, although the approach may at first appear computationally involved, it is in fact competitive.« less
NASA Astrophysics Data System (ADS)
Dao, Son Duy; Abhary, Kazem; Marian, Romeo
2017-06-01
Integration of production planning and scheduling is a class of problems commonly found in manufacturing industry. This class of problems associated with precedence constraint has been previously modeled and optimized by the authors, in which, it requires a multidimensional optimization at the same time: what to make, how many to make, where to make and the order to make. It is a combinatorial, NP-hard problem, for which no polynomial time algorithm is known to produce an optimal result on a random graph. In this paper, the further development of Genetic Algorithm (GA) for this integrated optimization is presented. Because of the dynamic nature of the problem, the size of its solution is variable. To deal with this variability and find an optimal solution to the problem, GA with new features in chromosome encoding, crossover, mutation, selection as well as algorithm structure is developed herein. With the proposed structure, the proposed GA is able to "learn" from its experience. Robustness of the proposed GA is demonstrated by a complex numerical example in which performance of the proposed GA is compared with those of three commercial optimization solvers.
Popov, Milen; Sotiriadis, Charalampos; Gay, Frederique; Jouannic, Anne-Marie; Lachenal, Yann; Hajdu, Steven D; Doenz, Francesco; Qanadli, Salah D
2017-04-01
To report our experience using a multilevel patient management algorithm to direct transarterial embolization (TAE) in managing spontaneous intramuscular hematoma (SIMH). From May 2006 to January 2014, twenty-seven patients with SIMH had been referred for TAE to our Radiology department. Clinical status and coagulation characteristics of the patients are analyzed. An algorithm integrating CT findings is suggested to manage SIMH. Patients were classified into three groups: Type I, SIMH with no active bleeding (AB); Type II, SIMH with AB and no muscular fascia rupture (MFR); and Type III, SIMH with MFR and AB. Type II is furthermore subcategorized as IIa, IIb and IIc. Types IIb, IIc and III were considered for TAE. The method of embolization as well as the material been used are described. Continuous variables are presented as mean ± SD. Categorical variables are reported as percentages. Technical success, clinical success, complications and 30-day mortality (d30 M) were analyzed. Two patients (7.5%) had Type IIb, four (15%) Type IIc and 21 (77.5%) presented Type III. The detailed CT and CTA findings, embolization procedure and materials used are described. Technical success was 96% with a complication rate of 4%. Clinical success was 88%. The bleeding-related thirty-day mortality was 15% (all with Type III). TAE is a safe and efficient technique to control bleeding that should be considered in selected SIMH as soon as possible. The proposed algorithm integrating CT features provides a comprehensive chart to select patients for TAE. 4.
Analysis of decentralized variable structure control for collective search by mobile robots
NASA Astrophysics Data System (ADS)
Goldsmith, Steven Y.; Feddema, John T.; Robinett, Rush D., III
1998-10-01
This paper presents an analysis of a decentralized coordination strategy for organizing and controlling a team of mobile robots performing collective search. The alpha- beta coordination strategy is a family of collective search algorithms that allow teams of communicating robots to implicitly coordinate their search activities through a division of labor based on self-selected roles. In an alpha- beta team, alpha agents are motivated to improve their status by exploring new regions of the search space. Beta agents are conservative, and rely on the alpha agents to provide advanced information on favorable regions of the search space. An agent selects its current role dynamically based on its current status value relative to the current status values of the other team members. Status is determined by some function of the agent's sensor readings, and is generally a measurement of source intensity at the agent's current location. Variations on the decision rules determining alpha and beta behavior produce different versions of the algorithm that lead to different global properties. The alpha-beta strategy is based on a simple finite-state machine that implements a form of Variable Structure Control (VSC). The VSC system changes the dynamics of the collective system by abruptly switching at defined states to alternative control laws. In VSC, Lyapunov's direct method is often used to design control surfaces which guide the system to a given goal. We introduce the alpha- beta algorithm and present an analysis of the equilibrium point and the global stability of the alpha-beta algorithm based on Lyapunov's method.
Johnson, Kevin J; Wright, Bob W; Jarman, Kristin H; Synovec, Robert E
2003-05-09
A rapid retention time alignment algorithm was developed as a preprocessing utility to be used prior to chemometric analysis of large datasets of diesel fuel profiles obtained using gas chromatography (GC). Retention time variation from chromatogram-to-chromatogram has been a significant impediment against the use of chemometric techniques in the analysis of chromatographic data due to the inability of current chemometric techniques to correctly model information that shifts from variable to variable within a dataset. The alignment algorithm developed is shown to increase the efficacy of pattern recognition methods applied to diesel fuel chromatograms by retaining chemical selectivity while reducing chromatogram-to-chromatogram retention time variations and to do so on a time scale that makes analysis of large sets of chromatographic data practical. Two sets of diesel fuel gas chromatograms were studied using the novel alignment algorithm followed by principal component analysis (PCA). In the first study, retention times for corresponding chromatographic peaks in 60 chromatograms varied by as much as 300 ms between chromatograms before alignment. In the second study of 42 chromatograms, the retention time shifting exhibited was on the order of 10 s between corresponding chromatographic peaks, and required a coarse retention time correction prior to alignment with the algorithm. In both cases, an increase in retention time precision afforded by the algorithm was clearly visible in plots of overlaid chromatograms before and then after applying the retention time alignment algorithm. Using the alignment algorithm, the standard deviation for corresponding peak retention times following alignment was 17 ms throughout a given chromatogram, corresponding to a relative standard deviation of 0.003% at an average retention time of 8 min. This level of retention time precision is a 5-fold improvement over the retention time precision initially provided by a state-of-the-art GC instrument equipped with electronic pressure control and was critical to the performance of the chemometric analysis. This increase in retention time precision does not come at the expense of chemical selectivity, since the PCA results suggest that essentially all of the chemical selectivity is preserved. Cluster resolution between dissimilar groups of diesel fuel chromatograms in a two-dimensional scores space generated with PCA is shown to substantially increase after alignment. The alignment method is robust against missing or extra peaks relative to a target chromatogram used in the alignment, and operates at high speed, requiring roughly 1 s of computation time per GC chromatogram.
Cider fermentation process monitoring by Vis-NIR sensor system and chemometrics.
Villar, Alberto; Vadillo, Julen; Santos, Jose I; Gorritxategi, Eneko; Mabe, Jon; Arnaiz, Aitor; Fernández, Luis A
2017-04-15
Optimization of a multivariate calibration process has been undertaken for a Visible-Near Infrared (400-1100nm) sensor system, applied in the monitoring of the fermentation process of the cider produced in the Basque Country (Spain). The main parameters that were monitored included alcoholic proof, l-lactic acid content, glucose+fructose and acetic acid content. The multivariate calibration was carried out using a combination of different variable selection techniques and the most suitable pre-processing strategies were selected based on the spectra characteristics obtained by the sensor system. The variable selection techniques studied in this work include Martens Uncertainty test, interval Partial Least Square Regression (iPLS) and Genetic Algorithm (GA). This procedure arises from the need to improve the calibration models prediction ability for cider monitoring. Copyright © 2016 Elsevier Ltd. All rights reserved.
Marateb, Hamid Reza; Mansourian, Marjan; Adibi, Peyman; Farina, Dario
2014-01-01
Background: selecting the correct statistical test and data mining method depends highly on the measurement scale of data, type of variables, and purpose of the analysis. Different measurement scales are studied in details and statistical comparison, modeling, and data mining methods are studied based upon using several medical examples. We have presented two ordinal–variables clustering examples, as more challenging variable in analysis, using Wisconsin Breast Cancer Data (WBCD). Ordinal-to-Interval scale conversion example: a breast cancer database of nine 10-level ordinal variables for 683 patients was analyzed by two ordinal-scale clustering methods. The performance of the clustering methods was assessed by comparison with the gold standard groups of malignant and benign cases that had been identified by clinical tests. Results: the sensitivity and accuracy of the two clustering methods were 98% and 96%, respectively. Their specificity was comparable. Conclusion: by using appropriate clustering algorithm based on the measurement scale of the variables in the study, high performance is granted. Moreover, descriptive and inferential statistics in addition to modeling approach must be selected based on the scale of the variables. PMID:24672565
NASA Astrophysics Data System (ADS)
Sahabiev, I. A.; Ryazanov, S. S.; Kolcova, T. G.; Grigoryan, B. R.
2018-03-01
The three most common techniques to interpolate soil properties at a field scale—ordinary kriging (OK), regression kriging with multiple linear regression drift model (RK + MLR), and regression kriging with principal component regression drift model (RK + PCR)—were examined. The results of the performed study were compiled into an algorithm of choosing the most appropriate soil mapping technique. Relief attributes were used as the auxiliary variables. When spatial dependence of a target variable was strong, the OK method showed more accurate interpolation results, and the inclusion of the auxiliary data resulted in an insignificant improvement in prediction accuracy. According to the algorithm, the RK + PCR method effectively eliminates multicollinearity of explanatory variables. However, if the number of predictors is less than ten, the probability of multicollinearity is reduced, and application of the PCR becomes irrational. In that case, the multiple linear regression should be used instead.
Sun, Wei; Huang, Guo H; Zeng, Guangming; Qin, Xiaosheng; Yu, Hui
2011-03-01
It is widely known that variation of the C/N ratio is dependent on many state variables during composting processes. This study attempted to develop a genetic algorithm aided stepwise cluster analysis (GASCA) method to describe the nonlinear relationships between the selected state variables and the C/N ratio in food waste composting. The experimental data from six bench-scale composting reactors were used to demonstrate the applicability of GASCA. Within the GASCA framework, GA searched optimal sets of both specified state variables and SCA's internal parameters; SCA established statistical nonlinear relationships between state variables and the C/N ratio; to avoid unnecessary and time-consuming calculation, a proxy table was introduced to save around 70% computational efforts. The obtained GASCA cluster trees had smaller sizes and higher prediction accuracy than the conventional SCA trees. Based on the optimal GASCA tree, the effects of the GA-selected state variables on the C/N ratio were ranged in a descending order as: NH₄+-N concentration>Moisture content>Ash Content>Mean Temperature>Mesophilic bacteria biomass. Such a rank implied that the variation of ammonium nitrogen concentration, the associated temperature and the moisture conditions, the total loss of both organic matters and available mineral constituents, and the mesophilic bacteria activity, were critical factors affecting the C/N ratio during the investigated food waste composting. This first application of GASCA to composting modelling indicated that more direct search algorithms could be coupled with SCA or other multivariate analysis methods to analyze complicated relationships during composting and many other environmental processes. Copyright © 2010 Elsevier B.V. All rights reserved.
Hu, Yi; Loizou, Philipos C
2010-06-01
Attempts to develop noise-suppression algorithms that can significantly improve speech intelligibility in noise by cochlear implant (CI) users have met with limited success. This is partly because algorithms were sought that would work equally well in all listening situations. Accomplishing this has been quite challenging given the variability in the temporal/spectral characteristics of real-world maskers. A different approach is taken in the present study focused on the development of environment-specific noise suppression algorithms. The proposed algorithm selects a subset of the envelope amplitudes for stimulation based on the signal-to-noise ratio (SNR) of each channel. Binary classifiers, trained using data collected from a particular noisy environment, are first used to classify the mixture envelopes of each channel as either target-dominated (SNR>or=0 dB) or masker-dominated (SNR<0 dB). Only target-dominated channels are subsequently selected for stimulation. Results with CI listeners indicated substantial improvements (by nearly 44 percentage points at 5 dB SNR) in intelligibility with the proposed algorithm when tested with sentences embedded in three real-world maskers. The present study demonstrated that the environment-specific approach to noise reduction has the potential to restore speech intelligibility in noise to a level near to that attained in quiet.
Frequency Management for Electromagnetic Continuous Wave Conductivity Meters
Mazurek, Przemyslaw; Putynkowski, Grzegorz
2016-01-01
Ground conductivity meters use electromagnetic fields for the mapping of geological variations, like the determination of water amount, depending on ground layers, which is important for the state analysis of embankments. The VLF band is contaminated by numerous natural and artificial electromagnetic interference signals. Prior to the determination of ground conductivity, the meter’s working frequency is not possible, due to the variable frequency of the interferences. Frequency management based on the analysis of the selected band using track-before-detect (TBD) algorithms, which allows dynamical frequency changes of the conductivity of the meter transmitting part, is proposed in the paper. Naive maximum value search, spatio-temporal TBD (ST-TBD), Viterbi TBD and a new algorithm that uses combined ST-TBD and Viterbi TBD are compared. Monte Carlo tests are provided for the numerical analysis of the properties for a single interference signal in the considered band, and a new approach based on combined ST-TBD and Viterbi algorithms shows the best performance. The considered algorithms process spectrogram data for the selected band, so DFT (Discrete Fourier Transform) could be applied for the computation of the spectrogram. Real–time properties, related to the latency, are discussed also, and it is shown that TBD algorithms are feasible for real applications. PMID:27070608
Frequency Management for Electromagnetic Continuous Wave Conductivity Meters.
Mazurek, Przemyslaw; Putynkowski, Grzegorz
2016-04-07
Ground conductivity meters use electromagnetic fields for the mapping of geological variations, like the determination of water amount, depending on ground layers, which is important for the state analysis of embankments. The VLF band is contaminated by numerous natural and artificial electromagnetic interference signals. Prior to the determination of ground conductivity, the meter's working frequency is not possible, due to the variable frequency of the interferences. Frequency management based on the analysis of the selected band using track-before-detect (TBD) algorithms, which allows dynamical frequency changes of the conductivity of the meter transmitting part, is proposed in the paper. Naive maximum value search, spatio-temporal TBD (ST-TBD), Viterbi TBD and a new algorithm that uses combined ST-TBD and Viterbi TBD are compared. Monte Carlo tests are provided for the numerical analysis of the properties for a single interference signal in the considered band, and a new approach based on combined ST-TBD and Viterbi algorithms shows the best performance. The considered algorithms process spectrogram data for the selected band, so DFT (Discrete Fourier Transform) could be applied for the computation of the spectrogram. Real-time properties, related to the latency, are discussed also, and it is shown that TBD algorithms are feasible for real applications.
NASA Astrophysics Data System (ADS)
Bonissone, Stefano R.
2001-11-01
There are many approaches to solving multi-objective optimization problems using evolutionary algorithms. We need to select methods for representing and aggregating preferences, as well as choosing strategies for searching in multi-dimensional objective spaces. First we suggest the use of linguistic variables to represent preferences and the use of fuzzy rule systems to implement tradeoff aggregations. After a review of alternatives EA methods for multi-objective optimizations, we explore the use of multi-sexual genetic algorithms (MSGA). In using a MSGA, we need to modify certain parts of the GAs, namely the selection and crossover operations. The selection operator groups solutions according to their gender tag to prepare them for crossover. The crossover is modified by appending a gender tag at the end of the chromosome. We use single and double point crossovers. We determine the gender of the offspring by the amount of genetic material provided by each parent. The parent that contributed the most to the creation of a specific offspring determines the gender that the offspring will inherit. This is still a work in progress, and in the conclusion we examine many future extensions and experiments.
Geraci, Joseph; Dharsee, Moyez; Nuin, Paulo; Haslehurst, Alexandria; Koti, Madhuri; Feilotter, Harriet E; Evans, Ken
2014-03-01
We introduce a novel method for visualizing high dimensional data via a discrete dynamical system. This method provides a 2D representation of the relationship between subjects according to a set of variables without geometric projections, transformed axes or principal components. The algorithm exploits a memory-type mechanism inherent in a certain class of discrete dynamical systems collectively referred to as the chaos game that are closely related to iterative function systems. The goal of the algorithm was to create a human readable representation of high dimensional patient data that was capable of detecting unrevealed subclusters of patients from within anticipated classifications. This provides a mechanism to further pursue a more personalized exploration of pathology when used with medical data. For clustering and classification protocols, the dynamical system portion of the algorithm is designed to come after some feature selection filter and before some model evaluation (e.g. clustering accuracy) protocol. In the version given here, a univariate features selection step is performed (in practice more complex feature selection methods are used), a discrete dynamical system is driven by this reduced set of variables (which results in a set of 2D cluster models), these models are evaluated for their accuracy (according to a user-defined binary classification) and finally a visual representation of the top classification models are returned. Thus, in addition to the visualization component, this methodology can be used for both supervised and unsupervised machine learning as the top performing models are returned in the protocol we describe here. Butterfly, the algorithm we introduce and provide working code for, uses a discrete dynamical system to classify high dimensional data and provide a 2D representation of the relationship between subjects. We report results on three datasets (two in the article; one in the appendix) including a public lung cancer dataset that comes along with the included Butterfly R package. In the included R script, a univariate feature selection method is used for the dimension reduction step, but in the future we wish to use a more powerful multivariate feature reduction method based on neural networks (Kriesel, 2007). A script written in R (designed to run on R studio) accompanies this article that implements this algorithm and is available at http://butterflygeraci.codeplex.com/. For details on the R package or for help installing the software refer to the accompanying document, Supporting Material and Appendix.
ERIC Educational Resources Information Center
Tsaparlis, Georgios
2005-01-01
This work provides a correlation study of the role of the following cognitive variables on problem solving in elementary physical chemistry: scientific reasoning (level of intellectual development/developmental level), working-memory capacity, functional mental ("M") capacity, and disembedding ability (i.e., degree of perceptual field…
Water quality parameter measurement using spectral signatures
NASA Technical Reports Server (NTRS)
White, P. E.
1973-01-01
Regression analysis is applied to the problem of measuring water quality parameters from remote sensing spectral signature data. The equations necessary to perform regression analysis are presented and methods of testing the strength and reliability of a regression are described. An efficient algorithm for selecting an optimal subset of the independent variables available for a regression is also presented.
Preliminary Design of a Manned Nuclear Electric Propulsion Vehicle Using Genetic Algorithms
NASA Technical Reports Server (NTRS)
Irwin, Ryan W.; Tinker, Michael L.
2005-01-01
Nuclear electric propulsion (NEP) vehicles will be needed for future manned missions to Mars and beyond. Candidate designs must be identified for further detailed design from a large array of possibilities. Genetic algorithms have proven their utility in conceptual design studies by effectively searching a large design space to pinpoint unique optimal designs. This research combined analysis codes for NEP subsystems with a genetic algorithm. The use of penalty functions with scaling ratios was investigated to increase computational efficiency. Also, the selection of design variables for optimization was considered to reduce computation time without losing beneficial design search space. Finally, trend analysis of a reference mission to the asteroids yielded a group of candidate designs for further analysis.
Fernandes, David Douglas Sousa; Gomes, Adriano A; Costa, Gean Bezerra da; Silva, Gildo William B da; Véras, Germano
2011-12-15
This work is concerned of evaluate the use of visible and near-infrared (NIR) range, separately and combined, to determine the biodiesel content in biodiesel/diesel blends using Multiple Linear Regression (MLR) and variable selection by Successive Projections Algorithm (SPA). Full spectrum models employing Partial Least Squares (PLS) and variables selection by Stepwise (SW) regression coupled with Multiple Linear Regression (MLR) and PLS models also with variable selection by Jack-Knife (Jk) were compared the proposed methodology. Several preprocessing were evaluated, being chosen derivative Savitzky-Golay with second-order polynomial and 17-point window for NIR and visible-NIR range, with offset correction. A total of 100 blends with biodiesel content between 5 and 50% (v/v) prepared starting from ten sample of biodiesel. In the NIR and visible region the best model was the SPA-MLR using only two and eight wavelengths with RMSEP of 0.6439% (v/v) and 0.5741 respectively, while in the visible-NIR region the best model was the SW-MLR using five wavelengths and RMSEP of 0.9533% (v/v). Results indicate that both spectral ranges evaluated showed potential for developing a rapid and nondestructive method to quantify biodiesel in blends with mineral diesel. Finally, one can still mention that the improvement in terms of prediction error obtained with the procedure for variables selection was significant. Copyright © 2011 Elsevier B.V. All rights reserved.
Reid, Colleen E; Jerrett, Michael; Petersen, Maya L; Pfister, Gabriele G; Morefield, Philip E; Tager, Ira B; Raffuse, Sean M; Balmes, John R
2015-03-17
Estimating population exposure to particulate matter during wildfires can be difficult because of insufficient monitoring data to capture the spatiotemporal variability of smoke plumes. Chemical transport models (CTMs) and satellite retrievals provide spatiotemporal data that may be useful in predicting PM2.5 during wildfires. We estimated PM2.5 concentrations during the 2008 northern California wildfires using 10-fold cross-validation (CV) to select an optimal prediction model from a set of 11 statistical algorithms and 29 predictor variables. The variables included CTM output, three measures of satellite aerosol optical depth, distance to the nearest fires, meteorological data, and land use, traffic, spatial location, and temporal characteristics. The generalized boosting model (GBM) with 29 predictor variables had the lowest CV root mean squared error and a CV-R2 of 0.803. The most important predictor variable was the Geostationary Operational Environmental Satellite Aerosol/Smoke Product (GASP) Aerosol Optical Depth (AOD), followed by the CTM output and distance to the nearest fire cluster. Parsimonious models with various combinations of fewer variables also predicted PM2.5 well. Using machine learning algorithms to combine spatiotemporal data from satellites and CTMs can reliably predict PM2.5 concentrations during a major wildfire event.
A multipopulation PSO based memetic algorithm for permutation flow shop scheduling.
Liu, Ruochen; Ma, Chenlin; Ma, Wenping; Li, Yangyang
2013-01-01
The permutation flow shop scheduling problem (PFSSP) is part of production scheduling, which belongs to the hardest combinatorial optimization problem. In this paper, a multipopulation particle swarm optimization (PSO) based memetic algorithm (MPSOMA) is proposed in this paper. In the proposed algorithm, the whole particle swarm population is divided into three subpopulations in which each particle evolves itself by the standard PSO and then updates each subpopulation by using different local search schemes such as variable neighborhood search (VNS) and individual improvement scheme (IIS). Then, the best particle of each subpopulation is selected to construct a probabilistic model by using estimation of distribution algorithm (EDA) and three particles are sampled from the probabilistic model to update the worst individual in each subpopulation. The best particle in the entire particle swarm is used to update the global optimal solution. The proposed MPSOMA is compared with two recently proposed algorithms, namely, PSO based memetic algorithm (PSOMA) and hybrid particle swarm optimization with estimation of distribution algorithm (PSOEDA), on 29 well-known PFFSPs taken from OR-library, and the experimental results show that it is an effective approach for the PFFSP.
Müller, Aline Lima Hermes; Picoloto, Rochele Sogari; de Azevedo Mello, Paola; Ferrão, Marco Flores; de Fátima Pereira dos Santos, Maria; Guimarães, Regina Célia Lourenço; Müller, Edson Irineu; Flores, Erico Marlon Moraes
2012-04-01
Total sulfur concentration was determined in atmospheric residue (AR) and vacuum residue (VR) samples obtained from petroleum distillation process by Fourier transform infrared spectroscopy with attenuated total reflectance (FT-IR/ATR) in association with chemometric methods. Calibration and prediction set consisted of 40 and 20 samples, respectively. Calibration models were developed using two variable selection models: interval partial least squares (iPLS) and synergy interval partial least squares (siPLS). Different treatments and pre-processing steps were also evaluated for the development of models. The pre-treatment based on multiplicative scatter correction (MSC) and the mean centered data were selected for models construction. The use of siPLS as variable selection method provided a model with root mean square error of prediction (RMSEP) values significantly better than those obtained by PLS model using all variables. The best model was obtained using siPLS algorithm with spectra divided in 20 intervals and combinations of 3 intervals (911-824, 823-736 and 737-650 cm(-1)). This model produced a RMSECV of 400 mg kg(-1) S and RMSEP of 420 mg kg(-1) S, showing a correlation coefficient of 0.990. Copyright © 2011 Elsevier B.V. All rights reserved.
Arnold, J B; Liow, J S; Schaper, K A; Stern, J J; Sled, J G; Shattuck, D W; Worth, A J; Cohen, M S; Leahy, R M; Mazziotta, J C; Rottenberg, D A
2001-05-01
The desire to correct intensity nonuniformity in magnetic resonance images has led to the proliferation of nonuniformity-correction (NUC) algorithms with different theoretical underpinnings. In order to provide end users with a rational basis for selecting a given algorithm for a specific neuroscientific application, we evaluated the performance of six NUC algorithms. We used simulated and real MRI data volumes, including six repeat scans of the same subject, in order to rank the accuracy, precision, and stability of the nonuniformity corrections. We also compared algorithms using data volumes from different subjects and different (1.5T and 3.0T) MRI scanners in order to relate differences in algorithmic performance to intersubject variability and/or differences in scanner performance. In phantom studies, the correlation of the extracted with the applied nonuniformity was highest in the transaxial (left-to-right) direction and lowest in the axial (top-to-bottom) direction. Two of the six algorithms demonstrated a high degree of stability, as measured by the iterative application of the algorithm to its corrected output. While none of the algorithms performed ideally under all circumstances, locally adaptive methods generally outperformed nonadaptive methods. Copyright 2001 Academic Press.
Stochastic model search with binary outcomes for genome-wide association studies
Malovini, Alberto; Puca, Annibale A; Bellazzi, Riccardo
2012-01-01
Objective The spread of case–control genome-wide association studies (GWASs) has stimulated the development of new variable selection methods and predictive models. We introduce a novel Bayesian model search algorithm, Binary Outcome Stochastic Search (BOSS), which addresses the model selection problem when the number of predictors far exceeds the number of binary responses. Materials and methods Our method is based on a latent variable model that links the observed outcomes to the underlying genetic variables. A Markov Chain Monte Carlo approach is used for model search and to evaluate the posterior probability of each predictor. Results BOSS is compared with three established methods (stepwise regression, logistic lasso, and elastic net) in a simulated benchmark. Two real case studies are also investigated: a GWAS on the genetic bases of longevity, and the type 2 diabetes study from the Wellcome Trust Case Control Consortium. Simulations show that BOSS achieves higher precisions than the reference methods while preserving good recall rates. In both experimental studies, BOSS successfully detects genetic polymorphisms previously reported to be associated with the analyzed phenotypes. Discussion BOSS outperforms the other methods in terms of F-measure on simulated data. In the two real studies, BOSS successfully detects biologically relevant features, some of which are missed by univariate analysis and the three reference techniques. Conclusion The proposed algorithm is an advance in the methodology for model selection with a large number of features. Our simulated and experimental results showed that BOSS proves effective in detecting relevant markers while providing a parsimonious model. PMID:22534080
Münnich, Timo; Klein, Jan; Hattingen, Elke; Noack, Anika; Herrmann, Eva; Seifert, Volker; Senft, Christian; Forster, Marie-Therese
2018-04-14
Tractography is a popular tool for visualizing the corticospinal tract (CST). However, results may be influenced by numerous variables, eg, the selection of seeding regions of interests (ROIs) or the chosen tracking algorithm. To compare different variable sets by correlating tractography results with intraoperative subcortical stimulation of the CST, correcting intraoperative brain shift by the use of intraoperative MRI. Seeding ROIs were created by means of motor cortex segmentation, functional MRI (fMRI), and navigated transcranial magnetic stimulation (nTMS). Based on these ROIs, tractography was run for each patient using a deterministic and a probabilistic algorithm. Tractographies were processed on pre- and postoperatively acquired data. Using a linear mixed effects statistical model, best correlation between subcortical stimulation intensity and the distance between tractography and stimulation sites was achieved by using the segmented motor cortex as seeding ROI and applying the probabilistic algorithm on preoperatively acquired imaging sequences. Tractographies based on fMRI or nTMS results differed very little, but with enlargement of positive nTMS sites the stimulation-distance correlation of nTMS-based tractography improved. Our results underline that the use of tractography demands for careful interpretation of its virtual results by considering all influencing variables.
Bisele, Maria; Bencsik, Martin; Lewis, Martin G C; Barnett, Cleveland T
2017-01-01
Assessment methods in human locomotion often involve the description of normalised graphical profiles and/or the extraction of discrete variables. Whilst useful, these approaches may not represent the full complexity of gait data. Multivariate statistical methods, such as Principal Component Analysis (PCA) and Discriminant Function Analysis (DFA), have been adopted since they have the potential to overcome these data handling issues. The aim of the current study was to develop and optimise a specific machine learning algorithm for processing human locomotion data. Twenty participants ran at a self-selected speed across a 15m runway in barefoot and shod conditions. Ground reaction forces (BW) and kinematics were measured at 1000 Hz and 100 Hz, respectively from which joint angles (°), joint moments (N.m.kg-1) and joint powers (W.kg-1) for the hip, knee and ankle joints were calculated in all three anatomical planes. Using PCA and DFA, power spectra of the kinematic and kinetic variables were used as a training database for the development of a machine learning algorithm. All possible combinations of 10 out of 20 participants were explored to find the iteration of individuals that would optimise the machine learning algorithm. The results showed that the algorithm was able to successfully predict whether a participant ran shod or barefoot in 93.5% of cases. To the authors' knowledge, this is the first study to optimise the development of a machine learning algorithm.
Bisele, Maria; Bencsik, Martin; Lewis, Martin G. C.
2017-01-01
Assessment methods in human locomotion often involve the description of normalised graphical profiles and/or the extraction of discrete variables. Whilst useful, these approaches may not represent the full complexity of gait data. Multivariate statistical methods, such as Principal Component Analysis (PCA) and Discriminant Function Analysis (DFA), have been adopted since they have the potential to overcome these data handling issues. The aim of the current study was to develop and optimise a specific machine learning algorithm for processing human locomotion data. Twenty participants ran at a self-selected speed across a 15m runway in barefoot and shod conditions. Ground reaction forces (BW) and kinematics were measured at 1000 Hz and 100 Hz, respectively from which joint angles (°), joint moments (N.m.kg-1) and joint powers (W.kg-1) for the hip, knee and ankle joints were calculated in all three anatomical planes. Using PCA and DFA, power spectra of the kinematic and kinetic variables were used as a training database for the development of a machine learning algorithm. All possible combinations of 10 out of 20 participants were explored to find the iteration of individuals that would optimise the machine learning algorithm. The results showed that the algorithm was able to successfully predict whether a participant ran shod or barefoot in 93.5% of cases. To the authors’ knowledge, this is the first study to optimise the development of a machine learning algorithm. PMID:28886059
Zhan, Xue-yan; Zhao, Na; Lin, Zhao-zhou; Wu, Zhi-sheng; Yuan, Rui-juan; Qiao, Yan-jiang
2014-12-01
The appropriate algorithm for calibration set selection was one of the key technologies for a good NIR quantitative model. There are different algorithms for calibration set selection, such as Random Sampling (RS) algorithm, Conventional Selection (CS) algorithm, Kennard-Stone(KS) algorithm and Sample set Portioning based on joint x-y distance (SPXY) algorithm, et al. However, there lack systematic comparisons between two algorithms of the above algorithms. The NIR quantitative models to determine the asiaticoside content in Centella total glucosides were established in the present paper, of which 7 indexes were classified and selected, and the effects of CS algorithm, KS algorithm and SPXY algorithm for calibration set selection on the accuracy and robustness of NIR quantitative models were investigated. The accuracy indexes of NIR quantitative models with calibration set selected by SPXY algorithm were significantly different from that with calibration set selected by CS algorithm or KS algorithm, while the robustness indexes, such as RMSECV and |RMSEP-RMSEC|, were not significantly different. Therefore, SPXY algorithm for calibration set selection could improve the predicative accuracy of NIR quantitative models to determine asiaticoside content in Centella total glucosides, and have no significant effect on the robustness of the models, which provides a reference to determine the appropriate algorithm for calibration set selection when NIR quantitative models are established for the solid system of traditional Chinese medcine.
Elastic SCAD as a novel penalization method for SVM classification tasks in high-dimensional data.
Becker, Natalia; Toedt, Grischa; Lichter, Peter; Benner, Axel
2011-05-09
Classification and variable selection play an important role in knowledge discovery in high-dimensional data. Although Support Vector Machine (SVM) algorithms are among the most powerful classification and prediction methods with a wide range of scientific applications, the SVM does not include automatic feature selection and therefore a number of feature selection procedures have been developed. Regularisation approaches extend SVM to a feature selection method in a flexible way using penalty functions like LASSO, SCAD and Elastic Net.We propose a novel penalty function for SVM classification tasks, Elastic SCAD, a combination of SCAD and ridge penalties which overcomes the limitations of each penalty alone.Since SVM models are extremely sensitive to the choice of tuning parameters, we adopted an interval search algorithm, which in comparison to a fixed grid search finds rapidly and more precisely a global optimal solution. Feature selection methods with combined penalties (Elastic Net and Elastic SCAD SVMs) are more robust to a change of the model complexity than methods using single penalties. Our simulation study showed that Elastic SCAD SVM outperformed LASSO (L1) and SCAD SVMs. Moreover, Elastic SCAD SVM provided sparser classifiers in terms of median number of features selected than Elastic Net SVM and often better predicted than Elastic Net in terms of misclassification error.Finally, we applied the penalization methods described above on four publicly available breast cancer data sets. Elastic SCAD SVM was the only method providing robust classifiers in sparse and non-sparse situations. The proposed Elastic SCAD SVM algorithm provides the advantages of the SCAD penalty and at the same time avoids sparsity limitations for non-sparse data. We were first to demonstrate that the integration of the interval search algorithm and penalized SVM classification techniques provides fast solutions on the optimization of tuning parameters.The penalized SVM classification algorithms as well as fixed grid and interval search for finding appropriate tuning parameters were implemented in our freely available R package 'penalizedSVM'.We conclude that the Elastic SCAD SVM is a flexible and robust tool for classification and feature selection tasks for high-dimensional data such as microarray data sets.
Elastic SCAD as a novel penalization method for SVM classification tasks in high-dimensional data
2011-01-01
Background Classification and variable selection play an important role in knowledge discovery in high-dimensional data. Although Support Vector Machine (SVM) algorithms are among the most powerful classification and prediction methods with a wide range of scientific applications, the SVM does not include automatic feature selection and therefore a number of feature selection procedures have been developed. Regularisation approaches extend SVM to a feature selection method in a flexible way using penalty functions like LASSO, SCAD and Elastic Net. We propose a novel penalty function for SVM classification tasks, Elastic SCAD, a combination of SCAD and ridge penalties which overcomes the limitations of each penalty alone. Since SVM models are extremely sensitive to the choice of tuning parameters, we adopted an interval search algorithm, which in comparison to a fixed grid search finds rapidly and more precisely a global optimal solution. Results Feature selection methods with combined penalties (Elastic Net and Elastic SCAD SVMs) are more robust to a change of the model complexity than methods using single penalties. Our simulation study showed that Elastic SCAD SVM outperformed LASSO (L1) and SCAD SVMs. Moreover, Elastic SCAD SVM provided sparser classifiers in terms of median number of features selected than Elastic Net SVM and often better predicted than Elastic Net in terms of misclassification error. Finally, we applied the penalization methods described above on four publicly available breast cancer data sets. Elastic SCAD SVM was the only method providing robust classifiers in sparse and non-sparse situations. Conclusions The proposed Elastic SCAD SVM algorithm provides the advantages of the SCAD penalty and at the same time avoids sparsity limitations for non-sparse data. We were first to demonstrate that the integration of the interval search algorithm and penalized SVM classification techniques provides fast solutions on the optimization of tuning parameters. The penalized SVM classification algorithms as well as fixed grid and interval search for finding appropriate tuning parameters were implemented in our freely available R package 'penalizedSVM'. We conclude that the Elastic SCAD SVM is a flexible and robust tool for classification and feature selection tasks for high-dimensional data such as microarray data sets. PMID:21554689
Schmidt, Johannes; Glaser, Bruno
2016-01-01
Tropical forests are significant carbon sinks and their soils’ carbon storage potential is immense. However, little is known about the soil organic carbon (SOC) stocks of tropical mountain areas whose complex soil-landscape and difficult accessibility pose a challenge to spatial analysis. The choice of methodology for spatial prediction is of high importance to improve the expected poor model results in case of low predictor-response correlations. Four aspects were considered to improve model performance in predicting SOC stocks of the organic layer of a tropical mountain forest landscape: Different spatial predictor settings, predictor selection strategies, various machine learning algorithms and model tuning. Five machine learning algorithms: random forests, artificial neural networks, multivariate adaptive regression splines, boosted regression trees and support vector machines were trained and tuned to predict SOC stocks from predictors derived from a digital elevation model and satellite image. Topographical predictors were calculated with a GIS search radius of 45 to 615 m. Finally, three predictor selection strategies were applied to the total set of 236 predictors. All machine learning algorithms—including the model tuning and predictor selection—were compared via five repetitions of a tenfold cross-validation. The boosted regression tree algorithm resulted in the overall best model. SOC stocks ranged between 0.2 to 17.7 kg m-2, displaying a huge variability with diffuse insolation and curvatures of different scale guiding the spatial pattern. Predictor selection and model tuning improved the models’ predictive performance in all five machine learning algorithms. The rather low number of selected predictors favours forward compared to backward selection procedures. Choosing predictors due to their indiviual performance was vanquished by the two procedures which accounted for predictor interaction. PMID:27128736
NASA Technical Reports Server (NTRS)
Mukhopadhyay, V.; Newsom, J. R.; Abel, I.
1981-01-01
A method of synthesizing reduced-order optimal feedback control laws for a high-order system is developed. A nonlinear programming algorithm is employed to search for the control law design variables that minimize a performance index defined by a weighted sum of mean-square steady-state responses and control inputs. An analogy with the linear quadractic Gaussian solution is utilized to select a set of design variables and their initial values. To improve the stability margins of the system, an input-noise adjustment procedure is used in the design algorithm. The method is applied to the synthesis of an active flutter-suppression control law for a wind tunnel model of an aeroelastic wing. The reduced-order controller is compared with the corresponding full-order controller and found to provide nearly optimal performance. The performance of the present method appeared to be superior to that of two other control law order-reduction methods. It is concluded that by using the present algorithm, nearly optimal low-order control laws with good stability margins can be synthesized.
Zhang, Zhenzhen; O'Neill, Marie S; Sánchez, Brisa N
2016-04-01
Factor analysis is a commonly used method of modelling correlated multivariate exposure data. Typically, the measurement model is assumed to have constant factor loadings. However, from our preliminary analyses of the Environmental Protection Agency's (EPA's) PM 2.5 fine speciation data, we have observed that the factor loadings for four constituents change considerably in stratified analyses. Since invariance of factor loadings is a prerequisite for valid comparison of the underlying latent variables, we propose a factor model that includes non-constant factor loadings that change over time and space using P-spline penalized with the generalized cross-validation (GCV) criterion. The model is implemented using the Expectation-Maximization (EM) algorithm and we select the multiple spline smoothing parameters by minimizing the GCV criterion with Newton's method during each iteration of the EM algorithm. The algorithm is applied to a one-factor model that includes four constituents. Through bootstrap confidence bands, we find that the factor loading for total nitrate changes across seasons and geographic regions.
Barbier, Paolo; Alimento, Marina; Berna, Giovanni; Celeste, Fabrizio; Gentile, Francesco; Mantero, Antonio; Montericcio, Vincenzo; Muratori, Manuela
2007-05-01
Large files produced by standard compression algorithms slow down spread of digital and tele-echocardiography. We validated echocardiographic video high-grade compression with the new Motion Pictures Expert Groups (MPEG)-4 algorithms with a multicenter study. Seven expert cardiologists blindly scored (5-point scale) 165 uncompressed and compressed 2-dimensional and color Doppler video clips, based on combined diagnostic content and image quality (uncompressed files as references). One digital video and 3 MPEG-4 algorithms (WM9, MV2, and DivX) were used, the latter at 3 compression levels (0%, 35%, and 60%). Compressed file sizes decreased from 12 to 83 MB to 0.03 to 2.3 MB (1:1051-1:26 reduction ratios). Mean SD of differences was 0.81 for intraobserver variability (uncompressed and digital video files). Compared with uncompressed files, only the DivX mean score at 35% (P = .04) and 60% (P = .001) compression was significantly reduced. At subcategory analysis, these differences were still significant for gray-scale and fundamental imaging but not for color or second harmonic tissue imaging. Original image quality, session sequence, compression grade, and bitrate were all independent determinants of mean score. Our study supports use of MPEG-4 algorithms to greatly reduce echocardiographic file sizes, thus facilitating archiving and transmission. Quality evaluation studies should account for the many independent variables that affect image quality grading.
An Algorithm for the Mixed Transportation Network Design Problem
Liu, Xinyu; Chen, Qun
2016-01-01
This paper proposes an optimization algorithm, the dimension-down iterative algorithm (DDIA), for solving a mixed transportation network design problem (MNDP), which is generally expressed as a mathematical programming with equilibrium constraint (MPEC). The upper level of the MNDP aims to optimize the network performance via both the expansion of the existing links and the addition of new candidate links, whereas the lower level is a traditional Wardrop user equilibrium (UE) problem. The idea of the proposed solution algorithm (DDIA) is to reduce the dimensions of the problem. A group of variables (discrete/continuous) is fixed to optimize another group of variables (continuous/discrete) alternately; then, the problem is transformed into solving a series of CNDPs (continuous network design problems) and DNDPs (discrete network design problems) repeatedly until the problem converges to the optimal solution. The advantage of the proposed algorithm is that its solution process is very simple and easy to apply. Numerical examples show that for the MNDP without budget constraint, the optimal solution can be found within a few iterations with DDIA. For the MNDP with budget constraint, however, the result depends on the selection of initial values, which leads to different optimal solutions (i.e., different local optimal solutions). Some thoughts are given on how to derive meaningful initial values, such as by considering the budgets of new and reconstruction projects separately. PMID:27626803
NASA Technical Reports Server (NTRS)
Nobbs, Steven G.
1995-01-01
An overview of the performance seeking control (PSC) algorithm and details of the important components of the algorithm are given. The onboard propulsion system models, the linear programming optimization, and engine control interface are described. The PSC algorithm receives input from various computers on the aircraft including the digital flight computer, digital engine control, and electronic inlet control. The PSC algorithm contains compact models of the propulsion system including the inlet, engine, and nozzle. The models compute propulsion system parameters, such as inlet drag and fan stall margin, which are not directly measurable in flight. The compact models also compute sensitivities of the propulsion system parameters to change in control variables. The engine model consists of a linear steady state variable model (SSVM) and a nonlinear model. The SSVM is updated with efficiency factors calculated in the engine model update logic, or Kalman filter. The efficiency factors are used to adjust the SSVM to match the actual engine. The propulsion system models are mathematically integrated to form an overall propulsion system model. The propulsion system model is then optimized using a linear programming optimization scheme. The goal of the optimization is determined from the selected PSC mode of operation. The resulting trims are used to compute a new operating point about which the optimization process is repeated. This process is continued until an overall (global) optimum is reached before applying the trims to the controllers.
Supercomputing on massively parallel bit-serial architectures
NASA Technical Reports Server (NTRS)
Iobst, Ken
1985-01-01
Research on the Goodyear Massively Parallel Processor (MPP) suggests that high-level parallel languages are practical and can be designed with powerful new semantics that allow algorithms to be efficiently mapped to the real machines. For the MPP these semantics include parallel/associative array selection for both dense and sparse matrices, variable precision arithmetic to trade accuracy for speed, micro-pipelined train broadcast, and conditional branching at the processing element (PE) control unit level. The preliminary design of a FORTRAN-like parallel language for the MPP has been completed and is being used to write programs to perform sparse matrix array selection, min/max search, matrix multiplication, Gaussian elimination on single bit arrays and other generic algorithms. A description is given of the MPP design. Features of the system and its operation are illustrated in the form of charts and diagrams.
NASA Astrophysics Data System (ADS)
Nurhaida, Subanar, Abdurakhman, Abadi, Agus Maman
2017-08-01
Seismic data is usually modelled using autoregressive processes. The aim of this paper is to find the arrival times of the seismic waves of Mt. Rinjani in Indonesia. Kitagawa algorithm's is used to detect the seismic P and S-wave. Householder transformation used in the algorithm made it effectively finding the number of change points and parameters of the autoregressive models. The results show that the use of Box-Cox transformation on the variable selection level makes the algorithm works well in detecting the change points. Furthermore, when the basic span of the subinterval is set 200 seconds and the maximum AR order is 20, there are 8 change points which occur at 1601, 2001, 7401, 7601,7801, 8001, 8201 and 9601. Finally, The P and S-wave arrival times are detected at time 1671 and 2045 respectively using a precise detection algorithm.
ELF: An Extended-Lagrangian Free Energy Calculation Module for Multiple Molecular Dynamics Engines.
Chen, Haochuan; Fu, Haohao; Shao, Xueguang; Chipot, Christophe; Cai, Wensheng
2018-06-18
Extended adaptive biasing force (eABF), a collective variable (CV)-based importance-sampling algorithm, has proven to be very robust and efficient compared with the original ABF algorithm. Its implementation in Colvars, a software addition to molecular dynamics (MD) engines, is, however, currently limited to NAMD and LAMMPS. To broaden the scope of eABF and its variants, like its generalized form (egABF), and make them available to other MD engines, e.g., GROMACS, AMBER, CP2K, and openMM, we present a PLUMED-based implementation, called extended-Lagrangian free energy calculation (ELF). This implementation can be used as a stand-alone gradient estimator for other CV-based sampling algorithms, such as temperature-accelerated MD (TAMD) and extended-Lagrangian metadynamics (MtD). ELF provides the end user with a convenient framework to help select the best-suited importance-sampling algorithm for a given application without any commitment to a particular MD engine.
Water Quality Monitoring for Lake Constance with a Physically Based Algorithm for MERIS Data.
Odermatt, Daniel; Heege, Thomas; Nieke, Jens; Kneubühler, Mathias; Itten, Klaus
2008-08-05
A physically based algorithm is used for automatic processing of MERIS level 1B full resolution data. The algorithm is originally used with input variables for optimization with different sensors (i.e. channel recalibration and weighting), aquatic regions (i.e. specific inherent optical properties) or atmospheric conditions (i.e. aerosol models). For operational use, however, a lake-specific parameterization is required, representing an approximation of the spatio-temporal variation in atmospheric and hydrooptic conditions, and accounting for sensor properties. The algorithm performs atmospheric correction with a LUT for at-sensor radiance, and a downhill simplex inversion of chl-a, sm and y from subsurface irradiance reflectance. These outputs are enhanced by a selective filter, which makes use of the retrieval residuals. Regular chl-a sampling measurements by the Lake's protection authority coinciding with MERIS acquisitions were used for parameterization, training and validation.
Automated selected reaction monitoring software for accurate label-free protein quantification.
Teleman, Johan; Karlsson, Christofer; Waldemarson, Sofia; Hansson, Karin; James, Peter; Malmström, Johan; Levander, Fredrik
2012-07-06
Selected reaction monitoring (SRM) is a mass spectrometry method with documented ability to quantify proteins accurately and reproducibly using labeled reference peptides. However, the use of labeled reference peptides becomes impractical if large numbers of peptides are targeted and when high flexibility is desired when selecting peptides. We have developed a label-free quantitative SRM workflow that relies on a new automated algorithm, Anubis, for accurate peak detection. Anubis efficiently removes interfering signals from contaminating peptides to estimate the true signal of the targeted peptides. We evaluated the algorithm on a published multisite data set and achieved results in line with manual data analysis. In complex peptide mixtures from whole proteome digests of Streptococcus pyogenes we achieved a technical variability across the entire proteome abundance range of 6.5-19.2%, which was considerably below the total variation across biological samples. Our results show that the label-free SRM workflow with automated data analysis is feasible for large-scale biological studies, opening up new possibilities for quantitative proteomics and systems biology.
Feature Selection for Nonstationary Data: Application to Human Recognition Using Medical Biometrics.
Komeili, Majid; Louis, Wael; Armanfard, Narges; Hatzinakos, Dimitrios
2018-05-01
Electrocardiogram (ECG) and transient evoked otoacoustic emission (TEOAE) are among the physiological signals that have attracted significant interest in biometric community due to their inherent robustness to replay and falsification attacks. However, they are time-dependent signals and this makes them hard to deal with in across-session human recognition scenario where only one session is available for enrollment. This paper presents a novel feature selection method to address this issue. It is based on an auxiliary dataset with multiple sessions where it selects a subset of features that are more persistent across different sessions. It uses local information in terms of sample margins while enforcing an across-session measure. This makes it a perfect fit for aforementioned biometric recognition problem. Comprehensive experiments on ECG and TEOAE variability due to time lapse and body posture are done. Performance of the proposed method is compared against seven state-of-the-art feature selection algorithms as well as another six approaches in the area of ECG and TEOAE biometric recognition. Experimental results demonstrate that the proposed method performs noticeably better than other algorithms.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Popov, Milen; Sotiriadis, Charalampos; Gay, Frederique
PurposeTo report our experience using a multilevel patient management algorithm to direct transarterial embolization (TAE) in managing spontaneous intramuscular hematoma (SIMH).Materials and MethodsFrom May 2006 to January 2014, twenty-seven patients with SIMH had been referred for TAE to our Radiology department. Clinical status and coagulation characteristics of the patients are analyzed. An algorithm integrating CT findings is suggested to manage SIMH. Patients were classified into three groups: Type I, SIMH with no active bleeding (AB); Type II, SIMH with AB and no muscular fascia rupture (MFR); and Type III, SIMH with MFR and AB. Type II is furthermore subcategorized asmore » IIa, IIb and IIc. Types IIb, IIc and III were considered for TAE. The method of embolization as well as the material been used are described. Continuous variables are presented as mean ± SD. Categorical variables are reported as percentages. Technical success, clinical success, complications and 30-day mortality (d30 M) were analyzed.ResultsTwo patients (7.5%) had Type IIb, four (15%) Type IIc and 21 (77.5%) presented Type III. The detailed CT and CTA findings, embolization procedure and materials used are described. Technical success was 96% with a complication rate of 4%. Clinical success was 88%. The bleeding-related thirty-day mortality was 15% (all with Type III).ConclusionTAE is a safe and efficient technique to control bleeding that should be considered in selected SIMH as soon as possible. The proposed algorithm integrating CT features provides a comprehensive chart to select patients for TAE.Level of Evidence4.« less
NASA Technical Reports Server (NTRS)
McClain, Charles R.; Signorini, Sergio
2002-01-01
Sensitivity analyses of sea-air CO2 flux to gas transfer algorithms, climatological wind speeds, sea surface temperatures (SST) and salinity (SSS) were conducted for the global oceans and selected regional domains. Large uncertainties in the global sea-air flux estimates are identified due to different gas transfer algorithms, global climatological wind speeds, and seasonal SST and SSS data. The global sea-air flux ranges from -0.57 to -2.27 Gt/yr, depending on the combination of gas transfer algorithms and global climatological wind speeds used. Different combinations of SST and SSS global fields resulted in changes as large as 35% on the oceans global sea-air flux. An error as small as plus or minus 0.2 in SSS translates into a plus or minus 43% deviation on the mean global CO2 flux. This result emphasizes the need for highly accurate satellite SSS observations for the development of remote sensing sea-air flux algorithms.
A new clustering algorithm applicable to multispectral and polarimetric SAR images
NASA Technical Reports Server (NTRS)
Wong, Yiu-Fai; Posner, Edward C.
1993-01-01
We describe an application of a scale-space clustering algorithm to the classification of a multispectral and polarimetric SAR image of an agricultural site. After the initial polarimetric and radiometric calibration and noise cancellation, we extracted a 12-dimensional feature vector for each pixel from the scattering matrix. The clustering algorithm was able to partition a set of unlabeled feature vectors from 13 selected sites, each site corresponding to a distinct crop, into 13 clusters without any supervision. The cluster parameters were then used to classify the whole image. The classification map is much less noisy and more accurate than those obtained by hierarchical rules. Starting with every point as a cluster, the algorithm works by melting the system to produce a tree of clusters in the scale space. It can cluster data in any multidimensional space and is insensitive to variability in cluster densities, sizes and ellipsoidal shapes. This algorithm, more powerful than existing ones, may be useful for remote sensing for land use.
Routing channels in VLSI layout
NASA Astrophysics Data System (ADS)
Cai, Hong
A number of algorithms for the automatic routing of interconnections in Very Large Scale Integration (VLSI) building-block layouts are presented. Algorithms for the topological definition of channels, the global routing and the geometrical definition of channels are presented. In contrast to traditional approaches the definition and ordering of the channels is done after the global routing. This approach has the advantage that global routing information can be taken into account to select the optimal channel structure. A polynomial algorithm for the channel definition and ordering problem is presented. The existence of a conflict-free channel structure is guaranteed by enforcing a sliceable placement. Algorithms for finding the shortest connection path are described. A separate algorithm is developed for the power net routing, because the two power nets must be planarly routed with variable wire width. An integrated placement and routing system for generating building-block layout is briefly described. Some experimental results and design experiences in using the system are also presented. Very good results are obtained.
1991-09-01
an Experimental Design ...... 31 Selection of Variables .................... ... 34 Defining Measures of Effectiveness ....... 37 Specification of...Required Number of Replications 44 Modification of Scenario Files ......... ... 46 Analysis of the Main Effects of a Two Level Factorial Design ...48 Analysis of the Interaction Effects of a *Two Level Factorial Design .. ............. ... 49 Yate’s Algorithm ......... ................ 50
Salas, Eric Ariel L; Valdez, Raul; Michel, Stefan
2017-11-01
We modeled summer and winter habitat suitability of Marco Polo argali in the Pamir Mountains in southeastern Tajikistan using these statistical algorithms: Generalized Linear Model, Random Forest, Boosted Regression Tree, Maxent, and Multivariate Adaptive Regression Splines. Using sheep occurrence data collected from 2009 to 2015 and a set of selected habitat predictors, we produced summer and winter habitat suitability maps and determined the important habitat suitability predictors for both seasons. Our results demonstrated that argali selected proximity to riparian areas and greenness as the two most relevant variables for summer, and the degree of slope (gentler slopes between 0° to 20°) and Landsat temperature band for winter. The terrain roughness was also among the most important variables in summer and winter models. Aspect was only significant for winter habitat, with argali preferring south-facing mountain slopes. We evaluated various measures of model performance such as the Area Under the Curve (AUC) and the True Skill Statistic (TSS). Comparing the five algorithms, the AUC scored highest for Boosted Regression Tree in summer (AUC = 0.94) and winter model runs (AUC = 0.94). In contrast, Random Forest underperformed in both model runs.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Filho, Faete J; Tolbert, Leon M; Ozpineci, Burak
2012-01-01
The work developed here proposes a methodology for calculating switching angles for varying DC sources in a multilevel cascaded H-bridges converter. In this approach the required fundamental is achieved, the lower harmonics are minimized, and the system can be implemented in real time with low memory requirements. Genetic algorithm (GA) is the stochastic search method to find the solution for the set of equations where the input voltages are the known variables and the switching angles are the unknown variables. With the dataset generated by GA, an artificial neural network (ANN) is trained to store the solutions without excessive memorymore » storage requirements. This trained ANN then senses the voltage of each cell and produces the switching angles in order to regulate the fundamental at 120 V and eliminate or minimize the low order harmonics while operating in real time.« less
Fuzzy support vector machines for adaptive Morse code recognition.
Yang, Cheng-Hong; Jin, Li-Cheng; Chuang, Li-Yeh
2006-11-01
Morse code is now being harnessed for use in rehabilitation applications of augmentative-alternative communication and assistive technology, facilitating mobility, environmental control and adapted worksite access. In this paper, Morse code is selected as a communication adaptive device for persons who suffer from muscle atrophy, cerebral palsy or other severe handicaps. A stable typing rate is strictly required for Morse code to be effective as a communication tool. Therefore, an adaptive automatic recognition method with a high recognition rate is needed. The proposed system uses both fuzzy support vector machines and the variable-degree variable-step-size least-mean-square algorithm to achieve these objectives. We apply fuzzy memberships to each point, and provide different contributions to the decision learning function for support vector machines. Statistical analyses demonstrated that the proposed method elicited a higher recognition rate than other algorithms in the literature.
McParland, D; Phillips, C M; Brennan, L; Roche, H M; Gormley, I C
2017-12-10
The LIPGENE-SU.VI.MAX study, like many others, recorded high-dimensional continuous phenotypic data and categorical genotypic data. LIPGENE-SU.VI.MAX focuses on the need to account for both phenotypic and genetic factors when studying the metabolic syndrome (MetS), a complex disorder that can lead to higher risk of type 2 diabetes and cardiovascular disease. Interest lies in clustering the LIPGENE-SU.VI.MAX participants into homogeneous groups or sub-phenotypes, by jointly considering their phenotypic and genotypic data, and in determining which variables are discriminatory. A novel latent variable model that elegantly accommodates high dimensional, mixed data is developed to cluster LIPGENE-SU.VI.MAX participants using a Bayesian finite mixture model. A computationally efficient variable selection algorithm is incorporated, estimation is via a Gibbs sampling algorithm and an approximate BIC-MCMC criterion is developed to select the optimal model. Two clusters or sub-phenotypes ('healthy' and 'at risk') are uncovered. A small subset of variables is deemed discriminatory, which notably includes phenotypic and genotypic variables, highlighting the need to jointly consider both factors. Further, 7 years after the LIPGENE-SU.VI.MAX data were collected, participants underwent further analysis to diagnose presence or absence of the MetS. The two uncovered sub-phenotypes strongly correspond to the 7-year follow-up disease classification, highlighting the role of phenotypic and genotypic factors in the MetS and emphasising the potential utility of the clustering approach in early screening. Additionally, the ability of the proposed approach to define the uncertainty in sub-phenotype membership at the participant level is synonymous with the concepts of precision medicine and nutrition. Copyright © 2017 John Wiley & Sons, Ltd. Copyright © 2017 John Wiley & Sons, Ltd.
Variability of ICA decomposition may impact EEG signals when used to remove eyeblink artifacts
PONTIFEX, MATTHEW B.; GWIZDALA, KATHRYN L.; PARKS, ANDREW C.; BILLINGER, MARTIN; BRUNNER, CLEMENS
2017-01-01
Despite the growing use of independent component analysis (ICA) algorithms for isolating and removing eyeblink-related activity from EEG data, we have limited understanding of how variability associated with ICA uncertainty may be influencing the reconstructed EEG signal after removing the eyeblink artifact components. To characterize the magnitude of this ICA uncertainty and to understand the extent to which it may influence findings within ERP and EEG investigations, ICA decompositions of EEG data from 32 college-aged young adults were repeated 30 times for three popular ICA algorithms. Following each decomposition, eyeblink components were identified and removed. The remaining components were back-projected, and the resulting clean EEG data were further used to analyze ERPs. Findings revealed that ICA uncertainty results in variation in P3 amplitude as well as variation across all EEG sampling points, but differs across ICA algorithms as a function of the spatial location of the EEG channel. This investigation highlights the potential of ICA uncertainty to introduce additional sources of variance when the data are back-projected without artifact components. Careful selection of ICA algorithms and parameters can reduce the extent to which ICA uncertainty may introduce an additional source of variance within ERP/EEG studies. PMID:28026876
Ahmadi, Mehdi; Shahlaei, Mohsen
2015-01-01
P2X7 antagonist activity for a set of 49 molecules of the P2X7 receptor antagonists, derivatives of purine, was modeled with the aid of chemometric and artificial intelligence techniques. The activity of these compounds was estimated by means of combination of principal component analysis (PCA), as a well-known data reduction method, genetic algorithm (GA), as a variable selection technique, and artificial neural network (ANN), as a non-linear modeling method. First, a linear regression, combined with PCA, (principal component regression) was operated to model the structure-activity relationships, and afterwards a combination of PCA and ANN algorithm was employed to accurately predict the biological activity of the P2X7 antagonist. PCA preserves as much of the information as possible contained in the original data set. Seven most important PC's to the studied activity were selected as the inputs of ANN box by an efficient variable selection method, GA. The best computational neural network model was a fully-connected, feed-forward model with 7-7-1 architecture. The developed ANN model was fully evaluated by different validation techniques, including internal and external validation, and chemical applicability domain. All validations showed that the constructed quantitative structure-activity relationship model suggested is robust and satisfactory.
Ahmadi, Mehdi; Shahlaei, Mohsen
2015-01-01
P2X7 antagonist activity for a set of 49 molecules of the P2X7 receptor antagonists, derivatives of purine, was modeled with the aid of chemometric and artificial intelligence techniques. The activity of these compounds was estimated by means of combination of principal component analysis (PCA), as a well-known data reduction method, genetic algorithm (GA), as a variable selection technique, and artificial neural network (ANN), as a non-linear modeling method. First, a linear regression, combined with PCA, (principal component regression) was operated to model the structure–activity relationships, and afterwards a combination of PCA and ANN algorithm was employed to accurately predict the biological activity of the P2X7 antagonist. PCA preserves as much of the information as possible contained in the original data set. Seven most important PC's to the studied activity were selected as the inputs of ANN box by an efficient variable selection method, GA. The best computational neural network model was a fully-connected, feed-forward model with 7−7−1 architecture. The developed ANN model was fully evaluated by different validation techniques, including internal and external validation, and chemical applicability domain. All validations showed that the constructed quantitative structure–activity relationship model suggested is robust and satisfactory. PMID:26600858
[Measurement of Water COD Based on UV-Vis Spectroscopy Technology].
Wang, Xiao-ming; Zhang, Hai-liang; Luo, Wei; Liu, Xue-mei
2016-01-01
Ultraviolet/visible (UV/Vis) spectroscopy technology was used to measure water COD. A total of 135 water samples were collected from Zhejiang province. Raw spectra with 3 different pretreatment methods (Multiplicative Scatter Correction (MSC), Standard Normal Variate (SNV) and 1st Derivatives were compared to determine the optimal pretreatment method for analysis. Spectral variable selection is an important strategy in spectrum modeling analysis, because it tends to parsimonious data representation and can lead to multivariate models with better performance. In order to simply calibration models, the preprocessed spectra were then used to select sensitive wavelengths by competitive adaptive reweighted sampling (CARS), Random frog and Successive Genetic Algorithm (GA) methods. Different numbers of sensitive wavelengths were selected by different variable selection methods with SNV preprocessing method. Partial least squares (PLS) was used to build models with the full spectra, and Extreme Learning Machine (ELM) was applied to build models with the selected wavelength variables. The overall results showed that ELM model performed better than PLS model, and the ELM model with the selected wavelengths based on CARS obtained the best results with the determination coefficient (R2), RMSEP and RPD were 0.82, 14.48 and 2.34 for prediction set. The results indicated that it was feasible to use UV/Vis with characteristic wavelengths which were obtained by CARS variable selection method, combined with ELM calibration could apply for the rapid and accurate determination of COD in aquaculture water. Moreover, this study laid the foundation for further implementation of online analysis of aquaculture water and rapid determination of other water quality parameters.
NASA Technical Reports Server (NTRS)
Fitzjerrell, D. G.; Grounds, D. J.; Leonard, J. I.
1975-01-01
Using a whole body algorithm simulation model, a wide variety and large number of stresses as well as different stress levels were simulated including environmental disturbances, metabolic changes, and special experimental situations. Simulation of short term stresses resulted in simultaneous and integrated responses from the cardiovascular, respiratory, and thermoregulatory subsystems and the accuracy of a large number of responding variables was verified. The capability of simulating significantly longer responses was demonstrated by validating a four week bed rest study. In this case, the long term subsystem model was found to reproduce many experimentally observed changes in circulatory dynamics, body fluid-electrolyte regulation, and renal function. The value of systems analysis and the selected design approach for developing a whole body algorithm was demonstrated.
Efficient robust doubly adaptive regularized regression with applications.
Karunamuni, Rohana J; Kong, Linglong; Tu, Wei
2018-01-01
We consider the problem of estimation and variable selection for general linear regression models. Regularized regression procedures have been widely used for variable selection, but most existing methods perform poorly in the presence of outliers. We construct a new penalized procedure that simultaneously attains full efficiency and maximum robustness. Furthermore, the proposed procedure satisfies the oracle properties. The new procedure is designed to achieve sparse and robust solutions by imposing adaptive weights on both the decision loss and the penalty function. The proposed method of estimation and variable selection attains full efficiency when the model is correct and, at the same time, achieves maximum robustness when outliers are present. We examine the robustness properties using the finite-sample breakdown point and an influence function. We show that the proposed estimator attains the maximum breakdown point. Furthermore, there is no loss in efficiency when there are no outliers or the error distribution is normal. For practical implementation of the proposed method, we present a computational algorithm. We examine the finite-sample and robustness properties using Monte Carlo studies. Two datasets are also analyzed.
Zhang, Yan; Zou, Hong-Yan; Shi, Pei; Yang, Qin; Tang, Li-Juan; Jiang, Jian-Hui; Wu, Hai-Long; Yu, Ru-Qin
2016-01-01
Determination of benzo[a]pyrene (BaP) in cigarette smoke can be very important for the tobacco quality control and the assessment of its harm to human health. In this study, mid-infrared spectroscopy (MIR) coupled to chemometric algorithm (DPSO-WPT-PLS), which was based on the wavelet packet transform (WPT), discrete particle swarm optimization algorithm (DPSO) and partial least squares regression (PLS), was used to quantify harmful ingredient benzo[a]pyrene in the cigarette mainstream smoke with promising result. Furthermore, the proposed method provided better performance compared to several other chemometric models, i.e., PLS, radial basis function-based PLS (RBF-PLS), PLS with stepwise regression variable selection (Stepwise-PLS) as well as WPT-PLS with informative wavelet coefficients selected by correlation coefficient test (rtest-WPT-PLS). It can be expected that the proposed strategy could become a new effective, rapid quantitative analysis technique in analyzing the harmful ingredient BaP in cigarette mainstream smoke. Copyright © 2015 Elsevier B.V. All rights reserved.
Automatic attention-based prioritization of unconstrained video for compression
NASA Astrophysics Data System (ADS)
Itti, Laurent
2004-06-01
We apply a biologically-motivated algorithm that selects visually-salient regions of interest in video streams to multiply-foveated video compression. Regions of high encoding priority are selected based on nonlinear integration of low-level visual cues, mimicking processing in primate occipital and posterior parietal cortex. A dynamic foveation filter then blurs (foveates) every frame, increasingly with distance from high-priority regions. Two variants of the model (one with continuously-variable blur proportional to saliency at every pixel, and the other with blur proportional to distance from three independent foveation centers) are validated against eye fixations from 4-6 human observers on 50 video clips (synthetic stimuli, video games, outdoors day and night home video, television newscast, sports, talk-shows, etc). Significant overlap is found between human and algorithmic foveations on every clip with one variant, and on 48 out of 50 clips with the other. Substantial compressed file size reductions by a factor 0.5 on average are obtained for foveated compared to unfoveated clips. These results suggest a general-purpose usefulness of the algorithm in improving compression ratios of unconstrained video.
Modeling multilayer x-ray reflectivity using genetic algorithms
NASA Astrophysics Data System (ADS)
Sánchez del Río, M.; Pareschi, G.; Michetschläger, C.
2000-06-01
The x-ray reflectivity of a multilayer is a non-linear function of many parameters (materials, layer thickness, density, roughness). Non-linear fitting of experimental data with simulations requires the use of initial values sufficiently close to the optimum value. This is a difficult task when the topology of the space of the variables is highly structured. We apply global optimization methods to fit multilayer reflectivity. Genetic algorithms are stochastic methods based on the model of natural evolution: the improvement of a population along successive generations. A complete set of initial parameters constitutes an individual. The population is a collection of individuals. Each generation is built from the parent generation by applying some operators (selection, crossover, mutation, etc.) on the members of the parent generation. The pressure of selection drives the population to include "good" individuals. For large number of generations, the best individuals will approximate the optimum parameters. Some results on fitting experimental hard x-ray reflectivity data for Ni/C and W/Si multilayers using genetic algorithms are presented. This method can also be applied to design multilayers optimized for a target application.
Modeling and analysis of selected space station communications and tracking subsystems
NASA Technical Reports Server (NTRS)
Richmond, Elmer Raydean
1993-01-01
The Communications and Tracking System on board Space Station Freedom (SSF) provides space-to-ground, space-to-space, audio, and video communications, as well as tracking data reception and processing services. Each major category of service is provided by a communications subsystem which is controlled and monitored by software. Among these subsystems, the Assembly/Contingency Subsystem (ACS) and the Space-to-Ground Subsystem (SGS) provide communications with the ground via the Tracking and Data Relay Satellite (TDRS) System. The ACS is effectively SSF's command link, while the SGS is primarily intended as the data link for SSF payloads. The research activities of this project focused on the ACS and SGS antenna management algorithms identified in the Flight System Software Requirements (FSSR) documentation, including: (1) software modeling and evaluation of antenna management (positioning) algorithms; and (2) analysis and investigation of selected variables and parameters of these antenna management algorithms i.e., descriptions and definitions of ranges, scopes, and dimensions. In a related activity, to assist those responsible for monitoring the development of this flight system software, a brief summary of software metrics concepts, terms, measures, and uses was prepared.
Analysis of Sting Balance Calibration Data Using Optimized Regression Models
NASA Technical Reports Server (NTRS)
Ulbrich, N.; Bader, Jon B.
2010-01-01
Calibration data of a wind tunnel sting balance was processed using a candidate math model search algorithm that recommends an optimized regression model for the data analysis. During the calibration the normal force and the moment at the balance moment center were selected as independent calibration variables. The sting balance itself had two moment gages. Therefore, after analyzing the connection between calibration loads and gage outputs, it was decided to choose the difference and the sum of the gage outputs as the two responses that best describe the behavior of the balance. The math model search algorithm was applied to these two responses. An optimized regression model was obtained for each response. Classical strain gage balance load transformations and the equations of the deflection of a cantilever beam under load are used to show that the search algorithm s two optimized regression models are supported by a theoretical analysis of the relationship between the applied calibration loads and the measured gage outputs. The analysis of the sting balance calibration data set is a rare example of a situation when terms of a regression model of a balance can directly be derived from first principles of physics. In addition, it is interesting to note that the search algorithm recommended the correct regression model term combinations using only a set of statistical quality metrics that were applied to the experimental data during the algorithm s term selection process.
Design Optimization of a Centrifugal Fan with Splitter Blades
NASA Astrophysics Data System (ADS)
Heo, Man-Woong; Kim, Jin-Hyuk; Kim, Kwang-Yong
2015-05-01
Multi-objective optimization of a centrifugal fan with additionally installed splitter blades was performed to simultaneously maximize the efficiency and pressure rise using three-dimensional Reynolds-averaged Navier-Stokes equations and hybrid multi-objective evolutionary algorithm. Two design variables defining the location of splitter, and the height ratio between inlet and outlet of impeller were selected for the optimization. In addition, the aerodynamic characteristics of the centrifugal fan were investigated with the variation of design variables in the design space. Latin hypercube sampling was used to select the training points, and response surface approximation models were constructed as surrogate models of the objective functions. With the optimization, both the efficiency and pressure rise of the centrifugal fan with splitter blades were improved considerably compared to the reference model.
Liu, Xiaona; Zhang, Qiao; Wu, Zhisheng; Shi, Xinyuan; Zhao, Na; Qiao, Yanjiang
2015-01-01
Laser-induced breakdown spectroscopy (LIBS) was applied to perform a rapid elemental analysis and provenance study of Blumea balsamifera DC. Principal component analysis (PCA) and partial least squares discriminant analysis (PLS-DA) were implemented to exploit the multivariate nature of the LIBS data. Scores and loadings of computed principal components visually illustrated the differing spectral data. The PLS-DA algorithm showed good classification performance. The PLS-DA model using complete spectra as input variables had similar discrimination performance to using selected spectral lines as input variables. The down-selection of spectral lines was specifically focused on the major elements of B. balsamifera samples. Results indicated that LIBS could be used to rapidly analyze elements and to perform provenance study of B. balsamifera. PMID:25558999
NASA Astrophysics Data System (ADS)
Shan, Jiajia; Wang, Xue; Zhou, Hao; Han, Shuqing; Riza, Dimas Firmanda Al; Kondo, Naoshi
2018-04-01
Synchronous fluorescence spectra, combined with multivariate analysis were used to predict flavonoids content in green tea rapidly and nondestructively. This paper presented a new and efficient spectral intervals selection method called clustering based partial least square (CL-PLS), which selected informative wavelengths by combining clustering concept and partial least square (PLS) methods to improve models’ performance by synchronous fluorescence spectra. The fluorescence spectra of tea samples were obtained and k-means and kohonen-self organizing map clustering algorithms were carried out to cluster full spectra into several clusters, and sub-PLS regression model was developed on each cluster. Finally, CL-PLS models consisting of gradually selected clusters were built. Correlation coefficient (R) was used to evaluate the effect on prediction performance of PLS models. In addition, variable influence on projection partial least square (VIP-PLS), selectivity ratio partial least square (SR-PLS), interval partial least square (iPLS) models and full spectra PLS model were investigated and the results were compared. The results showed that CL-PLS presented the best result for flavonoids prediction using synchronous fluorescence spectra.
Shan, Jiajia; Wang, Xue; Zhou, Hao; Han, Shuqing; Riza, Dimas Firmanda Al; Kondo, Naoshi
2018-03-13
Synchronous fluorescence spectra, combined with multivariate analysis were used to predict flavonoids content in green tea rapidly and nondestructively. This paper presented a new and efficient spectral intervals selection method called clustering based partial least square (CL-PLS), which selected informative wavelengths by combining clustering concept and partial least square (PLS) methods to improve models' performance by synchronous fluorescence spectra. The fluorescence spectra of tea samples were obtained and k-means and kohonen-self organizing map clustering algorithms were carried out to cluster full spectra into several clusters, and sub-PLS regression model was developed on each cluster. Finally, CL-PLS models consisting of gradually selected clusters were built. Correlation coefficient (R) was used to evaluate the effect on prediction performance of PLS models. In addition, variable influence on projection partial least square (VIP-PLS), selectivity ratio partial least square (SR-PLS), interval partial least square (iPLS) models and full spectra PLS model were investigated and the results were compared. The results showed that CL-PLS presented the best result for flavonoids prediction using synchronous fluorescence spectra.
A Cancer Gene Selection Algorithm Based on the K-S Test and CFS.
Su, Qiang; Wang, Yina; Jiang, Xiaobing; Chen, Fuxue; Lu, Wen-Cong
2017-01-01
To address the challenging problem of selecting distinguished genes from cancer gene expression datasets, this paper presents a gene subset selection algorithm based on the Kolmogorov-Smirnov (K-S) test and correlation-based feature selection (CFS) principles. The algorithm selects distinguished genes first using the K-S test, and then, it uses CFS to select genes from those selected by the K-S test. We adopted support vector machines (SVM) as the classification tool and used the criteria of accuracy to evaluate the performance of the classifiers on the selected gene subsets. This approach compared the proposed gene subset selection algorithm with the K-S test, CFS, minimum-redundancy maximum-relevancy (mRMR), and ReliefF algorithms. The average experimental results of the aforementioned gene selection algorithms for 5 gene expression datasets demonstrate that, based on accuracy, the performance of the new K-S and CFS-based algorithm is better than those of the K-S test, CFS, mRMR, and ReliefF algorithms. The experimental results show that the K-S test-CFS gene selection algorithm is a very effective and promising approach compared to the K-S test, CFS, mRMR, and ReliefF algorithms.
Fast local reconstruction by selective backprojection for low dose in dental computed tomography
NASA Astrophysics Data System (ADS)
Yan, Bin; Deng, Lin; Han, Yu; Zhang, Feng; Wang, Xian-Chao; Li, Lei
2014-10-01
The high radiation dose in computed tomography (CT) scans increases the lifetime risk of cancer, which becomes a major clinical concern. The backprojection-filtration (BPF) algorithm could reduce the radiation dose by reconstructing the images from truncated data in a short scan. In a dental CT, it could reduce the radiation dose for the teeth by using the projection acquired in a short scan, and could avoid irradiation to the other part by using truncated projection. However, the limit of integration for backprojection varies per PI-line, resulting in low calculation efficiency and poor parallel performance. Recently, a tent BPF has been proposed to improve the calculation efficiency by rearranging the projection. However, the memory-consuming data rebinning process is included. Accordingly, the selective BPF (S-BPF) algorithm is proposed in this paper. In this algorithm, the derivative of the projection is backprojected to the points whose x coordinate is less than that of the source focal spot to obtain the differentiated backprojection. The finite Hilbert inverse is then applied to each PI-line segment. S-BPF avoids the influence of the variable limit of integration by selective backprojection without additional time cost or memory cost. The simulation experiment and the real experiment demonstrated the higher reconstruction efficiency of S-BPF.
NASA Astrophysics Data System (ADS)
Friedel, Michael; Buscema, Massimo
2016-04-01
Aquatic ecosystem models can potentially be used to understand the influence of stresses on catchment resource quality. Given that catchment responses are functions of natural and anthropogenic stresses reflected in sparse and spatiotemporal biological, physical, and chemical measurements, an ecosystem is difficult to model using statistical or numerical methods. We propose an artificial adaptive systems approach to model ecosystems. First, an unsupervised machine-learning (ML) network is trained using the set of available sparse and disparate data variables. Second, an evolutionary algorithm with genetic doping is applied to reduce the number of ecosystem variables to an optimal set. Third, the optimal set of ecosystem variables is used to retrain the ML network. Fourth, a stochastic cross-validation approach is applied to quantify and compare the nonlinear uncertainty in selected predictions of the original and reduced models. Results are presented for aquatic ecosystems (tens of thousands of square kilometers) undergoing landscape change in the USA: Upper Illinois River Basin and Central Colorado Assessment Project Area, and Southland region, NZ.
NASA Astrophysics Data System (ADS)
Yu, Huiling; Liang, Hao; Lin, Xue; Zhang, Yizhuo
2018-04-01
A nondestructive methodology is proposed to determine the modulus of elasticity (MOE) of Fraxinus mandschurica samples by using near-infrared (NIR) spectroscopy. The test data consisted of 150 NIR absorption spectra of the wood samples obtained using an NIR spectrometer, with the wavelength range of 900 to 1900 nm. To eliminate the high-frequency noise and the systematic variations on the baseline, Savitzky-Golay convolution combined with standard normal variate and detrending transformation was applied as data pretreated methods. The uninformative variable elimination (UVE), improved by the evolutionary Monte Carlo (EMC) algorithm and successive projections algorithm (SPA) selected three characteristic variables from full 117 variables. The predictive ability of the models was evaluated concerning the root-mean-square error of prediction (RMSEP) and coefficient of determination (Rp2) in the prediction set. In comparison with the predicted results of all the models established in the experiments, UVE-EMC-SPA-LS-SVM presented the best results with the smallest RMSEP of 0.652 and the highest Rp2 of 0.887. Thus, it is feasible to determine the MOE of F. mandschurica using NIR spectroscopy accurately.
NASA Astrophysics Data System (ADS)
Zhao, Jianhua; Zeng, Haishan; Kalia, Sunil; Lui, Harvey
2017-02-01
Background: Raman spectroscopy is a non-invasive optical technique which can measure molecular vibrational modes within tissue. A large-scale clinical study (n = 518) has demonstrated that real-time Raman spectroscopy could distinguish malignant from benign skin lesions with good diagnostic accuracy; this was validated by a follow-up independent study (n = 127). Objective: Most of the previous diagnostic algorithms have typically been based on analyzing the full band of the Raman spectra, either in the fingerprint or high wavenumber regions. Our objective in this presentation is to explore wavenumber selection based analysis in Raman spectroscopy for skin cancer diagnosis. Methods: A wavenumber selection algorithm was implemented using variably-sized wavenumber windows, which were determined by the correlation coefficient between wavenumbers. Wavenumber windows were chosen based on accumulated frequency from leave-one-out cross-validated stepwise regression or least and shrinkage selection operator (LASSO). The diagnostic algorithms were then generated from the selected wavenumber windows using multivariate statistical analyses, including principal component and general discriminant analysis (PC-GDA) and partial least squares (PLS). A total cohort of 645 confirmed lesions from 573 patients encompassing skin cancers, precancers and benign skin lesions were included. Lesion measurements were divided into training cohort (n = 518) and testing cohort (n = 127) according to the measurement time. Result: The area under the receiver operating characteristic curve (ROC) improved from 0.861-0.891 to 0.891-0.911 and the diagnostic specificity for sensitivity levels of 0.99-0.90 increased respectively from 0.17-0.65 to 0.20-0.75 by selecting specific wavenumber windows for analysis. Conclusion: Wavenumber selection based analysis in Raman spectroscopy improves skin cancer diagnostic specificity at high sensitivity levels.
NASA Technical Reports Server (NTRS)
Seldner, K.
1976-01-01
The development of control systems for jet engines requires a real-time computer simulation. The simulation provides an effective tool for evaluating control concepts and problem areas prior to actual engine testing. The development and use of a real-time simulation of the Pratt and Whitney F100-PW100 turbofan engine is described. The simulation was used in a multi-variable optimal controls research program using linear quadratic regulator theory. The simulation is used to generate linear engine models at selected operating points and evaluate the control algorithm. To reduce the complexity of the design, it is desirable to reduce the order of the linear model. A technique to reduce the order of the model; is discussed. Selected results between high and low order models are compared. The LQR control algorithms can be programmed on digital computer. This computer will control the engine simulation over the desired flight envelope.
Welch, Catherine A; Petersen, Irene; Bartlett, Jonathan W; White, Ian R; Marston, Louise; Morris, Richard W; Nazareth, Irwin; Walters, Kate; Carpenter, James
2014-01-01
Most implementations of multiple imputation (MI) of missing data are designed for simple rectangular data structures ignoring temporal ordering of data. Therefore, when applying MI to longitudinal data with intermittent patterns of missing data, some alternative strategies must be considered. One approach is to divide data into time blocks and implement MI independently at each block. An alternative approach is to include all time blocks in the same MI model. With increasing numbers of time blocks, this approach is likely to break down because of co-linearity and over-fitting. The new two-fold fully conditional specification (FCS) MI algorithm addresses these issues, by only conditioning on measurements, which are local in time. We describe and report the results of a novel simulation study to critically evaluate the two-fold FCS algorithm and its suitability for imputation of longitudinal electronic health records. After generating a full data set, approximately 70% of selected continuous and categorical variables were made missing completely at random in each of ten time blocks. Subsequently, we applied a simple time-to-event model. We compared efficiency of estimated coefficients from a complete records analysis, MI of data in the baseline time block and the two-fold FCS algorithm. The results show that the two-fold FCS algorithm maximises the use of data available, with the gain relative to baseline MI depending on the strength of correlations within and between variables. Using this approach also increases plausibility of the missing at random assumption by using repeated measures over time of variables whose baseline values may be missing. PMID:24782349
An Evaluation of a Flight Deck Interval Management Algorithm Including Delayed Target Trajectories
NASA Technical Reports Server (NTRS)
Swieringa, Kurt A.; Underwood, Matthew C.; Barmore, Bryan; Leonard, Robert D.
2014-01-01
NASA's first Air Traffic Management (ATM) Technology Demonstration (ATD-1) was created to facilitate the transition of mature air traffic management technologies from the laboratory to operational use. The technologies selected for demonstration are the Traffic Management Advisor with Terminal Metering (TMA-TM), which provides precise timebased scheduling in the terminal airspace; Controller Managed Spacing (CMS), which provides controllers with decision support tools enabling precise schedule conformance; and Interval Management (IM), which consists of flight deck automation that enables aircraft to achieve or maintain precise in-trail spacing. During high demand operations, TMA-TM may produce a schedule and corresponding aircraft trajectories that include delay to ensure that a particular aircraft will be properly spaced from other aircraft at each schedule waypoint. These delayed trajectories are not communicated to the automation onboard the aircraft, forcing the IM aircraft to use the published speeds to estimate the target aircraft's estimated time of arrival. As a result, the aircraft performing IM operations may follow an aircraft whose TMA-TM generated trajectories have substantial speed deviations from the speeds expected by the spacing algorithm. Previous spacing algorithms were not designed to handle this magnitude of uncertainty. A simulation was conducted to examine a modified spacing algorithm with the ability to follow aircraft flying delayed trajectories. The simulation investigated the use of the new spacing algorithm with various delayed speed profiles and wind conditions, as well as several other variables designed to simulate real-life variability. The results and conclusions of this study indicate that the new spacing algorithm generally exhibits good performance; however, some types of target aircraft speed profiles can cause the spacing algorithm to command less than optimal speed control behavior.
A Multipopulation PSO Based Memetic Algorithm for Permutation Flow Shop Scheduling
Liu, Ruochen; Ma, Chenlin; Ma, Wenping; Li, Yangyang
2013-01-01
The permutation flow shop scheduling problem (PFSSP) is part of production scheduling, which belongs to the hardest combinatorial optimization problem. In this paper, a multipopulation particle swarm optimization (PSO) based memetic algorithm (MPSOMA) is proposed in this paper. In the proposed algorithm, the whole particle swarm population is divided into three subpopulations in which each particle evolves itself by the standard PSO and then updates each subpopulation by using different local search schemes such as variable neighborhood search (VNS) and individual improvement scheme (IIS). Then, the best particle of each subpopulation is selected to construct a probabilistic model by using estimation of distribution algorithm (EDA) and three particles are sampled from the probabilistic model to update the worst individual in each subpopulation. The best particle in the entire particle swarm is used to update the global optimal solution. The proposed MPSOMA is compared with two recently proposed algorithms, namely, PSO based memetic algorithm (PSOMA) and hybrid particle swarm optimization with estimation of distribution algorithm (PSOEDA), on 29 well-known PFFSPs taken from OR-library, and the experimental results show that it is an effective approach for the PFFSP. PMID:24453841
Arruti, Andoni; Cearreta, Idoia; Álvarez, Aitor; Lazkano, Elena; Sierra, Basilio
2014-01-01
Study of emotions in human–computer interaction is a growing research area. This paper shows an attempt to select the most significant features for emotion recognition in spoken Basque and Spanish Languages using different methods for feature selection. RekEmozio database was used as the experimental data set. Several Machine Learning paradigms were used for the emotion classification task. Experiments were executed in three phases, using different sets of features as classification variables in each phase. Moreover, feature subset selection was applied at each phase in order to seek for the most relevant feature subset. The three phases approach was selected to check the validity of the proposed approach. Achieved results show that an instance-based learning algorithm using feature subset selection techniques based on evolutionary algorithms is the best Machine Learning paradigm in automatic emotion recognition, with all different feature sets, obtaining a mean of 80,05% emotion recognition rate in Basque and a 74,82% in Spanish. In order to check the goodness of the proposed process, a greedy searching approach (FSS-Forward) has been applied and a comparison between them is provided. Based on achieved results, a set of most relevant non-speaker dependent features is proposed for both languages and new perspectives are suggested. PMID:25279686
Armañanzas, Rubén; Bielza, Concha; Chaudhuri, Kallol Ray; Martinez-Martin, Pablo; Larrañaga, Pedro
2013-07-01
Is it possible to predict the severity staging of a Parkinson's disease (PD) patient using scores of non-motor symptoms? This is the kickoff question for a machine learning approach to classify two widely known PD severity indexes using individual tests from a broad set of non-motor PD clinical scales only. The Hoehn & Yahr index and clinical impression of severity index are global measures of PD severity. They constitute the labels to be assigned in two supervised classification problems using only non-motor symptom tests as predictor variables. Such predictors come from a wide range of PD symptoms, such as cognitive impairment, psychiatric complications, autonomic dysfunction or sleep disturbance. The classification was coupled with a feature subset selection task using an advanced evolutionary algorithm, namely an estimation of distribution algorithm. Results show how five different classification paradigms using a wrapper feature selection scheme are capable of predicting each of the class variables with estimated accuracy in the range of 72-92%. In addition, classification into the main three severity categories (mild, moderate and severe) was split into dichotomic problems where binary classifiers perform better and select different subsets of non-motor symptoms. The number of jointly selected symptoms throughout the whole process was low, suggesting a link between the selected non-motor symptoms and the general severity of the disease. Quantitative results are discussed from a medical point of view, reflecting a clear translation to the clinical manifestations of PD. Moreover, results include a brief panel of non-motor symptoms that could help clinical practitioners to identify patients who are at different stages of the disease from a limited set of symptoms, such as hallucinations, fainting, inability to control body sphincters or believing in unlikely facts. Copyright © 2013 Elsevier B.V. All rights reserved.
Towards a robust framework for catchment classification
NASA Astrophysics Data System (ADS)
Deshmukh, A.; Samal, A.; Singh, R.
2017-12-01
Classification of catchments based on various measures of similarity has emerged as an important technique to understand regional scale hydrologic behavior. Classification of catchment characteristics and/or streamflow response has been used reveal which characteristics are more likely to explain the observed variability of hydrologic response. However, numerous algorithms for supervised or unsupervised classification are available, making it hard to identify the algorithm most suitable for the dataset at hand. Consequently, existing catchment classification studies vary significantly in the classification algorithms employed with no previous attempt at understanding the degree of uncertainty in classification due to this algorithmic choice. This hinders the generalizability of interpretations related to hydrologic behavior. Our goal is to develop a protocol that can be followed while classifying hydrologic datasets. We focus on a classification framework for unsupervised classification and provide a step-by-step classification procedure. The steps include testing the clusterabiltiy of original dataset prior to classification, feature selection, validation of clustered data, and quantification of similarity of two clusterings. We test several commonly available methods within this framework to understand the level of similarity of classification results across algorithms. We apply the proposed framework on recently developed datasets for India to analyze to what extent catchment properties can explain observed catchment response. Our testing dataset includes watershed characteristics for over 200 watersheds which comprise of both natural (physio-climatic) characteristics and socio-economic characteristics. This framework allows us to understand the controls on observed hydrologic variability across India.
Wendel, Jochen; Buttenfield, Barbara P.; Stanislawski, Larry V.
2016-01-01
Knowledge of landscape type can inform cartographic generalization of hydrographic features, because landscape characteristics provide an important geographic context that affects variation in channel geometry, flow pattern, and network configuration. Landscape types are characterized by expansive spatial gradients, lacking abrupt changes between adjacent classes; and as having a limited number of outliers that might confound classification. The US Geological Survey (USGS) is exploring methods to automate generalization of features in the National Hydrography Data set (NHD), to associate specific sequences of processing operations and parameters with specific landscape characteristics, thus obviating manual selection of a unique processing strategy for every NHD watershed unit. A chronology of methods to delineate physiographic regions for the United States is described, including a recent maximum likelihood classification based on seven input variables. This research compares unsupervised and supervised algorithms applied to these seven input variables, to evaluate and possibly refine the recent classification. Evaluation metrics for unsupervised methods include the Davies–Bouldin index, the Silhouette index, and the Dunn index as well as quantization and topographic error metrics. Cross validation and misclassification rate analysis are used to evaluate supervised classification methods. The paper reports the comparative analysis and its impact on the selection of landscape regions. The compared solutions show problems in areas of high landscape diversity. There is some indication that additional input variables, additional classes, or more sophisticated methods can refine the existing classification.
Gorodeski, Eiran Z.; Ishwaran, Hemant; Kogalur, Udaya B.; Blackstone, Eugene H.; Hsich, Eileen; Zhang, Zhu-ming; Vitolins, Mara Z.; Manson, JoAnn E.; Curb, J. David; Martin, Lisa W.; Prineas, Ronald J.; Lauer, Michael S.
2013-01-01
Background Simultaneous contribution of hundreds of electrocardiographic biomarkers to prediction of long-term mortality in post-menopausal women with clinically normal resting electrocardiograms (ECGs) is unknown. Methods and Results We analyzed ECGs and all-cause mortality in 33,144 women enrolled in Women’s Health Initiative trials, who were without baseline cardiovascular disease or cancer, and had normal ECGs by Minnesota and Novacode criteria. Four hundred and seventy seven ECG biomarkers, encompassing global and individual ECG findings, were measured using computer algorithms. During a median follow-up of 8.1 years (range for survivors 0.5–11.2 years), 1,229 women died. For analyses cohort was randomly split into derivation (n=22,096, deaths=819) and validation (n=11,048, deaths=410) subsets. ECG biomarkers, demographic, and clinical characteristics were simultaneously analyzed using both traditional Cox regression and Random Survival Forest (RSF), a novel algorithmic machine-learning approach. Regression modeling failed to converge. RSF variable selection yielded 20 variables that were independently predictive of long-term mortality, 14 of which were ECG biomarkers related to autonomic tone, atrial conduction, and ventricular depolarization and repolarization. Conclusions We identified 14 ECG biomarkers from amongst hundreds that were associated with long-term prognosis using a novel random forest variable selection methodology. These were related to autonomic tone, atrial conduction, ventricular depolarization, and ventricular repolarization. Quantitative ECG biomarkers have prognostic importance, and may be markers of subclinical disease in apparently healthy post-menopausal women. PMID:21862719
Stochastic Formal Correctness of Numerical Algorithms
NASA Technical Reports Server (NTRS)
Daumas, Marc; Lester, David; Martin-Dorel, Erik; Truffert, Annick
2009-01-01
We provide a framework to bound the probability that accumulated errors were never above a given threshold on numerical algorithms. Such algorithms are used for example in aircraft and nuclear power plants. This report contains simple formulas based on Levy's and Markov's inequalities and it presents a formal theory of random variables with a special focus on producing concrete results. We selected four very common applications that fit in our framework and cover the common practices of systems that evolve for a long time. We compute the number of bits that remain continuously significant in the first two applications with a probability of failure around one out of a billion, where worst case analysis considers that no significant bit remains. We are using PVS as such formal tools force explicit statement of all hypotheses and prevent incorrect uses of theorems.
Error propagation of partial least squares for parameters optimization in NIR modeling.
Du, Chenzhao; Dai, Shengyun; Qiao, Yanjiang; Wu, Zhisheng
2018-03-05
A novel methodology is proposed to determine the error propagation of partial least-square (PLS) for parameters optimization in near-infrared (NIR) modeling. The parameters include spectral pretreatment, latent variables and variable selection. In this paper, an open source dataset (corn) and a complicated dataset (Gardenia) were used to establish PLS models under different modeling parameters. And error propagation of modeling parameters for water quantity in corn and geniposide quantity in Gardenia were presented by both type І and type II error. For example, when variable importance in the projection (VIP), interval partial least square (iPLS) and backward interval partial least square (BiPLS) variable selection algorithms were used for geniposide in Gardenia, compared with synergy interval partial least squares (SiPLS), the error weight varied from 5% to 65%, 55% and 15%. The results demonstrated how and what extent the different modeling parameters affect error propagation of PLS for parameters optimization in NIR modeling. The larger the error weight, the worse the model. Finally, our trials finished a powerful process in developing robust PLS models for corn and Gardenia under the optimal modeling parameters. Furthermore, it could provide a significant guidance for the selection of modeling parameters of other multivariate calibration models. Copyright © 2017. Published by Elsevier B.V.
Error propagation of partial least squares for parameters optimization in NIR modeling
NASA Astrophysics Data System (ADS)
Du, Chenzhao; Dai, Shengyun; Qiao, Yanjiang; Wu, Zhisheng
2018-03-01
A novel methodology is proposed to determine the error propagation of partial least-square (PLS) for parameters optimization in near-infrared (NIR) modeling. The parameters include spectral pretreatment, latent variables and variable selection. In this paper, an open source dataset (corn) and a complicated dataset (Gardenia) were used to establish PLS models under different modeling parameters. And error propagation of modeling parameters for water quantity in corn and geniposide quantity in Gardenia were presented by both type І and type II error. For example, when variable importance in the projection (VIP), interval partial least square (iPLS) and backward interval partial least square (BiPLS) variable selection algorithms were used for geniposide in Gardenia, compared with synergy interval partial least squares (SiPLS), the error weight varied from 5% to 65%, 55% and 15%. The results demonstrated how and what extent the different modeling parameters affect error propagation of PLS for parameters optimization in NIR modeling. The larger the error weight, the worse the model. Finally, our trials finished a powerful process in developing robust PLS models for corn and Gardenia under the optimal modeling parameters. Furthermore, it could provide a significant guidance for the selection of modeling parameters of other multivariate calibration models.
Prediction of municipal solid waste generation using nonlinear autoregressive network.
Younes, Mohammad K; Nopiah, Z M; Basri, N E Ahmad; Basri, H; Abushammala, Mohammed F M; Maulud, K N A
2015-12-01
Most of the developing countries have solid waste management problems. Solid waste strategic planning requires accurate prediction of the quality and quantity of the generated waste. In developing countries, such as Malaysia, the solid waste generation rate is increasing rapidly, due to population growth and new consumption trends that characterize society. This paper proposes an artificial neural network (ANN) approach using feedforward nonlinear autoregressive network with exogenous inputs (NARX) to predict annual solid waste generation in relation to demographic and economic variables like population number, gross domestic product, electricity demand per capita and employment and unemployment numbers. In addition, variable selection procedures are also developed to select a significant explanatory variable. The model evaluation was performed using coefficient of determination (R(2)) and mean square error (MSE). The optimum model that produced the lowest testing MSE (2.46) and the highest R(2) (0.97) had three inputs (gross domestic product, population and employment), eight neurons and one lag in the hidden layer, and used Fletcher-Powell's conjugate gradient as the training algorithm.
Fouad, Marwa A; Tolba, Enas H; El-Shal, Manal A; El Kerdawy, Ahmed M
2018-05-11
The justified continuous emerging of new β-lactam antibiotics provokes the need for developing suitable analytical methods that accelerate and facilitate their analysis. A face central composite experimental design was adopted using different levels of phosphate buffer pH, acetonitrile percentage at zero time and after 15 min in a gradient program to obtain the optimum chromatographic conditions for the elution of 31 β-lactam antibiotics. Retention factors were used as the target property to build two QSRR models utilizing the conventional forward selection and the advanced nature-inspired firefly algorithm for descriptor selection, coupled with multiple linear regression. The obtained models showed high performance in both internal and external validation indicating their robustness and predictive ability. Williams-Hotelling test and student's t-test showed that there is no statistical significant difference between the models' results. Y-randomization validation showed that the obtained models are due to significant correlation between the selected molecular descriptors and the analytes' chromatographic retention. These results indicate that the generated FS-MLR and FFA-MLR models are showing comparable quality on both the training and validation levels. They also gave comparable information about the molecular features that influence the retention behavior of β-lactams under the current chromatographic conditions. We can conclude that in some cases simple conventional feature selection algorithm can be used to generate robust and predictive models comparable to that are generated using advanced ones. Copyright © 2018 Elsevier B.V. All rights reserved.
On the period determination of ASAS eclipsing binaries
NASA Astrophysics Data System (ADS)
Mayangsari, L.; Priyatikanto, R.; Putra, M.
2014-03-01
Variable stars, or particularly eclipsing binaries, are very essential astronomical occurrence. Surveys are the backbone of astronomy, and many discoveries of variable stars are the results of surveys. All-Sky Automated Survey (ASAS) is one of the observing projects whose ultimate goal is photometric monitoring of variable stars. Since its first light in 1997, ASAS has collected 50,099 variable stars, with 11,076 eclipsing binaries among them. In the present work we focus on the period determination of the eclipsing binaries. Since the number of data points in each ASAS eclipsing binary light curve is sparse, period determination of any system is a not straightforward process. For 30 samples of such systems we compare the implementation of Lomb-Scargle algorithm which is an Fast Fourier Transform (FFT) basis and Phase Dispersion Minimization (PDM) method which is non-FFT basis to determine their period. It is demonstrated that PDM gives better performance at handling eclipsing detached (ED) systems whose variability are non-sinusoidal. More over, using semi-automatic recipes, we get better period solution and satisfactorily improve 53% of the selected object's light curves, but failed against another 7% of selected objects. In addition, we also highlight 4 interesting objects for further investigation.
Smart, Otis; Burrell, Lauren
2014-01-01
Pattern classification for intracranial electroencephalogram (iEEG) and functional magnetic resonance imaging (fMRI) signals has furthered epilepsy research toward understanding the origin of epileptic seizures and localizing dysfunctional brain tissue for treatment. Prior research has demonstrated that implicitly selecting features with a genetic programming (GP) algorithm more effectively determined the proper features to discern biomarker and non-biomarker interictal iEEG and fMRI activity than conventional feature selection approaches. However for each the iEEG and fMRI modalities, it is still uncertain whether the stochastic properties of indirect feature selection with a GP yield (a) consistent results within a patient data set and (b) features that are specific or universal across multiple patient data sets. We examined the reproducibility of implicitly selecting features to classify interictal activity using a GP algorithm by performing several selection trials and subsequent frequent itemset mining (FIM) for separate iEEG and fMRI epilepsy patient data. We observed within-subject consistency and across-subject variability with some small similarity for selected features, indicating a clear need for patient-specific features and possible need for patient-specific feature selection or/and classification. For the fMRI, using nearest-neighbor classification and 30 GP generations, we obtained over 60% median sensitivity and over 60% median selectivity. For the iEEG, using nearest-neighbor classification and 30 GP generations, we obtained over 65% median sensitivity and over 65% median selectivity except one patient. PMID:25580059
Rough sets and Laplacian score based cost-sensitive feature selection
Yu, Shenglong
2018-01-01
Cost-sensitive feature selection learning is an important preprocessing step in machine learning and data mining. Recently, most existing cost-sensitive feature selection algorithms are heuristic algorithms, which evaluate the importance of each feature individually and select features one by one. Obviously, these algorithms do not consider the relationship among features. In this paper, we propose a new algorithm for minimal cost feature selection called the rough sets and Laplacian score based cost-sensitive feature selection. The importance of each feature is evaluated by both rough sets and Laplacian score. Compared with heuristic algorithms, the proposed algorithm takes into consideration the relationship among features with locality preservation of Laplacian score. We select a feature subset with maximal feature importance and minimal cost when cost is undertaken in parallel, where the cost is given by three different distributions to simulate different applications. Different from existing cost-sensitive feature selection algorithms, our algorithm simultaneously selects out a predetermined number of “good” features. Extensive experimental results show that the approach is efficient and able to effectively obtain the minimum cost subset. In addition, the results of our method are more promising than the results of other cost-sensitive feature selection algorithms. PMID:29912884
Rough sets and Laplacian score based cost-sensitive feature selection.
Yu, Shenglong; Zhao, Hong
2018-01-01
Cost-sensitive feature selection learning is an important preprocessing step in machine learning and data mining. Recently, most existing cost-sensitive feature selection algorithms are heuristic algorithms, which evaluate the importance of each feature individually and select features one by one. Obviously, these algorithms do not consider the relationship among features. In this paper, we propose a new algorithm for minimal cost feature selection called the rough sets and Laplacian score based cost-sensitive feature selection. The importance of each feature is evaluated by both rough sets and Laplacian score. Compared with heuristic algorithms, the proposed algorithm takes into consideration the relationship among features with locality preservation of Laplacian score. We select a feature subset with maximal feature importance and minimal cost when cost is undertaken in parallel, where the cost is given by three different distributions to simulate different applications. Different from existing cost-sensitive feature selection algorithms, our algorithm simultaneously selects out a predetermined number of "good" features. Extensive experimental results show that the approach is efficient and able to effectively obtain the minimum cost subset. In addition, the results of our method are more promising than the results of other cost-sensitive feature selection algorithms.
CUDA Optimization Strategies for Compute- and Memory-Bound Neuroimaging Algorithms
Lee, Daren; Dinov, Ivo; Dong, Bin; Gutman, Boris; Yanovsky, Igor; Toga, Arthur W.
2011-01-01
As neuroimaging algorithms and technology continue to grow faster than CPU performance in complexity and image resolution, data-parallel computing methods will be increasingly important. The high performance, data-parallel architecture of modern graphical processing units (GPUs) can reduce computational times by orders of magnitude. However, its massively threaded architecture introduces challenges when GPU resources are exceeded. This paper presents optimization strategies for compute- and memory-bound algorithms for the CUDA architecture. For compute-bound algorithms, the registers are reduced through variable reuse via shared memory and the data throughput is increased through heavier thread workloads and maximizing the thread configuration for a single thread block per multiprocessor. For memory-bound algorithms, fitting the data into the fast but limited GPU resources is achieved through reorganizing the data into self-contained structures and employing a multi-pass approach. Memory latencies are reduced by selecting memory resources whose cache performance are optimized for the algorithm's access patterns. We demonstrate the strategies on two computationally expensive algorithms and achieve optimized GPU implementations that perform up to 6× faster than unoptimized ones. Compared to CPU implementations, we achieve peak GPU speedups of 129× for the 3D unbiased nonlinear image registration technique and 93× for the non-local means surface denoising algorithm. PMID:21159404
CUDA optimization strategies for compute- and memory-bound neuroimaging algorithms.
Lee, Daren; Dinov, Ivo; Dong, Bin; Gutman, Boris; Yanovsky, Igor; Toga, Arthur W
2012-06-01
As neuroimaging algorithms and technology continue to grow faster than CPU performance in complexity and image resolution, data-parallel computing methods will be increasingly important. The high performance, data-parallel architecture of modern graphical processing units (GPUs) can reduce computational times by orders of magnitude. However, its massively threaded architecture introduces challenges when GPU resources are exceeded. This paper presents optimization strategies for compute- and memory-bound algorithms for the CUDA architecture. For compute-bound algorithms, the registers are reduced through variable reuse via shared memory and the data throughput is increased through heavier thread workloads and maximizing the thread configuration for a single thread block per multiprocessor. For memory-bound algorithms, fitting the data into the fast but limited GPU resources is achieved through reorganizing the data into self-contained structures and employing a multi-pass approach. Memory latencies are reduced by selecting memory resources whose cache performance are optimized for the algorithm's access patterns. We demonstrate the strategies on two computationally expensive algorithms and achieve optimized GPU implementations that perform up to 6× faster than unoptimized ones. Compared to CPU implementations, we achieve peak GPU speedups of 129× for the 3D unbiased nonlinear image registration technique and 93× for the non-local means surface denoising algorithm. Copyright © 2010 Elsevier Ireland Ltd. All rights reserved.
Variable selection based cotton bollworm odor spectroscopic detection
NASA Astrophysics Data System (ADS)
Lü, Chengxu; Gai, Shasha; Luo, Min; Zhao, Bo
2016-10-01
Aiming at rapid automatic pest detection based efficient and targeting pesticide application and shooting the trouble of reflectance spectral signal covered and attenuated by the solid plant, the possibility of near infrared spectroscopy (NIRS) detection on cotton bollworm odor is studied. Three cotton bollworm odor samples and 3 blank air gas samples were prepared. Different concentrations of cotton bollworm odor were prepared by mixing the above gas samples, resulting a calibration group of 62 samples and a validation group of 31 samples. Spectral collection system includes light source, optical fiber, sample chamber, spectrometer. Spectra were pretreated by baseline correction, modeled with partial least squares (PLS), and optimized by genetic algorithm (GA) and competitive adaptive reweighted sampling (CARS). Minor counts differences are found among spectra of different cotton bollworm odor concentrations. PLS model of all the variables was built presenting RMSEV of 14 and RV2 of 0.89, its theory basis is insect volatilizes specific odor, including pheromone and allelochemics, which are used for intra-specific and inter-specific communication and could be detected by NIR spectroscopy. 28 sensitive variables are selected by GA, presenting the model performance of RMSEV of 14 and RV2 of 0.90. Comparably, 8 sensitive variables are selected by CARS, presenting the model performance of RMSEV of 13 and RV2 of 0.92. CARS model employs only 1.5% variables presenting smaller error than that of all variable. Odor gas based NIR technique shows the potential for cotton bollworm detection.
Semi-Automatic Extraction Algorithm for Images of the Ciliary Muscle
Kao, Chiu-Yen; Richdale, Kathryn; Sinnott, Loraine T.; Ernst, Lauren E.; Bailey, Melissa D.
2011-01-01
Purpose To development and evaluate a semi-automatic algorithm for segmentation and morphological assessment of the dimensions of the ciliary muscle in Visante™ Anterior Segment Optical Coherence Tomography images. Methods Geometric distortions in Visante images analyzed as binary files were assessed by imaging an optical flat and human donor tissue. The appropriate pixel/mm conversion factor to use for air (n = 1) was estimated by imaging calibration spheres. A semi-automatic algorithm was developed to extract the dimensions of the ciliary muscle from Visante images. Measurements were also made manually using Visante software calipers. Interclass correlation coefficients (ICC) and Bland-Altman analyses were used to compare the methods. A multilevel model was fitted to estimate the variance of algorithm measurements that was due to differences within- and between-examiners in scleral spur selection versus biological variability. Results The optical flat and the human donor tissue were imaged and appeared without geometric distortions in binary file format. Bland-Altman analyses revealed that caliper measurements tended to underestimate ciliary muscle thickness at 3 mm posterior to the scleral spur in subjects with the thickest ciliary muscles (t = 3.6, p < 0.001). The percent variance due to within- or between-examiner differences in scleral spur selection was found to be small (6%) when compared to the variance due to biological difference across subjects (80%). Using the mean of measurements from three images achieved an estimated ICC of 0.85. Conclusions The semi-automatic algorithm successfully segmented the ciliary muscle for further measurement. Using the algorithm to follow the scleral curvature to locate more posterior measurements is critical to avoid underestimating thickness measurements. This semi-automatic algorithm will allow for repeatable, efficient, and masked ciliary muscle measurements in large datasets. PMID:21169877
NASA Astrophysics Data System (ADS)
Kurugol, Sila; Dy, Jennifer G.; Rajadhyaksha, Milind; Gossage, Kirk W.; Weissmann, Jesse; Brooks, Dana H.
2011-03-01
The examination of the dermis/epidermis junction (DEJ) is clinically important for skin cancer diagnosis. Reflectance confocal microscopy (RCM) is an emerging tool for detection of skin cancers in vivo. However, visual localization of the DEJ in RCM images, with high accuracy and repeatability, is challenging, especially in fair skin, due to low contrast, heterogeneous structure and high inter- and intra-subject variability. We recently proposed a semi-automated algorithm to localize the DEJ in z-stacks of RCM images of fair skin, based on feature segmentation and classification. Here we extend the algorithm to dark skin. The extended algorithm first decides the skin type and then applies the appropriate DEJ localization method. In dark skin, strong backscatter from the pigment melanin causes the basal cells above the DEJ to appear with high contrast. To locate those high contrast regions, the algorithm operates on small tiles (regions) and finds the peaks of the smoothed average intensity depth profile of each tile. However, for some tiles, due to heterogeneity, multiple peaks in the depth profile exist and the strongest peak might not be the basal layer peak. To select the correct peak, basal cells are represented with a vector of texture features. The peak with most similar features to this feature vector is selected. The results show that the algorithm detected the skin types correctly for all 17 stacks tested (8 fair, 9 dark). The DEJ detection algorithm achieved an average distance from the ground truth DEJ surface of around 4.7μm for dark skin and around 7-14μm for fair skin.
A Lifetime Maximization Relay Selection Scheme in Wireless Body Area Networks.
Zhang, Yu; Zhang, Bing; Zhang, Shi
2017-06-02
Network Lifetime is one of the most important metrics in Wireless Body Area Networks (WBANs). In this paper, a relay selection scheme is proposed under the topology constrains specified in the IEEE 802.15.6 standard to maximize the lifetime of WBANs through formulating and solving an optimization problem where relay selection of each node acts as optimization variable. Considering the diversity of the sensor nodes in WBANs, the optimization problem takes not only energy consumption rate but also energy difference among sensor nodes into account to improve the network lifetime performance. Since it is Non-deterministic Polynomial-hard (NP-hard) and intractable, a heuristic solution is then designed to rapidly address the optimization. The simulation results indicate that the proposed relay selection scheme has better performance in network lifetime compared with existing algorithms and that the heuristic solution has low time complexity with only a negligible performance degradation gap from optimal value. Furthermore, we also conduct simulations based on a general WBAN model to comprehensively illustrate the advantages of the proposed algorithm. At the end of the evaluation, we validate the feasibility of our proposed scheme via an implementation discussion.
Muhlestein, Whitney E; Akagi, Dallin S; Kallos, Justiss A; Morone, Peter J; Weaver, Kyle D; Thompson, Reid C; Chambless, Lola B
2018-04-01
Objective Machine learning (ML) algorithms are powerful tools for predicting patient outcomes. This study pilots a novel approach to algorithm selection and model creation using prediction of discharge disposition following meningioma resection as a proof of concept. Materials and Methods A diversity of ML algorithms were trained on a single-institution database of meningioma patients to predict discharge disposition. Algorithms were ranked by predictive power and top performers were combined to create an ensemble model. The final ensemble was internally validated on never-before-seen data to demonstrate generalizability. The predictive power of the ensemble was compared with a logistic regression. Further analyses were performed to identify how important variables impact the ensemble. Results Our ensemble model predicted disposition significantly better than a logistic regression (area under the curve of 0.78 and 0.71, respectively, p = 0.01). Tumor size, presentation at the emergency department, body mass index, convexity location, and preoperative motor deficit most strongly influence the model, though the independent impact of individual variables is nuanced. Conclusion Using a novel ML technique, we built a guided ML ensemble model that predicts discharge destination following meningioma resection with greater predictive power than a logistic regression, and that provides greater clinical insight than a univariate analysis. These techniques can be extended to predict many other patient outcomes of interest.
Analysis of Decentralized Variable Structure Control for Collective Search by Mobile Robots
DOE Office of Scientific and Technical Information (OSTI.GOV)
Feddema, J.; Goldsmith, S.; Robinett, R.
1998-11-04
This paper presents an analysis of a decentralized coordination strategy for organizing and controlling a team of mobile robots performing collective search. The alpha-beta coordination strategy is a family of collective search algorithms that allow teams of communicating robots to implicitly coordinate their search activities through a division of labor based on self-selected roIes. In an alpha-beta team. alpha agents are motivated to improve their status by exploring new regions of the search space. Beta a~ents are conservative, and reiy on the alpha agents to provide advanced information on favorable regions of the search space. An agent selects its currentmore » role dynamically based on its current status value relative to the current status values of the other team members. Status is determined by some function of the agent's sensor readings, and is generally a measurement of source intensity at the agent's current location. Variations on the decision rules determining alpha and beta behavior produce different versions of the algorithm that lead to different global properties. The alpha-beta strategy is based on a simple finite-state machine that implements a form of Variable Structure Control (VSC). The VSC system changes the dynamics of the collective system by abruptly switching at defined states to alternative control laws . In VSC, Lyapunov's direct method is often used to design control surfaces which guide the system to a given goal. We introduce the alpha-beta aIgorithm and present an analysis of the equilibrium point and the global stability of the alpha-beta algorithm based on Lyapunov's method.« less
Sparse Zero-Sum Games as Stable Functional Feature Selection
Sokolovska, Nataliya; Teytaud, Olivier; Rizkalla, Salwa; Clément, Karine; Zucker, Jean-Daniel
2015-01-01
In large-scale systems biology applications, features are structured in hidden functional categories whose predictive power is identical. Feature selection, therefore, can lead not only to a problem with a reduced dimensionality, but also reveal some knowledge on functional classes of variables. In this contribution, we propose a framework based on a sparse zero-sum game which performs a stable functional feature selection. In particular, the approach is based on feature subsets ranking by a thresholding stochastic bandit. We provide a theoretical analysis of the introduced algorithm. We illustrate by experiments on both synthetic and real complex data that the proposed method is competitive from the predictive and stability viewpoints. PMID:26325268
DOE Office of Scientific and Technical Information (OSTI.GOV)
Tessier, Francois; Vishwanath, Venkatram
2017-11-28
Reading and writing data efficiently from different tiers of storage is necessary for most scientific simulations to achieve good performance at scale. Many software solutions have been developed to decrease the I/O bottleneck. One wellknown strategy, in the context of collective I/O operations, is the two-phase I/O scheme. This strategy consists of selecting a subset of processes to aggregate contiguous pieces of data before performing reads/writes. In our previous work, we implemented the two-phase I/O scheme with a MPI-based topology-aware algorithm. Our algorithm showed very good performance at scale compared to the standard I/O libraries such as POSIX I/O andmore » MPI I/O. However, the algorithm had several limitations hindering a satisfying reproducibility of our experiments. In this paper, we extend our work by 1) identifying the obstacles we face to reproduce our experiments and 2) discovering solutions that reduce the unpredictability of our results.« less
Robust Integration Schemes for Generalized Viscoplasticity with Internal-State Variables
NASA Technical Reports Server (NTRS)
Saleeb, Atef F.; Li, W.; Wilt, Thomas E.
1997-01-01
The scope of the work in this presentation focuses on the development of algorithms for the integration of rate dependent constitutive equations. In view of their robustness; i.e., their superior stability and convergence properties for isotropic and anisotropic coupled viscoplastic-damage models, implicit integration schemes have been selected. This is the simplest in its class and is one of the most widely used implicit integrators at present.
Proteomic Prediction of Breast Cancer Risk: A Cohort Study
2007-03-01
Total 1728 1189 68.81 (c) Data processing. Data analysis was performed using in-house software (Du P , Angeletti RH. Automatic deconvolution of...isotope-resolved mass spectra using variable selection and quantized peptide mass distribution. Anal Chem., 78:3385-92, 2006; P Du, R Sudha, MB...control. Reportable Outcomes So far our publications have been on the development of algorithms for signal processing: 1. Du P , Angeletti RH
Tanner, Evan P; Papeş, Monica; Elmore, R Dwayne; Fuhlendorf, Samuel D; Davis, Craig A
2017-01-01
Ecological niche models (ENMs) have increasingly been used to estimate the potential effects of climate change on species' distributions worldwide. Recently, predictions of species abundance have also been obtained with such models, though knowledge about the climatic variables affecting species abundance is often lacking. To address this, we used a well-studied guild (temperate North American quail) and the Maxent modeling algorithm to compare model performance of three variable selection approaches: correlation/variable contribution (CVC), biological (i.e., variables known to affect species abundance), and random. We then applied the best approach to forecast potential distributions, under future climatic conditions, and analyze future potential distributions in light of available abundance data and presence-only occurrence data. To estimate species' distributional shifts we generated ensemble forecasts using four global circulation models, four representative concentration pathways, and two time periods (2050 and 2070). Furthermore, we present distributional shifts where 75%, 90%, and 100% of our ensemble models agreed. The CVC variable selection approach outperformed our biological approach for four of the six species. Model projections indicated species-specific effects of climate change on future distributions of temperate North American quail. The Gambel's quail (Callipepla gambelii) was the only species predicted to gain area in climatic suitability across all three scenarios of ensemble model agreement. Conversely, the scaled quail (Callipepla squamata) was the only species predicted to lose area in climatic suitability across all three scenarios of ensemble model agreement. Our models projected future loss of areas for the northern bobwhite (Colinus virginianus) and scaled quail in portions of their distributions which are currently areas of high abundance. Climatic variables that influence local abundance may not always scale up to influence species' distributions. Special attention should be given to selecting variables for ENMs, and tests of model performance should be used to validate the choice of variables.
A Network Selection Algorithm Considering Power Consumption in Hybrid Wireless Networks
NASA Astrophysics Data System (ADS)
Joe, Inwhee; Kim, Won-Tae; Hong, Seokjoon
In this paper, we propose a novel network selection algorithm considering power consumption in hybrid wireless networks for vertical handover. CDMA, WiBro, WLAN networks are candidate networks for this selection algorithm. This algorithm is composed of the power consumption prediction algorithm and the final network selection algorithm. The power consumption prediction algorithm estimates the expected lifetime of the mobile station based on the current battery level, traffic class and power consumption for each network interface card of the mobile station. If the expected lifetime of the mobile station in a certain network is not long enough compared the handover delay, this particular network will be removed from the candidate network list, thereby preventing unnecessary handovers in the preprocessing procedure. On the other hand, the final network selection algorithm consists of AHP (Analytic Hierarchical Process) and GRA (Grey Relational Analysis). The global factors of the network selection structure are QoS, cost and lifetime. If user preference is lifetime, our selection algorithm selects the network that offers longest service duration due to low power consumption. Also, we conduct some simulations using the OPNET simulation tool. The simulation results show that the proposed algorithm provides longer lifetime in the hybrid wireless network environment.
Rapid and Simultaneous Prediction of Eight Diesel Quality Parameters through ATR-FTIR Analysis.
Nespeca, Maurilio Gustavo; Hatanaka, Rafael Rodrigues; Flumignan, Danilo Luiz; de Oliveira, José Eduardo
2018-01-01
Quality assessment of diesel fuel is highly necessary for society, but the costs and time spent are very high while using standard methods. Therefore, this study aimed to develop an analytical method capable of simultaneously determining eight diesel quality parameters (density; flash point; total sulfur content; distillation temperatures at 10% (T10), 50% (T50), and 85% (T85) recovery; cetane index; and biodiesel content) through attenuated total reflection Fourier transform infrared (ATR-FTIR) spectroscopy and the multivariate regression method, partial least square (PLS). For this purpose, the quality parameters of 409 samples were determined using standard methods, and their spectra were acquired in ranges of 4000-650 cm -1 . The use of the multivariate filters, generalized least squares weighting (GLSW) and orthogonal signal correction (OSC), was evaluated to improve the signal-to-noise ratio of the models. Likewise, four variable selection approaches were tested: manual exclusion, forward interval PLS (FiPLS), backward interval PLS (BiPLS), and genetic algorithm (GA). The multivariate filters and variables selection algorithms generated more fitted and accurate PLS models. According to the validation, the FTIR/PLS models presented accuracy comparable to the reference methods and, therefore, the proposed method can be applied in the diesel routine monitoring to significantly reduce costs and analysis time.
Rapid and Simultaneous Prediction of Eight Diesel Quality Parameters through ATR-FTIR Analysis
Hatanaka, Rafael Rodrigues; Flumignan, Danilo Luiz; de Oliveira, José Eduardo
2018-01-01
Quality assessment of diesel fuel is highly necessary for society, but the costs and time spent are very high while using standard methods. Therefore, this study aimed to develop an analytical method capable of simultaneously determining eight diesel quality parameters (density; flash point; total sulfur content; distillation temperatures at 10% (T10), 50% (T50), and 85% (T85) recovery; cetane index; and biodiesel content) through attenuated total reflection Fourier transform infrared (ATR-FTIR) spectroscopy and the multivariate regression method, partial least square (PLS). For this purpose, the quality parameters of 409 samples were determined using standard methods, and their spectra were acquired in ranges of 4000–650 cm−1. The use of the multivariate filters, generalized least squares weighting (GLSW) and orthogonal signal correction (OSC), was evaluated to improve the signal-to-noise ratio of the models. Likewise, four variable selection approaches were tested: manual exclusion, forward interval PLS (FiPLS), backward interval PLS (BiPLS), and genetic algorithm (GA). The multivariate filters and variables selection algorithms generated more fitted and accurate PLS models. According to the validation, the FTIR/PLS models presented accuracy comparable to the reference methods and, therefore, the proposed method can be applied in the diesel routine monitoring to significantly reduce costs and analysis time. PMID:29629209
Pathway-Based Kernel Boosting for the Analysis of Genome-Wide Association Studies
Manitz, Juliane; Burger, Patricia; Amos, Christopher I.; Chang-Claude, Jenny; Wichmann, Heinz-Erich; Kneib, Thomas; Bickeböller, Heike
2017-01-01
The analysis of genome-wide association studies (GWAS) benefits from the investigation of biologically meaningful gene sets, such as gene-interaction networks (pathways). We propose an extension to a successful kernel-based pathway analysis approach by integrating kernel functions into a powerful algorithmic framework for variable selection, to enable investigation of multiple pathways simultaneously. We employ genetic similarity kernels from the logistic kernel machine test (LKMT) as base-learners in a boosting algorithm. A model to explain case-control status is created iteratively by selecting pathways that improve its prediction ability. We evaluated our method in simulation studies adopting 50 pathways for different sample sizes and genetic effect strengths. Additionally, we included an exemplary application of kernel boosting to a rheumatoid arthritis and a lung cancer dataset. Simulations indicate that kernel boosting outperforms the LKMT in certain genetic scenarios. Applications to GWAS data on rheumatoid arthritis and lung cancer resulted in sparse models which were based on pathways interpretable in a clinical sense. Kernel boosting is highly flexible in terms of considered variables and overcomes the problem of multiple testing. Additionally, it enables the prediction of clinical outcomes. Thus, kernel boosting constitutes a new, powerful tool in the analysis of GWAS data and towards the understanding of biological processes involved in disease susceptibility. PMID:28785300
Pathway-Based Kernel Boosting for the Analysis of Genome-Wide Association Studies.
Friedrichs, Stefanie; Manitz, Juliane; Burger, Patricia; Amos, Christopher I; Risch, Angela; Chang-Claude, Jenny; Wichmann, Heinz-Erich; Kneib, Thomas; Bickeböller, Heike; Hofner, Benjamin
2017-01-01
The analysis of genome-wide association studies (GWAS) benefits from the investigation of biologically meaningful gene sets, such as gene-interaction networks (pathways). We propose an extension to a successful kernel-based pathway analysis approach by integrating kernel functions into a powerful algorithmic framework for variable selection, to enable investigation of multiple pathways simultaneously. We employ genetic similarity kernels from the logistic kernel machine test (LKMT) as base-learners in a boosting algorithm. A model to explain case-control status is created iteratively by selecting pathways that improve its prediction ability. We evaluated our method in simulation studies adopting 50 pathways for different sample sizes and genetic effect strengths. Additionally, we included an exemplary application of kernel boosting to a rheumatoid arthritis and a lung cancer dataset. Simulations indicate that kernel boosting outperforms the LKMT in certain genetic scenarios. Applications to GWAS data on rheumatoid arthritis and lung cancer resulted in sparse models which were based on pathways interpretable in a clinical sense. Kernel boosting is highly flexible in terms of considered variables and overcomes the problem of multiple testing. Additionally, it enables the prediction of clinical outcomes. Thus, kernel boosting constitutes a new, powerful tool in the analysis of GWAS data and towards the understanding of biological processes involved in disease susceptibility.
Insausti, Matías; Gomes, Adriano A; Cruz, Fernanda V; Pistonesi, Marcelo F; Araujo, Mario C U; Galvão, Roberto K H; Pereira, Claudete F; Band, Beatriz S F
2012-08-15
This paper investigates the use of UV-vis, near infrared (NIR) and synchronous fluorescence (SF) spectrometries coupled with multivariate classification methods to discriminate biodiesel samples with respect to the base oil employed in their production. More specifically, the present work extends previous studies by investigating the discrimination of corn-based biodiesel from two other biodiesel types (sunflower and soybean). Two classification methods are compared, namely full-spectrum SIMCA (soft independent modelling of class analogies) and SPA-LDA (linear discriminant analysis with variables selected by the successive projections algorithm). Regardless of the spectrometric technique employed, full-spectrum SIMCA did not provide an appropriate discrimination of the three biodiesel types. In contrast, all samples were correctly classified on the basis of a reduced number of wavelengths selected by SPA-LDA. It can be concluded that UV-vis, NIR and SF spectrometries can be successfully employed to discriminate corn-based biodiesel from the two other biodiesel types, but wavelength selection by SPA-LDA is key to the proper separation of the classes. Copyright © 2012 Elsevier B.V. All rights reserved.
Determination of coronal magnetic fields from vector magnetograms
NASA Technical Reports Server (NTRS)
Mikic, Zoran
1992-01-01
The determination of coronal magnetic fields from vector magnetograms, including the development and application of algorithms to determine force-free coronal fields above selected observations of active regions is studied. Two additional active regions were selected and analyzed. The restriction of periodicity in the 3-D code which is used to determine the coronal field was removed giving the new code variable mesh spacing and is thus able to provide a more realistic description of coronal fields. The NOAA active region AR5747 of 20 Oct. 1989 was studied. A brief account of progress during the research performed is reported.
Boosting Learning Algorithm for Stock Price Forecasting
NASA Astrophysics Data System (ADS)
Wang, Chengzhang; Bai, Xiaoming
2018-03-01
To tackle complexity and uncertainty of stock market behavior, more studies have introduced machine learning algorithms to forecast stock price. ANN (artificial neural network) is one of the most successful and promising applications. We propose a boosting-ANN model in this paper to predict the stock close price. On the basis of boosting theory, multiple weak predicting machines, i.e. ANNs, are assembled to build a stronger predictor, i.e. boosting-ANN model. New error criteria of the weak studying machine and rules of weights updating are adopted in this study. We select technical factors from financial markets as forecasting input variables. Final results demonstrate the boosting-ANN model works better than other ones for stock price forecasting.
An assessment of support vector machines for land cover classification
Huang, C.; Davis, L.S.; Townshend, J.R.G.
2002-01-01
The support vector machine (SVM) is a group of theoretically superior machine learning algorithms. It was found competitive with the best available machine learning algorithms in classifying high-dimensional data sets. This paper gives an introduction to the theoretical development of the SVM and an experimental evaluation of its accuracy, stability and training speed in deriving land cover classifications from satellite images. The SVM was compared to three other popular classifiers, including the maximum likelihood classifier (MLC), neural network classifiers (NNC) and decision tree classifiers (DTC). The impacts of kernel configuration on the performance of the SVM and of the selection of training data and input variables on the four classifiers were also evaluated in this experiment.
Decision tree modeling using R.
Zhang, Zhongheng
2016-08-01
In machine learning field, decision tree learner is powerful and easy to interpret. It employs recursive binary partitioning algorithm that splits the sample in partitioning variable with the strongest association with the response variable. The process continues until some stopping criteria are met. In the example I focus on conditional inference tree, which incorporates tree-structured regression models into conditional inference procedures. While growing a single tree is subject to small changes in the training data, random forests procedure is introduced to address this problem. The sources of diversity for random forests come from the random sampling and restricted set of input variables to be selected. Finally, I introduce R functions to perform model based recursive partitioning. This method incorporates recursive partitioning into conventional parametric model building.
NASA Technical Reports Server (NTRS)
Stramska, Malgorzata; Stramski, Dariusz
2005-01-01
We use satellite data from Sea-viewing Wide Field-of-view Sensor (SeaWiFS) to investigate distributions of particulate organic carbon (POC) concentration in surface waters of the north polar Atlantic Ocean during the spring summer season (April through August) over a 6-year period from 1998 through 2003. By use of field data collected at sea, we developed regional relationships for the purpose of estimating POC from remote-sensing observations of ocean color. Analysis of several approaches used in the POC algorithm development and match-up analysis of coincident in situ derived and satellite-derived estimates of POC resulted in selection of an algorithm that is based on the blue-to-green ratio of remote-sensing reflectance R(sub rs) (or normalized water-leaving radiance L(sub wn)). The application of the selected algorithm to a 6-year record of SeaWiFS monthly composite data of L(sub wn) revealed patterns of seasonal and interannual variability of POC in the study region. For example, the results show a clear increase of POC throughout the season. The lowest values, generally less than 200 mg per cubic meters, and at some locations often less than 50 mg per cubic meters, were observed in April. In May and June, POC can exceed 300 or even 400 mg per cubic meters in some parts of the study region. Patterns of interannual variability are intricate, as they depend on the geographic location within the study region and particular time of year (month) considered. By comparing the results averaged over the entire study region and the entire season (April through August) for each year separately, we found that the lowest POC occurred in 2001 and the highest POC occurred in 2002 and 1999.
NASA Astrophysics Data System (ADS)
Arshad, Muhammad; Ullah, Saleem; Khurshid, Khurram; Ali, Asad
2017-10-01
Leaf Water Content (LWC) is an essential constituent of plant leaves that determines vegetation heath and its productivity. An accurate and on-time measurement of water content is crucial for planning irrigation, forecasting drought and predicting woodland fire. The retrieval of LWC from Visible to Shortwave Infrared (VSWIR: 0.4-2.5 μm) has been extensively investigated but little has been done in the Mid and Thermal Infrared (MIR and TIR: 2.50 -14.0 μm), windows of electromagnetic spectrum. This study is mainly focused on retrieval of LWC from Mid and Thermal Infrared, using Genetic Algorithm integrated with Partial Least Square Regression (PLSR). Genetic Algorithm fused with PLSR selects spectral wavebands with high predictive performance i.e., yields high adjusted-R2 and low RMSE. In our case, GA-PLSR selected eight variables (bands) and yielded highly accurate models with adjusted-R2 of 0.93 and RMSEcv equal to 7.1 %. The study also demonstrated that MIR is more sensitive to the variation in LWC as compared to TIR. However, the combined use of MIR and TIR spectra enhances the predictive performance in retrieval of LWC. The integration of Genetic Algorithm and PLSR, not only increases the estimation precision by selecting the most sensitive spectral bands but also helps in identifying the important spectral regions for quantifying water stresses in vegetation. The findings of this study will allow the future space missions (like HyspIRI) to position wavebands at sensitive regions for characterizing vegetation stresses.
Hatt, Mathieu; Lee, John A.; Schmidtlein, Charles R.; Naqa, Issam El; Caldwell, Curtis; De Bernardi, Elisabetta; Lu, Wei; Das, Shiva; Geets, Xavier; Gregoire, Vincent; Jeraj, Robert; MacManus, Michael P.; Mawlawi, Osama R.; Nestle, Ursula; Pugachev, Andrei B.; Schöder, Heiko; Shepherd, Tony; Spezi, Emiliano; Visvikis, Dimitris; Zaidi, Habib; Kirov, Assen S.
2017-01-01
Purpose The purpose of this educational report is to provide an overview of the present state-of-the-art PET auto-segmentation (PET-AS) algorithms and their respective validation, with an emphasis on providing the user with help in understanding the challenges and pitfalls associated with selecting and implementing a PET-AS algorithm for a particular application. Approach A brief description of the different types of PET-AS algorithms is provided using a classification based on method complexity and type. The advantages and the limitations of the current PET-AS algorithms are highlighted based on current publications and existing comparison studies. A review of the available image datasets and contour evaluation metrics in terms of their applicability for establishing a standardized evaluation of PET-AS algorithms is provided. The performance requirements for the algorithms and their dependence on the application, the radiotracer used and the evaluation criteria are described and discussed. Finally, a procedure for algorithm acceptance and implementation, as well as the complementary role of manual and auto-segmentation are addressed. Findings A large number of PET-AS algorithms have been developed within the last 20 years. Many of the proposed algorithms are based on either fixed or adaptively selected thresholds. More recently, numerous papers have proposed the use of more advanced image analysis paradigms to perform semi-automated delineation of the PET images. However, the level of algorithm validation is variable and for most published algorithms is either insufficient or inconsistent which prevents recommending a single algorithm. This is compounded by the fact that realistic image configurations with low signal-to-noise ratios (SNR) and heterogeneous tracer distributions have rarely been used. Large variations in the evaluation methods used in the literature point to the need for a standardized evaluation protocol. Conclusions Available comparison studies suggest that PET-AS algorithms relying on advanced image analysis paradigms provide generally more accurate segmentation than approaches based on PET activity thresholds, particularly for realistic configurations. However, this may not be the case for simple shape lesions in situations with a narrower range of parameters, where simpler methods may also perform well. Recent algorithms which employ some type of consensus or automatic selection between several PET-AS methods have potential to overcome the limitations of the individual methods when appropriately trained. In either case, accuracy evaluation is required for each different PET scanner and scanning and image reconstruction protocol. For the simpler, less robust approaches, adaptation to scanning conditions, tumor type, and tumor location by optimization of parameters is necessary. The results from the method evaluation stage can be used to estimate the contouring uncertainty. All PET-AS contours should be critically verified by a physician. A standard test, i.e., a benchmark dedicated to evaluating both existing and future PET-AS algorithms needs to be designed, to aid clinicians in evaluating and selecting PET-AS algorithms and to establish performance limits for their acceptance for clinical use. The initial steps toward designing and building such a standard are undertaken by the task group members. PMID:28120467
Multidisciplinary design optimization using genetic algorithms
NASA Technical Reports Server (NTRS)
Unal, Resit
1994-01-01
Multidisciplinary design optimization (MDO) is an important step in the conceptual design and evaluation of launch vehicles since it can have a significant impact on performance and life cycle cost. The objective is to search the system design space to determine values of design variables that optimize the performance characteristic subject to system constraints. Gradient-based optimization routines have been used extensively for aerospace design optimization. However, one limitation of gradient based optimizers is their need for gradient information. Therefore, design problems which include discrete variables can not be studied. Such problems are common in launch vehicle design. For example, the number of engines and material choices must be integer values or assume only a few discrete values. In this study, genetic algorithms are investigated as an approach to MDO problems involving discrete variables and discontinuous domains. Optimization by genetic algorithms (GA) uses a search procedure which is fundamentally different from those gradient based methods. Genetic algorithms seek to find good solutions in an efficient and timely manner rather than finding the best solution. GA are designed to mimic evolutionary selection. A population of candidate designs is evaluated at each iteration, and each individual's probability of reproduction (existence in the next generation) depends on its fitness value (related to the value of the objective function). Progress toward the optimum is achieved by the crossover and mutation operations. GA is attractive since it uses only objective function values in the search process, so gradient calculations are avoided. Hence, GA are able to deal with discrete variables. Studies report success in the use of GA for aircraft design optimization studies, trajectory analysis, space structure design and control systems design. In these studies reliable convergence was achieved, but the number of function evaluations was large compared with efficient gradient methods. Applicaiton of GA is underway for a cost optimization study for a launch-vehicle fuel-tank and structural design of a wing. The strengths and limitations of GA for launch vehicle design optimization is studied.
Dube, Timothy; Mutanga, Onisimo; Adam, Elhadi; Ismail, Riyad
2014-01-01
The quantification of aboveground biomass using remote sensing is critical for better understanding the role of forests in carbon sequestration and for informed sustainable management. Although remote sensing techniques have been proven useful in assessing forest biomass in general, more is required to investigate their capabilities in predicting intra-and-inter species biomass which are mainly characterised by non-linear relationships. In this study, we tested two machine learning algorithms, Stochastic Gradient Boosting (SGB) and Random Forest (RF) regression trees to predict intra-and-inter species biomass using high resolution RapidEye reflectance bands as well as the derived vegetation indices in a commercial plantation. The results showed that the SGB algorithm yielded the best performance for intra-and-inter species biomass prediction; using all the predictor variables as well as based on the most important selected variables. For example using the most important variables the algorithm produced an R2 of 0.80 and RMSE of 16.93 t·ha−1 for E. grandis; R2 of 0.79, RMSE of 17.27 t·ha−1 for P. taeda and R2 of 0.61, RMSE of 43.39 t·ha−1 for the combined species data sets. Comparatively, RF yielded plausible results only for E. dunii (R2 of 0.79; RMSE of 7.18 t·ha−1). We demonstrated that although the two statistical methods were able to predict biomass accurately, RF produced weaker results as compared to SGB when applied to combined species dataset. The result underscores the relevance of stochastic models in predicting biomass drawn from different species and genera using the new generation high resolution RapidEye sensor with strategically positioned bands. PMID:25140631
A Partitioning and Bounded Variable Algorithm for Linear Programming
ERIC Educational Resources Information Center
Sheskin, Theodore J.
2006-01-01
An interesting new partitioning and bounded variable algorithm (PBVA) is proposed for solving linear programming problems. The PBVA is a variant of the simplex algorithm which uses a modified form of the simplex method followed by the dual simplex method for bounded variables. In contrast to the two-phase method and the big M method, the PBVA does…
NASA Astrophysics Data System (ADS)
Tang, Jie; Liu, Rong; Zhang, Yue-Li; Liu, Mou-Ze; Hu, Yong-Fang; Shao, Ming-Jie; Zhu, Li-Jun; Xin, Hua-Wen; Feng, Gui-Wen; Shang, Wen-Jun; Meng, Xiang-Guang; Zhang, Li-Rong; Ming, Ying-Zi; Zhang, Wei
2017-02-01
Tacrolimus has a narrow therapeutic window and considerable variability in clinical use. Our goal was to compare the performance of multiple linear regression (MLR) and eight machine learning techniques in pharmacogenetic algorithm-based prediction of tacrolimus stable dose (TSD) in a large Chinese cohort. A total of 1,045 renal transplant patients were recruited, 80% of which were randomly selected as the “derivation cohort” to develop dose-prediction algorithm, while the remaining 20% constituted the “validation cohort” to test the final selected algorithm. MLR, artificial neural network (ANN), regression tree (RT), multivariate adaptive regression splines (MARS), boosted regression tree (BRT), support vector regression (SVR), random forest regression (RFR), lasso regression (LAR) and Bayesian additive regression trees (BART) were applied and their performances were compared in this work. Among all the machine learning models, RT performed best in both derivation [0.71 (0.67-0.76)] and validation cohorts [0.73 (0.63-0.82)]. In addition, the ideal rate of RT was 4% higher than that of MLR. To our knowledge, this is the first study to use machine learning models to predict TSD, which will further facilitate personalized medicine in tacrolimus administration in the future.
Sensitivity-Based Guided Model Calibration
NASA Astrophysics Data System (ADS)
Semnani, M.; Asadzadeh, M.
2017-12-01
A common practice in automatic calibration of hydrologic models is applying the sensitivity analysis prior to the global optimization to reduce the number of decision variables (DVs) by identifying the most sensitive ones. This two-stage process aims to improve the optimization efficiency. However, Parameter sensitivity information can be used to enhance the ability of the optimization algorithms to find good quality solutions in a fewer number of solution evaluations. This improvement can be achieved by increasing the focus of optimization on sampling from the most sensitive parameters in each iteration. In this study, the selection process of the dynamically dimensioned search (DDS) optimization algorithm is enhanced by utilizing a sensitivity analysis method to put more emphasis on the most sensitive decision variables for perturbation. The performance of DDS with the sensitivity information is compared to the original version of DDS for different mathematical test functions and a model calibration case study. Overall, the results show that DDS with sensitivity information finds nearly the same solutions as original DDS, however, in a significantly fewer number of solution evaluations.
Kim, Taegu; Hong, Jungsik; Kang, Pilsung
2017-01-01
Accurate box office forecasting models are developed by considering competition and word-of-mouth (WOM) effects in addition to screening-related information. Nationality, genre, ratings, and distributors of motion pictures running concurrently with the target motion picture are used to describe the competition, whereas the numbers of informative, positive, and negative mentions posted on social network services (SNS) are used to gauge the atmosphere spread by WOM. Among these candidate variables, only significant variables are selected by genetic algorithm (GA), based on which machine learning algorithms are trained to build forecasting models. The forecasts are combined to improve forecasting performance. Experimental results on the Korean film market show that the forecasting accuracy in early screening periods can be significantly improved by considering competition. In addition, WOM has a stronger influence on total box office forecasting. Considering both competition and WOM improves forecasting performance to a larger extent than when only one of them is considered.
Kim, Taegu; Hong, Jungsik
2017-01-01
Accurate box office forecasting models are developed by considering competition and word-of-mouth (WOM) effects in addition to screening-related information. Nationality, genre, ratings, and distributors of motion pictures running concurrently with the target motion picture are used to describe the competition, whereas the numbers of informative, positive, and negative mentions posted on social network services (SNS) are used to gauge the atmosphere spread by WOM. Among these candidate variables, only significant variables are selected by genetic algorithm (GA), based on which machine learning algorithms are trained to build forecasting models. The forecasts are combined to improve forecasting performance. Experimental results on the Korean film market show that the forecasting accuracy in early screening periods can be significantly improved by considering competition. In addition, WOM has a stronger influence on total box office forecasting. Considering both competition and WOM improves forecasting performance to a larger extent than when only one of them is considered. PMID:28819355
Duan, Ran; Fu, Haoda
2015-08-30
Recurrent event data are an important data type for medical research. In particular, many safety endpoints are recurrent outcomes, such as hypoglycemic events. For such a situation, it is important to identify the factors causing these events and rank these factors by their importance. Traditional model selection methods are not able to provide variable importance in this context. Methods that are able to evaluate the variable importance, such as gradient boosting and random forest algorithms, cannot directly be applied to recurrent events data. In this paper, we propose a two-step method that enables us to evaluate the variable importance for recurrent events data. We evaluated the performance of our proposed method by simulations and applied it to a data set from a diabetes study. Copyright © 2015 John Wiley & Sons, Ltd.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Marathe, Aniruddha P.; Harris, Rachel A.; Lowenthal, David K.
The use of clouds to execute high-performance computing (HPC) applications has greatly increased recently. Clouds provide several potential advantages over traditional supercomputers and in-house clusters. The most popular cloud is currently Amazon EC2, which provides fixed-cost and variable-cost, auction-based options. The auction market trades lower cost for potential interruptions that necessitate checkpointing; if the market price exceeds the bid price, a node is taken away from the user without warning. We explore techniques to maximize performance per dollar given a time constraint within which an application must complete. Specifically, we design and implement multiple techniques to reduce expected cost bymore » exploiting redundancy in the EC2 auction market. We then design an adaptive algorithm that selects a scheduling algorithm and determines the bid price. We show that our adaptive algorithm executes programs up to seven times cheaper than using the on-demand market and up to 44 percent cheaper than the best non-redundant, auction-market algorithm. We extend our adaptive algorithm to incorporate application scalability characteristics for further cost savings. In conclusion, we show that the adaptive algorithm informed with scalability characteristics of applications achieves up to 56 percent cost savings compared to the expected cost for the base adaptive algorithm run at a fixed, user-defined scale.« less
Parallel Algorithms for Switching Edges in Heterogeneous Graphs.
Bhuiyan, Hasanuzzaman; Khan, Maleq; Chen, Jiangzhuo; Marathe, Madhav
2017-06-01
An edge switch is an operation on a graph (or network) where two edges are selected randomly and one of their end vertices are swapped with each other. Edge switch operations have important applications in graph theory and network analysis, such as in generating random networks with a given degree sequence, modeling and analyzing dynamic networks, and in studying various dynamic phenomena over a network. The recent growth of real-world networks motivates the need for efficient parallel algorithms. The dependencies among successive edge switch operations and the requirement to keep the graph simple (i.e., no self-loops or parallel edges) as the edges are switched lead to significant challenges in designing a parallel algorithm. Addressing these challenges requires complex synchronization and communication among the processors leading to difficulties in achieving a good speedup by parallelization. In this paper, we present distributed memory parallel algorithms for switching edges in massive networks. These algorithms provide good speedup and scale well to a large number of processors. A harmonic mean speedup of 73.25 is achieved on eight different networks with 1024 processors. One of the steps in our edge switch algorithms requires the computation of multinomial random variables in parallel. This paper presents the first non-trivial parallel algorithm for the problem, achieving a speedup of 925 using 1024 processors.
Parallel Algorithms for Switching Edges in Heterogeneous Graphs☆
Khan, Maleq; Chen, Jiangzhuo; Marathe, Madhav
2017-01-01
An edge switch is an operation on a graph (or network) where two edges are selected randomly and one of their end vertices are swapped with each other. Edge switch operations have important applications in graph theory and network analysis, such as in generating random networks with a given degree sequence, modeling and analyzing dynamic networks, and in studying various dynamic phenomena over a network. The recent growth of real-world networks motivates the need for efficient parallel algorithms. The dependencies among successive edge switch operations and the requirement to keep the graph simple (i.e., no self-loops or parallel edges) as the edges are switched lead to significant challenges in designing a parallel algorithm. Addressing these challenges requires complex synchronization and communication among the processors leading to difficulties in achieving a good speedup by parallelization. In this paper, we present distributed memory parallel algorithms for switching edges in massive networks. These algorithms provide good speedup and scale well to a large number of processors. A harmonic mean speedup of 73.25 is achieved on eight different networks with 1024 processors. One of the steps in our edge switch algorithms requires the computation of multinomial random variables in parallel. This paper presents the first non-trivial parallel algorithm for the problem, achieving a speedup of 925 using 1024 processors. PMID:28757680
Application of Three Existing Stope Boundary Optimisation Methods in an Operating Underground Mine
NASA Astrophysics Data System (ADS)
Erdogan, Gamze; Yavuz, Mahmut
2017-12-01
The underground mine planning and design optimisation process have received little attention because of complexity and variability of problems in underground mines. Although a number of optimisation studies and software tools are available and some of them, in special, have been implemented effectively to determine the ultimate-pit limits in an open pit mine, there is still a lack of studies for optimisation of ultimate stope boundaries in underground mines. The proposed approaches for this purpose aim at maximizing the economic profit by selecting the best possible layout under operational, technical and physical constraints. In this paper, the existing three heuristic techniques including Floating Stope Algorithm, Maximum Value Algorithm and Mineable Shape Optimiser (MSO) are examined for optimisation of stope layout in a case study. Each technique is assessed in terms of applicability, algorithm capabilities and limitations considering the underground mine planning challenges. Finally, the results are evaluated and compared.
Negotiating Multicollinearity with Spike-and-Slab Priors.
Ročková, Veronika; George, Edward I
2014-08-01
In multiple regression under the normal linear model, the presence of multicollinearity is well known to lead to unreliable and unstable maximum likelihood estimates. This can be particularly troublesome for the problem of variable selection where it becomes more difficult to distinguish between subset models. Here we show how adding a spike-and-slab prior mitigates this difficulty by filtering the likelihood surface into a posterior distribution that allocates the relevant likelihood information to each of the subset model modes. For identification of promising high posterior models in this setting, we consider three EM algorithms, the fast closed form EMVS version of Rockova and George (2014) and two new versions designed for variants of the spike-and-slab formulation. For a multimodal posterior under multicollinearity, we compare the regions of convergence of these three algorithms. Deterministic annealing versions of the EMVS algorithm are seen to substantially mitigate this multimodality. A single simple running example is used for illustration throughout.
Study on the medical meteorological forecast of the number of hypertension inpatient based on SVR
NASA Astrophysics Data System (ADS)
Zhai, Guangyu; Chai, Guorong; Zhang, Haifeng
2017-06-01
The purpose of this study is to build a hypertension prediction model by discussing the meteorological factors for hypertension incidence. The research method is selecting the standard data of relative humidity, air temperature, visibility, wind speed and air pressure of Lanzhou from 2010 to 2012(calculating the maximum, minimum and average value with 5 days as a unit ) as the input variables of Support Vector Regression(SVR) and the standard data of hypertension incidence of the same period as the output dependent variables to obtain the optimal prediction parameters by cross validation algorithm, then by SVR algorithm learning and training, a SVR forecast model for hypertension incidence is built. The result shows that the hypertension prediction model is composed of 15 input independent variables, the training accuracy is 0.005, the final error is 0.0026389. The forecast accuracy based on SVR model is 97.1429%, which is higher than statistical forecast equation and neural network prediction method. It is concluded that SVR model provides a new method for hypertension prediction with its simple calculation, small error as well as higher historical sample fitting and Independent sample forecast capability.
Sensor fusion methods for reducing false alarms in heart rate monitoring.
Borges, Gabriel; Brusamarello, Valner
2016-12-01
Automatic patient monitoring is an essential resource in hospitals for good health care management. While alarms caused by abnormal physiological conditions are important for the delivery of fast treatment, they can be also a source of unnecessary noise because of false alarms caused by electromagnetic interference or motion artifacts. One significant source of false alarms is related to heart rate, which is triggered when the heart rhythm of the patient is too fast or too slow. In this work, the fusion of different physiological sensors is explored in order to create a robust heart rate estimation. A set of algorithms using heart rate variability index, Bayesian inference, neural networks, fuzzy logic and majority voting is proposed to fuse the information from the electrocardiogram, arterial blood pressure and photoplethysmogram. Three kinds of information are extracted from each source, namely, heart rate variability, the heart rate difference between sensors and the spectral analysis of low and high noise of each sensor. This information is used as input to the algorithms. Twenty recordings selected from the MIMIC database were used to validate the system. The results showed that neural networks fusion had the best false alarm reduction of 92.5 %, while the Bayesian technique had a reduction of 84.3 %, fuzzy logic 80.6 %, majority voter 72.5 % and the heart rate variability index 67.5 %. Therefore, the proposed algorithms showed good performance and could be useful in bedside monitors.
Spatio-Temporal Process Variability in Watershed Scale Wetland Restoration Planning
NASA Astrophysics Data System (ADS)
Evenson, G. R.
2012-12-01
Watershed scale restoration decision making processes are increasingly informed by quantitative methodologies providing site-specific restoration recommendations - sometimes referred to as "systematic planning." The more advanced of these methodologies are characterized by a coupling of search algorithms and ecological models to discover restoration plans that optimize environmental outcomes. Yet while these methods have exhibited clear utility as decision support toolsets, they may be critiqued for flawed evaluations of spatio-temporally variable processes fundamental to watershed scale restoration. Hydrologic and non-hydrologic mediated process connectivity along with post-restoration habitat dynamics, for example, are commonly ignored yet known to appreciably affect restoration outcomes. This talk will present a methodology to evaluate such spatio-temporally complex processes in the production of watershed scale wetland restoration plans. Using the Tuscarawas Watershed in Eastern Ohio as a case study, a genetic algorithm will be coupled with the Soil and Water Assessment Tool (SWAT) to reveal optimal wetland restoration plans as measured by their capacity to maximize nutrient reductions. Then, a so-called "graphical" representation of the optimization problem will be implemented in-parallel to promote hydrologic and non-hydrologic mediated connectivity amongst existing wetlands and sites selected for restoration. Further, various search algorithm mechanisms will be discussed as a means of accounting for temporal complexities such as post-restoration habitat dynamics. Finally, generalized patterns of restoration plan optimality will be discussed as an alternative and possibly superior decision support toolset given the complexity and stochastic nature of spatio-temporal process variability.
NASA Astrophysics Data System (ADS)
Horton, Pascal; Weingartner, Rolf; Obled, Charles; Jaboyedoff, Michel
2017-04-01
Analogue methods (AMs) rely on the hypothesis that similar situations, in terms of atmospheric circulation, are likely to result in similar local or regional weather conditions. These methods consist of sampling a certain number of past situations, based on different synoptic-scale meteorological variables (predictors), in order to construct a probabilistic prediction for a local weather variable of interest (predictand). They are often used for daily precipitation prediction, either in the context of real-time forecasting, reconstruction of past weather conditions, or future climate impact studies. The relationship between predictors and predictands is defined by several parameters (predictor variable, spatial and temporal windows used for the comparison, analogy criteria, and number of analogues), which are often calibrated by means of a semi-automatic sequential procedure that has strong limitations. AMs may include several subsampling levels (e.g. first sorting a set of analogs in terms of circulation, then restricting to those with similar moisture status). The parameter space of the AMs can be very complex, with substantial co-dependencies between the parameters. Thus, global optimization techniques are likely to be necessary for calibrating most AM variants, as they can optimize all parameters of all analogy levels simultaneously. Genetic algorithms (GAs) were found to be successful in finding optimal values of AM parameters. They allow taking into account parameters inter-dependencies, and selecting objectively some parameters that were manually selected beforehand (such as the pressure levels and the temporal windows of the predictor variables), and thus obviate the need of assessing a high number of combinations. The performance scores of the optimized methods increased compared to reference methods, and this even to a greater extent for days with high precipitation totals. The resulting parameters were found to be relevant and spatially coherent. Moreover, they were obtained automatically and objectively, which reduces efforts invested in exploration attempts when adapting the method to a new region or for a new predictand. In addition, the approach allowed for new degrees of freedom, such as a weighting between the pressure levels, and non overlapping spatial windows. Genetic algorithms were then used further in order to automatically select predictor variables and analogy criteria. This resulted in interesting outputs, providing new predictor-criterion combinations. However, some limitations of the approach were encountered, and the need of the expert input is likely to remain necessary. Nevertheless, letting GAs exploring a dataset for the best predictor for a predictand of interest is certainly a useful tool, particularly when applied for a new predictand or a new region under different climatic characteristics.
PI-line-based image reconstruction in helical cone-beam computed tomography with a variable pitch.
Zou, Yu; Pan, Xiaochuan; Xia, Dan; Wang, Ge
2005-08-01
Current applications of helical cone-beam computed tomography (CT) involve primarily a constant pitch where the translating speed of the table and the rotation speed of the source-detector remain constant. However, situations do exist where it may be more desirable to use a helical scan with a variable translating speed of the table, leading a variable pitch. One of such applications could arise in helical cone-beam CT fluoroscopy for the determination of vascular structures through real-time imaging of contrast bolus arrival. Most of the existing reconstruction algorithms have been developed only for helical cone-beam CT with constant pitch, including the backprojection-filtration (BPF) and filtered-backprojection (FBP) algorithms that we proposed previously. It is possible to generalize some of these algorithms to reconstruct images exactly for helical cone-beam CT with a variable pitch. In this work, we generalize our BPF and FBP algorithms to reconstruct images directly from data acquired in helical cone-beam CT with a variable pitch. We have also performed a preliminary numerical study to demonstrate and verify the generalization of the two algorithms. The results of the study confirm that our generalized BPF and FBP algorithms can yield exact reconstruction in helical cone-beam CT with a variable pitch. It should be pointed out that our generalized BPF algorithm is the only algorithm that is capable of reconstructing exactly region-of-interest image from data containing transverse truncations.
A Feature and Algorithm Selection Method for Improving the Prediction of Protein Structural Class.
Ni, Qianwu; Chen, Lei
2017-01-01
Correct prediction of protein structural class is beneficial to investigation on protein functions, regulations and interactions. In recent years, several computational methods have been proposed in this regard. However, based on various features, it is still a great challenge to select proper classification algorithm and extract essential features to participate in classification. In this study, a feature and algorithm selection method was presented for improving the accuracy of protein structural class prediction. The amino acid compositions and physiochemical features were adopted to represent features and thirty-eight machine learning algorithms collected in Weka were employed. All features were first analyzed by a feature selection method, minimum redundancy maximum relevance (mRMR), producing a feature list. Then, several feature sets were constructed by adding features in the list one by one. For each feature set, thirtyeight algorithms were executed on a dataset, in which proteins were represented by features in the set. The predicted classes yielded by these algorithms and true class of each protein were collected to construct a dataset, which were analyzed by mRMR method, yielding an algorithm list. From the algorithm list, the algorithm was taken one by one to build an ensemble prediction model. Finally, we selected the ensemble prediction model with the best performance as the optimal ensemble prediction model. Experimental results indicate that the constructed model is much superior to models using single algorithm and other models that only adopt feature selection procedure or algorithm selection procedure. The feature selection procedure or algorithm selection procedure are really helpful for building an ensemble prediction model that can yield a better performance. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.
Moteghaed, Niloofar Yousefi; Maghooli, Keivan; Garshasbi, Masoud
2018-01-01
Background: Gene expression data are characteristically high dimensional with a small sample size in contrast to the feature size and variability inherent in biological processes that contribute to difficulties in analysis. Selection of highly discriminative features decreases the computational cost and complexity of the classifier and improves its reliability for prediction of a new class of samples. Methods: The present study used hybrid particle swarm optimization and genetic algorithms for gene selection and a fuzzy support vector machine (SVM) as the classifier. Fuzzy logic is used to infer the importance of each sample in the training phase and decrease the outlier sensitivity of the system to increase the ability to generalize the classifier. A decision-tree algorithm was applied to the most frequent genes to develop a set of rules for each type of cancer. This improved the abilities of the algorithm by finding the best parameters for the classifier during the training phase without the need for trial-and-error by the user. The proposed approach was tested on four benchmark gene expression profiles. Results: Good results have been demonstrated for the proposed algorithm. The classification accuracy for leukemia data is 100%, for colon cancer is 96.67% and for breast cancer is 98%. The results show that the best kernel used in training the SVM classifier is the radial basis function. Conclusions: The experimental results show that the proposed algorithm can decrease the dimensionality of the dataset, determine the most informative gene subset, and improve classification accuracy using the optimal parameters of the classifier with no user interface. PMID:29535919
Parrish, Robert M; Hohenstein, Edward G; Martínez, Todd J; Sherrill, C David
2013-05-21
We investigate the application of molecular quadratures obtained from either standard Becke-type grids or discrete variable representation (DVR) techniques to the recently developed least-squares tensor hypercontraction (LS-THC) representation of the electron repulsion integral (ERI) tensor. LS-THC uses least-squares fitting to renormalize a two-sided pseudospectral decomposition of the ERI, over a physical-space quadrature grid. While this procedure is technically applicable with any choice of grid, the best efficiency is obtained when the quadrature is tuned to accurately reproduce the overlap metric for quadratic products of the primary orbital basis. Properly selected Becke DFT grids can roughly attain this property. Additionally, we provide algorithms for adopting the DVR techniques of the dynamics community to produce two different classes of grids which approximately attain this property. The simplest algorithm is radial discrete variable representation (R-DVR), which diagonalizes the finite auxiliary-basis representation of the radial coordinate for each atom, and then combines Lebedev-Laikov spherical quadratures and Becke atomic partitioning to produce the full molecular quadrature grid. The other algorithm is full discrete variable representation (F-DVR), which uses approximate simultaneous diagonalization of the finite auxiliary-basis representation of the full position operator to produce non-direct-product quadrature grids. The qualitative features of all three grid classes are discussed, and then the relative efficiencies of these grids are compared in the context of LS-THC-DF-MP2. Coarse Becke grids are found to give essentially the same accuracy and efficiency as R-DVR grids; however, the latter are built from explicit knowledge of the basis set and may guide future development of atom-centered grids. F-DVR is found to provide reasonable accuracy with markedly fewer points than either Becke or R-DVR schemes.
Improved multivariate polynomial factoring algorithm
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wang, P.S.
1978-10-01
A new algorithm for factoring multivariate polynomials over the integers based on an algorithm by Wang and Rothschild is described. The new algorithm has improved strategies for dealing with the known problems of the original algorithm, namely, the leading coefficient problem, the bad-zero problem and the occurrence of extraneous factors. It has an algorithm for correctly predetermining leading coefficients of the factors. A new and efficient p-adic algorithm named EEZ is described. Bascially it is a linearly convergent variable-by-variable parallel construction. The improved algorithm is generally faster and requires less store then the original algorithm. Machine examples with comparative timingmore » are included.« less
NASA Astrophysics Data System (ADS)
Bashi-Azghadi, Seyyed Nasser; Afshar, Abbas; Afshar, Mohammad Hadi
2018-03-01
Previous studies on consequence management assume that the selected response action including valve closure and/or hydrant opening remains unchanged during the entire management period. This study presents a new embedded simulation-optimization methodology for deriving time-varying operational response actions in which the network topology may change from one stage to another. Dynamic programming (DP) and genetic algorithm (GA) are used in order to minimize selected objective functions. Two networks of small and large sizes are used in order to illustrate the performance of the proposed modelling schemes if a time-dependent consequence management strategy is to be implemented. The results show that for a small number of decision variables even in large-scale networks, DP is superior in terms of accuracy and computer runtime. However, as the number of potential actions grows, DP loses its merit over the GA approach. This study clearly proves the priority of the proposed dynamic operation strategy over the commonly used static strategy.
McTwo: a two-step feature selection algorithm based on maximal information coefficient.
Ge, Ruiquan; Zhou, Manli; Luo, Youxi; Meng, Qinghan; Mai, Guoqin; Ma, Dongli; Wang, Guoqing; Zhou, Fengfeng
2016-03-23
High-throughput bio-OMIC technologies are producing high-dimension data from bio-samples at an ever increasing rate, whereas the training sample number in a traditional experiment remains small due to various difficulties. This "large p, small n" paradigm in the area of biomedical "big data" may be at least partly solved by feature selection algorithms, which select only features significantly associated with phenotypes. Feature selection is an NP-hard problem. Due to the exponentially increased time requirement for finding the globally optimal solution, all the existing feature selection algorithms employ heuristic rules to find locally optimal solutions, and their solutions achieve different performances on different datasets. This work describes a feature selection algorithm based on a recently published correlation measurement, Maximal Information Coefficient (MIC). The proposed algorithm, McTwo, aims to select features associated with phenotypes, independently of each other, and achieving high classification performance of the nearest neighbor algorithm. Based on the comparative study of 17 datasets, McTwo performs about as well as or better than existing algorithms, with significantly reduced numbers of selected features. The features selected by McTwo also appear to have particular biomedical relevance to the phenotypes from the literature. McTwo selects a feature subset with very good classification performance, as well as a small feature number. So McTwo may represent a complementary feature selection algorithm for the high-dimensional biomedical datasets.
Zhao, Yu-Xiang; Chou, Chien-Hsing
2016-01-01
In this study, a new feature selection algorithm, the neighborhood-relationship feature selection (NRFS) algorithm, is proposed for identifying rat electroencephalogram signals and recognizing Chinese characters. In these two applications, dependent relationships exist among the feature vectors and their neighboring feature vectors. Therefore, the proposed NRFS algorithm was designed for solving this problem. By applying the NRFS algorithm, unselected feature vectors have a high priority of being added into the feature subset if the neighboring feature vectors have been selected. In addition, selected feature vectors have a high priority of being eliminated if the neighboring feature vectors are not selected. In the experiments conducted in this study, the NRFS algorithm was compared with two feature algorithms. The experimental results indicated that the NRFS algorithm can extract the crucial frequency bands for identifying rat vigilance states and identifying crucial character regions for recognizing Chinese characters. PMID:27314346
The selection of construction sub-contractors using the fuzzy sets theory
DOE Office of Scientific and Technical Information (OSTI.GOV)
Krzemiński, Michał
The paper presents the algorithm for the selection of sub-contractors. Main area of author’s interest is scheduling flow models. The ranking task aims at execution time as short as possible Brigades downtime should also be as small as possible. These targets are exposed to significant obsolescence. The criteria for selection of subcontractors will not be therefore time and cost, it is assumed that all those criteria be meet by sub-contractors. The decision should be made in regard to factors difficult to measure, to assess which is the perfect application of fuzzy sets theory. The paper will present a set ofmore » evaluation criteria, the part of the knowledge base and a description of the output variable.« less
Alshamlan, Hala M; Badr, Ghada H; Alohali, Yousef A
2015-06-01
Naturally inspired evolutionary algorithms prove effectiveness when used for solving feature selection and classification problems. Artificial Bee Colony (ABC) is a relatively new swarm intelligence method. In this paper, we propose a new hybrid gene selection method, namely Genetic Bee Colony (GBC) algorithm. The proposed algorithm combines the used of a Genetic Algorithm (GA) along with Artificial Bee Colony (ABC) algorithm. The goal is to integrate the advantages of both algorithms. The proposed algorithm is applied to a microarray gene expression profile in order to select the most predictive and informative genes for cancer classification. In order to test the accuracy performance of the proposed algorithm, extensive experiments were conducted. Three binary microarray datasets are use, which include: colon, leukemia, and lung. In addition, another three multi-class microarray datasets are used, which are: SRBCT, lymphoma, and leukemia. Results of the GBC algorithm are compared with our recently proposed technique: mRMR when combined with the Artificial Bee Colony algorithm (mRMR-ABC). We also compared the combination of mRMR with GA (mRMR-GA) and Particle Swarm Optimization (mRMR-PSO) algorithms. In addition, we compared the GBC algorithm with other related algorithms that have been recently published in the literature, using all benchmark datasets. The GBC algorithm shows superior performance as it achieved the highest classification accuracy along with the lowest average number of selected genes. This proves that the GBC algorithm is a promising approach for solving the gene selection problem in both binary and multi-class cancer classification. Copyright © 2015 Elsevier Ltd. All rights reserved.
Igne, Benoit; Shi, Zhenqi; Drennen, James K; Anderson, Carl A
2014-02-01
The impact of raw material variability on the prediction ability of a near-infrared calibration model was studied. Calibrations, developed from a quaternary mixture design comprising theophylline anhydrous, lactose monohydrate, microcrystalline cellulose, and soluble starch, were challenged by intentional variation of raw material properties. A design with two theophylline physical forms, three lactose particle sizes, and two starch manufacturers was created to test model robustness. Further challenges to the models were accomplished through environmental conditions. Along with full-spectrum partial least squares (PLS) modeling, variable selection by dynamic backward PLS and genetic algorithms was utilized in an effort to mitigate the effects of raw material variability. In addition to evaluating models based on their prediction statistics, prediction residuals were analyzed by analyses of variance and model diagnostics (Hotelling's T(2) and Q residuals). Full-spectrum models were significantly affected by lactose particle size. Models developed by selecting variables gave lower prediction errors and proved to be a good approach to limit the effect of changing raw material characteristics. Hotelling's T(2) and Q residuals provided valuable information that was not detectable when studying only prediction trends. Diagnostic statistics were demonstrated to be critical in the appropriate interpretation of the prediction of quality parameters. © 2013 Wiley Periodicals, Inc. and the American Pharmacists Association.
NASA Astrophysics Data System (ADS)
Murray, J. R.
2017-12-01
Earth surface displacements measured at Global Navigation Satellite System (GNSS) sites record crustal deformation due, for example, to slip on faults underground. A primary objective in designing geodetic networks to study crustal deformation is to maximize the ability to recover parameters of interest like fault slip. Given Green's functions (GFs) relating observed displacement to motion on buried dislocations representing a fault, one can use various methods to estimate spatially variable slip. However, assumptions embodied in the GFs, e.g., use of a simplified elastic structure, introduce spatially correlated model prediction errors (MPE) not reflected in measurement uncertainties (Duputel et al., 2014). In theory, selection algorithms should incorporate inter-site correlations to identify measurement locations that give unique information. I assess the impact of MPE on site selection by expanding existing methods (Klein et al., 2017; Reeves and Zhe, 1999) to incorporate this effect. Reeves and Zhe's algorithm sequentially adds or removes a predetermined number of data according to a criterion that minimizes the sum of squared errors (SSE) on parameter estimates. Adapting this method to GNSS network design, Klein et al. select new sites that maximize model resolution, using trade-off curves to determine when additional resolution gain is small. Their analysis uses uncorrelated data errors and GFs for a uniform elastic half space. I compare results using GFs for spatially variable strike slip on a discretized dislocation in a uniform elastic half space, a layered elastic half space, and a layered half space with inclusion of MPE. I define an objective criterion to terminate the algorithm once the next site removal would increase SSE more than the expected incremental SSE increase if all sites had equal impact. Using a grid of candidate sites with 8 km spacing, I find the relative value of the selected sites (defined by the percent increase in SSE that further removal of each site would cause) is more uniform when MPE is included. However, the number and distribution of selected sites depends primarily on site location relative to the fault. For this test case, inclusion of MPE has minimal practical impact; I will investigate whether these findings hold for more densely spaced candidate grids and dipping faults.
A provisional effective evaluation when errors are present in independent variables
NASA Technical Reports Server (NTRS)
Gurin, L. S.
1983-01-01
Algorithms are examined for evaluating the parameters of a regression model when there are errors in the independent variables. The algorithms are fast and the estimates they yield are stable with respect to the correlation of errors and measurements of both the dependent variable and the independent variables.
Automatic measurement of images on astrometric plates
NASA Astrophysics Data System (ADS)
Ortiz Gil, A.; Lopez Garcia, A.; Martinez Gonzalez, J. M.; Yershov, V.
1994-04-01
We present some results on the process of automatic detection and measurement of objects in overlapped fields of astrometric plates. The main steps of our algorithm are the following: determination of the Scale and Tilt between charge coupled devices (CCD) and microscope coordinate systems and estimation of signal-to-noise ratio in each field;--image identification and improvement of its position and size;--image final centering;--image selection and storage. Several parameters allow the use of variable criteria for image identification, characterization and selection. Problems related with faint images and crowded fields will be approached by special techniques (morphological filters, histogram properties and fitting models).
A comparison of 12 algorithms for matching on the propensity score.
Austin, Peter C
2014-03-15
Propensity-score matching is increasingly being used to reduce the confounding that can occur in observational studies examining the effects of treatments or interventions on outcomes. We used Monte Carlo simulations to examine the following algorithms for forming matched pairs of treated and untreated subjects: optimal matching, greedy nearest neighbor matching without replacement, and greedy nearest neighbor matching without replacement within specified caliper widths. For each of the latter two algorithms, we examined four different sub-algorithms defined by the order in which treated subjects were selected for matching to an untreated subject: lowest to highest propensity score, highest to lowest propensity score, best match first, and random order. We also examined matching with replacement. We found that (i) nearest neighbor matching induced the same balance in baseline covariates as did optimal matching; (ii) when at least some of the covariates were continuous, caliper matching tended to induce balance on baseline covariates that was at least as good as the other algorithms; (iii) caliper matching tended to result in estimates of treatment effect with less bias compared with optimal and nearest neighbor matching; (iv) optimal and nearest neighbor matching resulted in estimates of treatment effect with negligibly less variability than did caliper matching; (v) caliper matching had amongst the best performance when assessed using mean squared error; (vi) the order in which treated subjects were selected for matching had at most a modest effect on estimation; and (vii) matching with replacement did not have superior performance compared with caliper matching without replacement. © 2013 The Authors. Statistics in Medicine published by John Wiley & Sons, Ltd.
A comparison of 12 algorithms for matching on the propensity score
Austin, Peter C
2014-01-01
Propensity-score matching is increasingly being used to reduce the confounding that can occur in observational studies examining the effects of treatments or interventions on outcomes. We used Monte Carlo simulations to examine the following algorithms for forming matched pairs of treated and untreated subjects: optimal matching, greedy nearest neighbor matching without replacement, and greedy nearest neighbor matching without replacement within specified caliper widths. For each of the latter two algorithms, we examined four different sub-algorithms defined by the order in which treated subjects were selected for matching to an untreated subject: lowest to highest propensity score, highest to lowest propensity score, best match first, and random order. We also examined matching with replacement. We found that (i) nearest neighbor matching induced the same balance in baseline covariates as did optimal matching; (ii) when at least some of the covariates were continuous, caliper matching tended to induce balance on baseline covariates that was at least as good as the other algorithms; (iii) caliper matching tended to result in estimates of treatment effect with less bias compared with optimal and nearest neighbor matching; (iv) optimal and nearest neighbor matching resulted in estimates of treatment effect with negligibly less variability than did caliper matching; (v) caliper matching had amongst the best performance when assessed using mean squared error; (vi) the order in which treated subjects were selected for matching had at most a modest effect on estimation; and (vii) matching with replacement did not have superior performance compared with caliper matching without replacement. © 2013 The Authors. Statistics in Medicine published by John Wiley & Sons, Ltd. PMID:24123228
NASA Astrophysics Data System (ADS)
Richards, Joseph W.; Starr, Dan L.; Brink, Henrik; Miller, Adam A.; Bloom, Joshua S.; Butler, Nathaniel R.; James, J. Berian; Long, James P.; Rice, John
2012-01-01
Despite the great promise of machine-learning algorithms to classify and predict astrophysical parameters for the vast numbers of astrophysical sources and transients observed in large-scale surveys, the peculiarities of the training data often manifest as strongly biased predictions on the data of interest. Typically, training sets are derived from historical surveys of brighter, more nearby objects than those from more extensive, deeper surveys (testing data). This sample selection bias can cause catastrophic errors in predictions on the testing data because (1) standard assumptions for machine-learned model selection procedures break down and (2) dense regions of testing space might be completely devoid of training data. We explore possible remedies to sample selection bias, including importance weighting, co-training, and active learning (AL). We argue that AL—where the data whose inclusion in the training set would most improve predictions on the testing set are queried for manual follow-up—is an effective approach and is appropriate for many astronomical applications. For a variable star classification problem on a well-studied set of stars from Hipparcos and Optical Gravitational Lensing Experiment, AL is the optimal method in terms of error rate on the testing data, beating the off-the-shelf classifier by 3.4% and the other proposed methods by at least 3.0%. To aid with manual labeling of variable stars, we developed a Web interface which allows for easy light curve visualization and querying of external databases. Finally, we apply AL to classify variable stars in the All Sky Automated Survey, finding dramatic improvement in our agreement with the ASAS Catalog of Variable Stars, from 65.5% to 79.5%, and a significant increase in the classifier's average confidence for the testing set, from 14.6% to 42.9%, after a few AL iterations.
Fuentes, Alejandra; Ortiz, Javier; Saavedra, Nicolás; Salazar, Luis A; Meneses, Claudio; Arriagada, Cesar
2016-04-01
The gene expression stability of candidate reference genes in the roots and leaves of Solanum lycopersicum inoculated with arbuscular mycorrhizal fungi was investigated. Eight candidate reference genes including elongation factor 1 α (EF1), glyceraldehyde-3-phosphate dehydrogenase (GAPDH), phosphoglycerate kinase (PGK), protein phosphatase 2A (PP2Acs), ribosomal protein L2 (RPL2), β-tubulin (TUB), ubiquitin (UBI) and actin (ACT) were selected, and their expression stability was assessed to determine the most stable internal reference for quantitative PCR normalization in S. lycopersicum inoculated with the arbuscular mycorrhizal fungus Rhizophagus irregularis. The stability of each gene was analysed in leaves and roots together and separated using the geNorm and NormFinder algorithms. Differences were detected between leaves and roots, varying among the best-ranked genes depending on the algorithm used and the tissue analysed. PGK, TUB and EF1 genes showed higher stability in roots, while EF1 and UBI had higher stability in leaves. Statistical algorithms indicated that the GAPDH gene was the least stable under the experimental conditions assayed. Then, we analysed the expression levels of the LePT4 gene, a phosphate transporter whose expression is induced by fungal colonization in host plant roots. No differences were observed when the most stable genes were used as reference genes. However, when GAPDH was used as the reference gene, we observed an overestimation of LePT4 expression. In summary, our results revealed that candidate reference genes present variable stability in S. lycopersicum arbuscular mycorrhizal symbiosis depending on the algorithm and tissue analysed. Thus, reference gene selection is an important issue for obtaining reliable results in gene expression quantification. Copyright © 2016 Elsevier Masson SAS. All rights reserved.
Analysis of Genome-Wide Association Studies with Multiple Outcomes Using Penalization
Liu, Jin; Huang, Jian; Ma, Shuangge
2012-01-01
Genome-wide association studies have been extensively conducted, searching for markers for biologically meaningful outcomes and phenotypes. Penalization methods have been adopted in the analysis of the joint effects of a large number of SNPs (single nucleotide polymorphisms) and marker identification. This study is partly motivated by the analysis of heterogeneous stock mice dataset, in which multiple correlated phenotypes and a large number of SNPs are available. Existing penalization methods designed to analyze a single response variable cannot accommodate the correlation among multiple response variables. With multiple response variables sharing the same set of markers, joint modeling is first employed to accommodate the correlation. The group Lasso approach is adopted to select markers associated with all the outcome variables. An efficient computational algorithm is developed. Simulation study and analysis of the heterogeneous stock mice dataset show that the proposed method can outperform existing penalization methods. PMID:23272092
Yang, Cheng-Huei; Luo, Ching-Hsing; Yang, Cheng-Hong; Chuang, Li-Yeh
2004-01-01
Morse code is now being harnessed for use in rehabilitation applications of augmentative-alternative communication and assistive technology, including mobility, environmental control and adapted worksite access. In this paper, Morse code is selected as a communication adaptive device for disabled persons who suffer from muscle atrophy, cerebral palsy or other severe handicaps. A stable typing rate is strictly required for Morse code to be effective as a communication tool. This restriction is a major hindrance. Therefore, a switch adaptive automatic recognition method with a high recognition rate is needed. The proposed system combines counter-propagation networks with a variable degree variable step size LMS algorithm. It is divided into five stages: space recognition, tone recognition, learning process, adaptive processing, and character recognition. Statistical analyses demonstrated that the proposed method elicited a better recognition rate in comparison to alternative methods in the literature.
ROTSE All-Sky Surveys for Variable Stars. I. Test Fields
NASA Astrophysics Data System (ADS)
Akerlof, C.; Amrose, S.; Balsano, R.; Bloch, J.; Casperson, D.; Fletcher, S.; Gisler, G.; Hills, J.; Kehoe, R.; Lee, B.; Marshall, S.; McKay, T.; Pawl, A.; Schaefer, J.; Szymanski, J.; Wren, J.
2000-04-01
The Robotic Optical Transient Search Experiment I (ROTSE-I) experiment has generated CCD photometry for the entire northern sky in two epochs nightly since 1998 March. These sky patrol data are a powerful resource for studies of astrophysical transients. As a demonstration project, we present first results of a search for periodic variable stars derived from ROTSE-I observations. Variable identification, period determination, and type classification are conducted via automatic algorithms. In a set of nine ROTSE-I sky patrol fields covering roughly 2000 deg2, we identify 1781 periodic variable stars with mean magnitudes between mv=10.0 and mv=15.5. About 90% of these objects are newly identified as variable. Examples of many familiar types are presented. All classifications for this study have been manually confirmed. The selection criteria for this analysis have been conservatively defined and are known to be biased against some variable classes. This preliminary study includes only 5.6% of the total ROTSE-I sky coverage, suggesting that the full ROTSE-I variable catalog will include more than 32,000 periodic variable stars.
Awad, Joseph; Owrangi, Amir; Villemaire, Lauren; O'Riordan, Elaine; Parraga, Grace; Fenster, Aaron
2012-02-01
Manual segmentation of lung tumors is observer dependent and time-consuming but an important component of radiology and radiation oncology workflow. The objective of this study was to generate an automated lung tumor measurement tool for segmentation of pulmonary metastatic tumors from x-ray computed tomography (CT) images to improve reproducibility and decrease the time required to segment tumor boundaries. The authors developed an automated lung tumor segmentation algorithm for volumetric image analysis of chest CT images using shape constrained Otsu multithresholding (SCOMT) and sparse field active surface (SFAS) algorithms. The observer was required to select the tumor center and the SCOMT algorithm subsequently created an initial surface that was deformed using level set SFAS to minimize the total energy consisting of mean separation, edge, partial volume, rolling, distribution, background, shape, volume, smoothness, and curvature energies. The proposed segmentation algorithm was compared to manual segmentation whereby 21 tumors were evaluated using one-dimensional (1D) response evaluation criteria in solid tumors (RECIST), two-dimensional (2D) World Health Organization (WHO), and 3D volume measurements. Linear regression goodness-of-fit measures (r(2) = 0.63, p < 0.0001; r(2) = 0.87, p < 0.0001; and r(2) = 0.96, p < 0.0001), and Pearson correlation coefficients (r = 0.79, p < 0.0001; r = 0.93, p < 0.0001; and r = 0.98, p < 0.0001) for 1D, 2D, and 3D measurements, respectively, showed significant correlations between manual and algorithm results. Intra-observer intraclass correlation coefficients (ICC) demonstrated high reproducibility for algorithm (0.989-0.995, 0.996-0.997, and 0.999-0.999) and manual measurements (0.975-0.993, 0.985-0.993, and 0.980-0.992) for 1D, 2D, and 3D measurements, respectively. The intra-observer coefficient of variation (CV%) was low for algorithm (3.09%-4.67%, 4.85%-5.84%, and 5.65%-5.88%) and manual observers (4.20%-6.61%, 8.14%-9.57%, and 14.57%-21.61%) for 1D, 2D, and 3D measurements, respectively. The authors developed an automated segmentation algorithm requiring only that the operator select the tumor to measure pulmonary metastatic tumors in 1D, 2D, and 3D. Algorithm and manual measurements were significantly correlated. Since the algorithm segmentation involves selection of a single seed point, it resulted in reduced intra-observer variability and decreased time, for making the measurements.
Pires, J C M; Gonçalves, B; Azevedo, F G; Carneiro, A P; Rego, N; Assembleia, A J B; Lima, J F B; Silva, P A; Alves, C; Martins, F G
2012-09-01
This study proposes three methodologies to define artificial neural network models through genetic algorithms (GAs) to predict the next-day hourly average surface ozone (O(3)) concentrations. GAs were applied to define the activation function in hidden layer and the number of hidden neurons. Two of the methodologies define threshold models, which assume that the behaviour of the dependent variable (O(3) concentrations) changes when it enters in a different regime (two and four regimes were considered in this study). The change from one regime to another depends on a specific value (threshold value) of an explanatory variable (threshold variable), which is also defined by GAs. The predictor variables were the hourly average concentrations of carbon monoxide (CO), nitrogen oxide, nitrogen dioxide (NO(2)), and O(3) (recorded in the previous day at an urban site with traffic influence) and also meteorological data (hourly averages of temperature, solar radiation, relative humidity and wind speed). The study was performed for the period from May to August 2004. Several models were achieved and only the best model of each methodology was analysed. In threshold models, the variables selected by GAs to define the O(3) regimes were temperature, CO and NO(2) concentrations, due to their importance in O(3) chemistry in an urban atmosphere. In the prediction of O(3) concentrations, the threshold model that considers two regimes was the one that fitted the data most efficiently.
An affine projection algorithm using grouping selection of input vectors
NASA Astrophysics Data System (ADS)
Shin, JaeWook; Kong, NamWoong; Park, PooGyeon
2011-10-01
This paper present an affine projection algorithm (APA) using grouping selection of input vectors. To improve the performance of conventional APA, the proposed algorithm adjusts the number of the input vectors using two procedures: grouping procedure and selection procedure. In grouping procedure, the some input vectors that have overlapping information for update is grouped using normalized inner product. Then, few input vectors that have enough information for for coefficient update is selected using steady-state mean square error (MSE) in selection procedure. Finally, the filter coefficients update using selected input vectors. The experimental results show that the proposed algorithm has small steady-state estimation errors comparing with the existing algorithms.
NASA Astrophysics Data System (ADS)
Zhu, Zhe; Gallant, Alisa L.; Woodcock, Curtis E.; Pengra, Bruce; Olofsson, Pontus; Loveland, Thomas R.; Jin, Suming; Dahal, Devendra; Yang, Limin; Auch, Roger F.
2016-12-01
The U.S. Geological Survey's Land Change Monitoring, Assessment, and Projection (LCMAP) initiative is a new end-to-end capability to continuously track and characterize changes in land cover, use, and condition to better support research and applications relevant to resource management and environmental change. Among the LCMAP product suite are annual land cover maps that will be available to the public. This paper describes an approach to optimize the selection of training and auxiliary data for deriving the thematic land cover maps based on all available clear observations from Landsats 4-8. Training data were selected from map products of the U.S. Geological Survey's Land Cover Trends project. The Random Forest classifier was applied for different classification scenarios based on the Continuous Change Detection and Classification (CCDC) algorithm. We found that extracting training data proportionally to the occurrence of land cover classes was superior to an equal distribution of training data per class, and suggest using a total of 20,000 training pixels to classify an area about the size of a Landsat scene. The problem of unbalanced training data was alleviated by extracting a minimum of 600 training pixels and a maximum of 8000 training pixels per class. We additionally explored removing outliers contained within the training data based on their spectral and spatial criteria, but observed no significant improvement in classification results. We also tested the importance of different types of auxiliary data that were available for the conterminous United States, including: (a) five variables used by the National Land Cover Database, (b) three variables from the cloud screening "Function of mask" (Fmask) statistics, and (c) two variables from the change detection results of CCDC. We found that auxiliary variables such as a Digital Elevation Model and its derivatives (aspect, position index, and slope), potential wetland index, water probability, snow probability, and cloud probability improved the accuracy of land cover classification. Compared to the original strategy of the CCDC algorithm (500 pixels per class), the use of the optimal strategy improved the classification accuracies substantially (15-percentage point increase in overall accuracy and 4-percentage point increase in minimum accuracy).
Combination of minimum enclosing balls classifier with SVM in coal-rock recognition.
Song, QingJun; Jiang, HaiYan; Song, Qinghui; Zhao, XieGuang; Wu, Xiaoxuan
2017-01-01
Top-coal caving technology is a productive and efficient method in modern mechanized coal mining, the study of coal-rock recognition is key to realizing automation in comprehensive mechanized coal mining. In this paper we propose a new discriminant analysis framework for coal-rock recognition. In the framework, a data acquisition model with vibration and acoustic signals is designed and the caving dataset with 10 feature variables and three classes is got. And the perfect combination of feature variables can be automatically decided by using the multi-class F-score (MF-Score) feature selection. In terms of nonlinear mapping in real-world optimization problem, an effective minimum enclosing ball (MEB) algorithm plus Support vector machine (SVM) is proposed for rapid detection of coal-rock in the caving process. In particular, we illustrate how to construct MEB-SVM classifier in coal-rock recognition which exhibit inherently complex distribution data. The proposed method is examined on UCI data sets and the caving dataset, and compared with some new excellent SVM classifiers. We conduct experiments with accuracy and Friedman test for comparison of more classifiers over multiple on the UCI data sets. Experimental results demonstrate that the proposed algorithm has good robustness and generalization ability. The results of experiments on the caving dataset show the better performance which leads to a promising feature selection and multi-class recognition in coal-rock recognition.
Combination of minimum enclosing balls classifier with SVM in coal-rock recognition
Song, QingJun; Jiang, HaiYan; Song, Qinghui; Zhao, XieGuang; Wu, Xiaoxuan
2017-01-01
Top-coal caving technology is a productive and efficient method in modern mechanized coal mining, the study of coal-rock recognition is key to realizing automation in comprehensive mechanized coal mining. In this paper we propose a new discriminant analysis framework for coal-rock recognition. In the framework, a data acquisition model with vibration and acoustic signals is designed and the caving dataset with 10 feature variables and three classes is got. And the perfect combination of feature variables can be automatically decided by using the multi-class F-score (MF-Score) feature selection. In terms of nonlinear mapping in real-world optimization problem, an effective minimum enclosing ball (MEB) algorithm plus Support vector machine (SVM) is proposed for rapid detection of coal-rock in the caving process. In particular, we illustrate how to construct MEB-SVM classifier in coal-rock recognition which exhibit inherently complex distribution data. The proposed method is examined on UCI data sets and the caving dataset, and compared with some new excellent SVM classifiers. We conduct experiments with accuracy and Friedman test for comparison of more classifiers over multiple on the UCI data sets. Experimental results demonstrate that the proposed algorithm has good robustness and generalization ability. The results of experiments on the caving dataset show the better performance which leads to a promising feature selection and multi-class recognition in coal-rock recognition. PMID:28937987
Classification and Feature Selection Algorithms for Modeling Ice Storm Climatology
NASA Astrophysics Data System (ADS)
Swaminathan, R.; Sridharan, M.; Hayhoe, K.; Dobbie, G.
2015-12-01
Ice storms account for billions of dollars of winter storm loss across the continental US and Canada. In the future, increasing concentration of human populations in areas vulnerable to ice storms such as the northeastern US will only exacerbate the impacts of these extreme events on infrastructure and society. Quantifying the potential impacts of global climate change on ice storm prevalence and frequency is challenging, as ice storm climatology is driven by complex and incompletely defined atmospheric processes, processes that are in turn influenced by a changing climate. This makes the underlying atmospheric and computational modeling of ice storm climatology a formidable task. We propose a novel computational framework that uses sophisticated stochastic classification and feature selection algorithms to model ice storm climatology and quantify storm occurrences from both reanalysis and global climate model outputs. The framework is based on an objective identification of ice storm events by key variables derived from vertical profiles of temperature, humidity and geopotential height. Historical ice storm records are used to identify days with synoptic-scale upper air and surface conditions associated with ice storms. Evaluation using NARR reanalysis and historical ice storm records corresponding to the northeastern US demonstrates that an objective computational model with standard performance measures, with a relatively high degree of accuracy, identify ice storm events based on upper-air circulation patterns and provide insights into the relationships between key climate variables associated with ice storms.
Steen, Valerie A.; Powell, Abby N.
2012-01-01
We examined wetland selection by the Black Tern (Chlidonias niger), a species that breeds primarily in the prairie pothole region, has experienced population declines, and is difficult to manage because of low site fidelity. To characterize its selection of wetlands in this region, we surveyed 589 wetlands throughout North and South Dakota. We documented breeding at 5% and foraging at 17% of wetlands. We created predictive habitat models with a machine-learning algorithm, Random Forests, to explore the relative role of local wetland characteristics and those of the surrounding landscape and to evaluate which characteristics were important to predicting breeding versus foraging. We also examined area-dependent wetland selection while addressing the passive sampling bias by replacing occurrence of terns in the models with an index of density. Local wetland variables were more important than landscape variables in predictions of occurrence of breeding and foraging. Wetland size was more important to prediction of foraging than of breeding locations, while floating matted vegetation was more important to prediction of breeding than of foraging locations. The amount of seasonal wetland in the landscape was the only landscape variable important to prediction of both foraging and breeding. Models based on a density index indicated that wetland selection by foraging terns may be more area dependent than that by breeding terns. Our study provides some of the first evidence for differential breeding and foraging wetland selection by Black Terns and for a more limited role of landscape effects and area sensitivity than has been previously shown.
A study of metaheuristic algorithms for high dimensional feature selection on microarray data
NASA Astrophysics Data System (ADS)
Dankolo, Muhammad Nasiru; Radzi, Nor Haizan Mohamed; Sallehuddin, Roselina; Mustaffa, Noorfa Haszlinna
2017-11-01
Microarray systems enable experts to examine gene profile at molecular level using machine learning algorithms. It increases the potentials of classification and diagnosis of many diseases at gene expression level. Though, numerous difficulties may affect the efficiency of machine learning algorithms which includes vast number of genes features comprised in the original data. Many of these features may be unrelated to the intended analysis. Therefore, feature selection is necessary to be performed in the data pre-processing. Many feature selection algorithms are developed and applied on microarray which including the metaheuristic optimization algorithms. This paper discusses the application of the metaheuristics algorithms for feature selection in microarray dataset. This study reveals that, the algorithms have yield an interesting result with limited resources thereby saving computational expenses of machine learning algorithms.
Archambeau, Cédric; Verleysen, Michel
2007-01-01
A new variational Bayesian learning algorithm for Student-t mixture models is introduced. This algorithm leads to (i) robust density estimation, (ii) robust clustering and (iii) robust automatic model selection. Gaussian mixture models are learning machines which are based on a divide-and-conquer approach. They are commonly used for density estimation and clustering tasks, but are sensitive to outliers. The Student-t distribution has heavier tails than the Gaussian distribution and is therefore less sensitive to any departure of the empirical distribution from Gaussianity. As a consequence, the Student-t distribution is suitable for constructing robust mixture models. In this work, we formalize the Bayesian Student-t mixture model as a latent variable model in a different way from Svensén and Bishop [Svensén, M., & Bishop, C. M. (2005). Robust Bayesian mixture modelling. Neurocomputing, 64, 235-252]. The main difference resides in the fact that it is not necessary to assume a factorized approximation of the posterior distribution on the latent indicator variables and the latent scale variables in order to obtain a tractable solution. Not neglecting the correlations between these unobserved random variables leads to a Bayesian model having an increased robustness. Furthermore, it is expected that the lower bound on the log-evidence is tighter. Based on this bound, the model complexity, i.e. the number of components in the mixture, can be inferred with a higher confidence.
Sensor placement on Canton Tower for health monitoring using asynchronous-climb monkey algorithm
NASA Astrophysics Data System (ADS)
Yi, Ting-Hua; Li, Hong-Nan; Zhang, Xu-Dong
2012-12-01
Heuristic optimization algorithms have become a popular choice for solving complex and intricate sensor placement problems which are difficult to solve by traditional methods. This paper proposes a novel and interesting methodology called the asynchronous-climb monkey algorithm (AMA) for the optimum design of sensor arrays for a structural health monitoring system. Different from the existing algorithms, the dual-structure coding method is designed and adopted for the representation of the design variables. The asynchronous-climb process is incorporated in the proposed AMA that can adjust the trajectory of each individual dynamically in the search space according to its own experience and other monkeys. The concept of ‘monkey king’ is introduced in the AMA, which reflects the Darwinian principle of natural selection and can create an interaction network to correctly guide the movement of other monkeys. Numerical experiments are carried out using two different objective functions by considering the Canton Tower in China with or without the antenna mast to evaluate the performance of the proposed algorithm. Investigations have indicated that the proposed AMA exhibits faster convergence characteristics and can generate sensor configurations superior in all instances when compared to the conventional monkey algorithm. For structures with stiffness mutation such as the Canton Tower, the sensor placement needs to be considered for each part separately.
MotieGhader, Habib; Gharaghani, Sajjad; Masoudi-Sobhanzadeh, Yosef; Masoudi-Nejad, Ali
2017-01-01
Feature selection is of great importance in Quantitative Structure-Activity Relationship (QSAR) analysis. This problem has been solved using some meta-heuristic algorithms such as GA, PSO, ACO and so on. In this work two novel hybrid meta-heuristic algorithms i.e. Sequential GA and LA (SGALA) and Mixed GA and LA (MGALA), which are based on Genetic algorithm and learning automata for QSAR feature selection are proposed. SGALA algorithm uses advantages of Genetic algorithm and Learning Automata sequentially and the MGALA algorithm uses advantages of Genetic Algorithm and Learning Automata simultaneously. We applied our proposed algorithms to select the minimum possible number of features from three different datasets and also we observed that the MGALA and SGALA algorithms had the best outcome independently and in average compared to other feature selection algorithms. Through comparison of our proposed algorithms, we deduced that the rate of convergence to optimal result in MGALA and SGALA algorithms were better than the rate of GA, ACO, PSO and LA algorithms. In the end, the results of GA, ACO, PSO, LA, SGALA, and MGALA algorithms were applied as the input of LS-SVR model and the results from LS-SVR models showed that the LS-SVR model had more predictive ability with the input from SGALA and MGALA algorithms than the input from all other mentioned algorithms. Therefore, the results have corroborated that not only is the predictive efficiency of proposed algorithms better, but their rate of convergence is also superior to the all other mentioned algorithms. PMID:28979308
MotieGhader, Habib; Gharaghani, Sajjad; Masoudi-Sobhanzadeh, Yosef; Masoudi-Nejad, Ali
2017-01-01
Feature selection is of great importance in Quantitative Structure-Activity Relationship (QSAR) analysis. This problem has been solved using some meta-heuristic algorithms such as GA, PSO, ACO and so on. In this work two novel hybrid meta-heuristic algorithms i.e. Sequential GA and LA (SGALA) and Mixed GA and LA (MGALA), which are based on Genetic algorithm and learning automata for QSAR feature selection are proposed. SGALA algorithm uses advantages of Genetic algorithm and Learning Automata sequentially and the MGALA algorithm uses advantages of Genetic Algorithm and Learning Automata simultaneously. We applied our proposed algorithms to select the minimum possible number of features from three different datasets and also we observed that the MGALA and SGALA algorithms had the best outcome independently and in average compared to other feature selection algorithms. Through comparison of our proposed algorithms, we deduced that the rate of convergence to optimal result in MGALA and SGALA algorithms were better than the rate of GA, ACO, PSO and LA algorithms. In the end, the results of GA, ACO, PSO, LA, SGALA, and MGALA algorithms were applied as the input of LS-SVR model and the results from LS-SVR models showed that the LS-SVR model had more predictive ability with the input from SGALA and MGALA algorithms than the input from all other mentioned algorithms. Therefore, the results have corroborated that not only is the predictive efficiency of proposed algorithms better, but their rate of convergence is also superior to the all other mentioned algorithms.
NASA Astrophysics Data System (ADS)
Gonzalez, T.; Ruvalcaba, A.; Oliver, L.
2016-12-01
The electricity generation from renewable resources has acquired a leading role. Mexico particularrly it has great interest in renewable natural resources for power generation, especially wind energy. Therefore, the country is rapidly entering in the development of wind power generators sites. The development of a wind places as an energy project, does not have a standardized methodology. Techniques vary according to the developer to select the best place to install a wind turbine system. Generally to install the system the developers consider three key factors: 1) the characteristics of the wind, 2) the potential distribution of electricity and 3) transport access to the site. This paper presents a study with a different methodology which is carried out in two stages: the first at regional scale uses "space" and "natural" criteria in order to select a region based on its cartographic features such as politics and physiographic division, location of conservation natural areas, water bodies, urban criteria; and natural criteria such as the amount and direction of the wind, the type and land use, vegetation, topography and biodiversity of the site. The result of the application of these criteria, gives a first optimal selection area. The second part of the methodology includes criteria and variables on detail scale. The analysis of all data information collected will provide new parameters (decision variables) for the site. The overall analysis of the information, based in these criteria, indicates that the best location that the best location of the field would be the southern Coahuila and the central part of Nuevo Leon. The wind power site will contribute to the economy grow of important cities including Monterrey. Finally, computational model of genetic algorithm will be used as a tool to determine the best site selection depending on the parameters considered.
A novel approach to selecting and weighting nutrients for nutrient profiling of foods and diets.
Arsenault, Joanne E; Fulgoni, Victor L; Hersey, James C; Muth, Mary K
2012-12-01
Nutrient profiling of foods is the science of ranking or classifying foods based on their nutrient composition. Most profiling systems use similar weighting factors across nutrients due to lack of scientific evidence to assign levels of importance to nutrients. Our aim was to use a statistical approach to determine the nutrients that best explain variation in Healthy Eating Index (HEI) scores and to obtain β-coefficients for the nutrients for use as weighting factors for a nutrient-profiling algorithm. We used a cross-sectional analysis of nutrient intakes and HEI scores. Our subjects included 16,587 individuals from the National Health and Nutrition Examination Survey 2005-2008 who were 2 years of age or older and not pregnant. Our main outcome measure was variation (R(2)) in HEI scores. Linear regression analyses were conducted with HEI scores as the dependent variable and all possible combinations of 16 nutrients of interest as independent variables, with covariates age, sex, and ethnicity. The analyses identified the best 1-nutrient variable model (with the highest R(2)), the best 2-nutrient variable model, and up to the best 16-nutrient variable model. The model with 8 nutrients explained 65% of the variance in HEI scores, similar to the models with 9 to 16 nutrients, but substantially higher than previous algorithms reported in the literature. The model contained five nutrients with positive β-coefficients (ie, protein, fiber, calcium, unsaturated fat, and vitamin C) and three nutrients with negative coefficients (ie, saturated fat, sodium, and added sugar). β-coefficients from the model were used as weighting factors to create an algorithm that generated a weighted nutrient density score representing the overall nutritional quality of a food. The weighted nutrient density score can be easily calculated and is useful for describing the overall nutrient quality of both foods and diets. Copyright © 2012 Academy of Nutrition and Dietetics. Published by Elsevier Inc. All rights reserved.
Application of modern control theory to the design of optimum aircraft controllers
NASA Technical Reports Server (NTRS)
Power, L. J.
1973-01-01
The synthesis procedure presented is based on the solution of the output regulator problem of linear optimal control theory for time-invariant systems. By this technique, solution of the matrix Riccati equation leads to a constant linear feedback control law for an output regulator which will maintain a plant in a particular equilibrium condition in the presence of impulse disturbances. Two simple algorithms are presented that can be used in an automatic synthesis procedure for the design of maneuverable output regulators requiring only selected state variables for feedback. The first algorithm is for the construction of optimal feedforward control laws that can be superimposed upon a Kalman output regulator and that will drive the output of a plant to a desired constant value on command. The second algorithm is for the construction of optimal Luenberger observers that can be used to obtain feedback control laws for the output regulator requiring measurement of only part of the state vector. This algorithm constructs observers which have minimum response time under the constraint that the magnitude of the gains in the observer filter be less than some arbitrary limit.
Detection of nasopharyngeal cancer using confocal Raman spectroscopy and genetic algorithm technique
NASA Astrophysics Data System (ADS)
Li, Shao-Xin; Chen, Qiu-Yan; Zhang, Yan-Jiao; Liu, Zhi-Ming; Xiong, Hong-Lian; Guo, Zhou-Yi; Mai, Hai-Qiang; Liu, Song-Hao
2012-12-01
Raman spectroscopy (RS) and a genetic algorithm (GA) were applied to distinguish nasopharyngeal cancer (NPC) from normal nasopharyngeal tissue. A total of 225 Raman spectra are acquired from 120 tissue sites of 63 nasopharyngeal patients, 56 Raman spectra from normal tissue and 169 Raman spectra from NPC tissue. The GA integrated with linear discriminant analysis (LDA) is developed to differentiate NPC and normal tissue according to spectral variables in the selected regions of 792-805, 867-880, 996-1009, 1086-1099, 1288-1304, 1663-1670, and 1742-1752 cm-1 related to proteins, nucleic acids and lipids of tissue. The GA-LDA algorithms with the leave-one-out cross-validation method provide a sensitivity of 69.2% and specificity of 100%. The results are better than that of principal component analysis which is applied to the same Raman dataset of nasopharyngeal tissue with a sensitivity of 63.3% and specificity of 94.6%. This demonstrates that Raman spectroscopy associated with GA-LDA diagnostic algorithm has enormous potential to detect and diagnose nasopharyngeal cancer.
Nankali, Saber; Miandoab, Payam Samadi; Baghizadeh, Amin
2016-01-01
In external‐beam radiotherapy, using external markers is one of the most reliable tools to predict tumor position, in clinical applications. The main challenge in this approach is tumor motion tracking with highest accuracy that depends heavily on external markers location, and this issue is the objective of this study. Four commercially available feature selection algorithms entitled 1) Correlation‐based Feature Selection, 2) Classifier, 3) Principal Components, and 4) Relief were proposed to find optimum location of external markers in combination with two “Genetic” and “Ranker” searching procedures. The performance of these algorithms has been evaluated using four‐dimensional extended cardiac‐torso anthropomorphic phantom. Six tumors in lung, three tumors in liver, and 49 points on the thorax surface were taken into account to simulate internal and external motions, respectively. The root mean square error of an adaptive neuro‐fuzzy inference system (ANFIS) as prediction model was considered as metric for quantitatively evaluating the performance of proposed feature selection algorithms. To do this, the thorax surface region was divided into nine smaller segments and predefined tumors motion was predicted by ANFIS using external motion data of given markers at each small segment, separately. Our comparative results showed that all feature selection algorithms can reasonably select specific external markers from those segments where the root mean square error of the ANFIS model is minimum. Moreover, the performance accuracy of proposed feature selection algorithms was compared, separately. For this, each tumor motion was predicted using motion data of those external markers selected by each feature selection algorithm. Duncan statistical test, followed by F‐test, on final results reflected that all proposed feature selection algorithms have the same performance accuracy for lung tumors. But for liver tumors, a correlation‐based feature selection algorithm, in combination with a genetic search algorithm, proved to yield best performance accuracy for selecting optimum markers. PACS numbers: 87.55.km, 87.56.Fc PMID:26894358
Nankali, Saber; Torshabi, Ahmad Esmaili; Miandoab, Payam Samadi; Baghizadeh, Amin
2016-01-08
In external-beam radiotherapy, using external markers is one of the most reliable tools to predict tumor position, in clinical applications. The main challenge in this approach is tumor motion tracking with highest accuracy that depends heavily on external markers location, and this issue is the objective of this study. Four commercially available feature selection algorithms entitled 1) Correlation-based Feature Selection, 2) Classifier, 3) Principal Components, and 4) Relief were proposed to find optimum location of external markers in combination with two "Genetic" and "Ranker" searching procedures. The performance of these algorithms has been evaluated using four-dimensional extended cardiac-torso anthropomorphic phantom. Six tumors in lung, three tumors in liver, and 49 points on the thorax surface were taken into account to simulate internal and external motions, respectively. The root mean square error of an adaptive neuro-fuzzy inference system (ANFIS) as prediction model was considered as metric for quantitatively evaluating the performance of proposed feature selection algorithms. To do this, the thorax surface region was divided into nine smaller segments and predefined tumors motion was predicted by ANFIS using external motion data of given markers at each small segment, separately. Our comparative results showed that all feature selection algorithms can reasonably select specific external markers from those segments where the root mean square error of the ANFIS model is minimum. Moreover, the performance accuracy of proposed feature selection algorithms was compared, separately. For this, each tumor motion was predicted using motion data of those external markers selected by each feature selection algorithm. Duncan statistical test, followed by F-test, on final results reflected that all proposed feature selection algorithms have the same performance accuracy for lung tumors. But for liver tumors, a correlation-based feature selection algorithm, in combination with a genetic search algorithm, proved to yield best performance accuracy for selecting optimum markers.
A Study of Quasar Selection in the Supernova Fields of the Dark Energy Survey
DOE Office of Scientific and Technical Information (OSTI.GOV)
Tie, S. S.; Martini, P.; Mudd, D.
In this paper, we present a study of quasar selection using the supernova fields of the Dark Energy Survey (DES). We used a quasar catalog from an overlapping portion of the SDSS Stripe 82 region to quantify the completeness and efficiency of selection methods involving color, probabilistic modeling, variability, and combinations of color/probabilistic modeling with variability. In all cases, we considered only objects that appear as point sources in the DES images. We examine color selection methods based on the Wide-field Infrared Survey Explorer (WISE) mid-IR W1-W2 color, a mixture of WISE and DES colors (g - i and i-W1),more » and a mixture of Vista Hemisphere Survey and DES colors (g - i and i - K). For probabilistic quasar selection, we used XDQSO, an algorithm that employs an empirical multi-wavelength flux model of quasars to assign quasar probabilities. Our variability selection uses the multi-band χ 2-probability that sources are constant in the DES Year 1 griz-band light curves. The completeness and efficiency are calculated relative to an underlying sample of point sources that are detected in the required selection bands and pass our data quality and photometric error cuts. We conduct our analyses at two magnitude limits, i < 19.8 mag and i < 22 mag. For the subset of sources with W1 and W2 detections, the W1-W2 color or XDQSOz method combined with variability gives the highest completenesses of >85% for both i-band magnitude limits and efficiencies of >80% to the bright limit and >60% to the faint limit; however, the giW1 and giW1+variability methods give the highest quasar surface densities. The XDQSOz method and combinations of W1W2/giW1/XDQSOz with variability are among the better selection methods when both high completeness and high efficiency are desired. We also present the OzDES Quasar Catalog of 1263 spectroscopically confirmed quasars from three years of OzDES observation in the 30 deg 2 of the DES supernova fields. Finally, the catalog includes quasars with redshifts up to z ~ 4 and brighter than i = 22 mag, although the catalog is not complete up to this magnitude limit.« less
A Study of Quasar Selection in the Supernova Fields of the Dark Energy Survey
Tie, S. S.; Martini, P.; Mudd, D.; ...
2017-02-15
In this paper, we present a study of quasar selection using the supernova fields of the Dark Energy Survey (DES). We used a quasar catalog from an overlapping portion of the SDSS Stripe 82 region to quantify the completeness and efficiency of selection methods involving color, probabilistic modeling, variability, and combinations of color/probabilistic modeling with variability. In all cases, we considered only objects that appear as point sources in the DES images. We examine color selection methods based on the Wide-field Infrared Survey Explorer (WISE) mid-IR W1-W2 color, a mixture of WISE and DES colors (g - i and i-W1),more » and a mixture of Vista Hemisphere Survey and DES colors (g - i and i - K). For probabilistic quasar selection, we used XDQSO, an algorithm that employs an empirical multi-wavelength flux model of quasars to assign quasar probabilities. Our variability selection uses the multi-band χ 2-probability that sources are constant in the DES Year 1 griz-band light curves. The completeness and efficiency are calculated relative to an underlying sample of point sources that are detected in the required selection bands and pass our data quality and photometric error cuts. We conduct our analyses at two magnitude limits, i < 19.8 mag and i < 22 mag. For the subset of sources with W1 and W2 detections, the W1-W2 color or XDQSOz method combined with variability gives the highest completenesses of >85% for both i-band magnitude limits and efficiencies of >80% to the bright limit and >60% to the faint limit; however, the giW1 and giW1+variability methods give the highest quasar surface densities. The XDQSOz method and combinations of W1W2/giW1/XDQSOz with variability are among the better selection methods when both high completeness and high efficiency are desired. We also present the OzDES Quasar Catalog of 1263 spectroscopically confirmed quasars from three years of OzDES observation in the 30 deg 2 of the DES supernova fields. Finally, the catalog includes quasars with redshifts up to z ~ 4 and brighter than i = 22 mag, although the catalog is not complete up to this magnitude limit.« less
A chaos wolf optimization algorithm with self-adaptive variable step-size
NASA Astrophysics Data System (ADS)
Zhu, Yong; Jiang, Wanlu; Kong, Xiangdong; Quan, Lingxiao; Zhang, Yongshun
2017-10-01
To explore the problem of parameter optimization for complex nonlinear function, a chaos wolf optimization algorithm (CWOA) with self-adaptive variable step-size was proposed. The algorithm was based on the swarm intelligence of wolf pack, which fully simulated the predation behavior and prey distribution way of wolves. It possessed three intelligent behaviors such as migration, summons and siege. And the competition rule as "winner-take-all" and the update mechanism as "survival of the fittest" were also the characteristics of the algorithm. Moreover, it combined the strategies of self-adaptive variable step-size search and chaos optimization. The CWOA was utilized in parameter optimization of twelve typical and complex nonlinear functions. And the obtained results were compared with many existing algorithms, including the classical genetic algorithm, the particle swarm optimization algorithm and the leader wolf pack search algorithm. The investigation results indicate that CWOA possess preferable optimization ability. There are advantages in optimization accuracy and convergence rate. Furthermore, it demonstrates high robustness and global searching ability.
Dynamic variable selection in SNP genotype autocalling from APEX microarray data.
Podder, Mohua; Welch, William J; Zamar, Ruben H; Tebbutt, Scott J
2006-11-30
Single nucleotide polymorphisms (SNPs) are DNA sequence variations, occurring when a single nucleotide--adenine (A), thymine (T), cytosine (C) or guanine (G)--is altered. Arguably, SNPs account for more than 90% of human genetic variation. Our laboratory has developed a highly redundant SNP genotyping assay consisting of multiple probes with signals from multiple channels for a single SNP, based on arrayed primer extension (APEX). This mini-sequencing method is a powerful combination of a highly parallel microarray with distinctive Sanger-based dideoxy terminator sequencing chemistry. Using this microarray platform, our current genotype calling system (known as SNP Chart) is capable of calling single SNP genotypes by manual inspection of the APEX data, which is time-consuming and exposed to user subjectivity bias. Using a set of 32 Coriell DNA samples plus three negative PCR controls as a training data set, we have developed a fully-automated genotyping algorithm based on simple linear discriminant analysis (LDA) using dynamic variable selection. The algorithm combines separate analyses based on the multiple probe sets to give a final posterior probability for each candidate genotype. We have tested our algorithm on a completely independent data set of 270 DNA samples, with validated genotypes, from patients admitted to the intensive care unit (ICU) of St. Paul's Hospital (plus one negative PCR control sample). Our method achieves a concordance rate of 98.9% with a 99.6% call rate for a set of 96 SNPs. By adjusting the threshold value for the final posterior probability of the called genotype, the call rate reduces to 94.9% with a higher concordance rate of 99.6%. We also reversed the two independent data sets in their training and testing roles, achieving a concordance rate up to 99.8%. The strength of this APEX chemistry-based platform is its unique redundancy having multiple probes for a single SNP. Our model-based genotype calling algorithm captures the redundancy in the system considering all the underlying probe features of a particular SNP, automatically down-weighting any 'bad data' corresponding to image artifacts on the microarray slide or failure of a specific chemistry. In this regard, our method is able to automatically select the probes which work well and reduce the effect of other so-called bad performing probes in a sample-specific manner, for any number of SNPs.
NASA Astrophysics Data System (ADS)
Legeais, JeanFrancois; Cazenave, Anny; Ablain, Michael; Larnicol, Gilles; Benveniste, Jerome; Johannessen, Johnny; Timms, Gary; Andersen, Ole; Cipollini, Paolo; Roca, Monica; Rudenko, Sergei; Fernandes, Joana; Balmaseda, Magdalena; Quartly, Graham; Fenoglio-Marc, Luciana; Meyssignac, Benoit; Scharffenberg, Martin
2016-04-01
Sea level is a very sensitive index of climate change and variability. Sea level integrates the ocean warming, mountain glaciers and ice sheet melting. Understanding the sea level variability and changes implies an accurate monitoring of the sea level variable at climate scales, in addition to understanding the ocean variability and the exchanges between ocean, land, cryosphere, and atmosphere. That is why Sea Level is one of the Essential Climate Variables (ECV) selected in the frame of the ESA Climate Change Initiative (CCI) program. It aims at providing long-term monitoring of the sea level ECV with regular updates, as required for climate studies. The program is now in its second phase of 3 year (following phase I during 2011-2013). The objectives are firstly to involve the climate research community, to refine their needs and collect their feedbacks on product quality. And secondly to develop, test and select the best algorithms and standards to generate an updated climate time series and to produce and validate the Sea Level ECV product. This will better answer the climate user needs by improving the quality of the Sea Level products and maintain a sustain service for an up-to-date production. This has led to the production of the Sea Level ECV which has benefited from yearly extensions and now covers the period 1993-2014. We will firstly present the main achievements of the ESA CCI Sea Level Project. On the one hand, the major steps required to produce the 22 years climate time series are briefly described: collect and refine the user requirements, development of adapted algorithms for climate applications and specification of the production system. On the other hand, the product characteristics are described as well as the results from product validation, performed by several groups of the ocean and climate modeling community. At last, new altimeter standards have been developed and the best one have been recently selected in order to produce a full reprocessing of the dataset (performed in 2016) adapted for climate studies. These new standards will be presented as well as other results regarding the improvement of the sea level estimation in the Arctic Ocean and in coastal areas for which preliminary results suggest that significant improvements can be achieved.
Modelling Ecuador's rainfall distribution according to geographical characteristics.
NASA Astrophysics Data System (ADS)
Tobar, Vladimiro; Wyseure, Guido
2017-04-01
It is known that rainfall is affected by terrain characteristics and some studies had focussed on its distribution over complex terrain. Ecuador's temporal and spatial rainfall distribution is affected by its location on the ITCZ, the marine currents in the Pacific, the Amazon rainforest, and the Andes mountain range. Although all these factors are important, we think that the latter one may hold a key for modelling spatial and temporal distribution of rainfall. The study considered 30 years of monthly data from 319 rainfall stations having at least 10 years of data available. The relatively low density of stations and their location in accessible sites near to main roads or rivers, leave large and important areas ungauged, making it not appropriate to rely on traditional interpolation techniques to estimate regional rainfall for water balance. The aim of this research was to come up with a useful model for seasonal rainfall distribution in Ecuador based on geographical characteristics to allow its spatial generalization. The target for modelling was the seasonal rainfall, characterized by nine percentiles for each one of the 12 months of the year that results in 108 response variables, later on reduced to four principal components comprising 94% of the total variability. Predictor variables for the model were: geographic coordinates, elevation, main wind effects from the Amazon and Coast, Valley and Hill indexes, and average and maximum elevation above the selected rainfall station to the east and to the west, for each one of 18 directions (50-135°, by 5°) adding up to 79 predictors. A multiple linear regression model by the Elastic-net algorithm with cross-validation was applied for each one of the PC as response to select the most important ones from the 79 predictor variables. The Elastic-net algorithm deals well with collinearity problems, while allowing variable selection in a blended approach between the Ridge and Lasso regression. The model fitting produced explained variances of 59%, 81%, 49% and 17% for PC1, PC2, PC3 and PC4, respectively, backing up the hypothesis of good correlation between geographical characteristics and seasonal rainfall patterns (comprised in the four principal components). With the obtained coefficients from the regression, the 108 rainfall percentiles for each station were back estimated giving very good results when compared with the original ones, with an overall 60% explained variance.
Cheng, Weiwei; Sun, Da-Wen; Pu, Hongbin; Wei, Qingyi
2017-04-15
The feasibility of hyperspectral imaging (HSI) (400-1000nm) for tracing the chemical spoilage extent of the raw meat used for two kinds of processed meats was investigated. Calibration models established separately for salted and cooked meats using full wavebands showed good results with the determination coefficient in prediction (R 2 P ) of 0.887 and 0.832, respectively. For simplifying the calibration models, two variable selection methods were used and compared. The results showed that genetic algorithm-partial least squares (GA-PLS) with as much continuous wavebands selected as possible always had better performance. The potential of HSI to develop one multispectral system for simultaneously tracing the chemical spoilage extent of the two kinds of processed meats was also studied. Good result with an R 2 P of 0.854 was obtained using GA-PLS as the dimension reduction method, which was thus used to visualize total volatile base nitrogen (TVB-N) contents corresponding to each pixel of the image. Copyright © 2016 Elsevier Ltd. All rights reserved.
Marathe, Aniruddha P.; Harris, Rachel A.; Lowenthal, David K.; ...
2015-12-17
The use of clouds to execute high-performance computing (HPC) applications has greatly increased recently. Clouds provide several potential advantages over traditional supercomputers and in-house clusters. The most popular cloud is currently Amazon EC2, which provides fixed-cost and variable-cost, auction-based options. The auction market trades lower cost for potential interruptions that necessitate checkpointing; if the market price exceeds the bid price, a node is taken away from the user without warning. We explore techniques to maximize performance per dollar given a time constraint within which an application must complete. Specifically, we design and implement multiple techniques to reduce expected cost bymore » exploiting redundancy in the EC2 auction market. We then design an adaptive algorithm that selects a scheduling algorithm and determines the bid price. We show that our adaptive algorithm executes programs up to seven times cheaper than using the on-demand market and up to 44 percent cheaper than the best non-redundant, auction-market algorithm. We extend our adaptive algorithm to incorporate application scalability characteristics for further cost savings. In conclusion, we show that the adaptive algorithm informed with scalability characteristics of applications achieves up to 56 percent cost savings compared to the expected cost for the base adaptive algorithm run at a fixed, user-defined scale.« less
Extraction of incident irradiance from LWIR hyperspectral imagery
NASA Astrophysics Data System (ADS)
Lahaie, Pierre
2014-10-01
The atmospheric correction of thermal hyperspectral imagery can be separated in two distinct processes: Atmospheric Compensation (AC) and Temperature and Emissivity separation (TES). TES requires for input at each pixel, the ground leaving radiance and the atmospheric downwelling irradiance, which are the outputs of the AC process. The extraction from imagery of the downwelling irradiance requires assumptions about some of the pixels' nature, the sensor and the atmosphere. Another difficulty is that, often the sensor's spectral response is not well characterized. To deal with this unknown, we defined a spectral mean operator that is used to filter the ground leaving radiance and a computation of the downwelling irradiance from MODTRAN. A user will select a number of pixels in the image for which the emissivity is assumed to be known. The emissivity of these pixels is assumed to be smooth and that the only spectrally fast varying variable in the downwelling irradiance. Using these assumptions we built an algorithm to estimate the downwelling irradiance. The algorithm is used on all the selected pixels. The estimated irradiance is the average on the spectral channels of the resulting computation. The algorithm performs well in simulation and results are shown for errors in the assumed emissivity and for errors in the atmospheric profiles. The sensor noise influences mainly the required number of pixels.
Inverse Ising problem in continuous time: A latent variable approach
NASA Astrophysics Data System (ADS)
Donner, Christian; Opper, Manfred
2017-12-01
We consider the inverse Ising problem: the inference of network couplings from observed spin trajectories for a model with continuous time Glauber dynamics. By introducing two sets of auxiliary latent random variables we render the likelihood into a form which allows for simple iterative inference algorithms with analytical updates. The variables are (1) Poisson variables to linearize an exponential term which is typical for point process likelihoods and (2) Pólya-Gamma variables, which make the likelihood quadratic in the coupling parameters. Using the augmented likelihood, we derive an expectation-maximization (EM) algorithm to obtain the maximum likelihood estimate of network parameters. Using a third set of latent variables we extend the EM algorithm to sparse couplings via L1 regularization. Finally, we develop an efficient approximate Bayesian inference algorithm using a variational approach. We demonstrate the performance of our algorithms on data simulated from an Ising model. For data which are simulated from a more biologically plausible network with spiking neurons, we show that the Ising model captures well the low order statistics of the data and how the Ising couplings are related to the underlying synaptic structure of the simulated network.
AVNM: A Voting based Novel Mathematical Rule for Image Classification.
Vidyarthi, Ankit; Mittal, Namita
2016-12-01
In machine learning, the accuracy of the system depends upon classification result. Classification accuracy plays an imperative role in various domains. Non-parametric classifier like K-Nearest Neighbor (KNN) is the most widely used classifier for pattern analysis. Besides its easiness, simplicity and effectiveness characteristics, the main problem associated with KNN classifier is the selection of a number of nearest neighbors i.e. "k" for computation. At present, it is hard to find the optimal value of "k" using any statistical algorithm, which gives perfect accuracy in terms of low misclassification error rate. Motivated by the prescribed problem, a new sample space reduction weighted voting mathematical rule (AVNM) is proposed for classification in machine learning. The proposed AVNM rule is also non-parametric in nature like KNN. AVNM uses the weighted voting mechanism with sample space reduction to learn and examine the predicted class label for unidentified sample. AVNM is free from any initial selection of predefined variable and neighbor selection as found in KNN algorithm. The proposed classifier also reduces the effect of outliers. To verify the performance of the proposed AVNM classifier, experiments are made on 10 standard datasets taken from UCI database and one manually created dataset. The experimental result shows that the proposed AVNM rule outperforms the KNN classifier and its variants. Experimentation results based on confusion matrix accuracy parameter proves higher accuracy value with AVNM rule. The proposed AVNM rule is based on sample space reduction mechanism for identification of an optimal number of nearest neighbor selections. AVNM results in better classification accuracy and minimum error rate as compared with the state-of-art algorithm, KNN, and its variants. The proposed rule automates the selection of nearest neighbor selection and improves classification rate for UCI dataset and manually created dataset. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
The artificial-free technique along the objective direction for the simplex algorithm
NASA Astrophysics Data System (ADS)
Boonperm, Aua-aree; Sinapiromsaran, Krung
2014-03-01
The simplex algorithm is a popular algorithm for solving linear programming problems. If the origin point satisfies all constraints then the simplex can be started. Otherwise, artificial variables will be introduced to start the simplex algorithm. If we can start the simplex algorithm without using artificial variables then the simplex iterate will require less time. In this paper, we present the artificial-free technique for the simplex algorithm by mapping the problem into the objective plane and splitting constraints into three groups. In the objective plane, one of variables which has a nonzero coefficient of the objective function is fixed in terms of another variable. Then it can split constraints into three groups: the positive coefficient group, the negative coefficient group and the zero coefficient group. Along the objective direction, some constraints from the positive coefficient group will form the optimal solution. If the positive coefficient group is nonempty, the algorithm starts with relaxing constraints from the negative coefficient group and the zero coefficient group. We guarantee the feasible region obtained from the positive coefficient group to be nonempty. The transformed problem is solved using the simplex algorithm. Additional constraints from the negative coefficient group and the zero coefficient group will be added to the solved problem and use the dual simplex method to determine the new optimal solution. An example shows the effectiveness of our algorithm.
Model predictive controller design for boost DC-DC converter using T-S fuzzy cost function
NASA Astrophysics Data System (ADS)
Seo, Sang-Wha; Kim, Yong; Choi, Han Ho
2017-11-01
This paper proposes a Takagi-Sugeno (T-S) fuzzy method to select cost function weights of finite control set model predictive DC-DC converter control algorithms. The proposed method updates the cost function weights at every sample time by using T-S type fuzzy rules derived from the common optimal control engineering knowledge that a state or input variable with an excessively large magnitude can be penalised by increasing the weight corresponding to the variable. The best control input is determined via the online optimisation of the T-S fuzzy cost function for all the possible control input sequences. This paper implements the proposed model predictive control algorithm in real time on a Texas Instruments TMS320F28335 floating-point Digital Signal Processor (DSP). Some experimental results are given to illuminate the practicality and effectiveness of the proposed control system under several operating conditions. The results verify that our method can yield not only good transient and steady-state responses (fast recovery time, small overshoot, zero steady-state error, etc.) but also insensitiveness to abrupt load or input voltage parameter variations.
Antwi, Philip; Li, Jianzheng; Boadi, Portia Opoku; Meng, Jia; Shi, En; Deng, Kaiwen; Bondinuba, Francis Kwesi
2017-03-01
Three-layered feedforward backpropagation (BP) artificial neural networks (ANN) and multiple nonlinear regression (MnLR) models were developed to estimate biogas and methane yield in an upflow anaerobic sludge blanket (UASB) reactor treating potato starch processing wastewater (PSPW). Anaerobic process parameters were optimized to identify their importance on methanation. pH, total chemical oxygen demand, ammonium, alkalinity, total Kjeldahl nitrogen, total phosphorus, volatile fatty acids and hydraulic retention time selected based on principal component analysis were used as input variables, whiles biogas and methane yield were employed as target variables. Quasi-Newton method and conjugate gradient backpropagation algorithms were best among eleven training algorithms. Coefficient of determination (R 2 ) of the BP-ANN reached 98.72% and 97.93% whiles MnLR model attained 93.9% and 91.08% for biogas and methane yield, respectively. Compared with the MnLR model, BP-ANN model demonstrated significant performance, suggesting possible control of the anaerobic digestion process with the BP-ANN model. Copyright © 2016 Elsevier Ltd. All rights reserved.
Darajeh, Negisa; Idris, Azni; Fard Masoumi, Hamid Reza; Nourani, Abolfazl; Truong, Paul; Rezania, Shahabaldin
2017-05-04
Artificial neural networks (ANNs) have been widely used to solve the problems because of their reliable, robust, and salient characteristics in capturing the nonlinear relationships between variables in complex systems. In this study, ANN was applied for modeling of Chemical Oxygen Demand (COD) and biodegradable organic matter (BOD) removal from palm oil mill secondary effluent (POMSE) by vetiver system. The independent variable, including POMSE concentration, vetiver slips density, and removal time, has been considered as input parameters to optimize the network, while the removal percentage of COD and BOD were selected as output. To determine the number of hidden layer nodes, the root mean squared error of testing set was minimized, and the topologies of the algorithms were compared by coefficient of determination and absolute average deviation. The comparison indicated that the quick propagation (QP) algorithm had minimum root mean squared error and absolute average deviation, and maximum coefficient of determination. The importance values of the variables was included vetiver slips density with 42.41%, time with 29.8%, and the POMSE concentration with 27.79%, which showed none of them, is negligible. Results show that the ANN has great potential ability in prediction of COD and BOD removal from POMSE with residual standard error (RSE) of less than 0.45%.
Jeyasingh, Suganthi; Veluchamy, Malathi
2017-05-01
Early diagnosis of breast cancer is essential to save lives of patients. Usually, medical datasets include a large variety of data that can lead to confusion during diagnosis. The Knowledge Discovery on Database (KDD) process helps to improve efficiency. It requires elimination of inappropriate and repeated data from the dataset before final diagnosis. This can be done using any of the feature selection algorithms available in data mining. Feature selection is considered as a vital step to increase the classification accuracy. This paper proposes a Modified Bat Algorithm (MBA) for feature selection to eliminate irrelevant features from an original dataset. The Bat algorithm was modified using simple random sampling to select the random instances from the dataset. Ranking was with the global best features to recognize the predominant features available in the dataset. The selected features are used to train a Random Forest (RF) classification algorithm. The MBA feature selection algorithm enhanced the classification accuracy of RF in identifying the occurrence of breast cancer. The Wisconsin Diagnosis Breast Cancer Dataset (WDBC) was used for estimating the performance analysis of the proposed MBA feature selection algorithm. The proposed algorithm achieved better performance in terms of Kappa statistic, Mathew’s Correlation Coefficient, Precision, F-measure, Recall, Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Relative Absolute Error (RAE) and Root Relative Squared Error (RRSE). Creative Commons Attribution License
Sensitivity study of Space Station Freedom operations cost and selected user resources
NASA Technical Reports Server (NTRS)
Accola, Anne; Fincannon, H. J.; Williams, Gregory J.; Meier, R. Timothy
1990-01-01
The results of sensitivity studies performed to estimate probable ranges for four key Space Station parameters using the Space Station Freedom's Model for Estimating Space Station Operations Cost (MESSOC) are discussed. The variables examined are grouped into five main categories: logistics, crew, design, space transportation system, and training. The modification of these variables implies programmatic decisions in areas such as orbital replacement unit (ORU) design, investment in repair capabilities, and crew operations policies. The model utilizes a wide range of algorithms and an extensive trial logistics data base to represent Space Station operations. The trial logistics data base consists largely of a collection of the ORUs that comprise the mature station, and their characteristics based on current engineering understanding of the Space Station. A nondimensional approach is used to examine the relative importance of variables on parameters.
Statistical analysis for validating ACO-KNN algorithm as feature selection in sentiment analysis
NASA Astrophysics Data System (ADS)
Ahmad, Siti Rohaidah; Yusop, Nurhafizah Moziyana Mohd; Bakar, Azuraliza Abu; Yaakub, Mohd Ridzwan
2017-10-01
This research paper aims to propose a hybrid of ant colony optimization (ACO) and k-nearest neighbor (KNN) algorithms as feature selections for selecting and choosing relevant features from customer review datasets. Information gain (IG), genetic algorithm (GA), and rough set attribute reduction (RSAR) were used as baseline algorithms in a performance comparison with the proposed algorithm. This paper will also discuss the significance test, which was used to evaluate the performance differences between the ACO-KNN, IG-GA, and IG-RSAR algorithms. This study evaluated the performance of the ACO-KNN algorithm using precision, recall, and F-score, which were validated using the parametric statistical significance tests. The evaluation process has statistically proven that this ACO-KNN algorithm has been significantly improved compared to the baseline algorithms. The evaluation process has statistically proven that this ACO-KNN algorithm has been significantly improved compared to the baseline algorithms. In addition, the experimental results have proven that the ACO-KNN can be used as a feature selection technique in sentiment analysis to obtain quality, optimal feature subset that can represent the actual data in customer review data.
ERIC Educational Resources Information Center
Weissman, Alexander
2013-01-01
Convergence of the expectation-maximization (EM) algorithm to a global optimum of the marginal log likelihood function for unconstrained latent variable models with categorical indicators is presented. The sufficient conditions under which global convergence of the EM algorithm is attainable are provided in an information-theoretic context by…
NASA Astrophysics Data System (ADS)
Walawender, Ewelina; Walawender, Jakub P.; Ustrnul, Zbigniew
2017-02-01
The main purpose of the study is to introduce methods for mapping the spatial distribution of the occurrence of selected atmospheric phenomena (thunderstorms, fog, glaze and rime) over Poland from 1966 to 2010 (45 years). Limited in situ observations as well the discontinuous and location-dependent nature of these phenomena make traditional interpolation inappropriate. Spatially continuous maps were created with the use of geospatial predictive modelling techniques. For each given phenomenon, an algorithm identifying its favourable meteorological and environmental conditions was created on the basis of observations recorded at 61 weather stations in Poland. Annual frequency maps presenting the probability of a day with a thunderstorm, fog, glaze or rime were created with the use of a modelled, gridded dataset by implementing predefined algorithms. Relevant explanatory variables were derived from NCEP/NCAR reanalysis and downscaled with the use of a Regional Climate Model. The resulting maps of favourable meteorological conditions were found to be valuable and representative on the country scale but at different correlation ( r) strength against in situ data (from r = 0.84 for thunderstorms to r = 0.15 for fog). A weak correlation between gridded estimates of fog occurrence and observations data indicated the very local nature of this phenomenon. For this reason, additional environmental predictors of fog occurrence were also examined. Topographic parameters derived from the SRTM elevation model and reclassified CORINE Land Cover data were used as the external, explanatory variables for the multiple linear regression kriging used to obtain the final map. The regression model explained 89 % of annual frequency of fog variability in the study area. Regression residuals were interpolated via simple kriging.
NASA Astrophysics Data System (ADS)
Wang, J.; Samms, T.; Meier, C.; Simmons, L.; Miller, D.; Bathke, D.
2005-12-01
Spatial evapotranspiration (ET) is usually estimated by Surface Energy Balance Algorithm for Land. The average accuracy of the algorithm is 85% on daily basis and 95% on seasonable basis. However, the accuracy of the algorithm varies from 67% to 95% on instantaneous ET estimates and, as reported in 18 studies, 70% to 98% on 1 to 10-day ET estimates. There is a need to understand the sensitivity of the ET calculation with respect to the algorithm variables and equations. With an increased understanding, information can be developed to improve the algorithm, and to better identify the key variables and equations. A Modified Surface Energy Balance Algorithm for Land (MSEBAL) was developed and validated with data from a pecan orchard and an alfalfa field. The MSEBAL uses ground reflectance and temperature data from ASTER sensors along with humidity, wind speed, and solar radiation data from a local weather station. MSEBAL outputs hourly and daily ET with 90 m by 90 m resolution. A sensitivity analysis was conducted for MSEBAL on ET calculation. In order to observe the sensitivity of the calculation to a particular variable, the value of that variable was changed while holding the magnitudes of the other variables. The key variables and equations to which the ET calculation most sensitive were determined in this study. href='http://weather.nmsu.edu/pecans/SEBALFolder/San%20Francisco%20AGU%20meeting/ASensitivityAnalysisonMSE">http://weather.nmsu.edu/pecans/SEBALFolder/San%20Francisco%20AGU%20meeting/ASensitivityAnalysisonMSE
Lu, Stephen M.; Lu, Wuyuan; Qasim, M. A.; Anderson, Stephen; Apostol, Izydor; Ardelt, Wojciech; Bigler, Theresa; Chiang, Yi Wen; Cook, James; James, Michael N. G.; Kato, Ikunoshin; Kelly, Clyde; Kohr, William; Komiyama, Tomoko; Lin, Tiao-Yin; Ogawa, Michio; Otlewski, Jacek; Park, Soon-Jae; Qasim, Sabiha; Ranjbar, Michael; Tashiro, Misao; Warne, Nicholas; Whatley, Harry; Wieczorek, Anna; Wieczorek, Maciej; Wilusz, Tadeusz; Wynn, Richard; Zhang, Wenlei; Laskowski, Michael
2001-01-01
An additivity-based sequence to reactivity algorithm for the interaction of members of the Kazal family of protein inhibitors with six selected serine proteinases is described. Ten consensus variable contact positions in the inhibitor were identified, and the 19 possible variants at each of these positions were expressed. The free energies of interaction of these variants and the wild type were measured. For an additive system, this data set allows for the calculation of all possible sequences, subject to some restrictions. The algorithm was extensively tested. It is exceptionally fast so that all possible sequences can be predicted. The strongest, the most specific possible, and the least specific inhibitors were designed, and an evolutionary problem was solved. PMID:11171964
Optimal design of dampers within seismic structures
NASA Astrophysics Data System (ADS)
Ren, Wenjie; Qian, Hui; Song, Wali; Wang, Liqiang
2009-07-01
An improved multi-objective genetic algorithm for structural passive control system optimization is proposed. Based on the two-branch tournament genetic algorithm, the selection operator is constructed by evaluating individuals according to their dominance in one run. For a constrained problem, the dominance-based penalty function method is advanced, containing information on an individual's status (feasible or infeasible), position in a search space, and distance from a Pareto optimal set. The proposed approach is used for the optimal designs of a six-storey building with shape memory alloy dampers subjected to earthquake. The number and position of dampers are chosen as the design variables. The number of dampers and peak relative inter-storey drift are considered as the objective functions. Numerical results generate a set of non-dominated solutions.
Mono-isotope Prediction for Mass Spectra Using Bayes Network.
Li, Hui; Liu, Chunmei; Rwebangira, Mugizi Robert; Burge, Legand
2014-12-01
Mass spectrometry is one of the widely utilized important methods to study protein functions and components. The challenge of mono-isotope pattern recognition from large scale protein mass spectral data needs computational algorithms and tools to speed up the analysis and improve the analytic results. We utilized naïve Bayes network as the classifier with the assumption that the selected features are independent to predict mono-isotope pattern from mass spectrometry. Mono-isotopes detected from validated theoretical spectra were used as prior information in the Bayes method. Three main features extracted from the dataset were employed as independent variables in our model. The application of the proposed algorithm to publicMo dataset demonstrates that our naïve Bayes classifier is advantageous over existing methods in both accuracy and sensitivity.
Detecting Anomalies in Process Control Networks
NASA Astrophysics Data System (ADS)
Rrushi, Julian; Kang, Kyoung-Don
This paper presents the estimation-inspection algorithm, a statistical algorithm for anomaly detection in process control networks. The algorithm determines if the payload of a network packet that is about to be processed by a control system is normal or abnormal based on the effect that the packet will have on a variable stored in control system memory. The estimation part of the algorithm uses logistic regression integrated with maximum likelihood estimation in an inductive machine learning process to estimate a series of statistical parameters; these parameters are used in conjunction with logistic regression formulas to form a probability mass function for each variable stored in control system memory. The inspection part of the algorithm uses the probability mass functions to estimate the normalcy probability of a specific value that a network packet writes to a variable. Experimental results demonstrate that the algorithm is very effective at detecting anomalies in process control networks.
Variable-spot ion beam figuring
NASA Astrophysics Data System (ADS)
Wu, Lixiang; Qiu, Keqiang; Fu, Shaojun
2016-03-01
This paper introduces a new scheme of ion beam figuring (IBF), or rather variable-spot IBF, which is conducted at a constant scanning velocity with variable-spot ion beam collimated by a variable diaphragm. It aims at improving the reachability and adaptation of the figuring process within the limits of machine dynamics by varying the ion beam spot size instead of the scanning velocity. In contrast to the dwell time algorithm in the conventional IBF, the variable-spot IBF adopts a new algorithm, which consists of the scan path programming and the trajectory optimization using pattern search. In this algorithm, instead of the dwell time, a new concept, integral etching time, is proposed to interpret the process of variable-spot IBF. We conducted simulations to verify its feasibility and practicality. The simulation results indicate the variable-spot IBF is a promising alternative to the conventional approach.
NASA Astrophysics Data System (ADS)
Ushijima, T.; Yeh, W.
2013-12-01
An optimal experimental design algorithm is developed to select locations for a network of observation wells that provides the maximum information about unknown hydraulic conductivity in a confined, anisotropic aquifer. The design employs a maximal information criterion that chooses, among competing designs, the design that maximizes the sum of squared sensitivities while conforming to specified design constraints. Because that the formulated problem is non-convex and contains integer variables (necessitating a combinatorial search), for a realistically-scaled model, the problem may be difficult, if not impossible, to solve through traditional mathematical programming techniques. Genetic Algorithms (GAs) are designed to search out the global optimum; however because a GA requires a large number of calls to a groundwater model, the formulated optimization problem may still be infeasible to solve. To overcome this, Proper Orthogonal Decomposition (POD) is applied to the groundwater model to reduce its dimension. The information matrix in the full model space can then be searched without solving the full model.
Negotiating Multicollinearity with Spike-and-Slab Priors
Ročková, Veronika
2014-01-01
In multiple regression under the normal linear model, the presence of multicollinearity is well known to lead to unreliable and unstable maximum likelihood estimates. This can be particularly troublesome for the problem of variable selection where it becomes more difficult to distinguish between subset models. Here we show how adding a spike-and-slab prior mitigates this difficulty by filtering the likelihood surface into a posterior distribution that allocates the relevant likelihood information to each of the subset model modes. For identification of promising high posterior models in this setting, we consider three EM algorithms, the fast closed form EMVS version of Rockova and George (2014) and two new versions designed for variants of the spike-and-slab formulation. For a multimodal posterior under multicollinearity, we compare the regions of convergence of these three algorithms. Deterministic annealing versions of the EMVS algorithm are seen to substantially mitigate this multimodality. A single simple running example is used for illustration throughout. PMID:25419004
Mokeddem, Diab; Khellaf, Abdelhafid
2009-01-01
Optimal design problem are widely known by their multiple performance measures that are often competing with each other. In this paper, an optimal multiproduct batch chemical plant design is presented. The design is firstly formulated as a multiobjective optimization problem, to be solved using the well suited non dominating sorting genetic algorithm (NSGA-II). The NSGA-II have capability to achieve fine tuning of variables in determining a set of non dominating solutions distributed along the Pareto front in a single run of the algorithm. The NSGA-II ability to identify a set of optimal solutions provides the decision-maker DM with a complete picture of the optimal solution space to gain better and appropriate choices. Then an outranking with PROMETHEE II helps the decision-maker to finalize the selection of a best compromise. The effectiveness of NSGA-II method with multiojective optimization problem is illustrated through two carefully referenced examples. PMID:19543537
Selecting materialized views using random algorithm
NASA Astrophysics Data System (ADS)
Zhou, Lijuan; Hao, Zhongxiao; Liu, Chi
2007-04-01
The data warehouse is a repository of information collected from multiple possibly heterogeneous autonomous distributed databases. The information stored at the data warehouse is in form of views referred to as materialized views. The selection of the materialized views is one of the most important decisions in designing a data warehouse. Materialized views are stored in the data warehouse for the purpose of efficiently implementing on-line analytical processing queries. The first issue for the user to consider is query response time. So in this paper, we develop algorithms to select a set of views to materialize in data warehouse in order to minimize the total view maintenance cost under the constraint of a given query response time. We call it query_cost view_ selection problem. First, cost graph and cost model of query_cost view_ selection problem are presented. Second, the methods for selecting materialized views by using random algorithms are presented. The genetic algorithm is applied to the materialized views selection problem. But with the development of genetic process, the legal solution produced become more and more difficult, so a lot of solutions are eliminated and producing time of the solutions is lengthened in genetic algorithm. Therefore, improved algorithm has been presented in this paper, which is the combination of simulated annealing algorithm and genetic algorithm for the purpose of solving the query cost view selection problem. Finally, in order to test the function and efficiency of our algorithms experiment simulation is adopted. The experiments show that the given methods can provide near-optimal solutions in limited time and works better in practical cases. Randomized algorithms will become invaluable tools for data warehouse evolution.
Kepler AutoRegressive Planet Search: Motivation & Methodology
NASA Astrophysics Data System (ADS)
Caceres, Gabriel; Feigelson, Eric; Jogesh Babu, G.; Bahamonde, Natalia; Bertin, Karine; Christen, Alejandra; Curé, Michel; Meza, Cristian
2015-08-01
The Kepler AutoRegressive Planet Search (KARPS) project uses statistical methodology associated with autoregressive (AR) processes to model Kepler lightcurves in order to improve exoplanet transit detection in systems with high stellar variability. We also introduce a planet-search algorithm to detect transits in time-series residuals after application of the AR models. One of the main obstacles in detecting faint planetary transits is the intrinsic stellar variability of the host star. The variability displayed by many stars may have autoregressive properties, wherein later flux values are correlated with previous ones in some manner. Auto-Regressive Moving-Average (ARMA) models, Generalized Auto-Regressive Conditional Heteroskedasticity (GARCH), and related models are flexible, phenomenological methods used with great success to model stochastic temporal behaviors in many fields of study, particularly econometrics. Powerful statistical methods are implemented in the public statistical software environment R and its many packages. Modeling involves maximum likelihood fitting, model selection, and residual analysis. These techniques provide a useful framework to model stellar variability and are used in KARPS with the objective of reducing stellar noise to enhance opportunities to find as-yet-undiscovered planets. Our analysis procedure consisting of three steps: pre-processing of the data to remove discontinuities, gaps and outliers; ARMA-type model selection and fitting; and transit signal search of the residuals using a new Transit Comb Filter (TCF) that replaces traditional box-finding algorithms. We apply the procedures to simulated Kepler-like time series with known stellar and planetary signals to evaluate the effectiveness of the KARPS procedures. The ARMA-type modeling is effective at reducing stellar noise, but also reduces and transforms the transit signal into ingress/egress spikes. A periodogram based on the TCF is constructed to concentrate the signal of these periodic spikes. When a periodic transit is found, the model is displayed on a standard period-folded averaged light curve. We also illustrate the efficient coding in R.
Chen, Qiang; Chen, Yunhao; Jiang, Weiguo
2016-07-30
In the field of multiple features Object-Based Change Detection (OBCD) for very-high-resolution remotely sensed images, image objects have abundant features and feature selection affects the precision and efficiency of OBCD. Through object-based image analysis, this paper proposes a Genetic Particle Swarm Optimization (GPSO)-based feature selection algorithm to solve the optimization problem of feature selection in multiple features OBCD. We select the Ratio of Mean to Variance (RMV) as the fitness function of GPSO, and apply the proposed algorithm to the object-based hybrid multivariate alternative detection model. Two experiment cases on Worldview-2/3 images confirm that GPSO can significantly improve the speed of convergence, and effectively avoid the problem of premature convergence, relative to other feature selection algorithms. According to the accuracy evaluation of OBCD, GPSO is superior at overall accuracy (84.17% and 83.59%) and Kappa coefficient (0.6771 and 0.6314) than other algorithms. Moreover, the sensitivity analysis results show that the proposed algorithm is not easily influenced by the initial parameters, but the number of features to be selected and the size of the particle swarm would affect the algorithm. The comparison experiment results reveal that RMV is more suitable than other functions as the fitness function of GPSO-based feature selection algorithm.
Affine Projection Algorithm with Improved Data-Selective Method Using the Condition Number
NASA Astrophysics Data System (ADS)
Ban, Sung Jun; Lee, Chang Woo; Kim, Sang Woo
Recently, a data-selective method has been proposed to achieve low misalignment in affine projection algorithm (APA) by keeping the condition number of an input data matrix small. We present an improved method, and a complexity reduction algorithm for the APA with the data-selective method. Experimental results show that the proposed algorithm has lower misalignment and a lower condition number for an input data matrix than both the conventional APA and the APA with the previous data-selective method.
A selective-update affine projection algorithm with selective input vectors
NASA Astrophysics Data System (ADS)
Kong, NamWoong; Shin, JaeWook; Park, PooGyeon
2011-10-01
This paper proposes an affine projection algorithm (APA) with selective input vectors, which based on the concept of selective-update in order to reduce estimation errors and computations. The algorithm consists of two procedures: input- vector-selection and state-decision. The input-vector-selection procedure determines the number of input vectors by checking with mean square error (MSE) whether the input vectors have enough information for update. The state-decision procedure determines the current state of the adaptive filter by using the state-decision criterion. As the adaptive filter is in transient state, the algorithm updates the filter coefficients with the selected input vectors. On the other hand, as soon as the adaptive filter reaches the steady state, the update procedure is not performed. Through these two procedures, the proposed algorithm achieves small steady-state estimation errors, low computational complexity and low update complexity for colored input signals.
Xu, Libin; Li, Yang; Xu, Ning; Hu, Yong; Wang, Chao; He, Jianjun; Cao, Yueze; Chen, Shigui; Li, Dongsheng
2014-12-24
This work demonstrated the possibility of using artificial neural networks to classify soy sauce from China. The aroma profiles of different soy sauce samples were differentiated using headspace solid-phase microextraction. The soy sauce samples were analyzed by gas chromatography-mass spectrometry, and 22 and 15 volatile aroma compounds were selected for sensitivity analysis to classify the samples by fermentation and geographic region, respectively. The 15 selected samples can be classified by fermentation and geographic region with a prediction success rate of 100%. Furans and phenols represented the variables with the greatest contribution in classifying soy sauce samples by fermentation and geographic region, respectively.
Mean-variance model for portfolio optimization with background risk based on uncertainty theory
NASA Astrophysics Data System (ADS)
Zhai, Jia; Bai, Manying
2018-04-01
The aim of this paper is to develop a mean-variance model for portfolio optimization considering the background risk, liquidity and transaction cost based on uncertainty theory. In portfolio selection problem, returns of securities and assets liquidity are assumed as uncertain variables because of incidents or lacking of historical data, which are common in economic and social environment. We provide crisp forms of the model and a hybrid intelligent algorithm to solve it. Under a mean-variance framework, we analyze the portfolio frontier characteristic considering independently additive background risk. In addition, we discuss some effects of background risk and liquidity constraint on the portfolio selection. Finally, we demonstrate the proposed models by numerical simulations.
Optimal design of compact spur gear reductions
NASA Technical Reports Server (NTRS)
Savage, M.; Lattime, S. B.; Kimmel, J. A.; Coe, H. H.
1992-01-01
The optimal design of compact spur gear reductions includes the selection of bearing and shaft proportions in addition to gear mesh parameters. Designs for single mesh spur gear reductions are based on optimization of system life, system volume, and system weight including gears, support shafts, and the four bearings. The overall optimization allows component properties to interact, yielding the best composite design. A modified feasible directions search algorithm directs the optimization through a continuous design space. Interpolated polynomials expand the discrete bearing properties and proportions into continuous variables for optimization. After finding the continuous optimum, the designer can analyze near optimal designs for comparison and selection. Design examples show the influence of the bearings on the optimal configurations.
Use of EPANET solver to manage water distribution in Smart City
NASA Astrophysics Data System (ADS)
Antonowicz, A.; Brodziak, R.; Bylka, J.; Mazurkiewicz, J.; Wojtecki, S.; Zakrzewski, P.
2018-02-01
Paper presents a method of using EPANET solver to support manage water distribution system in Smart City. The main task is to develop the application that allows remote access to the simulation model of the water distribution network developed in the EPANET environment. Application allows to perform both single and cyclic simulations with the specified step of changing the values of the selected process variables. In the paper the architecture of application was shown. The application supports the selection of the best device control algorithm using optimization methods. Optimization procedures are possible with following methods: brute force, SLSQP (Sequential Least SQuares Programming), Modified Powell Method. Article was supplemented by example of using developed computer tool.
2012-11-01
multicorrector algorithm . Predictor stage: Set Cρn+1,(0) = C ρ n, (157) Cun+1,(0) = C u n, (158) Cvn+1,(0) = C v n. (159) Multicorrector stage: Repeat the... corrector algorithm given by (157)-(178). Remark 20. We adopt the preconditioned GMRES algorithm [53] from PETSc [2] to solve the linear system given by (175...ICES REPORT 12-43 November 2012 Functional Entropy Variables: A New Methodology for Deriving Thermodynamically Consistent Algorithms for Complex
Variable-Metric Algorithm For Constrained Optimization
NASA Technical Reports Server (NTRS)
Frick, James D.
1989-01-01
Variable Metric Algorithm for Constrained Optimization (VMACO) is nonlinear computer program developed to calculate least value of function of n variables subject to general constraints, both equality and inequality. First set of constraints equality and remaining constraints inequalities. Program utilizes iterative method in seeking optimal solution. Written in ANSI Standard FORTRAN 77.
Lim, Jun-Seok; Pang, Hee-Suk
2016-01-01
In this paper an [Formula: see text]-regularized recursive total least squares (RTLS) algorithm is considered for the sparse system identification. Although recursive least squares (RLS) has been successfully applied in sparse system identification, the estimation performance in RLS based algorithms becomes worse, when both input and output are contaminated by noise (the error-in-variables problem). We proposed an algorithm to handle the error-in-variables problem. The proposed [Formula: see text]-RTLS algorithm is an RLS like iteration using the [Formula: see text] regularization. The proposed algorithm not only gives excellent performance but also reduces the required complexity through the effective inversion matrix handling. Simulations demonstrate the superiority of the proposed [Formula: see text]-regularized RTLS for the sparse system identification setting.
Multiple-variable neighbourhood search for the single-machine total weighted tardiness problem
NASA Astrophysics Data System (ADS)
Chung, Tsui-Ping; Fu, Qunjie; Liao, Ching-Jong; Liu, Yi-Ting
2017-07-01
The single-machine total weighted tardiness (SMTWT) problem is a typical discrete combinatorial optimization problem in the scheduling literature. This problem has been proved to be NP hard and thus provides a challenging area for metaheuristics, especially the variable neighbourhood search algorithm. In this article, a multiple variable neighbourhood search (m-VNS) algorithm with multiple neighbourhood structures is proposed to solve the problem. Special mechanisms named matching and strengthening operations are employed in the algorithm, which has an auto-revising local search procedure to explore the solution space beyond local optimality. Two aspects, searching direction and searching depth, are considered, and neighbourhood structures are systematically exchanged. Experimental results show that the proposed m-VNS algorithm outperforms all the compared algorithms in solving the SMTWT problem.
An improved VSS NLMS algorithm for active noise cancellation
NASA Astrophysics Data System (ADS)
Sun, Yunzhuo; Wang, Mingjiang; Han, Yufei; Zhang, Congyan
2017-08-01
In this paper, an improved variable step size NLMS algorithm is proposed. NLMS has fast convergence rate and low steady state error compared to other traditional adaptive filtering algorithm. But there is a contradiction between the convergence speed and steady state error that affect the performance of the NLMS algorithm. Now, we propose a new variable step size NLMS algorithm. It dynamically changes the step size according to current error and iteration times. The proposed algorithm has simple formulation and easily setting parameters, and effectively solves the contradiction in NLMS. The simulation results show that the proposed algorithm has a good tracking ability, fast convergence rate and low steady state error simultaneously.
Greedy feature selection for glycan chromatography data with the generalized Dirichlet distribution
2013-01-01
Background Glycoproteins are involved in a diverse range of biochemical and biological processes. Changes in protein glycosylation are believed to occur in many diseases, particularly during cancer initiation and progression. The identification of biomarkers for human disease states is becoming increasingly important, as early detection is key to improving survival and recovery rates. To this end, the serum glycome has been proposed as a potential source of biomarkers for different types of cancers. High-throughput hydrophilic interaction liquid chromatography (HILIC) technology for glycan analysis allows for the detailed quantification of the glycan content in human serum. However, the experimental data from this analysis is compositional by nature. Compositional data are subject to a constant-sum constraint, which restricts the sample space to a simplex. Statistical analysis of glycan chromatography datasets should account for their unusual mathematical properties. As the volume of glycan HILIC data being produced increases, there is a considerable need for a framework to support appropriate statistical analysis. Proposed here is a methodology for feature selection in compositional data. The principal objective is to provide a template for the analysis of glycan chromatography data that may be used to identify potential glycan biomarkers. Results A greedy search algorithm, based on the generalized Dirichlet distribution, is carried out over the feature space to search for the set of “grouping variables” that best discriminate between known group structures in the data, modelling the compositional variables using beta distributions. The algorithm is applied to two glycan chromatography datasets. Statistical classification methods are used to test the ability of the selected features to differentiate between known groups in the data. Two well-known methods are used for comparison: correlation-based feature selection (CFS) and recursive partitioning (rpart). CFS is a feature selection method, while recursive partitioning is a learning tree algorithm that has been used for feature selection in the past. Conclusions The proposed feature selection method performs well for both glycan chromatography datasets. It is computationally slower, but results in a lower misclassification rate and a higher sensitivity rate than both correlation-based feature selection and the classification tree method. PMID:23651459
Development and Testing of Data Mining Algorithms for Earth Observation
NASA Technical Reports Server (NTRS)
Glymour, Clark
2005-01-01
The new algorithms developed under this project included a principled procedure for classification of objects, events or circumstances according to a target variable when a very large number of potential predictor variables is available but the number of cases that can be used for training a classifier is relatively small. These "high dimensional" problems require finding a minimal set of variables -called the Markov Blanket-- sufficient for predicting the value of the target variable. An algorithm, the Markov Blanket Fan Search, was developed, implemented and tested on both simulated and real data in conjunction with a graphical model classifier, which was also implemented. Another algorithm developed and implemented in TETRAD IV for time series elaborated on work by C. Granger and N. Swanson, which in turn exploited some of our earlier work. The algorithms in question learn a linear time series model from data. Given such a time series, the simultaneous residual covariances, after factoring out time dependencies, may provide information about causal processes that occur more rapidly than the time series representation allow, so called simultaneous or contemporaneous causal processes. Working with A. Monetta, a graduate student from Italy, we produced the correct statistics for estimating the contemporaneous causal structure from time series data using the TETRAD IV suite of algorithms. Two economists, David Bessler and Kevin Hoover, have independently published applications using TETRAD style algorithms to the same purpose. These implementations and algorithmic developments were separately used in two kinds of studies of climate data: Short time series of geographically proximate climate variables predicting agricultural effects in California, and longer duration climate measurements of temperature teleconnections.
Revisiting negative selection algorithms.
Ji, Zhou; Dasgupta, Dipankar
2007-01-01
This paper reviews the progress of negative selection algorithms, an anomaly/change detection approach in Artificial Immune Systems (AIS). Following its initial model, we try to identify the fundamental characteristics of this family of algorithms and summarize their diversities. There exist various elements in this method, including data representation, coverage estimate, affinity measure, and matching rules, which are discussed for different variations. The various negative selection algorithms are categorized by different criteria as well. The relationship and possible combinations with other AIS or other machine learning methods are discussed. Prospective development and applicability of negative selection algorithms and their influence on related areas are then speculated based on the discussion.
ERIC Educational Resources Information Center
Fuwa, Minori; Kayama, Mizue; Kunimune, Hisayoshi; Hashimoto, Masami; Asano, David K.
2015-01-01
We have explored educational methods for algorithmic thinking for novices and implemented a block programming editor and a simple learning management system. In this paper, we propose a program/algorithm complexity metric specified for novice learners. This metric is based on the variable usage in arithmetic and relational formulas in learner's…
NASA Technical Reports Server (NTRS)
Hooker, Stanford B. (Editor); Firestone, Elaine R. (Editor); Acker, James G. (Editor); Campbell, Janet W.; Blaisdell, John M.; Darzi, Michael
1995-01-01
The level-3 data products from the Sea-viewing Wide Field-of-view Sensor (SeaWiFS) are statistical data sets derived from level-2 data. Each data set will be based on a fixed global grid of equal-area bins that are approximately 9 x 9 sq km. Statistics available for each bin include the sum and sum of squares of the natural logarithm of derived level-2 geophysical variables where sums are accumulated over a binning period. Operationally, products with binning periods of 1 day, 8 days, 1 month, and 1 year will be produced and archived. From these accumulated values and for each bin, estimates of the mean, standard deviation, median, and mode may be derived for each geophysical variable. This report contains two major parts: the first (Section 2) is intended as a users' guide for level-3 SeaWiFS data products. It contains an overview of level-0 to level-3 data processing, a discussion of important statistical considerations when using level-3 data, and details of how to use the level-3 data. The second part (Section 3) presents a comparative statistical study of several binning algorithms based on CZCS and moored fluorometer data. The operational binning algorithms were selected based on the results of this study.
NASA Astrophysics Data System (ADS)
Zhou, Xin; Jun, Sun; Zhang, Bing; Jun, Wu
2017-07-01
In order to improve the reliability of the spectrum feature extracted by wavelet transform, a method combining wavelet transform (WT) with bacterial colony chemotaxis algorithm and support vector machine (BCC-SVM) algorithm (WT-BCC-SVM) was proposed in this paper. Besides, we aimed to identify different kinds of pesticide residues on lettuce leaves in a novel and rapid non-destructive way by using fluorescence spectra technology. The fluorescence spectral data of 150 lettuce leaf samples of five different kinds of pesticide residues on the surface of lettuce were obtained using Cary Eclipse fluorescence spectrometer. Standard normalized variable detrending (SNV detrending), Savitzky-Golay coupled with Standard normalized variable detrending (SG-SNV detrending) were used to preprocess the raw spectra, respectively. Bacterial colony chemotaxis combined with support vector machine (BCC-SVM) and support vector machine (SVM) classification models were established based on full spectra (FS) and wavelet transform characteristics (WTC), respectively. Moreover, WTC were selected by WT. The results showed that the accuracy of training set, calibration set and the prediction set of the best optimal classification model (SG-SNV detrending-WT-BCC-SVM) were 100%, 98% and 93.33%, respectively. In addition, the results indicated that it was feasible to use WT-BCC-SVM to establish diagnostic model of different kinds of pesticide residues on lettuce leaves.
Modulation Depth Estimation and Variable Selection in State-Space Models for Neural Interfaces
Hochberg, Leigh R.; Donoghue, John P.; Brown, Emery N.
2015-01-01
Rapid developments in neural interface technology are making it possible to record increasingly large signal sets of neural activity. Various factors such as asymmetrical information distribution and across-channel redundancy may, however, limit the benefit of high-dimensional signal sets, and the increased computational complexity may not yield corresponding improvement in system performance. High-dimensional system models may also lead to overfitting and lack of generalizability. To address these issues, we present a generalized modulation depth measure using the state-space framework that quantifies the tuning of a neural signal channel to relevant behavioral covariates. For a dynamical system, we develop computationally efficient procedures for estimating modulation depth from multivariate data. We show that this measure can be used to rank neural signals and select an optimal channel subset for inclusion in the neural decoding algorithm. We present a scheme for choosing the optimal subset based on model order selection criteria. We apply this method to neuronal ensemble spike-rate decoding in neural interfaces, using our framework to relate motor cortical activity with intended movement kinematics. With offline analysis of intracortical motor imagery data obtained from individuals with tetraplegia using the BrainGate neural interface, we demonstrate that our variable selection scheme is useful for identifying and ranking the most information-rich neural signals. We demonstrate that our approach offers several orders of magnitude lower complexity but virtually identical decoding performance compared to greedy search and other selection schemes. Our statistical analysis shows that the modulation depth of human motor cortical single-unit signals is well characterized by the generalized Pareto distribution. Our variable selection scheme has wide applicability in problems involving multisensor signal modeling and estimation in biomedical engineering systems. PMID:25265627
Rapid Calculation of Spacecraft Trajectories Using Efficient Taylor Series Integration
NASA Technical Reports Server (NTRS)
Scott, James R.; Martini, Michael C.
2011-01-01
A variable-order, variable-step Taylor series integration algorithm was implemented in NASA Glenn's SNAP (Spacecraft N-body Analysis Program) code. SNAP is a high-fidelity trajectory propagation program that can propagate the trajectory of a spacecraft about virtually any body in the solar system. The Taylor series algorithm's very high order accuracy and excellent stability properties lead to large reductions in computer time relative to the code's existing 8th order Runge-Kutta scheme. Head-to-head comparison on near-Earth, lunar, Mars, and Europa missions showed that Taylor series integration is 15.8 times faster than Runge- Kutta on average, and is more accurate. These speedups were obtained for calculations involving central body, other body, thrust, and drag forces. Similar speedups have been obtained for calculations that include J2 spherical harmonic for central body gravitation. The algorithm includes a step size selection method that directly calculates the step size and never requires a repeat step. High-order Taylor series integration algorithms have been shown to provide major reductions in computer time over conventional integration methods in numerous scientific applications. The objective here was to directly implement Taylor series integration in an existing trajectory analysis code and demonstrate that large reductions in computer time (order of magnitude) could be achieved while simultaneously maintaining high accuracy. This software greatly accelerates the calculation of spacecraft trajectories. At each time level, the spacecraft position, velocity, and mass are expanded in a high-order Taylor series whose coefficients are obtained through efficient differentiation arithmetic. This makes it possible to take very large time steps at minimal cost, resulting in large savings in computer time. The Taylor series algorithm is implemented primarily through three subroutines: (1) a driver routine that automatically introduces auxiliary variables and sets up initial conditions and integrates; (2) a routine that calculates system reduced derivatives using recurrence relations for quotients and products; and (3) a routine that determines the step size and sums the series. The order of accuracy used in a trajectory calculation is arbitrary and can be set by the user. The algorithm directly calculates the motion of other planetary bodies and does not require ephemeris files (except to start the calculation). The code also runs with Taylor series and Runge-Kutta used interchangeably for different phases of a mission.
Al-Rajab, Murad; Lu, Joan; Xu, Qiang
2017-07-01
This paper examines the accuracy and efficiency (time complexity) of high performance genetic data feature selection and classification algorithms for colon cancer diagnosis. The need for this research derives from the urgent and increasing need for accurate and efficient algorithms. Colon cancer is a leading cause of death worldwide, hence it is vitally important for the cancer tissues to be expertly identified and classified in a rapid and timely manner, to assure both a fast detection of the disease and to expedite the drug discovery process. In this research, a three-phase approach was proposed and implemented: Phases One and Two examined the feature selection algorithms and classification algorithms employed separately, and Phase Three examined the performance of the combination of these. It was found from Phase One that the Particle Swarm Optimization (PSO) algorithm performed best with the colon dataset as a feature selection (29 genes selected) and from Phase Two that the Support Vector Machine (SVM) algorithm outperformed other classifications, with an accuracy of almost 86%. It was also found from Phase Three that the combined use of PSO and SVM surpassed other algorithms in accuracy and performance, and was faster in terms of time analysis (94%). It is concluded that applying feature selection algorithms prior to classification algorithms results in better accuracy than when the latter are applied alone. This conclusion is important and significant to industry and society. Copyright © 2017 Elsevier B.V. All rights reserved.
Yu, Qiang; Wei, Dingbang; Huo, Hongwei
2018-06-18
Given a set of t n-length DNA sequences, q satisfying 0 < q ≤ 1, and l and d satisfying 0 ≤ d < l < n, the quorum planted motif search (qPMS) finds l-length strings that occur in at least qt input sequences with up to d mismatches and is mainly used to locate transcription factor binding sites in DNA sequences. Existing qPMS algorithms have been able to efficiently process small standard datasets (e.g., t = 20 and n = 600), but they are too time consuming to process large DNA datasets, such as ChIP-seq datasets that contain thousands of sequences or more. We analyze the effects of t and q on the time performance of qPMS algorithms and find that a large t or a small q causes a longer computation time. Based on this information, we improve the time performance of existing qPMS algorithms by selecting a sample sequence set D' with a small t and a large q from the large input dataset D and then executing qPMS algorithms on D'. A sample sequence selection algorithm named SamSelect is proposed. The experimental results on both simulated and real data show (1) that SamSelect can select D' efficiently and (2) that the qPMS algorithms executed on D' can find implanted or real motifs in a significantly shorter time than when executed on D. We improve the ability of existing qPMS algorithms to process large DNA datasets from the perspective of selecting high-quality sample sequence sets so that the qPMS algorithms can find motifs in a short time in the selected sample sequence set D', rather than take an unfeasibly long time to search the original sequence set D. Our motif discovery method is an approximate algorithm.
Structure-activity relationships between sterols and their thermal stability in oil matrix.
Hu, Yinzhou; Xu, Junli; Huang, Weisu; Zhao, Yajing; Li, Maiquan; Wang, Mengmeng; Zheng, Lufei; Lu, Baiyi
2018-08-30
Structure-activity relationships between 20 sterols and their thermal stabilities were studied in a model oil system. All sterol degradations were found to be consistent with a first-order kinetic model with determination of coefficient (R 2 ) higher than 0.9444. The number of double bonds in the sterol structure was negatively correlated with the thermal stability of sterol, whereas the length of the branch chain was positively correlated with the thermal stability of sterol. A quantitative structure-activity relationship (QSAR) model to predict thermal stability of sterol was developed by using partial least squares regression (PLSR) combined with genetic algorithm (GA). A regression model was built with R 2 of 0.806. Almost all sterol degradation constants can be predicted accurately with R 2 of cross-validation equals to 0.680. Four important variables were selected in optimal QSAR model and the selected variables were observed to be related with information indices, RDF descriptors, and 3D-MoRSE descriptors. Copyright © 2018 Elsevier Ltd. All rights reserved.
Tumour auto-contouring on 2d cine MRI for locally advanced lung cancer: A comparative study.
Fast, Martin F; Eiben, Björn; Menten, Martin J; Wetscherek, Andreas; Hawkes, David J; McClelland, Jamie R; Oelfke, Uwe
2017-12-01
Radiotherapy guidance based on magnetic resonance imaging (MRI) is currently becoming a clinical reality. Fast 2d cine MRI sequences are expected to increase the precision of radiation delivery by facilitating tumour delineation during treatment. This study compares four auto-contouring algorithms for the task of delineating the primary tumour in six locally advanced (LA) lung cancer patients. Twenty-two cine MRI sequences were acquired using either a balanced steady-state free precession or a spoiled gradient echo imaging technique. Contours derived by the auto-contouring algorithms were compared against manual reference contours. A selection of eight image data sets was also used to assess the inter-observer delineation uncertainty. Algorithmically derived contours agreed well with the manual reference contours (median Dice similarity index: ⩾0.91). Multi-template matching and deformable image registration performed significantly better than feature-driven registration and the pulse-coupled neural network (PCNN). Neither MRI sequence nor image orientation was a conclusive predictor for algorithmic performance. Motion significantly degraded the performance of the PCNN. The inter-observer variability was of the same order of magnitude as the algorithmic performance. Auto-contouring of tumours on cine MRI is feasible in LA lung cancer patients. Despite large variations in implementation complexity, the different algorithms all have relatively similar performance. Copyright © 2017 The Author(s). Published by Elsevier B.V. All rights reserved.
Algorithme intelligent d'optimisation d'un design structurel de grande envergure
NASA Astrophysics Data System (ADS)
Dominique, Stephane
The implementation of an automated decision support system in the field of design and structural optimisation can give a significant advantage to any industry working on mechanical designs. Indeed, by providing solution ideas to a designer or by upgrading existing design solutions while the designer is not at work, the system may reduce the project cycle time, or allow more time to produce a better design. This thesis presents a new approach to automate a design process based on Case-Based Reasoning (CBR), in combination with a new genetic algorithm named Genetic Algorithm with Territorial core Evolution (GATE). This approach was developed in order to reduce the operating cost of the process. However, as the system implementation cost is quite expensive, the approach is better suited for large scale design problem, and particularly for design problems that the designer plans to solve for many different specification sets. First, the CBR process uses a databank filled with every known solution to similar design problems. Then, the closest solutions to the current problem in term of specifications are selected. After this, during the adaptation phase, an artificial neural network (ANN) interpolates amongst known solutions to produce an additional solution to the current problem using the current specifications as inputs. Each solution produced and selected by the CBR is then used to initialize the population of an island of the genetic algorithm. The algorithm will optimise the solution further during the refinement phase. Using progressive refinement, the algorithm starts using only the most important variables for the problem. Then, as the optimisation progress, the remaining variables are gradually introduced, layer by layer. The genetic algorithm that is used is a new algorithm specifically created during this thesis to solve optimisation problems from the field of mechanical device structural design. The algorithm is named GATE, and is essentially a real number genetic algorithm that prevents new individuals to be born too close to previously evaluated solutions. The restricted area becomes smaller or larger during the optimisation to allow global or local search when necessary. Also, a new search operator named Substitution Operator is incorporated in GATE. This operator allows an ANN surrogate model to guide the algorithm toward the most promising areas of the design space. The suggested CBR approach and GATE were tested on several simple test problems, as well as on the industrial problem of designing a gas turbine engine rotor's disc. These results are compared to other results obtained for the same problems by many other popular optimisation algorithms, such as (depending of the problem) gradient algorithms, binary genetic algorithm, real number genetic algorithm, genetic algorithm using multiple parents crossovers, differential evolution genetic algorithm, Hookes & Jeeves generalized pattern search method and POINTER from the software I-SIGHT 3.5. Results show that GATE is quite competitive, giving the best results for 5 of the 6 constrained optimisation problem. GATE also provided the best results of all on problem produced by a Maximum Set Gaussian landscape generator. Finally, GATE provided a disc 4.3% lighter than the best other tested algorithm (POINTER) for the gas turbine engine rotor's disc problem. One drawback of GATE is a lesser efficiency for highly multimodal unconstrained problems, for which he gave quite poor results with respect to its implementation cost. To conclude, according to the preliminary results obtained during this thesis, the suggested CBR process, combined with GATE, seems to be a very good candidate to automate and accelerate the structural design of mechanical devices, potentially reducing significantly the cost of industrial preliminary design processes.
Selectionist and Evolutionary Approaches to Brain Function: A Critical Appraisal
Fernando, Chrisantha; Szathmáry, Eörs; Husbands, Phil
2012-01-01
We consider approaches to brain dynamics and function that have been claimed to be Darwinian. These include Edelman’s theory of neuronal group selection, Changeux’s theory of synaptic selection and selective stabilization of pre-representations, Seung’s Darwinian synapse, Loewenstein’s synaptic melioration, Adam’s selfish synapse, and Calvin’s replicating activity patterns. Except for the last two, the proposed mechanisms are selectionist but not truly Darwinian, because no replicators with information transfer to copies and hereditary variation can be identified in them. All of them fit, however, a generalized selectionist framework conforming to the picture of Price’s covariance formulation, which deliberately was not specific even to selection in biology, and therefore does not imply an algorithmic picture of biological evolution. Bayesian models and reinforcement learning are formally in agreement with selection dynamics. A classification of search algorithms is shown to include Darwinian replicators (evolutionary units with multiplication, heredity, and variability) as the most powerful mechanism for search in a sparsely occupied search space. Examples are given of cases where parallel competitive search with information transfer among the units is more efficient than search without information transfer between units. Finally, we review our recent attempts to construct and analyze simple models of true Darwinian evolutionary units in the brain in terms of connectivity and activity copying of neuronal groups. Although none of the proposed neuronal replicators include miraculous mechanisms, their identification remains a challenge but also a great promise. PMID:22557963
Kebede, Mihiretu; Zegeye, Desalegn Tigabu; Zeleke, Berihun Megabiaw
2017-12-01
To monitor the progress of therapy and disease progression, periodic CD4 counts are required throughout the course of HIV/AIDS care and support. The demand for CD4 count measurement is increasing as ART programs expand over the last decade. This study aimed to predict CD4 count changes and to identify the predictors of CD4 count changes among patients on ART. A cross-sectional study was conducted at the University of Gondar Hospital from 3,104 adult patients on ART with CD4 counts measured at least twice (baseline and most recent). Data were retrieved from the HIV care clinic electronic database and patients` charts. Descriptive data were analyzed by SPSS version 20. Cross-Industry Standard Process for Data Mining (CRISP-DM) methodology was followed to undertake the study. WEKA version 3.8 was used to conduct a predictive data mining. Before building the predictive data mining models, information gain values and correlation-based Feature Selection methods were used for attribute selection. Variables were ranked according to their relevance based on their information gain values. J48, Neural Network, and Random Forest algorithms were experimented to assess model accuracies. The median duration of ART was 191.5 weeks. The mean CD4 count change was 243 (SD 191.14) cells per microliter. Overall, 2427 (78.2%) patients had their CD4 counts increased by at least 100 cells per microliter, while 4% had a decline from the baseline CD4 value. Baseline variables including age, educational status, CD8 count, ART regimen, and hemoglobin levels predicted CD4 count changes with predictive accuracies of J48, Neural Network, and Random Forest being 87.1%, 83.5%, and 99.8%, respectively. Random Forest algorithm had a superior performance accuracy level than both J48 and Artificial Neural Network. The precision, sensitivity and recall values of Random Forest were also more than 99%. Nearly accurate prediction results were obtained using Random Forest algorithm. This algorithm could be used in a low-resource setting to build a web-based prediction model for CD4 count changes. Copyright © 2017 Elsevier B.V. All rights reserved.
Fast object detection algorithm based on HOG and CNN
NASA Astrophysics Data System (ADS)
Lu, Tongwei; Wang, Dandan; Zhang, Yanduo
2018-04-01
In the field of computer vision, object classification and object detection are widely used in many fields. The traditional object detection have two main problems:one is that sliding window of the regional selection strategy is high time complexity and have window redundancy. And the other one is that Robustness of the feature is not well. In order to solve those problems, Regional Proposal Network (RPN) is used to select candidate regions instead of selective search algorithm. Compared with traditional algorithms and selective search algorithms, RPN has higher efficiency and accuracy. We combine HOG feature and convolution neural network (CNN) to extract features. And we use SVM to classify. For TorontoNet, our algorithm's mAP is 1.6 percentage points higher. For OxfordNet, our algorithm's mAP is 1.3 percentage higher.
NASA Astrophysics Data System (ADS)
WANG, Qingrong; ZHU, Changfeng
2017-06-01
Integration of distributed heterogeneous data sources is the key issues under the big data applications. In this paper the strategy of variable precision is introduced to the concept lattice, and the one-to-one mapping mode of variable precision concept lattice and ontology concept lattice is constructed to produce the local ontology by constructing the variable precision concept lattice for each subsystem, and the distributed generation algorithm of variable precision concept lattice based on ontology heterogeneous database is proposed to draw support from the special relationship between concept lattice and ontology construction. Finally, based on the standard of main concept lattice of the existing heterogeneous database generated, a case study has been carried out in order to testify the feasibility and validity of this algorithm, and the differences between the main concept lattice and the standard concept lattice are compared. Analysis results show that this algorithm above-mentioned can automatically process the construction process of distributed concept lattice under the heterogeneous data sources.
NASA Astrophysics Data System (ADS)
Berselli, Luigi C.; Spirito, Stefano
2018-06-01
Obtaining reliable numerical simulations of turbulent fluids is a challenging problem in computational fluid mechanics. The large eddy simulation (LES) models are efficient tools to approximate turbulent fluids, and an important step in the validation of these models is the ability to reproduce relevant properties of the flow. In this paper, we consider a fully discrete approximation of the Navier-Stokes-Voigt model by an implicit Euler algorithm (with respect to the time variable) and a Fourier-Galerkin method (in the space variables). We prove the convergence to weak solutions of the incompressible Navier-Stokes equations satisfying the natural local entropy condition, hence selecting the so-called physically relevant solutions.
Novel and efficient tag SNPs selection algorithms.
Chen, Wen-Pei; Hung, Che-Lun; Tsai, Suh-Jen Jane; Lin, Yaw-Ling
2014-01-01
SNPs are the most abundant forms of genetic variations amongst species; the association studies between complex diseases and SNPs or haplotypes have received great attention. However, these studies are restricted by the cost of genotyping all SNPs; thus, it is necessary to find smaller subsets, or tag SNPs, representing the rest of the SNPs. In fact, the existing tag SNP selection algorithms are notoriously time-consuming. An efficient algorithm for tag SNP selection was presented, which was applied to analyze the HapMap YRI data. The experimental results show that the proposed algorithm can achieve better performance than the existing tag SNP selection algorithms; in most cases, this proposed algorithm is at least ten times faster than the existing methods. In many cases, when the redundant ratio of the block is high, the proposed algorithm can even be thousands times faster than the previously known methods. Tools and web services for haplotype block analysis integrated by hadoop MapReduce framework are also developed using the proposed algorithm as computation kernels.
Chen, Qiang; Chen, Yunhao; Jiang, Weiguo
2016-01-01
In the field of multiple features Object-Based Change Detection (OBCD) for very-high-resolution remotely sensed images, image objects have abundant features and feature selection affects the precision and efficiency of OBCD. Through object-based image analysis, this paper proposes a Genetic Particle Swarm Optimization (GPSO)-based feature selection algorithm to solve the optimization problem of feature selection in multiple features OBCD. We select the Ratio of Mean to Variance (RMV) as the fitness function of GPSO, and apply the proposed algorithm to the object-based hybrid multivariate alternative detection model. Two experiment cases on Worldview-2/3 images confirm that GPSO can significantly improve the speed of convergence, and effectively avoid the problem of premature convergence, relative to other feature selection algorithms. According to the accuracy evaluation of OBCD, GPSO is superior at overall accuracy (84.17% and 83.59%) and Kappa coefficient (0.6771 and 0.6314) than other algorithms. Moreover, the sensitivity analysis results show that the proposed algorithm is not easily influenced by the initial parameters, but the number of features to be selected and the size of the particle swarm would affect the algorithm. The comparison experiment results reveal that RMV is more suitable than other functions as the fitness function of GPSO-based feature selection algorithm. PMID:27483285
Online selective kernel-based temporal difference learning.
Chen, Xingguo; Gao, Yang; Wang, Ruili
2013-12-01
In this paper, an online selective kernel-based temporal difference (OSKTD) learning algorithm is proposed to deal with large scale and/or continuous reinforcement learning problems. OSKTD includes two online procedures: online sparsification and parameter updating for the selective kernel-based value function. A new sparsification method (i.e., a kernel distance-based online sparsification method) is proposed based on selective ensemble learning, which is computationally less complex compared with other sparsification methods. With the proposed sparsification method, the sparsified dictionary of samples is constructed online by checking if a sample needs to be added to the sparsified dictionary. In addition, based on local validity, a selective kernel-based value function is proposed to select the best samples from the sample dictionary for the selective kernel-based value function approximator. The parameters of the selective kernel-based value function are iteratively updated by using the temporal difference (TD) learning algorithm combined with the gradient descent technique. The complexity of the online sparsification procedure in the OSKTD algorithm is O(n). In addition, two typical experiments (Maze and Mountain Car) are used to compare with both traditional and up-to-date O(n) algorithms (GTD, GTD2, and TDC using the kernel-based value function), and the results demonstrate the effectiveness of our proposed algorithm. In the Maze problem, OSKTD converges to an optimal policy and converges faster than both traditional and up-to-date algorithms. In the Mountain Car problem, OSKTD converges, requires less computation time compared with other sparsification methods, gets a better local optima than the traditional algorithms, and converges much faster than the up-to-date algorithms. In addition, OSKTD can reach a competitive ultimate optima compared with the up-to-date algorithms.
Parsimony and goodness-of-fit in multi-dimensional NMR inversion
NASA Astrophysics Data System (ADS)
Babak, Petro; Kryuchkov, Sergey; Kantzas, Apostolos
2017-01-01
Multi-dimensional nuclear magnetic resonance (NMR) experiments are often used for study of molecular structure and dynamics of matter in core analysis and reservoir evaluation. Industrial applications of multi-dimensional NMR involve a high-dimensional measurement dataset with complicated correlation structure and require rapid and stable inversion algorithms from the time domain to the relaxation rate and/or diffusion domains. In practice, applying existing inverse algorithms with a large number of parameter values leads to an infinite number of solutions with a reasonable fit to the NMR data. The interpretation of such variability of multiple solutions and selection of the most appropriate solution could be a very complex problem. In most cases the characteristics of materials have sparse signatures, and investigators would like to distinguish the most significant relaxation and diffusion values of the materials. To produce an easy to interpret and unique NMR distribution with the finite number of the principal parameter values, we introduce a new method for NMR inversion. The method is constructed based on the trade-off between the conventional goodness-of-fit approach to multivariate data and the principle of parsimony guaranteeing inversion with the least number of parameter values. We suggest performing the inversion of NMR data using the forward stepwise regression selection algorithm. To account for the trade-off between goodness-of-fit and parsimony, the objective function is selected based on Akaike Information Criterion (AIC). The performance of the developed multi-dimensional NMR inversion method and its comparison with conventional methods are illustrated using real data for samples with bitumen, water and clay.
ERIC Educational Resources Information Center
Yang, Ji Seung; Cai, Li
2014-01-01
The main purpose of this study is to improve estimation efficiency in obtaining maximum marginal likelihood estimates of contextual effects in the framework of nonlinear multilevel latent variable model by adopting the Metropolis-Hastings Robbins-Monro algorithm (MH-RM). Results indicate that the MH-RM algorithm can produce estimates and standard…
Automatic delineation of tumor volumes by co-segmentation of combined PET/MR data
NASA Astrophysics Data System (ADS)
Leibfarth, S.; Eckert, F.; Welz, S.; Siegel, C.; Schmidt, H.; Schwenzer, N.; Zips, D.; Thorwarth, D.
2015-07-01
Combined PET/MRI may be highly beneficial for radiotherapy treatment planning in terms of tumor delineation and characterization. To standardize tumor volume delineation, an automatic algorithm for the co-segmentation of head and neck (HN) tumors based on PET/MR data was developed. Ten HN patient datasets acquired in a combined PET/MR system were available for this study. The proposed algorithm uses both the anatomical T2-weighted MR and FDG-PET data. For both imaging modalities tumor probability maps were derived, assigning each voxel a probability of being cancerous based on its signal intensity. A combination of these maps was subsequently segmented using a threshold level set algorithm. To validate the method, tumor delineations from three radiation oncologists were available. Inter-observer variabilities and variabilities between the algorithm and each observer were quantified by means of the Dice similarity index and a distance measure. Inter-observer variabilities and variabilities between observers and algorithm were found to be comparable, suggesting that the proposed algorithm is adequate for PET/MR co-segmentation. Moreover, taking into account combined PET/MR data resulted in more consistent tumor delineations compared to MR information only.
NASA Astrophysics Data System (ADS)
De Vleeschouwer, Niels; Verhoest, Niko E. C.; Gobeyn, Sacha; De Baets, Bernard; Verwaeren, Jan; Pauwels, Valentijn R. N.
2015-04-01
The continuous monitoring of soil moisture in a permanent network can yield an interesting data product for use in hydrological modeling. Major advantages of in situ observations compared to remote sensing products are the potential vertical extent of the measurements, the smaller temporal resolution of the observation time series, the smaller impact of land cover variability on the observation bias, etc. However, two major disadvantages are the typically small integration volume of in situ measurements, and the often large spacing between monitoring locations. This causes only a small part of the modeling domain to be directly observed. Furthermore, the spatial configuration of the monitoring network is typically non-dynamic in time. Generally, e.g. when applying data assimilation, maximizing the observed information under given circumstances will lead to a better qualitative and quantitative insight of the hydrological system. It is therefore advisable to perform a prior analysis in order to select those monitoring locations which are most predictive for the unobserved modeling domain. This research focuses on optimizing the configuration of a soil moisture monitoring network in the catchment of the Bellebeek, situated in Belgium. A recursive algorithm, strongly linked to the equations of the Ensemble Kalman Filter, has been developed to select the most predictive locations in the catchment. The basic idea behind the algorithm is twofold. On the one hand a minimization of the modeled soil moisture ensemble error covariance between the different monitoring locations is intended. This causes the monitoring locations to be as independent as possible regarding the modeled soil moisture dynamics. On the other hand, the modeled soil moisture ensemble error covariance between the monitoring locations and the unobserved modeling domain is maximized. The latter causes a selection of monitoring locations which are more predictive towards unobserved locations. The main factors that will influence the outcome of the algorithm are the following: the choice of the hydrological model, the uncertainty model applied for ensemble generation, the general wetness of the catchment during which the error covariance is computed, etc. In this research the influence of the latter two is examined more in-depth. Furthermore, the optimal network configuration resulting from the newly developed algorithm is compared to network configurations obtained by two other algorithms. The first algorithm is based on a temporal stability analysis of the modeled soil moisture in order to identify catchment representative monitoring locations with regard to average conditions. The second algorithm involves the clustering of available spatially distributed data (e.g. land cover and soil maps) that is not obtained by hydrological modeling.
Space shuttle propulsion parameter estimation using optimal estimation techniques
NASA Technical Reports Server (NTRS)
1983-01-01
The first twelve system state variables are presented with the necessary mathematical developments for incorporating them into the filter/smoother algorithm. Other state variables, i.e., aerodynamic coefficients can be easily incorporated into the estimation algorithm, representing uncertain parameters, but for initial checkout purposes are treated as known quantities. An approach for incorporating the NASA propulsion predictive model results into the optimal estimation algorithm was identified. This approach utilizes numerical derivatives and nominal predictions within the algorithm with global iterations of the algorithm. The iterative process is terminated when the quality of the estimates provided no longer significantly improves.
Modification of the Integrated Sasang Constitutional Diagnostic Model
Nam, Jiho
2017-01-01
In 2012, the Korea Institute of Oriental Medicine proposed an objective and comprehensive physical diagnostic model to address quantification problems in the existing Sasang constitutional diagnostic method. However, certain issues have been raised regarding a revision of the proposed diagnostic model. In this paper, we propose various methodological approaches to address the problems of the previous diagnostic model. Firstly, more useful variables are selected in each component. Secondly, the least absolute shrinkage and selection operator is used to reduce multicollinearity without the modification of explanatory variables. Thirdly, proportions of SC types and age are considered to construct individual diagnostic models and classify the training set and the test set for reflecting the characteristics of the entire dataset. Finally, an integrated model is constructed with explanatory variables of individual diagnosis models. The proposed integrated diagnostic model significantly improves the sensitivities for both the male SY type (36.4% → 62.0%) and the female SE type (43.7% → 64.5%), which were areas of limitation of the previous integrated diagnostic model. The ideas of these new algorithms are expected to contribute not only to the scientific development of Sasang constitutional medicine in Korea but also to that of other diagnostic methods for traditional medicine. PMID:29317897
Quality assessment of data discrimination using self-organizing maps.
Mekler, Alexey; Schwarz, Dmitri
2014-10-01
One of the important aspects of the data classification problem lies in making the most appropriate selection of features. The set of variables should be small and, at the same time, should provide reliable discrimination of the classes. The method for the discriminating power evaluation that enables a comparison between different sets of variables will be useful in the search for the set of variables. A new approach to feature selection is presented. Two methods of evaluation of the data discriminating power of a feature set are suggested. Both of the methods implement self-organizing maps (SOMs) and the newly introduced exponents of the degree of data clusterization on the SOM. The first method is based on the comparison of intraclass and interclass distances on the map. Another method concerns the evaluation of the relative number of best matching unit's (BMUs) nearest neighbors of the same class. Both methods make it possible to evaluate the discriminating power of a feature set in cases when this set provides nonlinear discrimination of the classes. Current algorithms in program code can be downloaded for free at http://mekler.narod.ru/Science/Articles_support.html, as well as the supporting data files. Copyright © 2014 Elsevier Inc. All rights reserved.
Low complexity adaptive equalizers for underwater acoustic communications
NASA Astrophysics Data System (ADS)
Soflaei, Masoumeh; Azmi, Paeiz
2014-08-01
Interference signals due to scattering from surface and reflecting from bottom is one of the most important problems of reliable communications in shallow water channels. To solve this problem, one of the best suggested ways is to use adaptive equalizers. Convergence rate and misadjustment error in adaptive algorithms play important roles in adaptive equalizer performance. In this paper, affine projection algorithm (APA), selective regressor APA(SR-APA), family of selective partial update (SPU) algorithms, family of set-membership (SM) algorithms and selective partial update selective regressor APA (SPU-SR-APA) are compared with conventional algorithms such as the least mean square (LMS) in underwater acoustic communications. We apply experimental data from the Strait of Hormuz for demonstrating the efficiency of the proposed methods over shallow water channel. We observe that the values of the steady-state mean square error (MSE) of SR-APA, SPU-APA, SPU-normalized least mean square (SPU-NLMS), SPU-SR-APA, SM-APA and SM-NLMS algorithms decrease in comparison with the LMS algorithm. Also these algorithms have better convergence rates than LMS type algorithm.
Cawley, Gavin C; Talbot, Nicola L C
2006-10-01
Gene selection algorithms for cancer classification, based on the expression of a small number of biomarker genes, have been the subject of considerable research in recent years. Shevade and Keerthi propose a gene selection algorithm based on sparse logistic regression (SLogReg) incorporating a Laplace prior to promote sparsity in the model parameters, and provide a simple but efficient training procedure. The degree of sparsity obtained is determined by the value of a regularization parameter, which must be carefully tuned in order to optimize performance. This normally involves a model selection stage, based on a computationally intensive search for the minimizer of the cross-validation error. In this paper, we demonstrate that a simple Bayesian approach can be taken to eliminate this regularization parameter entirely, by integrating it out analytically using an uninformative Jeffrey's prior. The improved algorithm (BLogReg) is then typically two or three orders of magnitude faster than the original algorithm, as there is no longer a need for a model selection step. The BLogReg algorithm is also free from selection bias in performance estimation, a common pitfall in the application of machine learning algorithms in cancer classification. The SLogReg, BLogReg and Relevance Vector Machine (RVM) gene selection algorithms are evaluated over the well-studied colon cancer and leukaemia benchmark datasets. The leave-one-out estimates of the probability of test error and cross-entropy of the BLogReg and SLogReg algorithms are very similar, however the BlogReg algorithm is found to be considerably faster than the original SLogReg algorithm. Using nested cross-validation to avoid selection bias, performance estimation for SLogReg on the leukaemia dataset takes almost 48 h, whereas the corresponding result for BLogReg is obtained in only 1 min 24 s, making BLogReg by far the more practical algorithm. BLogReg also demonstrates better estimates of conditional probability than the RVM, which are of great importance in medical applications, with similar computational expense. A MATLAB implementation of the sparse logistic regression algorithm with Bayesian regularization (BLogReg) is available from http://theoval.cmp.uea.ac.uk/~gcc/cbl/blogreg/
Compact cancer biomarkers discovery using a swarm intelligence feature selection algorithm.
Martinez, Emmanuel; Alvarez, Mario Moises; Trevino, Victor
2010-08-01
Biomarker discovery is a typical application from functional genomics. Due to the large number of genes studied simultaneously in microarray data, feature selection is a key step. Swarm intelligence has emerged as a solution for the feature selection problem. However, swarm intelligence settings for feature selection fail to select small features subsets. We have proposed a swarm intelligence feature selection algorithm based on the initialization and update of only a subset of particles in the swarm. In this study, we tested our algorithm in 11 microarray datasets for brain, leukemia, lung, prostate, and others. We show that the proposed swarm intelligence algorithm successfully increase the classification accuracy and decrease the number of selected features compared to other swarm intelligence methods. Copyright © 2010 Elsevier Ltd. All rights reserved.
Multi-task feature selection in microarray data by binary integer programming.
Lan, Liang; Vucetic, Slobodan
2013-12-20
A major challenge in microarray classification is that the number of features is typically orders of magnitude larger than the number of examples. In this paper, we propose a novel feature filter algorithm to select the feature subset with maximal discriminative power and minimal redundancy by solving a quadratic objective function with binary integer constraints. To improve the computational efficiency, the binary integer constraints are relaxed and a low-rank approximation to the quadratic term is applied. The proposed feature selection algorithm was extended to solve multi-task microarray classification problems. We compared the single-task version of the proposed feature selection algorithm with 9 existing feature selection methods on 4 benchmark microarray data sets. The empirical results show that the proposed method achieved the most accurate predictions overall. We also evaluated the multi-task version of the proposed algorithm on 8 multi-task microarray datasets. The multi-task feature selection algorithm resulted in significantly higher accuracy than when using the single-task feature selection methods.
Alshamlan, Hala; Badr, Ghada; Alohali, Yousef
2015-01-01
An artificial bee colony (ABC) is a relatively recent swarm intelligence optimization approach. In this paper, we propose the first attempt at applying ABC algorithm in analyzing a microarray gene expression profile. In addition, we propose an innovative feature selection algorithm, minimum redundancy maximum relevance (mRMR), and combine it with an ABC algorithm, mRMR-ABC, to select informative genes from microarray profile. The new approach is based on a support vector machine (SVM) algorithm to measure the classification accuracy for selected genes. We evaluate the performance of the proposed mRMR-ABC algorithm by conducting extensive experiments on six binary and multiclass gene expression microarray datasets. Furthermore, we compare our proposed mRMR-ABC algorithm with previously known techniques. We reimplemented two of these techniques for the sake of a fair comparison using the same parameters. These two techniques are mRMR when combined with a genetic algorithm (mRMR-GA) and mRMR when combined with a particle swarm optimization algorithm (mRMR-PSO). The experimental results prove that the proposed mRMR-ABC algorithm achieves accurate classification performance using small number of predictive genes when tested using both datasets and compared to previously suggested methods. This shows that mRMR-ABC is a promising approach for solving gene selection and cancer classification problems. PMID:25961028
Alshamlan, Hala; Badr, Ghada; Alohali, Yousef
2015-01-01
An artificial bee colony (ABC) is a relatively recent swarm intelligence optimization approach. In this paper, we propose the first attempt at applying ABC algorithm in analyzing a microarray gene expression profile. In addition, we propose an innovative feature selection algorithm, minimum redundancy maximum relevance (mRMR), and combine it with an ABC algorithm, mRMR-ABC, to select informative genes from microarray profile. The new approach is based on a support vector machine (SVM) algorithm to measure the classification accuracy for selected genes. We evaluate the performance of the proposed mRMR-ABC algorithm by conducting extensive experiments on six binary and multiclass gene expression microarray datasets. Furthermore, we compare our proposed mRMR-ABC algorithm with previously known techniques. We reimplemented two of these techniques for the sake of a fair comparison using the same parameters. These two techniques are mRMR when combined with a genetic algorithm (mRMR-GA) and mRMR when combined with a particle swarm optimization algorithm (mRMR-PSO). The experimental results prove that the proposed mRMR-ABC algorithm achieves accurate classification performance using small number of predictive genes when tested using both datasets and compared to previously suggested methods. This shows that mRMR-ABC is a promising approach for solving gene selection and cancer classification problems.
Joint optimization of maintenance, buffers and machines in manufacturing lines
NASA Astrophysics Data System (ADS)
Nahas, Nabil; Nourelfath, Mustapha
2018-01-01
This article considers a series manufacturing line composed of several machines separated by intermediate buffers of finite capacity. The goal is to find the optimal number of preventive maintenance actions performed on each machine, the optimal selection of machines and the optimal buffer allocation plan that minimize the total system cost, while providing the desired system throughput level. The mean times between failures of all machines are assumed to increase when applying periodic preventive maintenance. To estimate the production line throughput, a decomposition method is used. The decision variables in the formulated optimal design problem are buffer levels, types of machines and times between preventive maintenance actions. Three heuristic approaches are developed to solve the formulated combinatorial optimization problem. The first heuristic consists of a genetic algorithm, the second is based on the nonlinear threshold accepting metaheuristic and the third is an ant colony system. The proposed heuristics are compared and their efficiency is shown through several numerical examples. It is found that the nonlinear threshold accepting algorithm outperforms the genetic algorithm and ant colony system, while the genetic algorithm provides better results than the ant colony system for longer manufacturing lines.
BIG DATA ANALYTICS AND PRECISION ANIMAL AGRICULTURE SYMPOSIUM: Data to decisions.
White, B J; Amrine, D E; Larson, R L
2018-04-14
Big data are frequently used in many facets of business and agronomy to enhance knowledge needed to improve operational decisions. Livestock operations collect data of sufficient quantity to perform predictive analytics. Predictive analytics can be defined as a methodology and suite of data evaluation techniques to generate a prediction for specific target outcomes. The objective of this manuscript is to describe the process of using big data and the predictive analytic framework to create tools to drive decisions in livestock production, health, and welfare. The predictive analytic process involves selecting a target variable, managing the data, partitioning the data, then creating algorithms, refining algorithms, and finally comparing accuracy of the created classifiers. The partitioning of the datasets allows model building and refining to occur prior to testing the predictive accuracy of the model with naive data to evaluate overall accuracy. Many different classification algorithms are available for predictive use and testing multiple algorithms can lead to optimal results. Application of a systematic process for predictive analytics using data that is currently collected or that could be collected on livestock operations will facilitate precision animal management through enhanced livestock operational decisions.
Pattern Recognition for a Flight Dynamics Monte Carlo Simulation
NASA Technical Reports Server (NTRS)
Restrepo, Carolina; Hurtado, John E.
2011-01-01
The design, analysis, and verification and validation of a spacecraft relies heavily on Monte Carlo simulations. Modern computational techniques are able to generate large amounts of Monte Carlo data but flight dynamics engineers lack the time and resources to analyze it all. The growing amounts of data combined with the diminished available time of engineers motivates the need to automate the analysis process. Pattern recognition algorithms are an innovative way of analyzing flight dynamics data efficiently. They can search large data sets for specific patterns and highlight critical variables so analysts can focus their analysis efforts. This work combines a few tractable pattern recognition algorithms with basic flight dynamics concepts to build a practical analysis tool for Monte Carlo simulations. Current results show that this tool can quickly and automatically identify individual design parameters, and most importantly, specific combinations of parameters that should be avoided in order to prevent specific system failures. The current version uses a kernel density estimation algorithm and a sequential feature selection algorithm combined with a k-nearest neighbor classifier to find and rank important design parameters. This provides an increased level of confidence in the analysis and saves a significant amount of time.
NASA Astrophysics Data System (ADS)
Liu, Meiling; Liu, Xiangnan; Li, Jin; Ding, Chao; Jiang, Jiale
2014-12-01
Satellites routinely provide frequent, large-scale, near-surface views of many oceanographic variables pertinent to plankton ecology. However, the nutrient fertility of water can be challenging to detect accurately using remote sensing technology. This research has explored an approach to estimate the nutrient fertility in coastal waters through the fusion of synthetic aperture radar (SAR) images and optical images using the random forest (RF) algorithm. The estimation of total inorganic nitrogen (TIN) in the Hong Kong Sea, China, was used as a case study. In March of 2009 and May and August of 2010, a sequence of multi-temporal in situ data and CCD images from China's HJ-1 satellite and RADARSAT-2 images were acquired. Four sensitive parameters were selected as input variables to evaluate TIN: single-band reflectance, a normalized difference spectral index (NDSI) and HV and VH polarizations. The RF algorithm was used to merge the different input variables from the SAR and optical imagery to generate a new dataset (i.e., the TIN outputs). The results showed the temporal-spatial distribution of TIN. The TIN values decreased from coastal waters to the open water areas, and TIN values in the northeast area were higher than those found in the southwest region of the study area. The maximum TIN values occurred in May. Additionally, the estimation accuracy for estimating TIN was significantly improved when the SAR and optical data were used in combination rather than a single data type alone. This study suggests that this method of estimating nutrient fertility in coastal waters by effectively fusing data from multiple sensors is very promising.
Dual-threshold segmentation using Arimoto entropy based on chaotic bee colony optimization
NASA Astrophysics Data System (ADS)
Li, Li
2018-03-01
In order to extract target from complex background more quickly and accurately, and to further improve the detection effect of defects, a method of dual-threshold segmentation using Arimoto entropy based on chaotic bee colony optimization was proposed. Firstly, the method of single-threshold selection based on Arimoto entropy was extended to dual-threshold selection in order to separate the target from the background more accurately. Then intermediate variables in formulae of Arimoto entropy dual-threshold selection was calculated by recursion to eliminate redundant computation effectively and to reduce the amount of calculation. Finally, the local search phase of artificial bee colony algorithm was improved by chaotic sequence based on tent mapping. The fast search for two optimal thresholds was achieved using the improved bee colony optimization algorithm, thus the search could be accelerated obviously. A large number of experimental results show that, compared with the existing segmentation methods such as multi-threshold segmentation method using maximum Shannon entropy, two-dimensional Shannon entropy segmentation method, two-dimensional Tsallis gray entropy segmentation method and multi-threshold segmentation method using reciprocal gray entropy, the proposed method can segment target more quickly and accurately with superior segmentation effect. It proves to be an instant and effective method for image segmentation.
NASA Astrophysics Data System (ADS)
Lin, Z. D.; Wang, Y. B.; Wang, R. J.; Wang, L. S.; Lu, C. P.; Zhang, Z. Y.; Song, L. T.; Liu, Y.
2017-07-01
A total of 130 topsoil samples collected from Guoyang County, Anhui Province, China, were used to establish a Vis-NIR model for the prediction of organic matter content (OMC) in lime concretion black soils. Different spectral pretreatments were applied for minimizing the irrelevant and useless information of the spectra and increasing the spectra correlation with the measured values. Subsequently, the Kennard-Stone (KS) method and sample set partitioning based on joint x-y distances (SPXY) were used to select the training set. Successive projection algorithm (SPA) and genetic algorithm (GA) were then applied for wavelength optimization. Finally, the principal component regression (PCR) model was constructed, in which the optimal number of principal components was determined using the leave-one-out cross validation technique. The results show that the combination of the Savitzky-Golay (SG) filter for smoothing and multiplicative scatter correction (MSC) can eliminate the effect of noise and baseline drift; the SPXY method is preferable to KS in the sample selection; both the SPA and the GA can significantly reduce the number of wavelength variables and favorably increase the accuracy, especially GA, which greatly improved the prediction accuracy of soil OMC with Rcc, RMSEP, and RPD up to 0.9316, 0.2142, and 2.3195, respectively.
NASA Astrophysics Data System (ADS)
Sahraei, S.; Asadzadeh, M.
2017-12-01
Any modern multi-objective global optimization algorithm should be able to archive a well-distributed set of solutions. While the solution diversity in the objective space has been explored extensively in the literature, little attention has been given to the solution diversity in the decision space. Selection metrics such as the hypervolume contribution and crowding distance calculated in the objective space would guide the search toward solutions that are well-distributed across the objective space. In this study, the diversity of solutions in the decision-space is used as the main selection criteria beside the dominance check in multi-objective optimization. To this end, currently archived solutions are clustered in the decision space and the ones in less crowded clusters are given more chance to be selected for generating new solution. The proposed approach is first tested on benchmark mathematical test problems. Second, it is applied to a hydrologic model calibration problem with more than three objective functions. Results show that the chance of finding more sparse set of high-quality solutions increases, and therefore the analyst would receive a well-diverse set of options with maximum amount of information. Pareto Archived-Dynamically Dimensioned Search, which is an efficient and parsimonious multi-objective optimization algorithm for model calibration, is utilized in this study.
Doubling down on phosphorylation as a variable peptide modification.
Cooper, Bret
2016-09-01
Some mass spectrometrists believe that searching for variable PTMs like phosphorylation of serine or threonine when using database-search algorithms to interpret peptide tandem mass spectra will increase false-positive matching. The basis for this is the premise that the algorithm compares a spectrum to both a nonphosphorylated peptide candidate and a phosphorylated candidate, which is double the number of candidates compared to a search with no possible phosphorylation. Hence, if the search space doubles, false-positive matching could increase accordingly as the algorithm considers more candidates to which false matches could be made. In this study, it is shown that the search for variable phosphoserine and phosphothreonine modifications does not always double the search space or unduly impinge upon the FDR. A breakdown of how one popular database-search algorithm deals with variable phosphorylation is presented. Published 2016. This article is a U.S. Government work and is in the public domain in the USA.
Threshold automatic selection hybrid phase unwrapping algorithm for digital holographic microscopy
NASA Astrophysics Data System (ADS)
Zhou, Meiling; Min, Junwei; Yao, Baoli; Yu, Xianghua; Lei, Ming; Yan, Shaohui; Yang, Yanlong; Dan, Dan
2015-01-01
Conventional quality-guided (QG) phase unwrapping algorithm is hard to be applied to digital holographic microscopy because of the long execution time. In this paper, we present a threshold automatic selection hybrid phase unwrapping algorithm that combines the existing QG algorithm and the flood-filled (FF) algorithm to solve this problem. The original wrapped phase map is divided into high- and low-quality sub-maps by selecting a threshold automatically, and then the FF and QG unwrapping algorithms are used in each level to unwrap the phase, respectively. The feasibility of the proposed method is proved by experimental results, and the execution speed is shown to be much faster than that of the original QG unwrapping algorithm.
Thrust stand evaluation of engine performance improvement algorithms in an F-15 airplane
NASA Technical Reports Server (NTRS)
Conners, Timothy R.
1992-01-01
Results are presented from the evaluation of the performance seeking control (PSC) optimization algorithm developed by Smith et al. (1990) for F-15 aircraft, which optimizes the quasi-steady-state performance of an F100 derivative turbofan engine for several modes of operation. The PSC algorithm uses onboard software engine model that calculates thrust, stall margin, and other unmeasured variables for use in the optimization. Comparisons are presented between the load cell measurements, PSC onboard model thrust calculations, and posttest state variable model computations. Actual performance improvements using the PSC algorithm are presented for its various modes. The results of using PSC algorithm are compared with similar test case results using the HIDEC algorithm.
Grainger, Matthew James; Aramyan, Lusine; Piras, Simone; Quested, Thomas Edward; Righi, Simone; Setti, Marco; Vittuari, Matteo; Stewart, Gavin Bruce
2018-01-01
Food waste from households contributes the greatest proportion to total food waste in developed countries. Therefore, food waste reduction requires an understanding of the socio-economic (contextual and behavioural) factors that lead to its generation within the household. Addressing such a complex subject calls for sound methodological approaches that until now have been conditioned by the large number of factors involved in waste generation, by the lack of a recognised definition, and by limited available data. This work contributes to food waste generation literature by using one of the largest available datasets that includes data on the objective amount of avoidable household food waste, along with information on a series of socio-economic factors. In order to address one aspect of the complexity of the problem, machine learning algorithms (random forests and boruta) for variable selection integrated with linear modelling, model selection and averaging are implemented. Model selection addresses model structural uncertainty, which is not routinely considered in assessments of food waste in literature. The main drivers of food waste in the home selected in the most parsimonious models include household size, the presence of fussy eaters, employment status, home ownership status, and the local authority. Results, regardless of which variable set the models are run on, point toward large households as being a key target element for food waste reduction interventions.
Aramyan, Lusine; Piras, Simone; Quested, Thomas Edward; Righi, Simone; Setti, Marco; Vittuari, Matteo; Stewart, Gavin Bruce
2018-01-01
Food waste from households contributes the greatest proportion to total food waste in developed countries. Therefore, food waste reduction requires an understanding of the socio-economic (contextual and behavioural) factors that lead to its generation within the household. Addressing such a complex subject calls for sound methodological approaches that until now have been conditioned by the large number of factors involved in waste generation, by the lack of a recognised definition, and by limited available data. This work contributes to food waste generation literature by using one of the largest available datasets that includes data on the objective amount of avoidable household food waste, along with information on a series of socio-economic factors. In order to address one aspect of the complexity of the problem, machine learning algorithms (random forests and boruta) for variable selection integrated with linear modelling, model selection and averaging are implemented. Model selection addresses model structural uncertainty, which is not routinely considered in assessments of food waste in literature. The main drivers of food waste in the home selected in the most parsimonious models include household size, the presence of fussy eaters, employment status, home ownership status, and the local authority. Results, regardless of which variable set the models are run on, point toward large households as being a key target element for food waste reduction interventions. PMID:29389949
NASA Astrophysics Data System (ADS)
Jough, Fooad Karimi Ghaleh; Şensoy, Serhan
2016-12-01
Different performance levels may be obtained for sideway collapse evaluation of steel moment frames depending on the evaluation procedure used to handle uncertainties. In this article, the process of representing modelling uncertainties, record to record (RTR) variations and cognitive uncertainties for moment resisting steel frames of various heights is discussed in detail. RTR uncertainty is used by incremental dynamic analysis (IDA), modelling uncertainties are considered through backbone curves and hysteresis loops of component, and cognitive uncertainty is presented in three levels of material quality. IDA is used to evaluate RTR uncertainty based on strong ground motion records selected by the k-means algorithm, which is favoured over Monte Carlo selection due to its time saving appeal. Analytical equations of the Response Surface Method are obtained through IDA results by the Cuckoo algorithm, which predicts the mean and standard deviation of the collapse fragility curve. The Takagi-Sugeno-Kang model is used to represent material quality based on the response surface coefficients. Finally, collapse fragility curves with the various sources of uncertainties mentioned are derived through a large number of material quality values and meta variables inferred by the Takagi-Sugeno-Kang fuzzy model based on response surface method coefficients. It is concluded that a better risk management strategy in countries where material quality control is weak, is to account for cognitive uncertainties in fragility curves and the mean annual frequency.
Environmental diversity as a surrogate for species representation.
Beier, Paul; de Albuquerque, Fábio Suzart
2015-10-01
Because many species have not been described and most species ranges have not been mapped, conservation planners often use surrogates for conservation planning, but evidence for surrogate effectiveness is weak. Surrogates are well-mapped features such as soil types, landforms, occurrences of an easily observed taxon (discrete surrogates), and well-mapped environmental conditions (continuous surrogate). In the context of reserve selection, the idea is that a set of sites selected to span diversity in the surrogate will efficiently represent most species. Environmental diversity (ED) is a rarely used surrogate that selects sites to efficiently span multivariate ordination space. Because it selects across continuous environmental space, ED should perform better than discrete surrogates (which necessarily ignore within-bin and between-bin heterogeneity). Despite this theoretical advantage, ED appears to have performed poorly in previous tests of its ability to identify 50 × 50 km cells that represented vertebrates in Western Europe. Using an improved implementation of ED, we retested ED on Western European birds, mammals, reptiles, amphibians, and combined terrestrial vertebrates. We also tested ED on data sets for plants of Zimbabwe, birds of Spain, and birds of Arizona (United States). Sites selected using ED represented European mammals no better than randomly selected cells, but they represented species in the other 7 data sets with 20% to 84% effectiveness. This far exceeds the performance in previous tests of ED, and exceeds the performance of most discrete surrogates. We believe ED performed poorly in previous tests because those tests considered only a few candidate explanatory variables and used suboptimal forms of ED's selection algorithm. We suggest future work on ED focus on analyses at finer grain sizes more relevant to conservation decisions, explore the effect of selecting the explanatory variables most associated with species turnover, and investigate whether nonclimate abiotic variables can provide useful surrogates in an ED framework. © 2015 Society for Conservation Biology.
Local search to improve coordinate-based task mapping
Balzuweit, Evan; Bunde, David P.; Leung, Vitus J.; ...
2015-10-31
We present a local search strategy to improve the coordinate-based mapping of a parallel job’s tasks to the MPI ranks of its parallel allocation in order to reduce network congestion and the job’s communication time. The goal is to reduce the number of network hops between communicating pairs of ranks. Our target is applications with a nearest-neighbor stencil communication pattern running on mesh systems with non-contiguous processor allocation, such as Cray XE and XK Systems. Utilizing the miniGhost mini-app, which models the shock physics application CTH, we demonstrate that our strategy reduces application running time while also reducing the runtimemore » variability. Furthermore, we further show that mapping quality can vary based on the selected allocation algorithm, even between allocation algorithms of similar apparent quality.« less
Probabilistic Structural Analysis Methods (PSAM) for select space propulsion system components
NASA Technical Reports Server (NTRS)
1991-01-01
This annual report summarizes the work completed during the third year of technical effort on the referenced contract. Principal developments continue to focus on the Probabilistic Finite Element Method (PFEM) which has been under development for three years. Essentially all of the linear capabilities within the PFEM code are in place. Major progress in the application or verifications phase was achieved. An EXPERT module architecture was designed and partially implemented. EXPERT is a user interface module which incorporates an expert system shell for the implementation of a rule-based interface utilizing the experience and expertise of the user community. The Fast Probability Integration (FPI) Algorithm continues to demonstrate outstanding performance characteristics for the integration of probability density functions for multiple variables. Additionally, an enhanced Monte Carlo simulation algorithm was developed and demonstrated for a variety of numerical strategies.
ERIC Educational Resources Information Center
von Davier, Matthias
2016-01-01
This report presents results on a parallel implementation of the expectation-maximization (EM) algorithm for multidimensional latent variable models. The developments presented here are based on code that parallelizes both the E step and the M step of the parallel-E parallel-M algorithm. Examples presented in this report include item response…
ERIC Educational Resources Information Center
Stanford Univ., CA. School Mathematics Study Group.
This is the second unit of a 15-unit School Mathematics Study Group (SMSG) mathematics text for high school students. Topics presented in the first chapter (Informal Algorithms and Flow Charts) include: changing a flat tire; algorithms, flow charts, and computers; assignment and variables; input and output; using a variable as a counter; decisions…
A parallel variable metric optimization algorithm
NASA Technical Reports Server (NTRS)
Straeter, T. A.
1973-01-01
An algorithm, designed to exploit the parallel computing or vector streaming (pipeline) capabilities of computers is presented. When p is the degree of parallelism, then one cycle of the parallel variable metric algorithm is defined as follows: first, the function and its gradient are computed in parallel at p different values of the independent variable; then the metric is modified by p rank-one corrections; and finally, a single univariant minimization is carried out in the Newton-like direction. Several properties of this algorithm are established. The convergence of the iterates to the solution is proved for a quadratic functional on a real separable Hilbert space. For a finite-dimensional space the convergence is in one cycle when p equals the dimension of the space. Results of numerical experiments indicate that the new algorithm will exploit parallel or pipeline computing capabilities to effect faster convergence than serial techniques.
NASA Technical Reports Server (NTRS)
Suarez, Max J. (Editor); Chang, Alfred T. C.; Chiu, Long S.
1997-01-01
Seventeen months of rainfall data (August 1987-December 1988) from nine satellite rainfall algorithms (Adler, Chang, Kummerow, Prabhakara, Huffman, Spencer, Susskind, and Wu) were analyzed to examine the uncertainty of satellite-derived rainfall estimates. The variability among algorithms, measured as the standard deviation computed from the ensemble of algorithms, shows regions of high algorithm variability tend to coincide with regions of high rain rates. Histograms of pattern correlation (PC) between algorithms suggest a bimodal distribution, with separation at a PC-value of about 0.85. Applying this threshold as a criteria for similarity, our analyses show that algorithms using the same sensor or satellite input tend to be similar, suggesting the dominance of sampling errors in these satellite estimates.
NASA Astrophysics Data System (ADS)
Zhang, Chen; Ni, Zhiwei; Ni, Liping; Tang, Na
2016-10-01
Feature selection is an important method of data preprocessing in data mining. In this paper, a novel feature selection method based on multi-fractal dimension and harmony search algorithm is proposed. Multi-fractal dimension is adopted as the evaluation criterion of feature subset, which can determine the number of selected features. An improved harmony search algorithm is used as the search strategy to improve the efficiency of feature selection. The performance of the proposed method is compared with that of other feature selection algorithms on UCI data-sets. Besides, the proposed method is also used to predict the daily average concentration of PM2.5 in China. Experimental results show that the proposed method can obtain competitive results in terms of both prediction accuracy and the number of selected features.
Resolving combinatorial ambiguities in dilepton t t¯ event topologies with constrained M2 variables
NASA Astrophysics Data System (ADS)
Debnath, Dipsikha; Kim, Doojin; Kim, Jeong Han; Kong, Kyoungchul; Matchev, Konstantin T.
2017-10-01
We advocate the use of on-shell constrained M2 variables in order to mitigate the combinatorial problem in supersymmetry-like events with two invisible particles at the LHC. We show that in comparison to other approaches in the literature, the constrained M2 variables provide superior ansätze for the unmeasured invisible momenta and therefore can be usefully applied to discriminate combinatorial ambiguities. We illustrate our procedure with the example of dilepton t t ¯ events. We critically review the existing methods based on the Cambridge MT 2 variable and MAOS reconstruction of invisible momenta, and show that their algorithm can be simplified without loss of sensitivity, due to a perfect correlation between events with complex solutions for the invisible momenta and events exhibiting a kinematic endpoint violation. Then we demonstrate that the efficiency for selecting the correct partition is further improved by utilizing the M2 variables instead. Finally, we also consider the general case when the underlying mass spectrum is unknown, and no kinematic endpoint information is available.
Framework for adaptive multiscale analysis of nonhomogeneous point processes.
Helgason, Hannes; Bartroff, Jay; Abry, Patrice
2011-01-01
We develop the methodology for hypothesis testing and model selection in nonhomogeneous Poisson processes, with an eye toward the application of modeling and variability detection in heart beat data. Modeling the process' non-constant rate function using templates of simple basis functions, we develop the generalized likelihood ratio statistic for a given template and a multiple testing scheme to model-select from a family of templates. A dynamic programming algorithm inspired by network flows is used to compute the maximum likelihood template in a multiscale manner. In a numerical example, the proposed procedure is nearly as powerful as the super-optimal procedures that know the true template size and true partition, respectively. Extensions to general history-dependent point processes is discussed.
Patients classification on weaning trials using neural networks and wavelet transform.
Arizmendi, Carlos; Viviescas, Juan; González, Hernando; Giraldo, Beatriz
2014-01-01
The determination of the optimal time of the patients in weaning trial process from mechanical ventilation, between patients capable of maintaining spontaneous breathing and patients that fail to maintain spontaneous breathing, is a very important task in intensive care unit. Wavelet Transform (WT) and Neural Networks (NN) techniques were applied in order to develop a classifier for the study of patients on weaning trial process. The respiratory pattern of each patient was characterized through different time series. Genetic Algorithms (GA) and Forward Selection were used as feature selection techniques. A classification performance of 77.00±0.06% of well classified patients, was obtained using a NN and GA combination, with only 6 variables of the 14 initials.
The Time-domain Spectroscopic Survey: Target Selection for Repeat Spectroscopy
NASA Astrophysics Data System (ADS)
MacLeod, Chelsea L.; Green, Paul J.; Anderson, Scott F.; Eracleous, Michael; Ruan, John J.; Runnoe, Jessie; Nielsen Brandt, William; Badenes, Carles; Greene, Jenny; Morganson, Eric; Schmidt, Sarah J.; Schwope, Axel; Shen, Yue; Amaro, Rachael; Lebleu, Amy; Filiz Ak, Nurten; Grier, Catherine J.; Hoover, Daniel; McGraw, Sean M.; Dawson, Kyle; Hall, Patrick B.; Hawley, Suzanne L.; Mariappan, Vivek; Myers, Adam D.; Pâris, Isabelle; Schneider, Donald P.; Stassun, Keivan G.; Bershady, Matthew A.; Blanton, Michael R.; Seo, Hee-Jong; Tinker, Jeremy; Fernández-Trincado, J. G.; Chambers, Kenneth; Kaiser, Nick; Kudritzki, R.-P.; Magnier, Eugene; Metcalfe, Nigel; Waters, Chris Z.
2018-01-01
As astronomers increasingly exploit the information available in the time domain, spectroscopic variability in particular opens broad new channels of investigation. Here we describe the selection algorithms for all targets intended for repeat spectroscopy in the Time Domain Spectroscopic Survey (TDSS), part of the extended Baryon Oscillation Spectroscopic Survey within the Sloan Digital Sky Survey (SDSS)-IV. Also discussed are the scientific rationale and technical constraints leading to these target selections. The TDSS includes a large “repeat quasar spectroscopy” (RQS) program delivering ∼13,000 repeat spectra of confirmed SDSS quasars, and several smaller “few-epoch spectroscopy” (FES) programs targeting specific classes of quasars as well as stars. The RQS program aims to provide a large and diverse quasar data set for studying variations in quasar spectra on timescales of years, a comparison sample for the FES quasar programs, and an opportunity for discovering rare, serendipitous events. The FES programs cover a wide variety of phenomena in both quasars and stars. Quasar FES programs target broad absorption line quasars, high signal-to-noise ratio normal broad line quasars, quasars with double-peaked or very asymmetric broad emission line profiles, binary supermassive black hole candidates, and the most photometrically variable quasars. Strongly variable stars are also targeted for repeat spectroscopy, encompassing many types of eclipsing binary systems, and classical pulsators like RR Lyrae. Other stellar FES programs allow spectroscopic variability studies of active ultracool dwarf stars, dwarf carbon stars, and white dwarf/M dwarf spectroscopic binaries. We present example TDSS spectra and describe anticipated sample sizes and results.
Aguirre-Gutiérrez, Jesús; Carvalheiro, Luísa G; Polce, Chiara; van Loon, E Emiel; Raes, Niels; Reemer, Menno; Biesmeijer, Jacobus C
2013-01-01
Understanding species distributions and the factors limiting them is an important topic in ecology and conservation, including in nature reserve selection and predicting climate change impacts. While Species Distribution Models (SDM) are the main tool used for these purposes, choosing the best SDM algorithm is not straightforward as these are plentiful and can be applied in many different ways. SDM are used mainly to gain insight in 1) overall species distributions, 2) their past-present-future probability of occurrence and/or 3) to understand their ecological niche limits (also referred to as ecological niche modelling). The fact that these three aims may require different models and outputs is, however, rarely considered and has not been evaluated consistently. Here we use data from a systematically sampled set of species occurrences to specifically test the performance of Species Distribution Models across several commonly used algorithms. Species range in distribution patterns from rare to common and from local to widespread. We compare overall model fit (representing species distribution), the accuracy of the predictions at multiple spatial scales, and the consistency in selection of environmental correlations all across multiple modelling runs. As expected, the choice of modelling algorithm determines model outcome. However, model quality depends not only on the algorithm, but also on the measure of model fit used and the scale at which it is used. Although model fit was higher for the consensus approach and Maxent, Maxent and GAM models were more consistent in estimating local occurrence, while RF and GBM showed higher consistency in environmental variables selection. Model outcomes diverged more for narrowly distributed species than for widespread species. We suggest that matching study aims with modelling approach is essential in Species Distribution Models, and provide suggestions how to do this for different modelling aims and species' data characteristics (i.e. sample size, spatial distribution).
Li, Yongqiang; Abbaspour, Mohammadreza R; Grootendorst, Paul V; Rauth, Andrew M; Wu, Xiao Yu
2015-08-01
This study was performed to optimize the formulation of polymer-lipid hybrid nanoparticles (PLN) for the delivery of an ionic water-soluble drug, verapamil hydrochloride (VRP) and to investigate the roles of formulation factors. Modeling and optimization were conducted based on a spherical central composite design. Three formulation factors, i.e., weight ratio of drug to lipid (X1), and concentrations of Tween 80 (X2) and Pluronic F68 (X3), were chosen as independent variables. Drug loading efficiency (Y1) and mean particle size (Y2) of PLN were selected as dependent variables. The predictive performance of artificial neural networks (ANN) and the response surface methodology (RSM) were compared. As ANN was found to exhibit better recognition and generalization capability over RSM, multi-objective optimization of PLN was then conducted based upon the validated ANN models and continuous genetic algorithms (GA). The optimal PLN possess a high drug loading efficiency (92.4%, w/w) and a small mean particle size (∼100nm). The predicted response variables matched well with the observed results. The three formulation factors exhibited different effects on the properties of PLN. ANN in coordination with continuous GA represent an effective and efficient approach to optimize the PLN formulation of VRP with desired properties. Copyright © 2015 Elsevier B.V. All rights reserved.
Lin, Kuan-Cheng; Hsieh, Yi-Hsiu
2015-10-01
The classification and analysis of data is an important issue in today's research. Selecting a suitable set of features makes it possible to classify an enormous quantity of data quickly and efficiently. Feature selection is generally viewed as a problem of feature subset selection, such as combination optimization problems. Evolutionary algorithms using random search methods have proven highly effective in obtaining solutions to problems of optimization in a diversity of applications. In this study, we developed a hybrid evolutionary algorithm based on endocrine-based particle swarm optimization (EPSO) and artificial bee colony (ABC) algorithms in conjunction with a support vector machine (SVM) for the selection of optimal feature subsets for the classification of datasets. The results of experiments using specific UCI medical datasets demonstrate that the accuracy of the proposed hybrid evolutionary algorithm is superior to that of basic PSO, EPSO and ABC algorithms, with regard to classification accuracy using subsets with a reduced number of features.
Well-Tempered Metadynamics: A Smoothly Converging and Tunable Free-Energy Method
NASA Astrophysics Data System (ADS)
Barducci, Alessandro; Bussi, Giovanni; Parrinello, Michele
2008-01-01
We present a method for determining the free-energy dependence on a selected number of collective variables using an adaptive bias. The formalism provides a unified description which has metadynamics and canonical sampling as limiting cases. Convergence and errors can be rigorously and easily controlled. The parameters of the simulation can be tuned so as to focus the computational effort only on the physically relevant regions of the order parameter space. The algorithm is tested on the reconstruction of an alanine dipeptide free-energy landscape.
Well-tempered metadynamics: a smoothly converging and tunable free-energy method.
Barducci, Alessandro; Bussi, Giovanni; Parrinello, Michele
2008-01-18
We present a method for determining the free-energy dependence on a selected number of collective variables using an adaptive bias. The formalism provides a unified description which has metadynamics and canonical sampling as limiting cases. Convergence and errors can be rigorously and easily controlled. The parameters of the simulation can be tuned so as to focus the computational effort only on the physically relevant regions of the order parameter space. The algorithm is tested on the reconstruction of an alanine dipeptide free-energy landscape.
Dynamic ocean provinces: a multi-sensor approach to global marine ecophysiology
NASA Astrophysics Data System (ADS)
Dowell, M.; Campbell, J.; Moore, T.
The concept of oceanic provinces or domains has existed for well over a century. Such systems, whether real or only conceptual, provide a useful framework for understanding the mechanisms controlling biological, physical and chemical processes and their interactions. Criteria have been established for defining provinces based on physical forcings, availability of light and nutrients, complexity of the marine food web, and other factors. In general, such classification systems reflect the heterogeneous nature of the ocean environment, and the effort of scientists to comprehend the whole system by understanding its various homogeneous components. If provinces are defined strictly on the basis of geospatial or temporal criteria (e.g., latitude zones, bathymetry, or season), the resulting maps exhibit discontinuities that are uncharacteristic of the ocean. While this may be useful for many purposes, it is unsatisfactory in that it does not capture the dynamic nature of fluid boundaries in the ocean. Boundaries fixed in time and space do not allow us to observe interannual or longer-term variability (e.g., regime shifts) that may result from climate change. The current study illustrates the potential of using fuzzy logic as a means of classifying the ocean into objectively defined provinces using properties measurable from satellite sensors (MODIS and SeaWiFS). This approach accommodates the dynamic variability of provinces which can be updated as each image is processed. We adopt this classification as the basis for parameterizing specific algorithms for each of the classes. Once the class specific algorithms have been applied, retrievals are then recomposed into a single blended product based on the "weighted" fuzzy memberships. This will be demonstrated through animations of multi-year time- series of monthly composites of the individual classes or provinces. The provinces themselves are identified on the basis of global fields of chlorophyll, sea surface temperature and PAR which will also be subsequently used to parameterize primary production (PP) algorithms. Two applications of the proposed dynamic classification are presented. The first applies different peer-reviewed PP algorithms to the different classes and objectively evaluates their performance to select the algorithm which performs best, and then merges results into a single primary production product. A second application illustrates the variability of P I parameters in each province and- analyzes province specific variability in the quantum yield of photosynthesis. Finally results illustrating how this approach is implemented in estimating global oceanic primary production are presented.
NASA Astrophysics Data System (ADS)
Lohvithee, Manasavee; Biguri, Ander; Soleimani, Manuchehr
2017-12-01
There are a number of powerful total variation (TV) regularization methods that have great promise in limited data cone-beam CT reconstruction with an enhancement of image quality. These promising TV methods require careful selection of the image reconstruction parameters, for which there are no well-established criteria. This paper presents a comprehensive evaluation of parameter selection in a number of major TV-based reconstruction algorithms. An appropriate way of selecting the values for each individual parameter has been suggested. Finally, a new adaptive-weighted projection-controlled steepest descent (AwPCSD) algorithm is presented, which implements the edge-preserving function for CBCT reconstruction with limited data. The proposed algorithm shows significant robustness compared to three other existing algorithms: ASD-POCS, AwASD-POCS and PCSD. The proposed AwPCSD algorithm is able to preserve the edges of the reconstructed images better with fewer sensitive parameters to tune.
Ma, Li; Fan, Suohai
2017-03-14
The random forests algorithm is a type of classifier with prominent universality, a wide application range, and robustness for avoiding overfitting. But there are still some drawbacks to random forests. Therefore, to improve the performance of random forests, this paper seeks to improve imbalanced data processing, feature selection and parameter optimization. We propose the CURE-SMOTE algorithm for the imbalanced data classification problem. Experiments on imbalanced UCI data reveal that the combination of Clustering Using Representatives (CURE) enhances the original synthetic minority oversampling technique (SMOTE) algorithms effectively compared with the classification results on the original data using random sampling, Borderline-SMOTE1, safe-level SMOTE, C-SMOTE, and k-means-SMOTE. Additionally, the hybrid RF (random forests) algorithm has been proposed for feature selection and parameter optimization, which uses the minimum out of bag (OOB) data error as its objective function. Simulation results on binary and higher-dimensional data indicate that the proposed hybrid RF algorithms, hybrid genetic-random forests algorithm, hybrid particle swarm-random forests algorithm and hybrid fish swarm-random forests algorithm can achieve the minimum OOB error and show the best generalization ability. The training set produced from the proposed CURE-SMOTE algorithm is closer to the original data distribution because it contains minimal noise. Thus, better classification results are produced from this feasible and effective algorithm. Moreover, the hybrid algorithm's F-value, G-mean, AUC and OOB scores demonstrate that they surpass the performance of the original RF algorithm. Hence, this hybrid algorithm provides a new way to perform feature selection and parameter optimization.
Omran, Dalia Abd El Hamid; Awad, AbuBakr Hussein; Mabrouk, Mahasen Abd El Rahman; Soliman, Ahmad Fouad; Aziz, Ashraf Omar Abdel
2015-01-01
Hepatocellular carcinoma (HCC) is the second most common malignancy in Egypt. Data mining is a method of predictive analysis which can explore tremendous volumes of information to discover hidden patterns and relationships. Our aim here was to develop a non-invasive algorithm for prediction of HCC. Such an algorithm should be economical, reliable, easy to apply and acceptable by domain experts. This cross-sectional study enrolled 315 patients with hepatitis C virus (HCV) related chronic liver disease (CLD); 135 HCC, 116 cirrhotic patients without HCC and 64 patients with chronic hepatitis C. Using data mining analysis, we constructed a decision tree learning algorithm to predict HCC. The decision tree algorithm was able to predict HCC with recall (sensitivity) of 83.5% and precession (specificity) of 83.3% using only routine data. The correctly classified instances were 259 (82.2%), and the incorrectly classified instances were 56 (17.8%). Out of 29 attributes, serum alpha fetoprotein (AFP), with an optimal cutoff value of ≥50.3 ng/ml was selected as the best predictor of HCC. To a lesser extent, male sex, presence of cirrhosis, AST>64U/L, and ascites were variables associated with HCC. Data mining analysis allows discovery of hidden patterns and enables the development of models to predict HCC, utilizing routine data as an alternative to CT and liver biopsy. This study has highlighted a new cutoff for AFP (≥50.3 ng/ml). Presence of a score of >2 risk variables (out of 5) can successfully predict HCC with a sensitivity of 96% and specificity of 82%.
A novel artificial immune clonal selection classification and rule mining with swarm learning model
NASA Astrophysics Data System (ADS)
Al-Sheshtawi, Khaled A.; Abdul-Kader, Hatem M.; Elsisi, Ashraf B.
2013-06-01
Metaheuristic optimisation algorithms have become popular choice for solving complex problems. By integrating Artificial Immune clonal selection algorithm (CSA) and particle swarm optimisation (PSO) algorithm, a novel hybrid Clonal Selection Classification and Rule Mining with Swarm Learning Algorithm (CS2) is proposed. The main goal of the approach is to exploit and explore the parallel computation merit of Clonal Selection and the speed and self-organisation merits of Particle Swarm by sharing information between clonal selection population and particle swarm. Hence, we employed the advantages of PSO to improve the mutation mechanism of the artificial immune CSA and to mine classification rules within datasets. Consequently, our proposed algorithm required less training time and memory cells in comparison to other AIS algorithms. In this paper, classification rule mining has been modelled as a miltiobjective optimisation problem with predictive accuracy. The multiobjective approach is intended to allow the PSO algorithm to return an approximation to the accuracy and comprehensibility border, containing solutions that are spread across the border. We compared our proposed algorithm classification accuracy CS2 with five commonly used CSAs, namely: AIRS1, AIRS2, AIRS-Parallel, CLONALG, and CSCA using eight benchmark datasets. We also compared our proposed algorithm classification accuracy CS2 with other five methods, namely: Naïve Bayes, SVM, MLP, CART, and RFB. The results show that the proposed algorithm is comparable to the 10 studied algorithms. As a result, the hybridisation, built of CSA and PSO, can develop respective merit, compensate opponent defect, and make search-optimal effect and speed better.
Large space structures control algorithm characterization
NASA Technical Reports Server (NTRS)
Fogel, E.
1983-01-01
Feedback control algorithms are developed for sensor/actuator pairs on large space systems. These algorithms have been sized in terms of (1) floating point operation (FLOP) demands; (2) storage for variables; and (3) input/output data flow. FLOP sizing (per control cycle) was done as a function of the number of control states and the number of sensor/actuator pairs. Storage for variables and I/O sizing was done for specific structure examples.
A Machine Learning Framework for Plan Payment Risk Adjustment.
Rose, Sherri
2016-12-01
To introduce cross-validation and a nonparametric machine learning framework for plan payment risk adjustment and then assess whether they have the potential to improve risk adjustment. 2011-2012 Truven MarketScan database. We compare the performance of multiple statistical approaches within a broad machine learning framework for estimation of risk adjustment formulas. Total annual expenditure was predicted using age, sex, geography, inpatient diagnoses, and hierarchical condition category variables. The methods included regression, penalized regression, decision trees, neural networks, and an ensemble super learner, all in concert with screening algorithms that reduce the set of variables considered. The performance of these methods was compared based on cross-validated R 2 . Our results indicate that a simplified risk adjustment formula selected via this nonparametric framework maintains much of the efficiency of a traditional larger formula. The ensemble approach also outperformed classical regression and all other algorithms studied. The implementation of cross-validated machine learning techniques provides novel insight into risk adjustment estimation, possibly allowing for a simplified formula, thereby reducing incentives for increased coding intensity as well as the ability of insurers to "game" the system with aggressive diagnostic upcoding. © Health Research and Educational Trust.
Exchange inlet optimization by genetic algorithm for improved RBCC performance
NASA Astrophysics Data System (ADS)
Chorkawy, G.; Etele, J.
2017-09-01
A genetic algorithm based on real parameter representation using a variable selection pressure and variable probability of mutation is used to optimize an annular air breathing rocket inlet called the Exchange Inlet. A rapid and accurate design method which provides estimates for air breathing, mixing, and isentropic flow performance is used as the engine of the optimization routine. Comparison to detailed numerical simulations show that the design method yields desired exit Mach numbers to within approximately 1% over 75% of the annular exit area and predicts entrained air massflows to between 1% and 9% of numerically simulated values depending on the flight condition. Optimum designs are shown to be obtained within approximately 8000 fitness function evaluations in a search space on the order of 106. The method is also shown to be able to identify beneficial values for particular alleles when they exist while showing the ability to handle cases where physical and aphysical designs co-exist at particular values of a subset of alleles within a gene. For an air breathing engine based on a hydrogen fuelled rocket an exchange inlet is designed which yields a predicted air entrainment ratio within 95% of the theoretical maximum.
Barua, Shaibal; Begum, Shahina; Ahmed, Mobyen Uddin
2015-01-01
Machine learning algorithms play an important role in computer science research. Recent advancement in sensor data collection in clinical sciences lead to a complex, heterogeneous data processing, and analysis for patient diagnosis and prognosis. Diagnosis and treatment of patients based on manual analysis of these sensor data are difficult and time consuming. Therefore, development of Knowledge-based systems to support clinicians in decision-making is important. However, it is necessary to perform experimental work to compare performances of different machine learning methods to help to select appropriate method for a specific characteristic of data sets. This paper compares classification performance of three popular machine learning methods i.e., case-based reasoning, neutral networks and support vector machine to diagnose stress of vehicle drivers using finger temperature and heart rate variability. The experimental results show that case-based reasoning outperforms other two methods in terms of classification accuracy. Case-based reasoning has achieved 80% and 86% accuracy to classify stress using finger temperature and heart rate variability. On contrary, both neural network and support vector machine have achieved less than 80% accuracy by using both physiological signals.
[Gaussian process regression and its application in near-infrared spectroscopy analysis].
Feng, Ai-Ming; Fang, Li-Min; Lin, Min
2011-06-01
Gaussian process (GP) is applied in the present paper as a chemometric method to explore the complicated relationship between the near infrared (NIR) spectra and ingredients. After the outliers were detected by Monte Carlo cross validation (MCCV) method and removed from dataset, different preprocessing methods, such as multiplicative scatter correction (MSC), smoothing and derivate, were tried for the best performance of the models. Furthermore, uninformative variable elimination (UVE) was introduced as a variable selection technique and the characteristic wavelengths obtained were further employed as input for modeling. A public dataset with 80 NIR spectra of corn was introduced as an example for evaluating the new algorithm. The optimal models for oil, starch and protein were obtained by the GP regression method. The performance of the final models were evaluated according to the root mean square error of calibration (RMSEC), root mean square error of cross-validation (RMSECV), root mean square error of prediction (RMSEP) and correlation coefficient (r). The models give good calibration ability with r values above 0.99 and the prediction ability is also satisfactory with r values higher than 0.96. The overall results demonstrate that GP algorithm is an effective chemometric method and is promising for the NIR analysis.
Chen, C L; Kaber, D B; Dempsey, P G
2000-06-01
A new and improved method to feedforward neural network (FNN) development for application to data classification problems, such as the prediction of levels of low-back disorder (LBD) risk associated with industrial jobs, is presented. Background on FNN development for data classification is provided along with discussions of previous research and neighborhood (local) solution search methods for hard combinatorial problems. An analytical study is presented which compared prediction accuracy of a FNN based on an error-back propagation (EBP) algorithm with the accuracy of a FNN developed by considering results of local solution search (simulated annealing) for classifying industrial jobs as posing low or high risk for LBDs. The comparison demonstrated superior performance of the FNN generated using the new method. The architecture of this FNN included fewer input (predictor) variables and hidden neurons than the FNN developed based on the EBP algorithm. Independent variable selection methods and the phenomenon of 'overfitting' in FNN (and statistical model) generation for data classification are discussed. The results are supportive of the use of the new approach to FNN development for applications to musculoskeletal disorders and risk forecasting in other domains.
A Comparative Study of Optimization Algorithms for Engineering Synthesis.
1983-03-01
the ADS program demonstrates the flexibility a design engineer would have in selecting an optimization algorithm best suited to solve a particular...demonstrates the flexibility a design engineer would have in selecting an optimization algorithm best suited to solve a particular problem. 4 TABLE OF...algorithm to suit a particular problem. The ADS library of design optimization algorithms was . developed by Vanderplaats in response to the first
Noh, Hwayoung; Freisling, Heinz; Assi, Nada; Zamora-Ros, Raul; Achaintre, David; Affret, Aurélie; Mancini, Francesca; Boutron-Ruault, Marie-Christine; Flögel, Anna; Boeing, Heiner; Kühn, Tilman; Schübel, Ruth; Trichopoulou, Antonia; Naska, Androniki; Kritikou, Maria; Palli, Domenico; Pala, Valeria; Tumino, Rosario; Ricceri, Fulvio; Santucci de Magistris, Maria; Cross, Amanda; Slimani, Nadia; Scalbert, Augustin; Ferrari, Pietro
2017-07-25
We identified urinary polyphenol metabolite patterns by a novel algorithm that combines dimension reduction and variable selection methods to explain polyphenol-rich food intake, and compared their respective performance with that of single biomarkers in the European Prospective Investigation into Cancer and Nutrition (EPIC) study. The study included 475 adults from four European countries (Germany, France, Italy, and Greece). Dietary intakes were assessed with 24-h dietary recalls (24-HDR) and dietary questionnaires (DQ). Thirty-four polyphenols were measured by ultra-performance liquid chromatography-electrospray ionization-tandem mass spectrometry (UPLC-ESI-MS-MS) in 24-h urine. Reduced rank regression-based variable importance in projection (RRR-VIP) and least absolute shrinkage and selection operator (LASSO) methods were used to select polyphenol metabolites. Reduced rank regression (RRR) was then used to identify patterns in these metabolites, maximizing the explained variability in intake of pre-selected polyphenol-rich foods. The performance of RRR models was evaluated using internal cross-validation to control for over-optimistic findings from over-fitting. High performance was observed for explaining recent intake (24-HDR) of red wine ( r = 0.65; AUC = 89.1%), coffee ( r = 0.51; AUC = 89.1%), and olives ( r = 0.35; AUC = 82.2%). These metabolite patterns performed better or equally well compared to single polyphenol biomarkers. Neither metabolite patterns nor single biomarkers performed well in explaining habitual intake (as reported in the DQ) of polyphenol-rich foods. This proposed strategy of biomarker pattern identification has the potential of expanding the currently still limited list of available dietary intake biomarkers.
The Cramér-Rao Bounds and Sensor Selection for Nonlinear Systems with Uncertain Observations.
Wang, Zhiguo; Shen, Xiaojing; Wang, Ping; Zhu, Yunmin
2018-04-05
This paper considers the problems of the posterior Cramér-Rao bound and sensor selection for multi-sensor nonlinear systems with uncertain observations. In order to effectively overcome the difficulties caused by uncertainty, we investigate two methods to derive the posterior Cramér-Rao bound. The first method is based on the recursive formula of the Cramér-Rao bound and the Gaussian mixture model. Nevertheless, it needs to compute a complex integral based on the joint probability density function of the sensor measurements and the target state. The computation burden of this method is relatively high, especially in large sensor networks. Inspired by the idea of the expectation maximization algorithm, the second method is to introduce some 0-1 latent variables to deal with the Gaussian mixture model. Since the regular condition of the posterior Cramér-Rao bound is unsatisfied for the discrete uncertain system, we use some continuous variables to approximate the discrete latent variables. Then, a new Cramér-Rao bound can be achieved by a limiting process of the Cramér-Rao bound of the continuous system. It avoids the complex integral, which can reduce the computation burden. Based on the new posterior Cramér-Rao bound, the optimal solution of the sensor selection problem can be derived analytically. Thus, it can be used to deal with the sensor selection of a large-scale sensor networks. Two typical numerical examples verify the effectiveness of the proposed methods.
NASA Astrophysics Data System (ADS)
Balzarolo, M.; Vescovo, L.; Hammerle, A.; Gianelle, D.; Papale, D.; Tomelleri, E.; Wohlfahrt, G.
2015-05-01
In this paper we explore the skill of hyperspectral reflectance measurements and vegetation indices (VIs) derived from these in estimating carbon dioxide (CO2) fluxes of grasslands. Hyperspectral reflectance data, CO2 fluxes and biophysical parameters were measured at three grassland sites located in European mountain regions using standardized protocols. The relationships between CO2 fluxes, ecophysiological variables, traditional VIs and VIs derived using all two-band combinations of wavelengths available from the whole hyperspectral data space were analysed. We found that VIs derived from hyperspectral data generally explained a large fraction of the variability in the investigated dependent variables but differed in their ability to estimate midday and daily average CO2 fluxes and various derived ecophysiological parameters. Relationships between VIs and CO2 fluxes and ecophysiological parameters were site-specific, likely due to differences in soils, vegetation parameters and environmental conditions. Chlorophyll and water-content-related VIs explained the largest fraction of variability in most of the dependent variables. Band selection based on a combination of a genetic algorithm with random forests (GA-rF) confirmed that it is difficult to select a universal band region suitable across the investigated ecosystems. Our findings have major implications for upscaling terrestrial CO2 fluxes to larger regions and for remote- and proximal-sensing sampling and analysis strategies and call for more cross-site synthesis studies linking ground-based spectral reflectance with ecosystem-scale CO2 fluxes.
Taha, Zahari; Musa, Rabiu Muazu; P P Abdul Majeed, Anwar; Alim, Muhammad Muaz; Abdullah, Mohamad Razali
2018-02-01
Support Vector Machine (SVM) has been shown to be an effective learning algorithm for classification and prediction. However, the application of SVM for prediction and classification in specific sport has rarely been used to quantify/discriminate low and high-performance athletes. The present study classified and predicted high and low-potential archers from a set of fitness and motor ability variables trained on different SVMs kernel algorithms. 50 youth archers with the mean age and standard deviation of 17.0 ± 0.6 years drawn from various archery programmes completed a six arrows shooting score test. Standard fitness and ability measurements namely hand grip, vertical jump, standing broad jump, static balance, upper muscle strength and the core muscle strength were also recorded. Hierarchical agglomerative cluster analysis (HACA) was used to cluster the archers based on the performance variables tested. SVM models with linear, quadratic, cubic, fine RBF, medium RBF, as well as the coarse RBF kernel functions, were trained based on the measured performance variables. The HACA clustered the archers into high-potential archers (HPA) and low-potential archers (LPA), respectively. The linear, quadratic, cubic, as well as the medium RBF kernel functions models, demonstrated reasonably excellent classification accuracy of 97.5% and 2.5% error rate for the prediction of the HPA and the LPA. The findings of this investigation can be valuable to coaches and sports managers to recognise high potential athletes from a combination of the selected few measured fitness and motor ability performance variables examined which would consequently save cost, time and effort during talent identification programme. Copyright © 2017 Elsevier B.V. All rights reserved.
An efficient variable projection formulation for separable nonlinear least squares problems.
Gan, Min; Li, Han-Xiong
2014-05-01
We consider in this paper a class of nonlinear least squares problems in which the model can be represented as a linear combination of nonlinear functions. The variable projection algorithm projects the linear parameters out of the problem, leaving the nonlinear least squares problems involving only the nonlinear parameters. To implement the variable projection algorithm more efficiently, we propose a new variable projection functional based on matrix decomposition. The advantage of the proposed formulation is that the size of the decomposed matrix may be much smaller than those of previous ones. The Levenberg-Marquardt algorithm using finite difference method is then applied to minimize the new criterion. Numerical results show that the proposed approach achieves significant reduction in computing time.
Sale, Mark; Sherer, Eric A
2015-01-01
The current algorithm for selecting a population pharmacokinetic/pharmacodynamic model is based on the well-established forward addition/backward elimination method. A central strength of this approach is the opportunity for a modeller to continuously examine the data and postulate new hypotheses to explain observed biases. This algorithm has served the modelling community well, but the model selection process has essentially remained unchanged for the last 30 years. During this time, more robust approaches to model selection have been made feasible by new technology and dramatic increases in computation speed. We review these methods, with emphasis on genetic algorithm approaches and discuss the role these methods may play in population pharmacokinetic/pharmacodynamic model selection. PMID:23772792
Developing operation algorithms for vision subsystems in autonomous mobile robots
NASA Astrophysics Data System (ADS)
Shikhman, M. V.; Shidlovskiy, S. V.
2018-05-01
The paper analyzes algorithms for selecting keypoints on the image for the subsequent automatic detection of people and obstacles. The algorithm is based on the histogram of oriented gradients and the support vector method. The combination of these methods allows successful selection of dynamic and static objects. The algorithm can be applied in various autonomous mobile robots.