NASA Astrophysics Data System (ADS)
Chen, Jie; Brissette, François P.; Lucas-Picher, Philippe
2016-11-01
Given the ever increasing number of climate change simulations being carried out, it has become impractical to use all of them to cover the uncertainty of climate change impacts. Various methods have been proposed to optimally select subsets of a large ensemble of climate simulations for impact studies. However, the behaviour of optimally-selected subsets of climate simulations for climate change impacts is unknown, since the transfer process from climate projections to the impact study world is usually highly non-linear. Consequently, this study investigates the transferability of optimally-selected subsets of climate simulations in the case of hydrological impacts. Two different methods were used for the optimal selection of subsets of climate scenarios, and both were found to be capable of adequately representing the spread of selected climate model variables contained in the original large ensemble. However, in both cases, the optimal subsets had limited transferability to hydrological impacts. To capture a similar variability in the impact model world, many more simulations have to be used than those that are needed to simply cover variability from the climate model variables' perspective. Overall, both optimal subset selection methods were better than random selection when small subsets were selected from a large ensemble for impact studies. However, as the number of selected simulations increased, random selection often performed better than the two optimal methods. To ensure adequate uncertainty coverage, the results of this study imply that selecting as many climate change simulations as possible is the best avenue. Where this was not possible, the two optimal methods were found to perform adequately.
Systematic wavelength selection for improved multivariate spectral analysis
Thomas, Edward V.; Robinson, Mark R.; Haaland, David M.
1995-01-01
Methods and apparatus for determining in a biological material one or more unknown values of at least one known characteristic (e.g. the concentration of an analyte such as glucose in blood or the concentration of one or more blood gas parameters) with a model based on a set of samples with known values of the known characteristics and a multivariate algorithm using several wavelength subsets. The method includes selecting multiple wavelength subsets, from the electromagnetic spectral region appropriate for determining the known characteristic, for use by an algorithm wherein the selection of wavelength subsets improves the model's fitness of the determination for the unknown values of the known characteristic. The selection process utilizes multivariate search methods that select both predictive and synergistic wavelengths within the range of wavelengths utilized. The fitness of the wavelength subsets is determined by the fitness function F=.function.(cost, performance). The method includes the steps of: (1) using one or more applications of a genetic algorithm to produce one or more count spectra, with multiple count spectra then combined to produce a combined count spectrum; (2) smoothing the count spectrum; (3) selecting a threshold count from a count spectrum to select these wavelength subsets which optimize the fitness function; and (4) eliminating a portion of the selected wavelength subsets. The determination of the unknown values can be made: (1) noninvasively and in vivo; (2) invasively and in vivo; or (3) in vitro.
Data-driven confounder selection via Markov and Bayesian networks.
Häggström, Jenny
2018-06-01
To unbiasedly estimate a causal effect on an outcome unconfoundedness is often assumed. If there is sufficient knowledge on the underlying causal structure then existing confounder selection criteria can be used to select subsets of the observed pretreatment covariates, X, sufficient for unconfoundedness, if such subsets exist. Here, estimation of these target subsets is considered when the underlying causal structure is unknown. The proposed method is to model the causal structure by a probabilistic graphical model, for example, a Markov or Bayesian network, estimate this graph from observed data and select the target subsets given the estimated graph. The approach is evaluated by simulation both in a high-dimensional setting where unconfoundedness holds given X and in a setting where unconfoundedness only holds given subsets of X. Several common target subsets are investigated and the selected subsets are compared with respect to accuracy in estimating the average causal effect. The proposed method is implemented with existing software that can easily handle high-dimensional data, in terms of large samples and large number of covariates. The results from the simulation study show that, if unconfoundedness holds given X, this approach is very successful in selecting the target subsets, outperforming alternative approaches based on random forests and LASSO, and that the subset estimating the target subset containing all causes of outcome yields smallest MSE in the average causal effect estimation. © 2017, The International Biometric Society.
An evaluation of exact methods for the multiple subset maximum cardinality selection problem.
Brusco, Michael J; Köhn, Hans-Friedrich; Steinley, Douglas
2016-05-01
The maximum cardinality subset selection problem requires finding the largest possible subset from a set of objects, such that one or more conditions are satisfied. An important extension of this problem is to extract multiple subsets, where the addition of one more object to a larger subset would always be preferred to increases in the size of one or more smaller subsets. We refer to this as the multiple subset maximum cardinality selection problem (MSMCSP). A recently published branch-and-bound algorithm solves the MSMCSP as a partitioning problem. Unfortunately, the computational requirement associated with the algorithm is often enormous, thus rendering the method infeasible from a practical standpoint. In this paper, we present an alternative approach that successively solves a series of binary integer linear programs to obtain a globally optimal solution to the MSMCSP. Computational comparisons of the methods using published similarity data for 45 food items reveal that the proposed sequential method is computationally far more efficient than the branch-and-bound approach. © 2016 The British Psychological Society.
Diagnosis of Chronic Kidney Disease Based on Support Vector Machine by Feature Selection Methods.
Polat, Huseyin; Danaei Mehr, Homay; Cetin, Aydin
2017-04-01
As Chronic Kidney Disease progresses slowly, early detection and effective treatment are the only cure to reduce the mortality rate. Machine learning techniques are gaining significance in medical diagnosis because of their classification ability with high accuracy rates. The accuracy of classification algorithms depend on the use of correct feature selection algorithms to reduce the dimension of datasets. In this study, Support Vector Machine classification algorithm was used to diagnose Chronic Kidney Disease. To diagnose the Chronic Kidney Disease, two essential types of feature selection methods namely, wrapper and filter approaches were chosen to reduce the dimension of Chronic Kidney Disease dataset. In wrapper approach, classifier subset evaluator with greedy stepwise search engine and wrapper subset evaluator with the Best First search engine were used. In filter approach, correlation feature selection subset evaluator with greedy stepwise search engine and filtered subset evaluator with the Best First search engine were used. The results showed that the Support Vector Machine classifier by using filtered subset evaluator with the Best First search engine feature selection method has higher accuracy rate (98.5%) in the diagnosis of Chronic Kidney Disease compared to other selected methods.
Biochemical Sensors Using Carbon Nanotube Arrays
NASA Technical Reports Server (NTRS)
Meyyappan, Meyya (Inventor); Cassell, Alan M. (Inventor); Li, Jun (Inventor)
2011-01-01
Method and system for detecting presence of biomolecules in a selected subset, or in each of several selected subsets, in a fluid. Each of an array of two or more carbon nanotubes ("CNTs") is connected at a first CNT end to one or more electronics devices, each of which senses a selected electrochemical signal that is generated when a target biomolecule in the selected subset becomes attached to a functionalized second end of the CNT, which is covalently bonded with a probe molecule. This approach indicates when target biomolecules in the selected subset are present and indicates presence or absence of target biomolecules in two or more selected subsets. Alternatively, presence of absence of an analyte can be detected.
Adaptive feature selection using v-shaped binary particle swarm optimization.
Teng, Xuyang; Dong, Hongbin; Zhou, Xiurong
2017-01-01
Feature selection is an important preprocessing method in machine learning and data mining. This process can be used not only to reduce the amount of data to be analyzed but also to build models with stronger interpretability based on fewer features. Traditional feature selection methods evaluate the dependency and redundancy of features separately, which leads to a lack of measurement of their combined effect. Moreover, a greedy search considers only the optimization of the current round and thus cannot be a global search. To evaluate the combined effect of different subsets in the entire feature space, an adaptive feature selection method based on V-shaped binary particle swarm optimization is proposed. In this method, the fitness function is constructed using the correlation information entropy. Feature subsets are regarded as individuals in a population, and the feature space is searched using V-shaped binary particle swarm optimization. The above procedure overcomes the hard constraint on the number of features, enables the combined evaluation of each subset as a whole, and improves the search ability of conventional binary particle swarm optimization. The proposed algorithm is an adaptive method with respect to the number of feature subsets. The experimental results show the advantages of optimizing the feature subsets using the V-shaped transfer function and confirm the effectiveness and efficiency of the feature subsets obtained under different classifiers.
Adaptive feature selection using v-shaped binary particle swarm optimization
Dong, Hongbin; Zhou, Xiurong
2017-01-01
Feature selection is an important preprocessing method in machine learning and data mining. This process can be used not only to reduce the amount of data to be analyzed but also to build models with stronger interpretability based on fewer features. Traditional feature selection methods evaluate the dependency and redundancy of features separately, which leads to a lack of measurement of their combined effect. Moreover, a greedy search considers only the optimization of the current round and thus cannot be a global search. To evaluate the combined effect of different subsets in the entire feature space, an adaptive feature selection method based on V-shaped binary particle swarm optimization is proposed. In this method, the fitness function is constructed using the correlation information entropy. Feature subsets are regarded as individuals in a population, and the feature space is searched using V-shaped binary particle swarm optimization. The above procedure overcomes the hard constraint on the number of features, enables the combined evaluation of each subset as a whole, and improves the search ability of conventional binary particle swarm optimization. The proposed algorithm is an adaptive method with respect to the number of feature subsets. The experimental results show the advantages of optimizing the feature subsets using the V-shaped transfer function and confirm the effectiveness and efficiency of the feature subsets obtained under different classifiers. PMID:28358850
The Cross-Entropy Based Multi-Filter Ensemble Method for Gene Selection.
Sun, Yingqiang; Lu, Chengbo; Li, Xiaobo
2018-05-17
The gene expression profile has the characteristics of a high dimension, low sample, and continuous type, and it is a great challenge to use gene expression profile data for the classification of tumor samples. This paper proposes a cross-entropy based multi-filter ensemble (CEMFE) method for microarray data classification. Firstly, multiple filters are used to select the microarray data in order to obtain a plurality of the pre-selected feature subsets with a different classification ability. The top N genes with the highest rank of each subset are integrated so as to form a new data set. Secondly, the cross-entropy algorithm is used to remove the redundant data in the data set. Finally, the wrapper method, which is based on forward feature selection, is used to select the best feature subset. The experimental results show that the proposed method is more efficient than other gene selection methods and that it can achieve a higher classification accuracy under fewer characteristic genes.
A non-linear data mining parameter selection algorithm for continuous variables
Razavi, Marianne; Brady, Sean
2017-01-01
In this article, we propose a new data mining algorithm, by which one can both capture the non-linearity in data and also find the best subset model. To produce an enhanced subset of the original variables, a preferred selection method should have the potential of adding a supplementary level of regression analysis that would capture complex relationships in the data via mathematical transformation of the predictors and exploration of synergistic effects of combined variables. The method that we present here has the potential to produce an optimal subset of variables, rendering the overall process of model selection more efficient. This algorithm introduces interpretable parameters by transforming the original inputs and also a faithful fit to the data. The core objective of this paper is to introduce a new estimation technique for the classical least square regression framework. This new automatic variable transformation and model selection method could offer an optimal and stable model that minimizes the mean square error and variability, while combining all possible subset selection methodology with the inclusion variable transformations and interactions. Moreover, this method controls multicollinearity, leading to an optimal set of explanatory variables. PMID:29131829
Two-stage atlas subset selection in multi-atlas based image segmentation.
Zhao, Tingting; Ruan, Dan
2015-06-01
Fast growing access to large databases and cloud stored data presents a unique opportunity for multi-atlas based image segmentation and also presents challenges in heterogeneous atlas quality and computation burden. This work aims to develop a novel two-stage method tailored to the special needs in the face of large atlas collection with varied quality, so that high-accuracy segmentation can be achieved with low computational cost. An atlas subset selection scheme is proposed to substitute a significant portion of the computationally expensive full-fledged registration in the conventional scheme with a low-cost alternative. More specifically, the authors introduce a two-stage atlas subset selection method. In the first stage, an augmented subset is obtained based on a low-cost registration configuration and a preliminary relevance metric; in the second stage, the subset is further narrowed down to a fusion set of desired size, based on full-fledged registration and a refined relevance metric. An inference model is developed to characterize the relationship between the preliminary and refined relevance metrics, and a proper augmented subset size is derived to ensure that the desired atlases survive the preliminary selection with high probability. The performance of the proposed scheme has been assessed with cross validation based on two clinical datasets consisting of manually segmented prostate and brain magnetic resonance images, respectively. The proposed scheme demonstrates comparable end-to-end segmentation performance as the conventional single-stage selection method, but with significant computation reduction. Compared with the alternative computation reduction method, their scheme improves the mean and medium Dice similarity coefficient value from (0.74, 0.78) to (0.83, 0.85) and from (0.82, 0.84) to (0.95, 0.95) for prostate and corpus callosum segmentation, respectively, with statistical significance. The authors have developed a novel two-stage atlas subset selection scheme for multi-atlas based segmentation. It achieves good segmentation accuracy with significantly reduced computation cost, making it a suitable configuration in the presence of extensive heterogeneous atlases.
Feature selection for the classification of traced neurons.
López-Cabrera, José D; Lorenzo-Ginori, Juan V
2018-06-01
The great availability of computational tools to calculate the properties of traced neurons leads to the existence of many descriptors which allow the automated classification of neurons from these reconstructions. This situation determines the necessity to eliminate irrelevant features as well as making a selection of the most appropriate among them, in order to improve the quality of the classification obtained. The dataset used contains a total of 318 traced neurons, classified by human experts in 192 GABAergic interneurons and 126 pyramidal cells. The features were extracted by means of the L-measure software, which is one of the most used computational tools in neuroinformatics to quantify traced neurons. We review some current feature selection techniques as filter, wrapper, embedded and ensemble methods. The stability of the feature selection methods was measured. For the ensemble methods, several aggregation methods based on different metrics were applied to combine the subsets obtained during the feature selection process. The subsets obtained applying feature selection methods were evaluated using supervised classifiers, among which Random Forest, C4.5, SVM, Naïve Bayes, Knn, Decision Table and the Logistic classifier were used as classification algorithms. Feature selection methods of types filter, embedded, wrappers and ensembles were compared and the subsets returned were tested in classification tasks for different classification algorithms. L-measure features EucDistanceSD, PathDistanceSD, Branch_pathlengthAve, Branch_pathlengthSD and EucDistanceAve were present in more than 60% of the selected subsets which provides evidence about their importance in the classification of this neurons. Copyright © 2018 Elsevier B.V. All rights reserved.
NASA Astrophysics Data System (ADS)
Kwon, Ki-Won; Cho, Yongsoo
This letter presents a simple joint estimation method for residual frequency offset (RFO) and sampling frequency offset (STO) in OFDM-based digital video broadcasting (DVB) systems. The proposed method selects a continual pilot (CP) subset from an unsymmetrically and non-uniformly distributed CP set to obtain an unbiased estimator. Simulation results show that the proposed method using a properly selected CP subset is unbiased and performs robustly.
A hybrid feature selection method using multiclass SVM for diagnosis of erythemato-squamous disease
NASA Astrophysics Data System (ADS)
Maryam, Setiawan, Noor Akhmad; Wahyunggoro, Oyas
2017-08-01
The diagnosis of erythemato-squamous disease is a complex problem and difficult to detect in dermatology. Besides that, it is a major cause of skin cancer. Data mining implementation in the medical field helps expert to diagnose precisely, accurately, and inexpensively. In this research, we use data mining technique to developed a diagnosis model based on multiclass SVM with a novel hybrid feature selection method to diagnose erythemato-squamous disease. Our hybrid feature selection method, named ChiGA (Chi Square and Genetic Algorithm), uses the advantages from filter and wrapper methods to select the optimal feature subset from original feature. Chi square used as filter method to remove redundant features and GA as wrapper method to select the ideal feature subset with SVM used as classifier. Experiment performed with 10 fold cross validation on erythemato-squamous diseases dataset taken from University of California Irvine (UCI) machine learning database. The experimental result shows that the proposed model based multiclass SVM with Chi Square and GA can give an optimum feature subset. There are 18 optimum features with 99.18% accuracy.
Tang, Rongnian; Chen, Xupeng; Li, Chuang
2018-05-01
Near-infrared spectroscopy is an efficient, low-cost technology that has potential as an accurate method in detecting the nitrogen content of natural rubber leaves. Successive projections algorithm (SPA) is a widely used variable selection method for multivariate calibration, which uses projection operations to select a variable subset with minimum multi-collinearity. However, due to the fluctuation of correlation between variables, high collinearity may still exist in non-adjacent variables of subset obtained by basic SPA. Based on analysis to the correlation matrix of the spectra data, this paper proposed a correlation-based SPA (CB-SPA) to apply the successive projections algorithm in regions with consistent correlation. The result shows that CB-SPA can select variable subsets with more valuable variables and less multi-collinearity. Meanwhile, models established by the CB-SPA subset outperform basic SPA subsets in predicting nitrogen content in terms of both cross-validation and external prediction. Moreover, CB-SPA is assured to be more efficient, for the time cost in its selection procedure is one-twelfth that of the basic SPA.
Two-stage atlas subset selection in multi-atlas based image segmentation
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zhao, Tingting, E-mail: tingtingzhao@mednet.ucla.edu; Ruan, Dan, E-mail: druan@mednet.ucla.edu
2015-06-15
Purpose: Fast growing access to large databases and cloud stored data presents a unique opportunity for multi-atlas based image segmentation and also presents challenges in heterogeneous atlas quality and computation burden. This work aims to develop a novel two-stage method tailored to the special needs in the face of large atlas collection with varied quality, so that high-accuracy segmentation can be achieved with low computational cost. Methods: An atlas subset selection scheme is proposed to substitute a significant portion of the computationally expensive full-fledged registration in the conventional scheme with a low-cost alternative. More specifically, the authors introduce a two-stagemore » atlas subset selection method. In the first stage, an augmented subset is obtained based on a low-cost registration configuration and a preliminary relevance metric; in the second stage, the subset is further narrowed down to a fusion set of desired size, based on full-fledged registration and a refined relevance metric. An inference model is developed to characterize the relationship between the preliminary and refined relevance metrics, and a proper augmented subset size is derived to ensure that the desired atlases survive the preliminary selection with high probability. Results: The performance of the proposed scheme has been assessed with cross validation based on two clinical datasets consisting of manually segmented prostate and brain magnetic resonance images, respectively. The proposed scheme demonstrates comparable end-to-end segmentation performance as the conventional single-stage selection method, but with significant computation reduction. Compared with the alternative computation reduction method, their scheme improves the mean and medium Dice similarity coefficient value from (0.74, 0.78) to (0.83, 0.85) and from (0.82, 0.84) to (0.95, 0.95) for prostate and corpus callosum segmentation, respectively, with statistical significance. Conclusions: The authors have developed a novel two-stage atlas subset selection scheme for multi-atlas based segmentation. It achieves good segmentation accuracy with significantly reduced computation cost, making it a suitable configuration in the presence of extensive heterogeneous atlases.« less
2013-01-01
Background Gene expression data could likely be a momentous help in the progress of proficient cancer diagnoses and classification platforms. Lately, many researchers analyze gene expression data using diverse computational intelligence methods, for selecting a small subset of informative genes from the data for cancer classification. Many computational methods face difficulties in selecting small subsets due to the small number of samples compared to the huge number of genes (high-dimension), irrelevant genes, and noisy genes. Methods We propose an enhanced binary particle swarm optimization to perform the selection of small subsets of informative genes which is significant for cancer classification. Particle speed, rule, and modified sigmoid function are introduced in this proposed method to increase the probability of the bits in a particle’s position to be zero. The method was empirically applied to a suite of ten well-known benchmark gene expression data sets. Results The performance of the proposed method proved to be superior to other previous related works, including the conventional version of binary particle swarm optimization (BPSO) in terms of classification accuracy and the number of selected genes. The proposed method also requires lower computational time compared to BPSO. PMID:23617960
Visual analytics in cheminformatics: user-supervised descriptor selection for QSAR methods.
Martínez, María Jimena; Ponzoni, Ignacio; Díaz, Mónica F; Vazquez, Gustavo E; Soto, Axel J
2015-01-01
The design of QSAR/QSPR models is a challenging problem, where the selection of the most relevant descriptors constitutes a key step of the process. Several feature selection methods that address this step are concentrated on statistical associations among descriptors and target properties, whereas the chemical knowledge is left out of the analysis. For this reason, the interpretability and generality of the QSAR/QSPR models obtained by these feature selection methods are drastically affected. Therefore, an approach for integrating domain expert's knowledge in the selection process is needed for increase the confidence in the final set of descriptors. In this paper a software tool, which we named Visual and Interactive DEscriptor ANalysis (VIDEAN), that combines statistical methods with interactive visualizations for choosing a set of descriptors for predicting a target property is proposed. Domain expertise can be added to the feature selection process by means of an interactive visual exploration of data, and aided by statistical tools and metrics based on information theory. Coordinated visual representations are presented for capturing different relationships and interactions among descriptors, target properties and candidate subsets of descriptors. The competencies of the proposed software were assessed through different scenarios. These scenarios reveal how an expert can use this tool to choose one subset of descriptors from a group of candidate subsets or how to modify existing descriptor subsets and even incorporate new descriptors according to his or her own knowledge of the target property. The reported experiences showed the suitability of our software for selecting sets of descriptors with low cardinality, high interpretability, low redundancy and high statistical performance in a visual exploratory way. Therefore, it is possible to conclude that the resulting tool allows the integration of a chemist's expertise in the descriptor selection process with a low cognitive effort in contrast with the alternative of using an ad-hoc manual analysis of the selected descriptors. Graphical abstractVIDEAN allows the visual analysis of candidate subsets of descriptors for QSAR/QSPR. In the two panels on the top, users can interactively explore numerical correlations as well as co-occurrences in the candidate subsets through two interactive graphs.
Approximate error conjugation gradient minimization methods
Kallman, Jeffrey S
2013-05-21
In one embodiment, a method includes selecting a subset of rays from a set of all rays to use in an error calculation for a constrained conjugate gradient minimization problem, calculating an approximate error using the subset of rays, and calculating a minimum in a conjugate gradient direction based on the approximate error. In another embodiment, a system includes a processor for executing logic, logic for selecting a subset of rays from a set of all rays to use in an error calculation for a constrained conjugate gradient minimization problem, logic for calculating an approximate error using the subset of rays, and logic for calculating a minimum in a conjugate gradient direction based on the approximate error. In other embodiments, computer program products, methods, and systems are described capable of using approximate error in constrained conjugate gradient minimization problems.
Fizzy: feature subset selection for metagenomics.
Ditzler, Gregory; Morrison, J Calvin; Lan, Yemin; Rosen, Gail L
2015-11-04
Some of the current software tools for comparative metagenomics provide ecologists with the ability to investigate and explore bacterial communities using α- & β-diversity. Feature subset selection--a sub-field of machine learning--can also provide a unique insight into the differences between metagenomic or 16S phenotypes. In particular, feature subset selection methods can obtain the operational taxonomic units (OTUs), or functional features, that have a high-level of influence on the condition being studied. For example, in a previous study we have used information-theoretic feature selection to understand the differences between protein family abundances that best discriminate between age groups in the human gut microbiome. We have developed a new Python command line tool, which is compatible with the widely adopted BIOM format, for microbial ecologists that implements information-theoretic subset selection methods for biological data formats. We demonstrate the software tools capabilities on publicly available datasets. We have made the software implementation of Fizzy available to the public under the GNU GPL license. The standalone implementation can be found at http://github.com/EESI/Fizzy.
Fizzy. Feature subset selection for metagenomics
Ditzler, Gregory; Morrison, J. Calvin; Lan, Yemin; ...
2015-11-04
Background: Some of the current software tools for comparative metagenomics provide ecologists with the ability to investigate and explore bacterial communities using α– & β–diversity. Feature subset selection – a sub-field of machine learning – can also provide a unique insight into the differences between metagenomic or 16S phenotypes. In particular, feature subset selection methods can obtain the operational taxonomic units (OTUs), or functional features, that have a high-level of influence on the condition being studied. For example, in a previous study we have used information-theoretic feature selection to understand the differences between protein family abundances that best discriminate betweenmore » age groups in the human gut microbiome. Results: We have developed a new Python command line tool, which is compatible with the widely adopted BIOM format, for microbial ecologists that implements information-theoretic subset selection methods for biological data formats. We demonstrate the software tools capabilities on publicly available datasets. Conclusions: We have made the software implementation of Fizzy available to the public under the GNU GPL license. The standalone implementation can be found at http://github.com/EESI/Fizzy.« less
Vanderhaeghe, F; Smolders, A J P; Roelofs, J G M; Hoffmann, M
2012-03-01
Selecting an appropriate variable subset in linear multivariate methods is an important methodological issue for ecologists. Interest often exists in obtaining general predictive capacity or in finding causal inferences from predictor variables. Because of a lack of solid knowledge on a studied phenomenon, scientists explore predictor variables in order to find the most meaningful (i.e. discriminating) ones. As an example, we modelled the response of the amphibious softwater plant Eleocharis multicaulis using canonical discriminant function analysis. We asked how variables can be selected through comparison of several methods: univariate Pearson chi-square screening, principal components analysis (PCA) and step-wise analysis, as well as combinations of some methods. We expected PCA to perform best. The selected methods were evaluated through fit and stability of the resulting discriminant functions and through correlations between these functions and the predictor variables. The chi-square subset, at P < 0.05, followed by a step-wise sub-selection, gave the best results. In contrast to expectations, PCA performed poorly, as so did step-wise analysis. The different chi-square subset methods all yielded ecologically meaningful variables, while probable noise variables were also selected by PCA and step-wise analysis. We advise against the simple use of PCA or step-wise discriminant analysis to obtain an ecologically meaningful variable subset; the former because it does not take into account the response variable, the latter because noise variables are likely to be selected. We suggest that univariate screening techniques are a worthwhile alternative for variable selection in ecology. © 2011 German Botanical Society and The Royal Botanical Society of the Netherlands.
Analysis of Information Content in High-Spectral Resolution Sounders using Subset Selection Analysis
NASA Technical Reports Server (NTRS)
Velez-Reyes, Miguel; Joiner, Joanna
1998-01-01
In this paper, we summarize the results of the sensitivity analysis and data reduction carried out to determine the information content of AIRS and IASI channels. The analysis and data reduction was based on the use of subset selection techniques developed in the linear algebra and statistical community to study linear dependencies in high dimensional data sets. We applied the subset selection method to study dependency among channels by studying the dependency among their weighting functions. Also, we applied the technique to study the information provided by the different levels in which the atmosphere is discretized for retrievals and analysis. Results from the method correlate well with intuition in many respects and point out to possible modifications for band selection in sensor design and number and location of levels in the analysis process.
Bell, Andrew S; Bradley, Joseph; Everett, Jeremy R; Loesel, Jens; McLoughlin, David; Mills, James; Peakman, Marie-Claire; Sharp, Robert E; Williams, Christine; Zhu, Hongyao
2016-11-01
High-throughput screening (HTS) is an effective method for lead and probe discovery that is widely used in industry and academia to identify novel chemical matter and to initiate the drug discovery process. However, HTS can be time consuming and costly and the use of subsets as an efficient alternative to screening entire compound collections has been investigated. Subsets may be selected on the basis of chemical diversity, molecular properties, biological activity diversity or biological target focus. Previously, we described a novel form of subset screening: plate-based diversity subset (PBDS) screening, in which the screening subset is constructed by plate selection (rather than individual compound cherry-picking), using algorithms that select for compound quality and chemical diversity on a plate basis. In this paper, we describe a second-generation approach to the construction of an updated subset: PBDS2, using both plate and individual compound selection, that has an improved coverage of the chemical space of the screening file, whilst only selecting the same number of plates for screening. We describe the validation of PBDS2 and its successful use in hit and lead discovery. PBDS2 screening became the default mode of singleton (one compound per well) HTS for lead discovery in Pfizer.
2012-01-01
Background Previous studies on tumor classification based on gene expression profiles suggest that gene selection plays a key role in improving the classification performance. Moreover, finding important tumor-related genes with the highest accuracy is a very important task because these genes might serve as tumor biomarkers, which is of great benefit to not only tumor molecular diagnosis but also drug development. Results This paper proposes a novel gene selection method with rich biomedical meaning based on Heuristic Breadth-first Search Algorithm (HBSA) to find as many optimal gene subsets as possible. Due to the curse of dimensionality, this type of method could suffer from over-fitting and selection bias problems. To address these potential problems, a HBSA-based ensemble classifier is constructed using majority voting strategy from individual classifiers constructed by the selected gene subsets, and a novel HBSA-based gene ranking method is designed to find important tumor-related genes by measuring the significance of genes using their occurrence frequencies in the selected gene subsets. The experimental results on nine tumor datasets including three pairs of cross-platform datasets indicate that the proposed method can not only obtain better generalization performance but also find many important tumor-related genes. Conclusions It is found that the frequencies of the selected genes follow a power-law distribution, indicating that only a few top-ranked genes can be used as potential diagnosis biomarkers. Moreover, the top-ranked genes leading to very high prediction accuracy are closely related to specific tumor subtype and even hub genes. Compared with other related methods, the proposed method can achieve higher prediction accuracy with fewer genes. Moreover, they are further justified by analyzing the top-ranked genes in the context of individual gene function, biological pathway, and protein-protein interaction network. PMID:22830977
NASA Astrophysics Data System (ADS)
Seo, Seung Beom; Kim, Young-Oh; Kim, Youngil; Eum, Hyung-Il
2018-04-01
When selecting a subset of climate change scenarios (GCM models), the priority is to ensure that the subset reflects the comprehensive range of possible model results for all variables concerned. Though many studies have attempted to improve the scenario selection, there is a lack of studies that discuss methods to ensure that the results from a subset of climate models contain the same range of uncertainty in hydrologic variables as when all models are considered. We applied the Katsavounidis-Kuo-Zhang (KKZ) algorithm to select a subset of climate change scenarios and demonstrated its ability to reduce the number of GCM models in an ensemble, while the ranges of multiple climate extremes indices were preserved. First, we analyzed the role of 27 ETCCDI climate extremes indices for scenario selection and selected the representative climate extreme indices. Before the selection of a subset, we excluded a few deficient GCM models that could not represent the observed climate regime. Subsequently, we discovered that a subset of GCM models selected by the KKZ algorithm with the representative climate extreme indices could not capture the full potential range of changes in hydrologic extremes (e.g., 3-day peak flow and 7-day low flow) in some regional case studies. However, the application of the KKZ algorithm with a different set of climate indices, which are correlated to the hydrologic extremes, enabled the overcoming of this limitation. Key climate indices, dependent on the hydrologic extremes to be projected, must therefore be determined prior to the selection of a subset of GCM models.
Application of machine learning on brain cancer multiclass classification
NASA Astrophysics Data System (ADS)
Panca, V.; Rustam, Z.
2017-07-01
Classification of brain cancer is a problem of multiclass classification. One approach to solve this problem is by first transforming it into several binary problems. The microarray gene expression dataset has the two main characteristics of medical data: extremely many features (genes) and only a few number of samples. The application of machine learning on microarray gene expression dataset mainly consists of two steps: feature selection and classification. In this paper, the features are selected using a method based on support vector machine recursive feature elimination (SVM-RFE) principle which is improved to solve multiclass classification, called multiple multiclass SVM-RFE. Instead of using only the selected features on a single classifier, this method combines the result of multiple classifiers. The features are divided into subsets and SVM-RFE is used on each subset. Then, the selected features on each subset are put on separate classifiers. This method enhances the feature selection ability of each single SVM-RFE. Twin support vector machine (TWSVM) is used as the method of the classifier to reduce computational complexity. While ordinary SVM finds single optimum hyperplane, the main objective Twin SVM is to find two non-parallel optimum hyperplanes. The experiment on the brain cancer microarray gene expression dataset shows this method could classify 71,4% of the overall test data correctly, using 100 and 1000 genes selected from multiple multiclass SVM-RFE feature selection method. Furthermore, the per class results show that this method could classify data of normal and MD class with 100% accuracy.
On the reliable and flexible solution of practical subset regression problems
NASA Technical Reports Server (NTRS)
Verhaegen, M. H.
1987-01-01
A new algorithm for solving subset regression problems is described. The algorithm performs a QR decomposition with a new column-pivoting strategy, which permits subset selection directly from the originally defined regression parameters. This, in combination with a number of extensions of the new technique, makes the method a very flexible tool for analyzing subset regression problems in which the parameters have a physical meaning.
NASA Astrophysics Data System (ADS)
Zhang, Aizhu; Sun, Genyun; Wang, Zhenjie
2015-12-01
The serious information redundancy in hyperspectral images (HIs) cannot contribute to the data analysis accuracy, instead it require expensive computational resources. Consequently, to identify the most useful and valuable information from the HIs, thereby improve the accuracy of data analysis, this paper proposed a novel hyperspectral band selection method using the hybrid genetic algorithm and gravitational search algorithm (GA-GSA). In the proposed method, the GA-GSA is mapped to the binary space at first. Then, the accuracy of the support vector machine (SVM) classifier and the number of selected spectral bands are utilized to measure the discriminative capability of the band subset. Finally, the band subset with the smallest number of spectral bands as well as covers the most useful and valuable information is obtained. To verify the effectiveness of the proposed method, studies conducted on an AVIRIS image against two recently proposed state-of-the-art GSA variants are presented. The experimental results revealed the superiority of the proposed method and indicated that the method can indeed considerably reduce data storage costs and efficiently identify the band subset with stable and high classification precision.
Gunter, Lacey; Zhu, Ji; Murphy, Susan
2012-01-01
For many years, subset analysis has been a popular topic for the biostatistics and clinical trials literature. In more recent years, the discussion has focused on finding subsets of genomes which play a role in the effect of treatment, often referred to as stratified or personalized medicine. Though highly sought after, methods for detecting subsets with altering treatment effects are limited and lacking in power. In this article we discuss variable selection for qualitative interactions with the aim to discover these critical patient subsets. We propose a new technique designed specifically to find these interaction variables among a large set of variables while still controlling for the number of false discoveries. We compare this new method against standard qualitative interaction tests using simulations and give an example of its use on data from a randomized controlled trial for the treatment of depression. PMID:22023676
Cheng, Qiang; Zhou, Hongbo; Cheng, Jie
2011-06-01
Selecting features for multiclass classification is a critically important task for pattern recognition and machine learning applications. Especially challenging is selecting an optimal subset of features from high-dimensional data, which typically have many more variables than observations and contain significant noise, missing components, or outliers. Existing methods either cannot handle high-dimensional data efficiently or scalably, or can only obtain local optimum instead of global optimum. Toward the selection of the globally optimal subset of features efficiently, we introduce a new selector--which we call the Fisher-Markov selector--to identify those features that are the most useful in describing essential differences among the possible groups. In particular, in this paper we present a way to represent essential discriminating characteristics together with the sparsity as an optimization objective. With properly identified measures for the sparseness and discriminativeness in possibly high-dimensional settings, we take a systematic approach for optimizing the measures to choose the best feature subset. We use Markov random field optimization techniques to solve the formulated objective functions for simultaneous feature selection. Our results are noncombinatorial, and they can achieve the exact global optimum of the objective function for some special kernels. The method is fast; in particular, it can be linear in the number of features and quadratic in the number of observations. We apply our procedure to a variety of real-world data, including mid--dimensional optical handwritten digit data set and high-dimensional microarray gene expression data sets. The effectiveness of our method is confirmed by experimental results. In pattern recognition and from a model selection viewpoint, our procedure says that it is possible to select the most discriminating subset of variables by solving a very simple unconstrained objective function which in fact can be obtained with an explicit expression.
HIV-1 protease cleavage site prediction based on two-stage feature selection method.
Niu, Bing; Yuan, Xiao-Cheng; Roeper, Preston; Su, Qiang; Peng, Chun-Rong; Yin, Jing-Yuan; Ding, Juan; Li, HaiPeng; Lu, Wen-Cong
2013-03-01
Knowledge of the mechanism of HIV protease cleavage specificity is critical to the design of specific and effective HIV inhibitors. Searching for an accurate, robust, and rapid method to correctly predict the cleavage sites in proteins is crucial when searching for possible HIV inhibitors. In this article, HIV-1 protease specificity was studied using the correlation-based feature subset (CfsSubset) selection method combined with Genetic Algorithms method. Thirty important biochemical features were found based on a jackknife test from the original data set containing 4,248 features. By using the AdaBoost method with the thirty selected features the prediction model yields an accuracy of 96.7% for the jackknife test and 92.1% for an independent set test, with increased accuracy over the original dataset by 6.7% and 77.4%, respectively. Our feature selection scheme could be a useful technique for finding effective competitive inhibitors of HIV protease.
Stochastic subset selection for learning with kernel machines.
Rhinelander, Jason; Liu, Xiaoping P
2012-06-01
Kernel machines have gained much popularity in applications of machine learning. Support vector machines (SVMs) are a subset of kernel machines and generalize well for classification, regression, and anomaly detection tasks. The training procedure for traditional SVMs involves solving a quadratic programming (QP) problem. The QP problem scales super linearly in computational effort with the number of training samples and is often used for the offline batch processing of data. Kernel machines operate by retaining a subset of observed data during training. The data vectors contained within this subset are referred to as support vectors (SVs). The work presented in this paper introduces a subset selection method for the use of kernel machines in online, changing environments. Our algorithm works by using a stochastic indexing technique when selecting a subset of SVs when computing the kernel expansion. The work described here is novel because it separates the selection of kernel basis functions from the training algorithm used. The subset selection algorithm presented here can be used in conjunction with any online training technique. It is important for online kernel machines to be computationally efficient due to the real-time requirements of online environments. Our algorithm is an important contribution because it scales linearly with the number of training samples and is compatible with current training techniques. Our algorithm outperforms standard techniques in terms of computational efficiency and provides increased recognition accuracy in our experiments. We provide results from experiments using both simulated and real-world data sets to verify our algorithm.
Using learning automata to determine proper subset size in high-dimensional spaces
NASA Astrophysics Data System (ADS)
Seyyedi, Seyyed Hossein; Minaei-Bidgoli, Behrouz
2017-03-01
In this paper, we offer a new method called FSLA (Finding the best candidate Subset using Learning Automata), which combines the filter and wrapper approaches for feature selection in high-dimensional spaces. Considering the difficulties of dimension reduction in high-dimensional spaces, FSLA's multi-objective functionality is to determine, in an efficient manner, a feature subset that leads to an appropriate tradeoff between the learning algorithm's accuracy and efficiency. First, using an existing weighting function, the feature list is sorted and selected subsets of the list of different sizes are considered. Then, a learning automaton verifies the performance of each subset when it is used as the input space of the learning algorithm and estimates its fitness upon the algorithm's accuracy and the subset size, which determines the algorithm's efficiency. Finally, FSLA introduces the fittest subset as the best choice. We tested FSLA in the framework of text classification. The results confirm its promising performance of attaining the identified goal.
Libbrecht, Maxwell W; Bilmes, Jeffrey A; Noble, William Stafford
2018-04-01
Selecting a non-redundant representative subset of sequences is a common step in many bioinformatics workflows, such as the creation of non-redundant training sets for sequence and structural models or selection of "operational taxonomic units" from metagenomics data. Previous methods for this task, such as CD-HIT, PISCES, and UCLUST, apply a heuristic threshold-based algorithm that has no theoretical guarantees. We propose a new approach based on submodular optimization. Submodular optimization, a discrete analogue to continuous convex optimization, has been used with great success for other representative set selection problems. We demonstrate that the submodular optimization approach results in representative protein sequence subsets with greater structural diversity than sets chosen by existing methods, using as a gold standard the SCOPe library of protein domain structures. In this setting, submodular optimization consistently yields protein sequence subsets that include more SCOPe domain families than sets of the same size selected by competing approaches. We also show how the optimization framework allows us to design a mixture objective function that performs well for both large and small representative sets. The framework we describe is the best possible in polynomial time (under some assumptions), and it is flexible and intuitive because it applies a suite of generic methods to optimize one of a variety of objective functions. © 2018 Wiley Periodicals, Inc.
Identification of features in indexed data and equipment therefore
Jarman, Kristin H [Richland, WA; Daly, Don Simone [Richland, WA; Anderson, Kevin K [Richland, WA; Wahl, Karen L [Richland, WA
2002-04-02
Embodiments of the present invention provide methods of identifying a feature in an indexed dataset. Such embodiments encompass selecting an initial subset of indices, the initial subset of indices being encompassed by an initial window-of-interest and comprising at least one beginning index and at least one ending index; computing an intensity weighted measure of dispersion for the subset of indices using a subset of responses corresponding to the subset of indices; and comparing the intensity weighted measure of dispersion to a dispersion critical value determined from an expected value of the intensity weighted measure of dispersion under a null hypothesis of no transient feature present. Embodiments of the present invention also encompass equipment configured to perform the methods of the present invention.
Comparison of Different EHG Feature Selection Methods for the Detection of Preterm Labor
Alamedine, D.; Khalil, M.; Marque, C.
2013-01-01
Numerous types of linear and nonlinear features have been extracted from the electrohysterogram (EHG) in order to classify labor and pregnancy contractions. As a result, the number of available features is now very large. The goal of this study is to reduce the number of features by selecting only the relevant ones which are useful for solving the classification problem. This paper presents three methods for feature subset selection that can be applied to choose the best subsets for classifying labor and pregnancy contractions: an algorithm using the Jeffrey divergence (JD) distance, a sequential forward selection (SFS) algorithm, and a binary particle swarm optimization (BPSO) algorithm. The two last methods are based on a classifier and were tested with three types of classifiers. These methods have allowed us to identify common features which are relevant for contraction classification. PMID:24454536
System and method for progressive band selection for hyperspectral images
NASA Technical Reports Server (NTRS)
Fisher, Kevin (Inventor)
2013-01-01
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for progressive band selection for hyperspectral images. A system having module configured to control a processor to practice the method calculates a virtual dimensionality of a hyperspectral image having multiple bands to determine a quantity Q of how many bands are needed for a threshold level of information, ranks each band based on a statistical measure, selects Q bands from the multiple bands to generate a subset of bands based on the virtual dimensionality, and generates a reduced image based on the subset of bands. This approach can create reduced datasets of full hyperspectral images tailored for individual applications. The system uses a metric specific to a target application to rank the image bands, and then selects the most useful bands. The number of bands selected can be specified manually or calculated from the hyperspectral image's virtual dimensionality.
SU-E-J-128: Two-Stage Atlas Selection in Multi-Atlas-Based Image Segmentation
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zhao, T; Ruan, D
2015-06-15
Purpose: In the new era of big data, multi-atlas-based image segmentation is challenged by heterogeneous atlas quality and high computation burden from extensive atlas collection, demanding efficient identification of the most relevant atlases. This study aims to develop a two-stage atlas selection scheme to achieve computational economy with performance guarantee. Methods: We develop a low-cost fusion set selection scheme by introducing a preliminary selection to trim full atlas collection into an augmented subset, alleviating the need for extensive full-fledged registrations. More specifically, fusion set selection is performed in two successive steps: preliminary selection and refinement. An augmented subset is firstmore » roughly selected from the whole atlas collection with a simple registration scheme and the corresponding preliminary relevance metric; the augmented subset is further refined into the desired fusion set size, using full-fledged registration and the associated relevance metric. The main novelty of this work is the introduction of an inference model to relate the preliminary and refined relevance metrics, based on which the augmented subset size is rigorously derived to ensure the desired atlases survive the preliminary selection with high probability. Results: The performance and complexity of the proposed two-stage atlas selection method were assessed using a collection of 30 prostate MR images. It achieved comparable segmentation accuracy as the conventional one-stage method with full-fledged registration, but significantly reduced computation time to 1/3 (from 30.82 to 11.04 min per segmentation). Compared with alternative one-stage cost-saving approach, the proposed scheme yielded superior performance with mean and medium DSC of (0.83, 0.85) compared to (0.74, 0.78). Conclusion: This work has developed a model-guided two-stage atlas selection scheme to achieve significant cost reduction while guaranteeing high segmentation accuracy. The benefit in both complexity and performance is expected to be most pronounced with large-scale heterogeneous data.« less
Decoys Selection in Benchmarking Datasets: Overview and Perspectives
Réau, Manon; Langenfeld, Florent; Zagury, Jean-François; Lagarde, Nathalie; Montes, Matthieu
2018-01-01
Virtual Screening (VS) is designed to prospectively help identifying potential hits, i.e., compounds capable of interacting with a given target and potentially modulate its activity, out of large compound collections. Among the variety of methodologies, it is crucial to select the protocol that is the most adapted to the query/target system under study and that yields the most reliable output. To this aim, the performance of VS methods is commonly evaluated and compared by computing their ability to retrieve active compounds in benchmarking datasets. The benchmarking datasets contain a subset of known active compounds together with a subset of decoys, i.e., assumed non-active molecules. The composition of both the active and the decoy compounds subsets is critical to limit the biases in the evaluation of the VS methods. In this review, we focus on the selection of decoy compounds that has considerably changed over the years, from randomly selected compounds to highly customized or experimentally validated negative compounds. We first outline the evolution of decoys selection in benchmarking databases as well as current benchmarking databases that tend to minimize the introduction of biases, and secondly, we propose recommendations for the selection and the design of benchmarking datasets. PMID:29416509
An improved wrapper-based feature selection method for machinery fault diagnosis
2017-01-01
A major issue of machinery fault diagnosis using vibration signals is that it is over-reliant on personnel knowledge and experience in interpreting the signal. Thus, machine learning has been adapted for machinery fault diagnosis. The quantity and quality of the input features, however, influence the fault classification performance. Feature selection plays a vital role in selecting the most representative feature subset for the machine learning algorithm. In contrast, the trade-off relationship between capability when selecting the best feature subset and computational effort is inevitable in the wrapper-based feature selection (WFS) method. This paper proposes an improved WFS technique before integration with a support vector machine (SVM) model classifier as a complete fault diagnosis system for a rolling element bearing case study. The bearing vibration dataset made available by the Case Western Reserve University Bearing Data Centre was executed using the proposed WFS and its performance has been analysed and discussed. The results reveal that the proposed WFS secures the best feature subset with a lower computational effort by eliminating the redundancy of re-evaluation. The proposed WFS has therefore been found to be capable and efficient to carry out feature selection tasks. PMID:29261689
Deng, Changjian; Lv, Kun; Shi, Debo; Yang, Bo; Yu, Song; He, Zhiyi; Yan, Jia
2018-06-12
In this paper, a novel feature selection and fusion framework is proposed to enhance the discrimination ability of gas sensor arrays for odor identification. Firstly, we put forward an efficient feature selection method based on the separability and the dissimilarity to determine the feature selection order for each type of feature when increasing the dimension of selected feature subsets. Secondly, the K-nearest neighbor (KNN) classifier is applied to determine the dimensions of the optimal feature subsets for different types of features. Finally, in the process of establishing features fusion, we come up with a classification dominance feature fusion strategy which conducts an effective basic feature. Experimental results on two datasets show that the recognition rates of Database I and Database II achieve 97.5% and 80.11%, respectively, when k = 1 for KNN classifier and the distance metric is correlation distance (COR), which demonstrates the superiority of the proposed feature selection and fusion framework in representing signal features. The novel feature selection method proposed in this paper can effectively select feature subsets that are conducive to the classification, while the feature fusion framework can fuse various features which describe the different characteristics of sensor signals, for enhancing the discrimination ability of gas sensors and, to a certain extent, suppressing drift effect.
Minimally buffered data transfers between nodes in a data communications network
Miller, Douglas R.
2015-06-23
Methods, apparatus, and products for minimally buffered data transfers between nodes in a data communications network are disclosed that include: receiving, by a messaging module on an origin node, a storage identifier, a origin data type, and a target data type, the storage identifier specifying application storage containing data, the origin data type describing a data subset contained in the origin application storage, the target data type describing an arrangement of the data subset in application storage on a target node; creating, by the messaging module, origin metadata describing the origin data type; selecting, by the messaging module from the origin application storage in dependence upon the origin metadata and the storage identifier, the data subset; and transmitting, by the messaging module to the target node, the selected data subset for storing in the target application storage in dependence upon the target data type without temporarily buffering the data subset.
In Silico Syndrome Prediction for Coronary Artery Disease in Traditional Chinese Medicine
Lu, Peng; Chen, Jianxin; Zhao, Huihui; Gao, Yibo; Luo, Liangtao; Zuo, Xiaohan; Shi, Qi; Yang, Yiping; Yi, Jianqiang; Wang, Wei
2012-01-01
Coronary artery disease (CAD) is the leading causes of deaths in the world. The differentiation of syndrome (ZHENG) is the criterion of diagnosis and therapeutic in TCM. Therefore, syndrome prediction in silico can be improving the performance of treatment. In this paper, we present a Bayesian network framework to construct a high-confidence syndrome predictor based on the optimum subset, that is, collected by Support Vector Machine (SVM) feature selection. Syndrome of CAD can be divided into asthenia and sthenia syndromes. According to the hierarchical characteristics of syndrome, we firstly label every case three types of syndrome (asthenia, sthenia, or both) to solve several syndromes with some patients. On basis of the three syndromes' classes, we design SVM feature selection to achieve the optimum symptom subset and compare this subset with Markov blanket feature select using ROC. Using this subset, the six predictors of CAD's syndrome are constructed by the Bayesian network technique. We also design Naïve Bayes, C4.5 Logistic, Radial basis function (RBF) network compared with Bayesian network. In a conclusion, the Bayesian network method based on the optimum symptoms shows a practical method to predict six syndromes of CAD in TCM. PMID:22567030
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ditzler, Gregory; Morrison, J. Calvin; Lan, Yemin
Background: Some of the current software tools for comparative metagenomics provide ecologists with the ability to investigate and explore bacterial communities using α– & β–diversity. Feature subset selection – a sub-field of machine learning – can also provide a unique insight into the differences between metagenomic or 16S phenotypes. In particular, feature subset selection methods can obtain the operational taxonomic units (OTUs), or functional features, that have a high-level of influence on the condition being studied. For example, in a previous study we have used information-theoretic feature selection to understand the differences between protein family abundances that best discriminate betweenmore » age groups in the human gut microbiome. Results: We have developed a new Python command line tool, which is compatible with the widely adopted BIOM format, for microbial ecologists that implements information-theoretic subset selection methods for biological data formats. We demonstrate the software tools capabilities on publicly available datasets. Conclusions: We have made the software implementation of Fizzy available to the public under the GNU GPL license. The standalone implementation can be found at http://github.com/EESI/Fizzy.« less
Testing Different Model Building Procedures Using Multiple Regression.
ERIC Educational Resources Information Center
Thayer, Jerome D.
The stepwise regression method of selecting predictors for computer assisted multiple regression analysis was compared with forward, backward, and best subsets regression, using 16 data sets. The results indicated the stepwise method was preferred because of its practical nature, when the models chosen by different selection methods were similar…
Unbiased feature selection in learning random forests for high-dimensional data.
Nguyen, Thanh-Tung; Huang, Joshua Zhexue; Nguyen, Thuy Thi
2015-01-01
Random forests (RFs) have been widely used as a powerful classification method. However, with the randomization in both bagging samples and feature selection, the trees in the forest tend to select uninformative features for node splitting. This makes RFs have poor accuracy when working with high-dimensional data. Besides that, RFs have bias in the feature selection process where multivalued features are favored. Aiming at debiasing feature selection in RFs, we propose a new RF algorithm, called xRF, to select good features in learning RFs for high-dimensional data. We first remove the uninformative features using p-value assessment, and the subset of unbiased features is then selected based on some statistical measures. This feature subset is then partitioned into two subsets. A feature weighting sampling technique is used to sample features from these two subsets for building trees. This approach enables one to generate more accurate trees, while allowing one to reduce dimensionality and the amount of data needed for learning RFs. An extensive set of experiments has been conducted on 47 high-dimensional real-world datasets including image datasets. The experimental results have shown that RFs with the proposed approach outperformed the existing random forests in increasing the accuracy and the AUC measures.
Selection-Fusion Approach for Classification of Datasets with Missing Values
Ghannad-Rezaie, Mostafa; Soltanian-Zadeh, Hamid; Ying, Hao; Dong, Ming
2010-01-01
This paper proposes a new approach based on missing value pattern discovery for classifying incomplete data. This approach is particularly designed for classification of datasets with a small number of samples and a high percentage of missing values where available missing value treatment approaches do not usually work well. Based on the pattern of the missing values, the proposed approach finds subsets of samples for which most of the features are available and trains a classifier for each subset. Then, it combines the outputs of the classifiers. Subset selection is translated into a clustering problem, allowing derivation of a mathematical framework for it. A trade off is established between the computational complexity (number of subsets) and the accuracy of the overall classifier. To deal with this trade off, a numerical criterion is proposed for the prediction of the overall performance. The proposed method is applied to seven datasets from the popular University of California, Irvine data mining archive and an epilepsy dataset from Henry Ford Hospital, Detroit, Michigan (total of eight datasets). Experimental results show that classification accuracy of the proposed method is superior to those of the widely used multiple imputations method and four other methods. They also show that the level of superiority depends on the pattern and percentage of missing values. PMID:20212921
Wang, Jie; Feng, Zuren; Lu, Na; Luo, Jing
2018-06-01
Feature selection plays an important role in the field of EEG signals based motor imagery pattern classification. It is a process that aims to select an optimal feature subset from the original set. Two significant advantages involved are: lowering the computational burden so as to speed up the learning procedure and removing redundant and irrelevant features so as to improve the classification performance. Therefore, feature selection is widely employed in the classification of EEG signals in practical brain-computer interface systems. In this paper, we present a novel statistical model to select the optimal feature subset based on the Kullback-Leibler divergence measure, and automatically select the optimal subject-specific time segment. The proposed method comprises four successive stages: a broad frequency band filtering and common spatial pattern enhancement as preprocessing, features extraction by autoregressive model and log-variance, the Kullback-Leibler divergence based optimal feature and time segment selection and linear discriminate analysis classification. More importantly, this paper provides a potential framework for combining other feature extraction models and classification algorithms with the proposed method for EEG signals classification. Experiments on single-trial EEG signals from two public competition datasets not only demonstrate that the proposed method is effective in selecting discriminative features and time segment, but also show that the proposed method yields relatively better classification results in comparison with other competitive methods. Copyright © 2018 Elsevier Ltd. All rights reserved.
Compact cancer biomarkers discovery using a swarm intelligence feature selection algorithm.
Martinez, Emmanuel; Alvarez, Mario Moises; Trevino, Victor
2010-08-01
Biomarker discovery is a typical application from functional genomics. Due to the large number of genes studied simultaneously in microarray data, feature selection is a key step. Swarm intelligence has emerged as a solution for the feature selection problem. However, swarm intelligence settings for feature selection fail to select small features subsets. We have proposed a swarm intelligence feature selection algorithm based on the initialization and update of only a subset of particles in the swarm. In this study, we tested our algorithm in 11 microarray datasets for brain, leukemia, lung, prostate, and others. We show that the proposed swarm intelligence algorithm successfully increase the classification accuracy and decrease the number of selected features compared to other swarm intelligence methods. Copyright © 2010 Elsevier Ltd. All rights reserved.
Comparative study of feature selection with ensemble learning using SOM variants
NASA Astrophysics Data System (ADS)
Filali, Ameni; Jlassi, Chiraz; Arous, Najet
2017-03-01
Ensemble learning has succeeded in the growth of stability and clustering accuracy, but their runtime prohibits them from scaling up to real-world applications. This study deals the problem of selecting a subset of the most pertinent features for every cluster from a dataset. The proposed method is another extension of the Random Forests approach using self-organizing maps (SOM) variants to unlabeled data that estimates the out-of-bag feature importance from a set of partitions. Every partition is created using a various bootstrap sample and a random subset of the features. Then, we show that the process internal estimates are used to measure variable pertinence in Random Forests are also applicable to feature selection in unsupervised learning. This approach aims to the dimensionality reduction, visualization and cluster characterization at the same time. Hence, we provide empirical results on nineteen benchmark data sets indicating that RFS can lead to significant improvement in terms of clustering accuracy, over several state-of-the-art unsupervised methods, with a very limited subset of features. The approach proves promise to treat with very broad domains.
Variable selection with stepwise and best subset approaches
2016-01-01
While purposeful selection is performed partly by software and partly by hand, the stepwise and best subset approaches are automatically performed by software. Two R functions stepAIC() and bestglm() are well designed for stepwise and best subset regression, respectively. The stepAIC() function begins with a full or null model, and methods for stepwise regression can be specified in the direction argument with character values “forward”, “backward” and “both”. The bestglm() function begins with a data frame containing explanatory variables and response variables. The response variable should be in the last column. Varieties of goodness-of-fit criteria can be specified in the IC argument. The Bayesian information criterion (BIC) usually results in more parsimonious model than the Akaike information criterion. PMID:27162786
Lin, Kuan-Cheng; Hsieh, Yi-Hsiu
2015-10-01
The classification and analysis of data is an important issue in today's research. Selecting a suitable set of features makes it possible to classify an enormous quantity of data quickly and efficiently. Feature selection is generally viewed as a problem of feature subset selection, such as combination optimization problems. Evolutionary algorithms using random search methods have proven highly effective in obtaining solutions to problems of optimization in a diversity of applications. In this study, we developed a hybrid evolutionary algorithm based on endocrine-based particle swarm optimization (EPSO) and artificial bee colony (ABC) algorithms in conjunction with a support vector machine (SVM) for the selection of optimal feature subsets for the classification of datasets. The results of experiments using specific UCI medical datasets demonstrate that the accuracy of the proposed hybrid evolutionary algorithm is superior to that of basic PSO, EPSO and ABC algorithms, with regard to classification accuracy using subsets with a reduced number of features.
A Parameter Subset Selection Algorithm for Mixed-Effects Models
Schmidt, Kathleen L.; Smith, Ralph C.
2016-01-01
Mixed-effects models are commonly used to statistically model phenomena that include attributes associated with a population or general underlying mechanism as well as effects specific to individuals or components of the general mechanism. This can include individual effects associated with data from multiple experiments. However, the parameterizations used to incorporate the population and individual effects are often unidentifiable in the sense that parameters are not uniquely specified by the data. As a result, the current literature focuses on model selection, by which insensitive parameters are fixed or removed from the model. Model selection methods that employ information criteria are applicablemore » to both linear and nonlinear mixed-effects models, but such techniques are limited in that they are computationally prohibitive for large problems due to the number of possible models that must be tested. To limit the scope of possible models for model selection via information criteria, we introduce a parameter subset selection (PSS) algorithm for mixed-effects models, which orders the parameters by their significance. In conclusion, we provide examples to verify the effectiveness of the PSS algorithm and to test the performance of mixed-effects model selection that makes use of parameter subset selection.« less
Wang, Shuaiqun; Aorigele; Kong, Wei; Zeng, Weiming; Hong, Xiaomin
2016-01-01
Gene expression data composed of thousands of genes play an important role in classification platforms and disease diagnosis. Hence, it is vital to select a small subset of salient features over a large number of gene expression data. Lately, many researchers devote themselves to feature selection using diverse computational intelligence methods. However, in the progress of selecting informative genes, many computational methods face difficulties in selecting small subsets for cancer classification due to the huge number of genes (high dimension) compared to the small number of samples, noisy genes, and irrelevant genes. In this paper, we propose a new hybrid algorithm HICATS incorporating imperialist competition algorithm (ICA) which performs global search and tabu search (TS) that conducts fine-tuned search. In order to verify the performance of the proposed algorithm HICATS, we have tested it on 10 well-known benchmark gene expression classification datasets with dimensions varying from 2308 to 12600. The performance of our proposed method proved to be superior to other related works including the conventional version of binary optimization algorithm in terms of classification accuracy and the number of selected genes.
Aorigele; Zeng, Weiming; Hong, Xiaomin
2016-01-01
Gene expression data composed of thousands of genes play an important role in classification platforms and disease diagnosis. Hence, it is vital to select a small subset of salient features over a large number of gene expression data. Lately, many researchers devote themselves to feature selection using diverse computational intelligence methods. However, in the progress of selecting informative genes, many computational methods face difficulties in selecting small subsets for cancer classification due to the huge number of genes (high dimension) compared to the small number of samples, noisy genes, and irrelevant genes. In this paper, we propose a new hybrid algorithm HICATS incorporating imperialist competition algorithm (ICA) which performs global search and tabu search (TS) that conducts fine-tuned search. In order to verify the performance of the proposed algorithm HICATS, we have tested it on 10 well-known benchmark gene expression classification datasets with dimensions varying from 2308 to 12600. The performance of our proposed method proved to be superior to other related works including the conventional version of binary optimization algorithm in terms of classification accuracy and the number of selected genes. PMID:27579323
Ballabio, Davide; Consonni, Viviana; Mauri, Andrea; Todeschini, Roberto
2010-01-11
In multivariate regression and classification issues variable selection is an important procedure used to select an optimal subset of variables with the aim of producing more parsimonious and eventually more predictive models. Variable selection is often necessary when dealing with methodologies that produce thousands of variables, such as Quantitative Structure-Activity Relationships (QSARs) and highly dimensional analytical procedures. In this paper a novel method for variable selection for classification purposes is introduced. This method exploits the recently proposed Canonical Measure of Correlation between two sets of variables (CMC index). The CMC index is in this case calculated for two specific sets of variables, the former being comprised of the independent variables and the latter of the unfolded class matrix. The CMC values, calculated by considering one variable at a time, can be sorted and a ranking of the variables on the basis of their class discrimination capabilities results. Alternatively, CMC index can be calculated for all the possible combinations of variables and the variable subset with the maximal CMC can be selected, but this procedure is computationally more demanding and classification performance of the selected subset is not always the best one. The effectiveness of the CMC index in selecting variables with discriminative ability was compared with that of other well-known strategies for variable selection, such as the Wilks' Lambda, the VIP index based on the Partial Least Squares-Discriminant Analysis, and the selection provided by classification trees. A variable Forward Selection based on the CMC index was finally used in conjunction of Linear Discriminant Analysis. This approach was tested on several chemical data sets. Obtained results were encouraging.
Variable screening via quantile partial correlation
Ma, Shujie; Tsai, Chih-Ling
2016-01-01
In quantile linear regression with ultra-high dimensional data, we propose an algorithm for screening all candidate variables and subsequently selecting relevant predictors. Specifically, we first employ quantile partial correlation for screening, and then we apply the extended Bayesian information criterion (EBIC) for best subset selection. Our proposed method can successfully select predictors when the variables are highly correlated, and it can also identify variables that make a contribution to the conditional quantiles but are marginally uncorrelated or weakly correlated with the response. Theoretical results show that the proposed algorithm can yield the sure screening set. By controlling the false selection rate, model selection consistency can be achieved theoretically. In practice, we proposed using EBIC for best subset selection so that the resulting model is screening consistent. Simulation studies demonstrate that the proposed algorithm performs well, and an empirical example is presented. PMID:28943683
Identification of selection signatures in cattle breeds selected for dairy production.
Stella, Alessandra; Ajmone-Marsan, Paolo; Lazzari, Barbara; Boettcher, Paul
2010-08-01
The genomics revolution has spurred the undertaking of HapMap studies of numerous species, allowing for population genomics to increase the understanding of how selection has created genetic differences between subspecies populations. The objectives of this study were to (1) develop an approach to detect signatures of selection in subsets of phenotypically similar breeds of livestock by comparing single nucleotide polymorphism (SNP) diversity between the subset and a larger population, (2) verify this method in breeds selected for simply inherited traits, and (3) apply this method to the dairy breeds in the International Bovine HapMap (IBHM) study. The data consisted of genotypes for 32,689 SNPs of 497 animals from 19 breeds. For a given subset of breeds, the test statistic was the parametric composite log likelihood (CLL) of the differences in allelic frequencies between the subset and the IBHM for a sliding window of SNPs. The null distribution was obtained by calculating CLL for 50,000 random subsets (per chromosome) of individuals. The validity of this approach was confirmed by obtaining extremely large CLLs at the sites of causative variation for polled (BTA1) and black-coat-color (BTA18) phenotypes. Across the 30 bovine chromosomes, 699 putative selection signatures were detected. The largest CLL was on BTA6 and corresponded to KIT, which is responsible for the piebald phenotype present in four of the five dairy breeds. Potassium channel-related genes were at the site of the largest CLL on three chromosomes (BTA14, -16, and -25) whereas integrins (BTA18 and -19) and serine/arginine rich splicing factors (BTA20 and -23) each had the largest CLL on two chromosomes. On the basis of the results of this study, the application of population genomics to farm animals seems quite promising. Comparisons between breed groups have the potential to identify genomic regions influencing complex traits with no need for complex equipment and the collection of extensive phenotypic records and can contribute to the identification of candidate genes and to the understanding of the biological mechanisms controlling complex traits.
Karayianni, Katerina N; Grimaldi, Keith A; Nikita, Konstantina S; Valavanis, Ioannis K
2015-01-01
This paper aims to enlighten the complex etiology beneath obesity by analysing data from a large nutrigenetics study, in which nutritional and genetic factors associated with obesity were recorded for around two thousand individuals. In our previous work, these data have been analysed using artificial neural network methods, which identified optimised subsets of factors to predict one's obesity status. These methods did not reveal though how the selected factors interact with each other in the obtained predictive models. For that reason, parallel Multifactor Dimensionality Reduction (pMDR) was used here to further analyse the pre-selected subsets of nutrigenetic factors. Within pMDR, predictive models using up to eight factors were constructed, further reducing the input dimensionality, while rules describing the interactive effects of the selected factors were derived. In this way, it was possible to identify specific genetic variations and their interactive effects with particular nutritional factors, which are now under further study.
Method and apparatus for wavefront sensing
Bahk, Seung-Whan
2016-08-23
A method of measuring characteristics of a wavefront of an incident beam includes obtaining an interferogram associated with the incident beam passing through a transmission mask and Fourier transforming the interferogram to provide a frequency domain interferogram. The method also includes selecting a subset of harmonics from the frequency domain interferogram, individually inverse Fourier transforming each of the subset of harmonics to provide a set of spatial domain harmonics, and extracting a phase profile from each of the set of spatial domain harmonics. The method further includes removing phase discontinuities in the phase profile, rotating the phase profile, and reconstructing a phase front of the wavefront of the incident beam.
Selecting sequence variants to improve genomic predictions for dairy cattle
USDA-ARS?s Scientific Manuscript database
Millions of genetic variants have been identified by population-scale sequencing projects, but subsets are needed for routine genomic predictions or to include on genotyping arrays. Methods of selecting sequence variants were compared using both simulated sequence genotypes and actual data from run ...
NASA Astrophysics Data System (ADS)
Hou, Yanqing; Verhagen, Sandra; Wu, Jie
2016-12-01
Ambiguity Resolution (AR) is a key technique in GNSS precise positioning. In case of weak models (i.e., low precision of data), however, the success rate of AR may be low, which may consequently introduce large errors to the baseline solution in cases of wrong fixing. Partial Ambiguity Resolution (PAR) is therefore proposed such that the baseline precision can be improved by fixing only a subset of ambiguities with high success rate. This contribution proposes a new PAR strategy, allowing to select the subset such that the expected precision gain is maximized among a set of pre-selected subsets, while at the same time the failure rate is controlled. These pre-selected subsets are supposed to obtain the highest success rate among those with the same subset size. The strategy is called Two-step Success Rate Criterion (TSRC) as it will first try to fix a relatively large subset with the fixed failure rate ratio test (FFRT) to decide on acceptance or rejection. In case of rejection, a smaller subset will be fixed and validated by the ratio test so as to fulfill the overall failure rate criterion. It is shown how the method can be practically used, without introducing a large additional computation effort. And more importantly, how it can improve (or at least not deteriorate) the availability in terms of baseline precision comparing to classical Success Rate Criterion (SRC) PAR strategy, based on a simulation validation. In the simulation validation, significant improvements are obtained for single-GNSS on short baselines with dual-frequency observations. For dual-constellation GNSS, the improvement for single-frequency observations on short baselines is very significant, on average 68%. For the medium- to long baselines, with dual-constellation GNSS the average improvement is around 20-30%.
A New Direction of Cancer Classification: Positive Effect of Low-Ranking MicroRNAs.
Li, Feifei; Piao, Minghao; Piao, Yongjun; Li, Meijing; Ryu, Keun Ho
2014-10-01
Many studies based on microRNA (miRNA) expression profiles showed a new aspect of cancer classification. Because one characteristic of miRNA expression data is the high dimensionality, feature selection methods have been used to facilitate dimensionality reduction. The feature selection methods have one shortcoming thus far: they just consider the problem of where feature to class is 1:1 or n:1. However, because one miRNA may influence more than one type of cancer, human miRNA is considered to be ranked low in traditional feature selection methods and are removed most of the time. In view of the limitation of the miRNA number, low-ranking miRNAs are also important to cancer classification. We considered both high- and low-ranking features to cover all problems (1:1, n:1, 1:n, and m:n) in cancer classification. First, we used the correlation-based feature selection method to select the high-ranking miRNAs, and chose the support vector machine, Bayes network, decision tree, k-nearest-neighbor, and logistic classifier to construct cancer classification. Then, we chose Chi-square test, information gain, gain ratio, and Pearson's correlation feature selection methods to build the m:n feature subset, and used the selected miRNAs to determine cancer classification. The low-ranking miRNA expression profiles achieved higher classification accuracy compared with just using high-ranking miRNAs in traditional feature selection methods. Our results demonstrate that the m:n feature subset made a positive impression of low-ranking miRNAs in cancer classification.
Capela, Nicole A; Lemaire, Edward D; Baddour, Natalie
2015-01-01
Human activity recognition (HAR), using wearable sensors, is a growing area with the potential to provide valuable information on patient mobility to rehabilitation specialists. Smartphones with accelerometer and gyroscope sensors are a convenient, minimally invasive, and low cost approach for mobility monitoring. HAR systems typically pre-process raw signals, segment the signals, and then extract features to be used in a classifier. Feature selection is a crucial step in the process to reduce potentially large data dimensionality and provide viable parameters to enable activity classification. Most HAR systems are customized to an individual research group, including a unique data set, classes, algorithms, and signal features. These data sets are obtained predominantly from able-bodied participants. In this paper, smartphone accelerometer and gyroscope sensor data were collected from populations that can benefit from human activity recognition: able-bodied, elderly, and stroke patients. Data from a consecutive sequence of 41 mobility tasks (18 different tasks) were collected for a total of 44 participants. Seventy-six signal features were calculated and subsets of these features were selected using three filter-based, classifier-independent, feature selection methods (Relief-F, Correlation-based Feature Selection, Fast Correlation Based Filter). The feature subsets were then evaluated using three generic classifiers (Naïve Bayes, Support Vector Machine, j48 Decision Tree). Common features were identified for all three populations, although the stroke population subset had some differences from both able-bodied and elderly sets. Evaluation with the three classifiers showed that the feature subsets produced similar or better accuracies than classification with the entire feature set. Therefore, since these feature subsets are classifier-independent, they should be useful for developing and improving HAR systems across and within populations.
2015-01-01
Human activity recognition (HAR), using wearable sensors, is a growing area with the potential to provide valuable information on patient mobility to rehabilitation specialists. Smartphones with accelerometer and gyroscope sensors are a convenient, minimally invasive, and low cost approach for mobility monitoring. HAR systems typically pre-process raw signals, segment the signals, and then extract features to be used in a classifier. Feature selection is a crucial step in the process to reduce potentially large data dimensionality and provide viable parameters to enable activity classification. Most HAR systems are customized to an individual research group, including a unique data set, classes, algorithms, and signal features. These data sets are obtained predominantly from able-bodied participants. In this paper, smartphone accelerometer and gyroscope sensor data were collected from populations that can benefit from human activity recognition: able-bodied, elderly, and stroke patients. Data from a consecutive sequence of 41 mobility tasks (18 different tasks) were collected for a total of 44 participants. Seventy-six signal features were calculated and subsets of these features were selected using three filter-based, classifier-independent, feature selection methods (Relief-F, Correlation-based Feature Selection, Fast Correlation Based Filter). The feature subsets were then evaluated using three generic classifiers (Naïve Bayes, Support Vector Machine, j48 Decision Tree). Common features were identified for all three populations, although the stroke population subset had some differences from both able-bodied and elderly sets. Evaluation with the three classifiers showed that the feature subsets produced similar or better accuracies than classification with the entire feature set. Therefore, since these feature subsets are classifier-independent, they should be useful for developing and improving HAR systems across and within populations. PMID:25885272
Column Subset Selection, Matrix Factorization, and Eigenvalue Optimization
2008-07-01
Pietsch and Grothendieck, which are regarded as basic instruments in modern functional analysis [Pis86]. • The methods for computing these... Pietsch factorization and the maxcut semi- definite program [GW95]. 1.2. Overview. We focus on the algorithmic version of the Kashin–Tzafriri theorem...will see that the desired subset is exposed by factoring the random submatrix. This factorization, which was invented by Pietsch , is regarded as a basic
Spectral Band Selection for Urban Material Classification Using Hyperspectral Libraries
NASA Astrophysics Data System (ADS)
Le Bris, A.; Chehata, N.; Briottet, X.; Paparoditis, N.
2016-06-01
In urban areas, information concerning very high resolution land cover and especially material maps are necessary for several city modelling or monitoring applications. That is to say, knowledge concerning the roofing materials or the different kinds of ground areas is required. Airborne remote sensing techniques appear to be convenient for providing such information at a large scale. However, results obtained using most traditional processing methods based on usual red-green-blue-near infrared multispectral images remain limited for such applications. A possible way to improve classification results is to enhance the imagery spectral resolution using superspectral or hyperspectral sensors. In this study, it is intended to design a superspectral sensor dedicated to urban materials classification and this work particularly focused on the selection of the optimal spectral band subsets for such sensor. First, reflectance spectral signatures of urban materials were collected from 7 spectral libraires. Then, spectral optimization was performed using this data set. The band selection workflow included two steps, optimising first the number of spectral bands using an incremental method and then examining several possible optimised band subsets using a stochastic algorithm. The same wrapper relevance criterion relying on a confidence measure of Random Forests classifier was used at both steps. To cope with the limited number of available spectra for several classes, additional synthetic spectra were generated from the collection of reference spectra: intra-class variability was simulated by multiplying reference spectra by a random coefficient. At the end, selected band subsets were evaluated considering the classification quality reached using a rbf svm classifier. It was confirmed that a limited band subset was sufficient to classify common urban materials. The important contribution of bands from the Short Wave Infra-Red (SWIR) spectral domain (1000-2400 nm) to material classification was also shown.
GenoCore: A simple and fast algorithm for core subset selection from large genotype datasets.
Jeong, Seongmun; Kim, Jae-Yoon; Jeong, Soon-Chun; Kang, Sung-Taeg; Moon, Jung-Kyung; Kim, Namshin
2017-01-01
Selecting core subsets from plant genotype datasets is important for enhancing cost-effectiveness and to shorten the time required for analyses of genome-wide association studies (GWAS), and genomics-assisted breeding of crop species, etc. Recently, a large number of genetic markers (>100,000 single nucleotide polymorphisms) have been identified from high-density single nucleotide polymorphism (SNP) arrays and next-generation sequencing (NGS) data. However, there is no software available for picking out the efficient and consistent core subset from such a huge dataset. It is necessary to develop software that can extract genetically important samples in a population with coherence. We here present a new program, GenoCore, which can find quickly and efficiently the core subset representing the entire population. We introduce simple measures of coverage and diversity scores, which reflect genotype errors and genetic variations, and can help to select a sample rapidly and accurately for crop genotype dataset. Comparison of our method to other core collection software using example datasets are performed to validate the performance according to genetic distance, diversity, coverage, required system resources, and the number of selected samples. GenoCore selects the smallest, most consistent, and most representative core collection from all samples, using less memory with more efficient scores, and shows greater genetic coverage compared to the other software tested. GenoCore was written in R language, and can be accessed online with an example dataset and test results at https://github.com/lovemun/Genocore.
Gene selection for microarray data classification via subspace learning and manifold regularization.
Tang, Chang; Cao, Lijuan; Zheng, Xiao; Wang, Minhui
2017-12-19
With the rapid development of DNA microarray technology, large amount of genomic data has been generated. Classification of these microarray data is a challenge task since gene expression data are often with thousands of genes but a small number of samples. In this paper, an effective gene selection method is proposed to select the best subset of genes for microarray data with the irrelevant and redundant genes removed. Compared with original data, the selected gene subset can benefit the classification task. We formulate the gene selection task as a manifold regularized subspace learning problem. In detail, a projection matrix is used to project the original high dimensional microarray data into a lower dimensional subspace, with the constraint that the original genes can be well represented by the selected genes. Meanwhile, the local manifold structure of original data is preserved by a Laplacian graph regularization term on the low-dimensional data space. The projection matrix can serve as an importance indicator of different genes. An iterative update algorithm is developed for solving the problem. Experimental results on six publicly available microarray datasets and one clinical dataset demonstrate that the proposed method performs better when compared with other state-of-the-art methods in terms of microarray data classification. Graphical Abstract The graphical abstract of this work.
Optimizing an Actuator Array for the Control of Multi-Frequency Noise in Aircraft Interiors
NASA Technical Reports Server (NTRS)
Palumbo, D. L.; Padula, S. L.
1997-01-01
Techniques developed for selecting an optimized actuator array for interior noise reduction at a single frequency are extended to the multi-frequency case. Transfer functions for 64 actuators were obtained at 5 frequencies from ground testing the rear section of a fully trimmed DC-9 fuselage. A single loudspeaker facing the left side of the aircraft was the primary source. A combinatorial search procedure (tabu search) was employed to find optimum actuator subsets of from 2 to 16 actuators. Noise reduction predictions derived from the transfer functions were used as a basis for evaluating actuator subsets during optimization. Results indicate that it is necessary to constrain actuator forces during optimization. Unconstrained optimizations selected actuators which require unrealistically large forces. Two methods of constraint are evaluated. It is shown that a fast, but approximate, method yields results equivalent to an accurate, but computationally expensive, method.
Arruti, Andoni; Cearreta, Idoia; Álvarez, Aitor; Lazkano, Elena; Sierra, Basilio
2014-01-01
Study of emotions in human–computer interaction is a growing research area. This paper shows an attempt to select the most significant features for emotion recognition in spoken Basque and Spanish Languages using different methods for feature selection. RekEmozio database was used as the experimental data set. Several Machine Learning paradigms were used for the emotion classification task. Experiments were executed in three phases, using different sets of features as classification variables in each phase. Moreover, feature subset selection was applied at each phase in order to seek for the most relevant feature subset. The three phases approach was selected to check the validity of the proposed approach. Achieved results show that an instance-based learning algorithm using feature subset selection techniques based on evolutionary algorithms is the best Machine Learning paradigm in automatic emotion recognition, with all different feature sets, obtaining a mean of 80,05% emotion recognition rate in Basque and a 74,82% in Spanish. In order to check the goodness of the proposed process, a greedy searching approach (FSS-Forward) has been applied and a comparison between them is provided. Based on achieved results, a set of most relevant non-speaker dependent features is proposed for both languages and new perspectives are suggested. PMID:25279686
Defining an essence of structure determining residue contacts in proteins.
Sathyapriya, R; Duarte, Jose M; Stehr, Henning; Filippis, Ioannis; Lappe, Michael
2009-12-01
The network of native non-covalent residue contacts determines the three-dimensional structure of a protein. However, not all contacts are of equal structural significance, and little knowledge exists about a minimal, yet sufficient, subset required to define the global features of a protein. Characterisation of this "structural essence" has remained elusive so far: no algorithmic strategy has been devised to-date that could outperform a random selection in terms of 3D reconstruction accuracy (measured as the Ca RMSD). It is not only of theoretical interest (i.e., for design of advanced statistical potentials) to identify the number and nature of essential native contacts-such a subset of spatial constraints is very useful in a number of novel experimental methods (like EPR) which rely heavily on constraint-based protein modelling. To derive accurate three-dimensional models from distance constraints, we implemented a reconstruction pipeline using distance geometry. We selected a test-set of 12 protein structures from the four major SCOP fold classes and performed our reconstruction analysis. As a reference set, series of random subsets (ranging from 10% to 90% of native contacts) are generated for each protein, and the reconstruction accuracy is computed for each subset. We have developed a rational strategy, termed "cone-peeling" that combines sequence features and network descriptors to select minimal subsets that outperform the reference sets. We present, for the first time, a rational strategy to derive a structural essence of residue contacts and provide an estimate of the size of this minimal subset. Our algorithm computes sparse subsets capable of determining the tertiary structure at approximately 4.8 A Ca RMSD with as little as 8% of the native contacts (Ca-Ca and Cb-Cb). At the same time, a randomly chosen subset of native contacts needs about twice as many contacts to reach the same level of accuracy. This "structural essence" opens new avenues in the fields of structure prediction, empirical potentials and docking.
Defining an Essence of Structure Determining Residue Contacts in Proteins
Sathyapriya, R.; Duarte, Jose M.; Stehr, Henning; Filippis, Ioannis; Lappe, Michael
2009-01-01
The network of native non-covalent residue contacts determines the three-dimensional structure of a protein. However, not all contacts are of equal structural significance, and little knowledge exists about a minimal, yet sufficient, subset required to define the global features of a protein. Characterisation of this “structural essence” has remained elusive so far: no algorithmic strategy has been devised to-date that could outperform a random selection in terms of 3D reconstruction accuracy (measured as the Ca RMSD). It is not only of theoretical interest (i.e., for design of advanced statistical potentials) to identify the number and nature of essential native contacts—such a subset of spatial constraints is very useful in a number of novel experimental methods (like EPR) which rely heavily on constraint-based protein modelling. To derive accurate three-dimensional models from distance constraints, we implemented a reconstruction pipeline using distance geometry. We selected a test-set of 12 protein structures from the four major SCOP fold classes and performed our reconstruction analysis. As a reference set, series of random subsets (ranging from 10% to 90% of native contacts) are generated for each protein, and the reconstruction accuracy is computed for each subset. We have developed a rational strategy, termed “cone-peeling” that combines sequence features and network descriptors to select minimal subsets that outperform the reference sets. We present, for the first time, a rational strategy to derive a structural essence of residue contacts and provide an estimate of the size of this minimal subset. Our algorithm computes sparse subsets capable of determining the tertiary structure at approximately 4.8 Å Ca RMSD with as little as 8% of the native contacts (Ca-Ca and Cb-Cb). At the same time, a randomly chosen subset of native contacts needs about twice as many contacts to reach the same level of accuracy. This “structural essence” opens new avenues in the fields of structure prediction, empirical potentials and docking. PMID:19997489
Efficient Simulation Budget Allocation for Selecting an Optimal Subset
NASA Technical Reports Server (NTRS)
Chen, Chun-Hung; He, Donghai; Fu, Michael; Lee, Loo Hay
2008-01-01
We consider a class of the subset selection problem in ranking and selection. The objective is to identify the top m out of k designs based on simulated output. Traditional procedures are conservative and inefficient. Using the optimal computing budget allocation framework, we formulate the problem as that of maximizing the probability of correc tly selecting all of the top-m designs subject to a constraint on the total number of samples available. For an approximation of this corre ct selection probability, we derive an asymptotically optimal allocat ion and propose an easy-to-implement heuristic sequential allocation procedure. Numerical experiments indicate that the resulting allocatio ns are superior to other methods in the literature that we tested, and the relative efficiency increases for larger problems. In addition, preliminary numerical results indicate that the proposed new procedur e has the potential to enhance computational efficiency for simulation optimization.
2009-01-01
selection and uncertainty sampling signif- icantly. Index Terms: Transcription, labeling, submodularity, submod- ular selection, active learning , sequence...name of batch active learning , where a subset of data that is most informative and represen- tative of the whole is selected for labeling. Often...representative subset. Note that our Fisher ker- nel is over an unsupervised generative model, which enables us to bootstrap our active learning approach
Decision Variants for the Automatic Determination of Optimal Feature Subset in RF-RFE.
Chen, Qi; Meng, Zhaopeng; Liu, Xinyi; Jin, Qianguo; Su, Ran
2018-06-15
Feature selection, which identifies a set of most informative features from the original feature space, has been widely used to simplify the predictor. Recursive feature elimination (RFE), as one of the most popular feature selection approaches, is effective in data dimension reduction and efficiency increase. A ranking of features, as well as candidate subsets with the corresponding accuracy, is produced through RFE. The subset with highest accuracy (HA) or a preset number of features (PreNum) are often used as the final subset. However, this may lead to a large number of features being selected, or if there is no prior knowledge about this preset number, it is often ambiguous and subjective regarding final subset selection. A proper decision variant is in high demand to automatically determine the optimal subset. In this study, we conduct pioneering work to explore the decision variant after obtaining a list of candidate subsets from RFE. We provide a detailed analysis and comparison of several decision variants to automatically select the optimal feature subset. Random forest (RF)-recursive feature elimination (RF-RFE) algorithm and a voting strategy are introduced. We validated the variants on two totally different molecular biology datasets, one for a toxicogenomic study and the other one for protein sequence analysis. The study provides an automated way to determine the optimal feature subset when using RF-RFE.
NASA Astrophysics Data System (ADS)
Zhang, Chen; Ni, Zhiwei; Ni, Liping; Tang, Na
2016-10-01
Feature selection is an important method of data preprocessing in data mining. In this paper, a novel feature selection method based on multi-fractal dimension and harmony search algorithm is proposed. Multi-fractal dimension is adopted as the evaluation criterion of feature subset, which can determine the number of selected features. An improved harmony search algorithm is used as the search strategy to improve the efficiency of feature selection. The performance of the proposed method is compared with that of other feature selection algorithms on UCI data-sets. Besides, the proposed method is also used to predict the daily average concentration of PM2.5 in China. Experimental results show that the proposed method can obtain competitive results in terms of both prediction accuracy and the number of selected features.
Qian, Chongsheng; Wang, Yingying; Cai, Huili; Laroye, Caroline; De Carvalho Bittencourt, Marcelo; Clement, Laurence; Stoltz, Jean-François; Decot, Véronique; Reppel, Loïc; Bensoussan, Danièle
2016-01-01
Adoptive antiviral cellular immunotherapy by infusion of virus-specific T cells (VSTs) is becoming an alternative treatment for viral infection after hematopoietic stem cell transplantation. The T memory stem cell (TSCM) subset was recently described as exhibiting self-renewal and multipotency properties which are required for sustained efficacy in vivo. We wondered if such a crucial subset for immunotherapy was present in VSTs. We identified, by flow cytometry, TSCM in adenovirus (ADV)-specific interferon (IFN)-γ+ T cells before and after IFN-γ-based immunomagnetic selection, and analyzed the distribution of the main T-cell subsets in VSTs: naive T cells (TN), TSCM, T central memory cells (TCM), T effector memory cell (TEM), and effector T cells (TEFF). In this study all of the different T-cell subsets were observed in the blood sample from healthy donor ADV-VSTs, both before and after IFN-γ-based immunomagnetic selection. As the IFN-γ-based immunomagnetic selection system sorts mainly the most differentiated T-cell subsets, we observed that TEM was always the major T-cell subset of ADV-specific T cells after immunomagnetic isolation and especially after expansion in vitro. Comparing T-cell subpopulation profiles before and after in vitro expansion, we observed that in vitro cell culture with interleukin-2 resulted in a significant expansion of TN-like, TCM, TEM, and TEFF subsets in CD4IFN-γ T cells and of TCM and TEM subsets only in CD8IFN-γ T cells. We demonstrated the presence of all T-cell subsets in IFN-γ VSTs including the TSCM subpopulation, although this was weakly selected by the IFN-γ-based immunomagnetic selection system.
ERIC Educational Resources Information Center
Brusco, Michael J.; Singh, Renu; Steinley, Douglas
2009-01-01
The selection of a subset of variables from a pool of candidates is an important problem in several areas of multivariate statistics. Within the context of principal component analysis (PCA), a number of authors have argued that subset selection is crucial for identifying those variables that are required for correct interpretation of the…
Variable Selection through Correlation Sifting
NASA Astrophysics Data System (ADS)
Huang, Jim C.; Jojic, Nebojsa
Many applications of computational biology require a variable selection procedure to sift through a large number of input variables and select some smaller number that influence a target variable of interest. For example, in virology, only some small number of viral protein fragments influence the nature of the immune response during viral infection. Due to the large number of variables to be considered, a brute-force search for the subset of variables is in general intractable. To approximate this, methods based on ℓ1-regularized linear regression have been proposed and have been found to be particularly successful. It is well understood however that such methods fail to choose the correct subset of variables if these are highly correlated with other "decoy" variables. We present a method for sifting through sets of highly correlated variables which leads to higher accuracy in selecting the correct variables. The main innovation is a filtering step that reduces correlations among variables to be selected, making the ℓ1-regularization effective for datasets on which many methods for variable selection fail. The filtering step changes both the values of the predictor variables and output values by projections onto components obtained through a computationally-inexpensive principal components analysis. In this paper we demonstrate the usefulness of our method on synthetic datasets and on novel applications in virology. These include HIV viral load analysis based on patients' HIV sequences and immune types, as well as the analysis of seasonal variation in influenza death rates based on the regions of the influenza genome that undergo diversifying selection in the previous season.
An Ensemble Framework Coping with Instability in the Gene Selection Process.
Castellanos-Garzón, José A; Ramos, Juan; López-Sánchez, Daniel; de Paz, Juan F; Corchado, Juan M
2018-03-01
This paper proposes an ensemble framework for gene selection, which is aimed at addressing instability problems presented in the gene filtering task. The complex process of gene selection from gene expression data faces different instability problems from the informative gene subsets found by different filter methods. This makes the identification of significant genes by the experts difficult. The instability of results can come from filter methods, gene classifier methods, different datasets of the same disease and multiple valid groups of biomarkers. Even though there is a wide number of proposals, the complexity imposed by this problem remains a challenge today. This work proposes a framework involving five stages of gene filtering to discover biomarkers for diagnosis and classification tasks. This framework performs a process of stable feature selection, facing the problems above and, thus, providing a more suitable and reliable solution for clinical and research purposes. Our proposal involves a process of multistage gene filtering, in which several ensemble strategies for gene selection were added in such a way that different classifiers simultaneously assess gene subsets to face instability. Firstly, we apply an ensemble of recent gene selection methods to obtain diversity in the genes found (stability according to filter methods). Next, we apply an ensemble of known classifiers to filter genes relevant to all classifiers at a time (stability according to classification methods). The achieved results were evaluated in two different datasets of the same disease (pancreatic ductal adenocarcinoma), in search of stability according to the disease, for which promising results were achieved.
Minimization search method for data inversion
NASA Technical Reports Server (NTRS)
Fymat, A. L.
1975-01-01
Technique has been developed for determining values of selected subsets of independent variables in mathematical formulations. Required computation time increases with first power of the number of variables. This is in contrast with classical minimization methods for which computational time increases with third power of the number of variables.
Furlanello, Cesare; Serafini, Maria; Merler, Stefano; Jurman, Giuseppe
2003-11-06
We describe the E-RFE method for gene ranking, which is useful for the identification of markers in the predictive classification of array data. The method supports a practical modeling scheme designed to avoid the construction of classification rules based on the selection of too small gene subsets (an effect known as the selection bias, in which the estimated predictive errors are too optimistic due to testing on samples already considered in the feature selection process). With E-RFE, we speed up the recursive feature elimination (RFE) with SVM classifiers by eliminating chunks of uninteresting genes using an entropy measure of the SVM weights distribution. An optimal subset of genes is selected according to a two-strata model evaluation procedure: modeling is replicated by an external stratified-partition resampling scheme, and, within each run, an internal K-fold cross-validation is used for E-RFE ranking. Also, the optimal number of genes can be estimated according to the saturation of Zipf's law profiles. Without a decrease of classification accuracy, E-RFE allows a speed-up factor of 100 with respect to standard RFE, while improving on alternative parametric RFE reduction strategies. Thus, a process for gene selection and error estimation is made practical, ensuring control of the selection bias, and providing additional diagnostic indicators of gene importance.
EFS: an ensemble feature selection tool implemented as R-package and web-application.
Neumann, Ursula; Genze, Nikita; Heider, Dominik
2017-01-01
Feature selection methods aim at identifying a subset of features that improve the prediction performance of subsequent classification models and thereby also simplify their interpretability. Preceding studies demonstrated that single feature selection methods can have specific biases, whereas an ensemble feature selection has the advantage to alleviate and compensate for these biases. The software EFS (Ensemble Feature Selection) makes use of multiple feature selection methods and combines their normalized outputs to a quantitative ensemble importance. Currently, eight different feature selection methods have been integrated in EFS, which can be used separately or combined in an ensemble. EFS identifies relevant features while compensating specific biases of single methods due to an ensemble approach. Thereby, EFS can improve the prediction accuracy and interpretability in subsequent binary classification models. EFS can be downloaded as an R-package from CRAN or used via a web application at http://EFS.heiderlab.de.
Linear methods for reducing EMG contamination in peripheral nerve motor decodes.
Kagan, Zachary B; Wendelken, Suzanne; Page, David M; Davis, Tyler; Hutchinson, Douglas T; Clark, Gregory A; Warren, David J
2016-08-01
Signals recorded from the peripheral nervous system (PNS) with high channel count penetrating microelectrode arrays, such as the Utah Slanted Electrode Array (USEA), often have electromyographic (EMG) signals contaminating the neural signal. This common-mode signal source may prevent single neural units from successfully being detected, thus hindering motor decode algorithms. Reducing this EMG contamination may lead to more accurate motor decode performance. A virtual reference (VR), created by a weighted linear combination of signals from a subset of all available channels, can be used to reduce this EMG contamination. Four methods of determining individual channel weights and six different methods of selecting subsets of channels were investigated (24 different VR types in total). The methods of determining individual channel weights were equal weighting, regression-based weighting, and two different proximity-based weightings. The subsets of channels were selected by a radius-based criteria, such that a channel was included if it was within a particular radius of inclusion from the target channel. These six radii of inclusion were 1.5, 2.9, 3.2, 5, 8.4, and 12.8 electrode-distances; the 12.8 electrode radius includes all USEA electrodes. We found that application of a VR improves the detectability of neural events via increasing the SNR, but we found no statistically meaningful difference amongst the VR types we examined. The computational complexity of implementation varies with respect to the method of determining channel weights and the number of channels in a subset, but does not correlate with VR performance. Hence, we examined the computational costs of calculating and applying the VR and based on these criteria, we recommend an equal weighting method of assigning weights with a 3.2 electrode-distance radius of inclusion. Further, we found empirically that application of the recommended VR will require less than 1 ms for 33.3 ms of data from one USEA.
NASA Technical Reports Server (NTRS)
Tumer, Kagan; Oza, Nikunj C.; Clancy, Daniel (Technical Monitor)
2001-01-01
Using an ensemble of classifiers instead of a single classifier has been shown to improve generalization performance in many pattern recognition problems. However, the extent of such improvement depends greatly on the amount of correlation among the errors of the base classifiers. Therefore, reducing those correlations while keeping the classifiers' performance levels high is an important area of research. In this article, we explore input decimation (ID), a method which selects feature subsets for their ability to discriminate among the classes and uses them to decouple the base classifiers. We provide a summary of the theoretical benefits of correlation reduction, along with results of our method on two underwater sonar data sets, three benchmarks from the Probenl/UCI repositories, and two synthetic data sets. The results indicate that input decimated ensembles (IDEs) outperform ensembles whose base classifiers use all the input features; randomly selected subsets of features; and features created using principal components analysis, on a wide range of domains.
Surveillance system and method having parameter estimation and operating mode partitioning
NASA Technical Reports Server (NTRS)
Bickford, Randall L. (Inventor)
2003-01-01
A system and method for monitoring an apparatus or process asset including partitioning an unpartitioned training data set into a plurality of training data subsets each having an operating mode associated thereto; creating a process model comprised of a plurality of process submodels each trained as a function of at least one of the training data subsets; acquiring a current set of observed signal data values from the asset; determining an operating mode of the asset for the current set of observed signal data values; selecting a process submodel from the process model as a function of the determined operating mode of the asset; calculating a current set of estimated signal data values from the selected process submodel for the determined operating mode; and outputting the calculated current set of estimated signal data values for providing asset surveillance and/or control.
Scalable amplification of strand subsets from chip-synthesized oligonucleotide libraries
Schmidt, Thorsten L.; Beliveau, Brian J.; Uca, Yavuz O.; Theilmann, Mark; Da Cruz, Felipe; Wu, Chao-Ting; Shih, William M.
2015-01-01
Synthetic oligonucleotides are the main cost factor for studies in DNA nanotechnology, genetics and synthetic biology, which all require thousands of these at high quality. Inexpensive chip-synthesized oligonucleotide libraries can contain hundreds of thousands of distinct sequences, however only at sub-femtomole quantities per strand. Here we present a selective oligonucleotide amplification method, based on three rounds of rolling-circle amplification, that produces nanomole amounts of single-stranded oligonucleotides per millilitre reaction. In a multistep one-pot procedure, subsets of hundreds or thousands of single-stranded DNAs with different lengths can selectively be amplified and purified together. These oligonucleotides are used to fold several DNA nanostructures and as primary fluorescence in situ hybridization probes. The amplification cost is lower than other reported methods (typically around US$ 20 per nanomole total oligonucleotides produced) and is dominated by the use of commercial enzymes. PMID:26567534
A new mosaic method for three-dimensional surface
NASA Astrophysics Data System (ADS)
Yuan, Yun; Zhu, Zhaokun; Ding, Yongjun
2011-08-01
Three-dimensional (3-D) data mosaic is a indispensable link in surface measurement and digital terrain map generation. With respect to the mosaic problem of the local unorganized cloud points with rude registration and mass mismatched points, a new mosaic method for 3-D surface based on RANSAC is proposed. Every circular of this method is processed sequentially by random sample with additional shape constraint, data normalization of cloud points, absolute orientation, data denormalization of cloud points, inlier number statistic, etc. After N random sample trials the largest consensus set is selected, and at last the model is re-estimated using all the points in the selected subset. The minimal subset is composed of three non-colinear points which form a triangle. The shape of triangle is considered in random sample selection in order to make the sample selection reasonable. A new coordinate system transformation algorithm presented in this paper is used to avoid the singularity. The whole rotation transformation between the two coordinate systems can be solved by twice rotations expressed by Euler angle vector, each rotation has explicit physical means. Both simulation and real data are used to prove the correctness and validity of this mosaic method. This method has better noise immunity due to its robust estimation property, and has high accuracy as the shape constraint is added to random sample and the data normalization added to the absolute orientation. This method is applicable for high precision measurement of three-dimensional surface and also for the 3-D terrain mosaic.
Algorithms for Learning Preferences for Sets of Objects
NASA Technical Reports Server (NTRS)
Wagstaff, Kiri L.; desJardins, Marie; Eaton, Eric
2010-01-01
A method is being developed that provides for an artificial-intelligence system to learn a user's preferences for sets of objects and to thereafter automatically select subsets of objects according to those preferences. The method was originally intended to enable automated selection, from among large sets of images acquired by instruments aboard spacecraft, of image subsets considered to be scientifically valuable enough to justify use of limited communication resources for transmission to Earth. The method is also applicable to other sets of objects: examples of sets of objects considered in the development of the method include food menus, radio-station music playlists, and assortments of colored blocks for creating mosaics. The method does not require the user to perform the often-difficult task of quantitatively specifying preferences; instead, the user provides examples of preferred sets of objects. This method goes beyond related prior artificial-intelligence methods for learning which individual items are preferred by the user: this method supports a concept of setbased preferences, which include not only preferences for individual items but also preferences regarding types and degrees of diversity of items in a set. Consideration of diversity in this method involves recognition that members of a set may interact with each other in the sense that when considered together, they may be regarded as being complementary, redundant, or incompatible to various degrees. The effects of such interactions are loosely summarized in the term portfolio effect. The learning method relies on a preference representation language, denoted DD-PREF, to express set-based preferences. In DD-PREF, a preference is represented by a tuple that includes quality (depth) functions to estimate how desired a specific value is, weights for each feature preference, the desired diversity of feature values, and the relative importance of diversity versus depth. The system applies statistical concepts to estimate quantitative measures of the user s preferences from training examples (preferred subsets) specified by the user. Once preferences have been learned, the system uses those preferences to select preferred subsets from new sets. The method was found to be viable when tested in computational experiments on menus, music playlists, and rover images. Contemplated future development efforts include further tests on more diverse sets and development of a sub-method for (a) estimating the parameter that represents the relative importance of diversity versus depth, and (b) incorporating background knowledge about the nature of quality functions, which are special functions that specify depth preferences for features.
NASA Astrophysics Data System (ADS)
Khehra, Baljit Singh; Pharwaha, Amar Partap Singh
2017-04-01
Ductal carcinoma in situ (DCIS) is one type of breast cancer. Clusters of microcalcifications (MCCs) are symptoms of DCIS that are recognized by mammography. Selection of robust features vector is the process of selecting an optimal subset of features from a large number of available features in a given problem domain after the feature extraction and before any classification scheme. Feature selection reduces the feature space that improves the performance of classifier and decreases the computational burden imposed by using many features on classifier. Selection of an optimal subset of features from a large number of available features in a given problem domain is a difficult search problem. For n features, the total numbers of possible subsets of features are 2n. Thus, selection of an optimal subset of features problem belongs to the category of NP-hard problems. In this paper, an attempt is made to find the optimal subset of MCCs features from all possible subsets of features using genetic algorithm (GA), particle swarm optimization (PSO) and biogeography-based optimization (BBO). For simulation, a total of 380 benign and malignant MCCs samples have been selected from mammogram images of DDSM database. A total of 50 features extracted from benign and malignant MCCs samples are used in this study. In these algorithms, fitness function is correct classification rate of classifier. Support vector machine is used as a classifier. From experimental results, it is also observed that the performance of PSO-based and BBO-based algorithms to select an optimal subset of features for classifying MCCs as benign or malignant is better as compared to GA-based algorithm.
Mujalli, Randa Oqab; de Oña, Juan
2011-10-01
This study describes a method for reducing the number of variables frequently considered in modeling the severity of traffic accidents. The method's efficiency is assessed by constructing Bayesian networks (BN). It is based on a two stage selection process. Several variable selection algorithms, commonly used in data mining, are applied in order to select subsets of variables. BNs are built using the selected subsets and their performance is compared with the original BN (with all the variables) using five indicators. The BNs that improve the indicators' values are further analyzed for identifying the most significant variables (accident type, age, atmospheric factors, gender, lighting, number of injured, and occupant involved). A new BN is built using these variables, where the results of the indicators indicate, in most of the cases, a statistically significant improvement with respect to the original BN. It is possible to reduce the number of variables used to model traffic accidents injury severity through BNs without reducing the performance of the model. The study provides the safety analysts a methodology that could be used to minimize the number of variables used in order to determine efficiently the injury severity of traffic accidents without reducing the performance of the model. Copyright © 2011 Elsevier Ltd. All rights reserved.
Discrete Biogeography Based Optimization for Feature Selection in Molecular Signatures.
Liu, Bo; Tian, Meihong; Zhang, Chunhua; Li, Xiangtao
2015-04-01
Biomarker discovery from high-dimensional data is a complex task in the development of efficient cancer diagnoses and classification. However, these data are usually redundant and noisy, and only a subset of them present distinct profiles for different classes of samples. Thus, selecting high discriminative genes from gene expression data has become increasingly interesting in the field of bioinformatics. In this paper, a discrete biogeography based optimization is proposed to select the good subset of informative gene relevant to the classification. In the proposed algorithm, firstly, the fisher-markov selector is used to choose fixed number of gene data. Secondly, to make biogeography based optimization suitable for the feature selection problem; discrete migration model and discrete mutation model are proposed to balance the exploration and exploitation ability. Then, discrete biogeography based optimization, as we called DBBO, is proposed by integrating discrete migration model and discrete mutation model. Finally, the DBBO method is used for feature selection, and three classifiers are used as the classifier with the 10 fold cross-validation method. In order to show the effective and efficiency of the algorithm, the proposed algorithm is tested on four breast cancer dataset benchmarks. Comparison with genetic algorithm, particle swarm optimization, differential evolution algorithm and hybrid biogeography based optimization, experimental results demonstrate that the proposed method is better or at least comparable with previous method from literature when considering the quality of the solutions obtained. © 2015 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Nishi, Kanae; Kewley-Port, Diane
2008-01-01
Purpose Nishi and Kewley-Port (2007) trained Japanese listeners to perceive nine American English monophthongs and showed that a protocol using all nine vowels (fullset) produced better results than the one using only the three more difficult vowels (subset). The present study extended the target population to Koreans and examined whether protocols combining the two stimulus sets would provide more effective training. Method Three groups of five Korean listeners were trained on American English vowels for nine days using one of the three protocols: fullset only, first three days on subset then six days on fullset, or first six days on fullset then three days on subset. Participants' performance was assessed by pre- and post-training tests, as well as by a mid-training test. Results 1) Fullset training was also effective for Koreans; 2) no advantage was found for the two combined protocols over the fullset only protocol, and 3) sustained “non-improvement” was observed for training using one of the combined protocols. Conclusions In using subsets for training American English vowels, care should be taken not only in the selection of subset vowels, but also for the training orders of subsets. PMID:18664694
Rough sets and Laplacian score based cost-sensitive feature selection
Yu, Shenglong
2018-01-01
Cost-sensitive feature selection learning is an important preprocessing step in machine learning and data mining. Recently, most existing cost-sensitive feature selection algorithms are heuristic algorithms, which evaluate the importance of each feature individually and select features one by one. Obviously, these algorithms do not consider the relationship among features. In this paper, we propose a new algorithm for minimal cost feature selection called the rough sets and Laplacian score based cost-sensitive feature selection. The importance of each feature is evaluated by both rough sets and Laplacian score. Compared with heuristic algorithms, the proposed algorithm takes into consideration the relationship among features with locality preservation of Laplacian score. We select a feature subset with maximal feature importance and minimal cost when cost is undertaken in parallel, where the cost is given by three different distributions to simulate different applications. Different from existing cost-sensitive feature selection algorithms, our algorithm simultaneously selects out a predetermined number of “good” features. Extensive experimental results show that the approach is efficient and able to effectively obtain the minimum cost subset. In addition, the results of our method are more promising than the results of other cost-sensitive feature selection algorithms. PMID:29912884
Rough sets and Laplacian score based cost-sensitive feature selection.
Yu, Shenglong; Zhao, Hong
2018-01-01
Cost-sensitive feature selection learning is an important preprocessing step in machine learning and data mining. Recently, most existing cost-sensitive feature selection algorithms are heuristic algorithms, which evaluate the importance of each feature individually and select features one by one. Obviously, these algorithms do not consider the relationship among features. In this paper, we propose a new algorithm for minimal cost feature selection called the rough sets and Laplacian score based cost-sensitive feature selection. The importance of each feature is evaluated by both rough sets and Laplacian score. Compared with heuristic algorithms, the proposed algorithm takes into consideration the relationship among features with locality preservation of Laplacian score. We select a feature subset with maximal feature importance and minimal cost when cost is undertaken in parallel, where the cost is given by three different distributions to simulate different applications. Different from existing cost-sensitive feature selection algorithms, our algorithm simultaneously selects out a predetermined number of "good" features. Extensive experimental results show that the approach is efficient and able to effectively obtain the minimum cost subset. In addition, the results of our method are more promising than the results of other cost-sensitive feature selection algorithms.
LS Bound based gene selection for DNA microarray data.
Zhou, Xin; Mao, K Z
2005-04-15
One problem with discriminant analysis of DNA microarray data is that each sample is represented by quite a large number of genes, and many of them are irrelevant, insignificant or redundant to the discriminant problem at hand. Methods for selecting important genes are, therefore, of much significance in microarray data analysis. In the present study, a new criterion, called LS Bound measure, is proposed to address the gene selection problem. The LS Bound measure is derived from leave-one-out procedure of LS-SVMs (least squares support vector machines), and as the upper bound for leave-one-out classification results it reflects to some extent the generalization performance of gene subsets. We applied this LS Bound measure for gene selection on two benchmark microarray datasets: colon cancer and leukemia. We also compared the LS Bound measure with other evaluation criteria, including the well-known Fisher's ratio and Mahalanobis class separability measure, and other published gene selection algorithms, including Weighting factor and SVM Recursive Feature Elimination. The strength of the LS Bound measure is that it provides gene subsets leading to more accurate classification results than the filter method while its computational complexity is at the level of the filter method. A companion website can be accessed at http://www.ntu.edu.sg/home5/pg02776030/lsbound/. The website contains: (1) the source code of the gene selection algorithm; (2) the complete set of tables and figures regarding the experimental study; (3) proof of the inequality (9). ekzmao@ntu.edu.sg.
Core Hunter 3: flexible core subset selection.
De Beukelaer, Herman; Davenport, Guy F; Fack, Veerle
2018-05-31
Core collections provide genebank curators and plant breeders a way to reduce size of their collections and populations, while minimizing impact on genetic diversity and allele frequency. Many methods have been proposed to generate core collections, often using distance metrics to quantify the similarity of two accessions, based on genetic marker data or phenotypic traits. Core Hunter is a multi-purpose core subset selection tool that uses local search algorithms to generate subsets relying on one or more metrics, including several distance metrics and allelic richness. In version 3 of Core Hunter (CH3) we have incorporated two new, improved methods for summarizing distances to quantify diversity or representativeness of the core collection. A comparison of CH3 and Core Hunter 2 (CH2) showed that these new metrics can be effectively optimized with less complex algorithms, as compared to those used in CH2. CH3 is more effective at maximizing the improved diversity metric than CH2, still ensures a high average and minimum distance, and is faster for large datasets. Using CH3, a simple stochastic hill-climber is able to find highly diverse core collections, and the more advanced parallel tempering algorithm further increases the quality of the core and further reduces variability across independent samples. We also evaluate the ability of CH3 to simultaneously maximize diversity, and either representativeness or allelic richness, and compare the results with those of the GDOpt and SimEli methods. CH3 can sample equally representative cores as GDOpt, which was specifically designed for this purpose, and is able to construct cores that are simultaneously more diverse, and either are more representative or have higher allelic richness, than those obtained by SimEli. In version 3, Core Hunter has been updated to include two new core subset selection metrics that construct cores for representativeness or diversity, with improved performance. It combines and outperforms the strengths of other methods, as it (simultaneously) optimizes a variety of metrics. In addition, CH3 is an improvement over CH2, with the option to use genetic marker data or phenotypic traits, or both, and improved speed. Core Hunter 3 is freely available on http://www.corehunter.org .
Prediction of lysine ubiquitylation with ensemble classifier and feature selection.
Zhao, Xiaowei; Li, Xiangtao; Ma, Zhiqiang; Yin, Minghao
2011-01-01
Ubiquitylation is an important process of post-translational modification. Correct identification of protein lysine ubiquitylation sites is of fundamental importance to understand the molecular mechanism of lysine ubiquitylation in biological systems. This paper develops a novel computational method to effectively identify the lysine ubiquitylation sites based on the ensemble approach. In the proposed method, 468 ubiquitylation sites from 323 proteins retrieved from the Swiss-Prot database were encoded into feature vectors by using four kinds of protein sequences information. An effective feature selection method was then applied to extract informative feature subsets. After different feature subsets were obtained by setting different starting points in the search procedure, they were used to train multiple random forests classifiers and then aggregated into a consensus classifier by majority voting. Evaluated by jackknife tests and independent tests respectively, the accuracy of the proposed predictor reached 76.82% for the training dataset and 79.16% for the test dataset, indicating that this predictor is a useful tool to predict lysine ubiquitylation sites. Furthermore, site-specific feature analysis was performed and it was shown that ubiquitylation is intimately correlated with the features of its surrounding sites in addition to features derived from the lysine site itself. The feature selection method is available upon request.
Sensitivity Analysis for Atmospheric Infrared Sounder (AIRS) CO2 Retrieval
NASA Technical Reports Server (NTRS)
Gat, Ilana
2012-01-01
The Atmospheric Infrared Sounder (AIRS) is a thermal infrared sensor able to retrieve the daily atmospheric state globally for clear as well as partially cloudy field-of-views. The AIRS spectrometer has 2378 channels sensing from 15.4 micrometers to 3.7 micrometers, of which a small subset in the 15 micrometers region has been selected, to date, for CO2 retrieval. To improve upon the current retrieval method, we extended the retrieval calculations to include a prior estimate component and developed a channel ranking system to optimize the channels and number of channels used. The channel ranking system uses a mathematical formalism to rapidly process and assess the retrieval potential of large numbers of channels. Implementing this system, we identifed a larger optimized subset of AIRS channels that can decrease retrieval errors and minimize the overall sensitivity to other iridescent contributors, such as water vapor, ozone, and atmospheric temperature. This methodology selects channels globally by accounting for the latitudinal, longitudinal, and seasonal dependencies of the subset. The new methodology increases accuracy in AIRS CO2 as well as other retrievals and enables the extension of retrieved CO2 vertical profiles to altitudes ranging from the lower troposphere to upper stratosphere. The extended retrieval method for CO2 vertical profile estimation using a maximum-likelihood estimation method. We use model data to demonstrate the beneficial impact of the extended retrieval method using the new channel ranking system on CO2 retrieval.
Classification of urine sediment based on convolution neural network
NASA Astrophysics Data System (ADS)
Pan, Jingjing; Jiang, Cunbo; Zhu, Tiantian
2018-04-01
By designing a new convolution neural network framework, this paper breaks the constraints of the original convolution neural network framework requiring large training samples and samples of the same size. Move and cropping the input images, generate the same size of the sub-graph. And then, the generated sub-graph uses the method of dropout, increasing the diversity of samples and preventing the fitting generation. Randomly select some proper subset in the sub-graphic set and ensure that the number of elements in the proper subset is same and the proper subset is not the same. The proper subsets are used as input layers for the convolution neural network. Through the convolution layer, the pooling, the full connection layer and output layer, we can obtained the classification loss rate of test set and training set. In the red blood cells, white blood cells, calcium oxalate crystallization classification experiment, the classification accuracy rate of 97% or more.
Gene selection heuristic algorithm for nutrigenomics studies.
Valour, D; Hue, I; Grimard, B; Valour, B
2013-07-15
Large datasets from -omics studies need to be deeply investigated. The aim of this paper is to provide a new method (LEM method) for the search of transcriptome and metabolome connections. The heuristic algorithm here described extends the classical canonical correlation analysis (CCA) to a high number of variables (without regularization) and combines well-conditioning and fast-computing in "R." Reduced CCA models are summarized in PageRank matrices, the product of which gives a stochastic matrix that resumes the self-avoiding walk covered by the algorithm. Then, a homogeneous Markov process applied to this stochastic matrix converges the probabilities of interconnection between genes, providing a selection of disjointed subsets of genes. This is an alternative to regularized generalized CCA for the determination of blocks within the structure matrix. Each gene subset is thus linked to the whole metabolic or clinical dataset that represents the biological phenotype of interest. Moreover, this selection process reaches the aim of biologists who often need small sets of genes for further validation or extended phenotyping. The algorithm is shown to work efficiently on three published datasets, resulting in meaningfully broadened gene networks.
Oliveira, Roberta B; Pereira, Aledir S; Tavares, João Manuel R S
2017-10-01
The number of deaths worldwide due to melanoma has risen in recent times, in part because melanoma is the most aggressive type of skin cancer. Computational systems have been developed to assist dermatologists in early diagnosis of skin cancer, or even to monitor skin lesions. However, there still remains a challenge to improve classifiers for the diagnosis of such skin lesions. The main objective of this article is to evaluate different ensemble classification models based on input feature manipulation to diagnose skin lesions. Input feature manipulation processes are based on feature subset selections from shape properties, colour variation and texture analysis to generate diversity for the ensemble models. Three subset selection models are presented here: (1) a subset selection model based on specific feature groups, (2) a correlation-based subset selection model, and (3) a subset selection model based on feature selection algorithms. Each ensemble classification model is generated using an optimum-path forest classifier and integrated with a majority voting strategy. The proposed models were applied on a set of 1104 dermoscopic images using a cross-validation procedure. The best results were obtained by the first ensemble classification model that generates a feature subset ensemble based on specific feature groups. The skin lesion diagnosis computational system achieved 94.3% accuracy, 91.8% sensitivity and 96.7% specificity. The input feature manipulation process based on specific feature subsets generated the greatest diversity for the ensemble classification model with very promising results. Copyright © 2017 Elsevier B.V. All rights reserved.
Accuracy of direct genomic values in Holstein bulls and cows using subsets of SNP markers
2010-01-01
Background At the current price, the use of high-density single nucleotide polymorphisms (SNP) genotyping assays in genomic selection of dairy cattle is limited to applications involving elite sires and dams. The objective of this study was to evaluate the use of low-density assays to predict direct genomic value (DGV) on five milk production traits, an overall conformation trait, a survival index, and two profit index traits (APR, ASI). Methods Dense SNP genotypes were available for 42,576 SNP for 2,114 Holstein bulls and 510 cows. A subset of 1,847 bulls born between 1955 and 2004 was used as a training set to fit models with various sets of pre-selected SNP. A group of 297 bulls born between 2001 and 2004 and all cows born between 1992 and 2004 were used to evaluate the accuracy of DGV prediction. Ridge regression (RR) and partial least squares regression (PLSR) were used to derive prediction equations and to rank SNP based on the absolute value of the regression coefficients. Four alternative strategies were applied to select subset of SNP, namely: subsets of the highest ranked SNP for each individual trait, or a single subset of evenly spaced SNP, where SNP were selected based on their rank for ASI, APR or minor allele frequency within intervals of approximately equal length. Results RR and PLSR performed very similarly to predict DGV, with PLSR performing better for low-density assays and RR for higher-density SNP sets. When using all SNP, DGV predictions for production traits, which have a higher heritability, were more accurate (0.52-0.64) than for survival (0.19-0.20), which has a low heritability. The gain in accuracy using subsets that included the highest ranked SNP for each trait was marginal (5-6%) over a common set of evenly spaced SNP when at least 3,000 SNP were used. Subsets containing 3,000 SNP provided more than 90% of the accuracy that could be achieved with a high-density assay for cows, and 80% of the high-density assay for young bulls. Conclusions Accurate genomic evaluation of the broader bull and cow population can be achieved with a single genotyping assays containing ~ 3,000 to 5,000 evenly spaced SNP. PMID:20950478
Weigel, K A; de los Campos, G; González-Recio, O; Naya, H; Wu, X L; Long, N; Rosa, G J M; Gianola, D
2009-10-01
The objective of the present study was to assess the predictive ability of subsets of single nucleotide polymorphism (SNP) markers for development of low-cost, low-density genotyping assays in dairy cattle. Dense SNP genotypes of 4,703 Holstein bulls were provided by the USDA Agricultural Research Service. A subset of 3,305 bulls born from 1952 to 1998 was used to fit various models (training set), and a subset of 1,398 bulls born from 1999 to 2002 was used to evaluate their predictive ability (testing set). After editing, data included genotypes for 32,518 SNP and August 2003 and April 2008 predicted transmitting abilities (PTA) for lifetime net merit (LNM$), the latter resulting from progeny testing. The Bayesian least absolute shrinkage and selection operator method was used to regress August 2003 PTA on marker covariates in the training set to arrive at estimates of marker effects and direct genomic PTA. The coefficient of determination (R(2)) from regressing the April 2008 progeny test PTA of bulls in the testing set on their August 2003 direct genomic PTA was 0.375. Subsets of 300, 500, 750, 1,000, 1,250, 1,500, and 2,000 SNP were created by choosing equally spaced and highly ranked SNP, with the latter based on the absolute value of their estimated effects obtained from the training set. The SNP effects were re-estimated from the training set for each subset of SNP, and the 2008 progeny test PTA of bulls in the testing set were regressed on corresponding direct genomic PTA. The R(2) values for subsets of 300, 500, 750, 1,000, 1,250, 1,500, and 2,000 SNP with largest effects (evenly spaced SNP) were 0.184 (0.064), 0.236 (0.111), 0.269 (0.190), 0.289 (0.179), 0.307 (0.228), 0.313 (0.268), and 0.322 (0.291), respectively. These results indicate that a low-density assay comprising selected SNP could be a cost-effective alternative for selection decisions and that significant gains in predictive ability may be achieved by increasing the number of SNP allocated to such an assay from 300 or fewer to 1,000 or more.
Torija, Antonio J; Ruiz, Diego P
2015-02-01
The prediction of environmental noise in urban environments requires the solution of a complex and non-linear problem, since there are complex relationships among the multitude of variables involved in the characterization and modelling of environmental noise and environmental-noise magnitudes. Moreover, the inclusion of the great spatial heterogeneity characteristic of urban environments seems to be essential in order to achieve an accurate environmental-noise prediction in cities. This problem is addressed in this paper, where a procedure based on feature-selection techniques and machine-learning regression methods is proposed and applied to this environmental problem. Three machine-learning regression methods, which are considered very robust in solving non-linear problems, are used to estimate the energy-equivalent sound-pressure level descriptor (LAeq). These three methods are: (i) multilayer perceptron (MLP), (ii) sequential minimal optimisation (SMO), and (iii) Gaussian processes for regression (GPR). In addition, because of the high number of input variables involved in environmental-noise modelling and estimation in urban environments, which make LAeq prediction models quite complex and costly in terms of time and resources for application to real situations, three different techniques are used to approach feature selection or data reduction. The feature-selection techniques used are: (i) correlation-based feature-subset selection (CFS), (ii) wrapper for feature-subset selection (WFS), and the data reduction technique is principal-component analysis (PCA). The subsequent analysis leads to a proposal of different schemes, depending on the needs regarding data collection and accuracy. The use of WFS as the feature-selection technique with the implementation of SMO or GPR as regression algorithm provides the best LAeq estimation (R(2)=0.94 and mean absolute error (MAE)=1.14-1.16 dB(A)). Copyright © 2014 Elsevier B.V. All rights reserved.
Optimisation algorithms for ECG data compression.
Haugland, D; Heber, J G; Husøy, J H
1997-07-01
The use of exact optimisation algorithms for compressing digital electrocardiograms (ECGs) is demonstrated. As opposed to traditional time-domain methods, which use heuristics to select a small subset of representative signal samples, the problem of selecting the subset is formulated in rigorous mathematical terms. This approach makes it possible to derive algorithms guaranteeing the smallest possible reconstruction error when a bounded selection of signal samples is interpolated. The proposed model resembles well-known network models and is solved by a cubic dynamic programming algorithm. When applied to standard test problems, the algorithm produces a compressed representation for which the distortion is about one-half of that obtained by traditional time-domain compression techniques at reasonable compression ratios. This illustrates that, in terms of the accuracy of decoded signals, existing time-domain heuristics for ECG compression may be far from what is theoretically achievable. The paper is an attempt to bridge this gap.
Advances in metaheuristics for gene selection and classification of microarray data.
Duval, Béatrice; Hao, Jin-Kao
2010-01-01
Gene selection aims at identifying a (small) subset of informative genes from the initial data in order to obtain high predictive accuracy for classification. Gene selection can be considered as a combinatorial search problem and thus be conveniently handled with optimization methods. In this article, we summarize some recent developments of using metaheuristic-based methods within an embedded approach for gene selection. In particular, we put forward the importance and usefulness of integrating problem-specific knowledge into the search operators of such a method. To illustrate the point, we explain how ranking coefficients of a linear classifier such as support vector machine (SVM) can be profitably used to reinforce the search efficiency of Local Search and Evolutionary Search metaheuristic algorithms for gene selection and classification.
Zhang, Daqing; Xiao, Jianfeng; Zhou, Nannan; Luo, Xiaomin; Jiang, Hualiang; Chen, Kaixian
2015-01-01
Blood-brain barrier (BBB) is a highly complex physical barrier determining what substances are allowed to enter the brain. Support vector machine (SVM) is a kernel-based machine learning method that is widely used in QSAR study. For a successful SVM model, the kernel parameters for SVM and feature subset selection are the most important factors affecting prediction accuracy. In most studies, they are treated as two independent problems, but it has been proven that they could affect each other. We designed and implemented genetic algorithm (GA) to optimize kernel parameters and feature subset selection for SVM regression and applied it to the BBB penetration prediction. The results show that our GA/SVM model is more accurate than other currently available log BB models. Therefore, to optimize both SVM parameters and feature subset simultaneously with genetic algorithm is a better approach than other methods that treat the two problems separately. Analysis of our log BB model suggests that carboxylic acid group, polar surface area (PSA)/hydrogen-bonding ability, lipophilicity, and molecular charge play important role in BBB penetration. Among those properties relevant to BBB penetration, lipophilicity could enhance the BBB penetration while all the others are negatively correlated with BBB penetration. PMID:26504797
Elyasigomari, V; Lee, D A; Screen, H R C; Shaheed, M H
2017-03-01
For each cancer type, only a few genes are informative. Due to the so-called 'curse of dimensionality' problem, the gene selection task remains a challenge. To overcome this problem, we propose a two-stage gene selection method called MRMR-COA-HS. In the first stage, the minimum redundancy and maximum relevance (MRMR) feature selection is used to select a subset of relevant genes. The selected genes are then fed into a wrapper setup that combines a new algorithm, COA-HS, using the support vector machine as a classifier. The method was applied to four microarray datasets, and the performance was assessed by the leave one out cross-validation method. Comparative performance assessment of the proposed method with other evolutionary algorithms suggested that the proposed algorithm significantly outperforms other methods in selecting a fewer number of genes while maintaining the highest classification accuracy. The functions of the selected genes were further investigated, and it was confirmed that the selected genes are biologically relevant to each cancer type. Copyright © 2017. Published by Elsevier Inc.
YamiPred: A Novel Evolutionary Method for Predicting Pre-miRNAs and Selecting Relevant Features.
Kleftogiannis, Dimitrios; Theofilatos, Konstantinos; Likothanassis, Spiros; Mavroudi, Seferina
2015-01-01
MicroRNAs (miRNAs) are small non-coding RNAs, which play a significant role in gene regulation. Predicting miRNA genes is a challenging bioinformatics problem and existing experimental and computational methods fail to deal with it effectively. We developed YamiPred, an embedded classification method that combines the efficiency and robustness of support vector machines (SVM) with genetic algorithms (GA) for feature selection and parameters optimization. YamiPred was tested in a new and realistic human dataset and was compared with state-of-the-art computational intelligence approaches and the prevalent SVM-based tools for miRNA prediction. Experimental results indicate that YamiPred outperforms existing approaches in terms of accuracy and of geometric mean of sensitivity and specificity. The embedded feature selection component selects a compact feature subset that contributes to the performance optimization. Further experimentation with this minimal feature subset has achieved very high classification performance and revealed the minimum number of samples required for developing a robust predictor. YamiPred also confirmed the important role of commonly used features such as entropy and enthalpy, and uncovered the significance of newly introduced features, such as %A-U aggregate nucleotide frequency and positional entropy. The best model trained on human data has successfully predicted pre-miRNAs to other organisms including the category of viruses.
NASA Astrophysics Data System (ADS)
Ma, Yuan-Zhuo; Li, Hong-Shuang; Yao, Wei-Xing
2018-05-01
The evaluation of the probabilistic constraints in reliability-based design optimization (RBDO) problems has always been significant and challenging work, which strongly affects the performance of RBDO methods. This article deals with RBDO problems using a recently developed generalized subset simulation (GSS) method and a posterior approximation approach. The posterior approximation approach is used to transform all the probabilistic constraints into ordinary constraints as in deterministic optimization. The assessment of multiple failure probabilities required by the posterior approximation approach is achieved by GSS in a single run at all supporting points, which are selected by a proper experimental design scheme combining Sobol' sequences and Bucher's design. Sequentially, the transformed deterministic design optimization problem can be solved by optimization algorithms, for example, the sequential quadratic programming method. Three optimization problems are used to demonstrate the efficiency and accuracy of the proposed method.
Hernandez, Maria Eugenia; Martinez-Fong, Daniel; Perez-Tapia, Mayra; Estrada-Garcia, Iris; Estrada-Parra, Sergio; Pavón, Lenin
2010-02-01
To date, only the effect of a short-term antidepressant treatment (<12 weeks) on neuroendocrinoimmune alterations in patients with a major depressive disorder has been evaluated. Our objective was to determine the effect of a 52-week long treatment with selective serotonin-reuptake inhibitors on lymphocyte subsets. The participants were thirty-one patients and twenty-two healthy volunteers. The final number of patients (10) resulted from selection and course, as detailed in the enrollment scheme. Methods used to psychiatrically analyze the participants included the Mini-International Neuropsychiatric Interview, Hamilton Depression Scale and Beck Depression Inventory. The peripheral lymphocyte subsets were measured in peripheral blood using flow cytometry. Before treatment, increased counts of natural killer (NK) cells in patients were statistically significant when compared with those of healthy volunteers (312+/-29 versus 158+/-30; cells/mL), but no differences in the populations of T and B cells were found. The patients showed remission of depressive episodes after 20 weeks of treatment along with an increase in NK cell and B cell populations, which remained increased until the end of the study. At the 52nd week of treatment, patients showed an increase in the counts of NK cells (396+/-101 cells/mL) and B cells (268+/-64 cells/mL) compared to healthy volunteers (NK, 159+/-30 cells/mL; B cells, 179+/-37 cells/mL). We conclude that long-term treatment with selective serotonin-reuptake inhibitors not only causes remission of depressive symptoms, but also affects lymphocyte subset populations. The physiopathological consequence of these changes remains to be determined.
Updating estimates of low streamflow statistics to account for possible trends
NASA Astrophysics Data System (ADS)
Blum, A. G.; Archfield, S. A.; Hirsch, R. M.; Vogel, R. M.; Kiang, J. E.; Dudley, R. W.
2017-12-01
Given evidence of both increasing and decreasing trends in low flows in many streams, methods are needed to update estimators of low flow statistics used in water resources management. One such metric is the 10-year annual low-flow statistic (7Q10) calculated as the annual minimum seven-day streamflow which is exceeded in nine out of ten years on average. Historical streamflow records may not be representative of current conditions at a site if environmental conditions are changing. We present a new approach to frequency estimation under nonstationary conditions that applies a stationary nonparametric quantile estimator to a subset of the annual minimum flow record. Monte Carlo simulation experiments were used to evaluate this approach across a range of trend and no trend scenarios. Relative to the standard practice of using the entire available streamflow record, use of a nonparametric quantile estimator combined with selection of the most recent 30 or 50 years for 7Q10 estimation were found to improve accuracy and reduce bias. Benefits of data subset selection approaches were greater for higher magnitude trends annual minimum flow records with lower coefficients of variation. A nonparametric trend test approach for subset selection did not significantly improve upon always selecting the last 30 years of record. At 174 stream gages in the Chesapeake Bay region, 7Q10 estimators based on the most recent 30 years of flow record were compared to estimators based on the entire period of record. Given the availability of long records of low streamflow, using only a subset of the flow record ( 30 years) can be used to update 7Q10 estimators to better reflect current streamflow conditions.
Feature Selection for Ridge Regression with Provable Guarantees.
Paul, Saurabh; Drineas, Petros
2016-04-01
We introduce single-set spectral sparsification as a deterministic sampling-based feature selection technique for regularized least-squares classification, which is the classification analog to ridge regression. The method is unsupervised and gives worst-case guarantees of the generalization power of the classification function after feature selection with respect to the classification function obtained using all features. We also introduce leverage-score sampling as an unsupervised randomized feature selection method for ridge regression. We provide risk bounds for both single-set spectral sparsification and leverage-score sampling on ridge regression in the fixed design setting and show that the risk in the sampled space is comparable to the risk in the full-feature space. We perform experiments on synthetic and real-world data sets; a subset of TechTC-300 data sets, to support our theory. Experimental results indicate that the proposed methods perform better than the existing feature selection methods.
Zhan, Xiaobin; Jiang, Shulan; Yang, Yili; Liang, Jian; Shi, Tielin; Li, Xiwen
2015-09-18
This paper proposes an ultrasonic measurement system based on least squares support vector machines (LS-SVM) for inline measurement of particle concentrations in multicomponent suspensions. Firstly, the ultrasonic signals are analyzed and processed, and the optimal feature subset that contributes to the best model performance is selected based on the importance of features. Secondly, the LS-SVM model is tuned, trained and tested with different feature subsets to obtain the optimal model. In addition, a comparison is made between the partial least square (PLS) model and the LS-SVM model. Finally, the optimal LS-SVM model with the optimal feature subset is applied to inline measurement of particle concentrations in the mixing process. The results show that the proposed method is reliable and accurate for inline measuring the particle concentrations in multicomponent suspensions and the measurement accuracy is sufficiently high for industrial application. Furthermore, the proposed method is applicable to the modeling of the nonlinear system dynamically and provides a feasible way to monitor industrial processes.
Generation of synthetic flood hydrographs by hydrological donors (SHYDONHY method)
NASA Astrophysics Data System (ADS)
Paquet, Emmanuel
2017-04-01
For the design of hydraulic infrastructures like dams, a design hydrograph is required in most of the cases. Some of its features (e.g. peak value, duration, volume) corresponding to a given return period are computed thanks to a wide range of methods: historical records, mono or multivariate statistical analysis, stochastic simulation, etc. Then various methods have been proposed to construct design hydrographs having such characteristics, ranging from traditional unit-hydrograph to statistical methods (Yue et al., 2002). A new method to build design hydrographs (or more generally synthetic hydrographs) is introduced here, named SHYDONHY, French acronym for "Synthèse d'HYdrogrammes par DONneurs HYdrologiques". It is based on an extensive database of 100 000 flood hydrographs recorded at hourly time-step on 1300 gauging stations in France and Switzerland, covering a wide range of catchment size and climatology. For each station, an average of two hydrographs per year of record has been selected by a peak-over-threshold (POT) method with independence criteria (Lang et al., 1999). This sampling ensures that only hydrographs of intense floods are gathered in the dataset. For a given catchment, where few or no hydrograph is available at the outlet, a sub-set of 10 "donor stations" is selected within the complete dataset, considering several criteria: proximity, size, mean annual values and regimes for both total runoff and POT-selected floods. This sub-set of stations (and their corresponding flood hydrographs) will allow to: • Estimate a characteristic duration of flood hydrographs (e.g. duration for which the discharge is above 50% of the peak value). • For a given duration (e.g. one day), estimate the average peak-to- volume ratio of floods. • For a given duration and peak-to-volume ratio, generation of a synthetic reference hydrograph by combining appropriate hydrographs of the sub-set. • For a given daily discharge sequence, being observed or generated for extreme flood estimation, generate a suitable synthetic hydrograph, also by combining selected hydrographs of the sub-set. The reliability of the method is assessed by performing a jackknife validation on the whole dataset of stations, in particular by reconstructing the hydrograph of the biggest flood of each station and comparing it to the actual one. Some applications are presented, e.g. the coupling of SHYDONHY with the SCHADEX method (Paquet et al., 2003) for the stochastic simulation of extreme reservoir level in dams. References: Lang, M., Ouarda, T. B. M. J., & Bobée, B. (1999). Towards operational guidelines for over-threshold modeling. Journal of hydrology, 225(3), 103-117. Paquet, E., Garavaglia, F., Garçon, R., & Gailhard, J. (2013). The SCHADEX method: A semi-continuous rainfall-runoff simulation for extreme flood estimation. Journal of Hydrology, 495, 23-37. Yue, S., Ouarda, T. B., Bobée, B., Legendre, P., & Bruneau, P. (2002). Approach for describing statistical properties of flood hydrograph. Journal of hydrologic engineering, 7(2), 147-153.
Pathways to Disease: The Biological Consequences of Social Adversity on Asthma in Minority Youth
2016-10-01
the microbiome, and teleomere length and relate these biomarkers to the measured exposures to adversity and stress. The selection of and methods to...granted approval from the HRPO at the end of December 2015. After selecting a subset of our study population for evaluation, we experienced a second delay...of almost three months in setting up our account for Clinical lab testing. Since we were able to prepare the selected samples for biomarker testing
A tool for selecting SNPs for association studies based on observed linkage disequilibrium patterns.
De La Vega, Francisco M; Isaac, Hadar I; Scafe, Charles R
2006-01-01
The design of genetic association studies using single-nucleotide polymorphisms (SNPs) requires the selection of subsets of the variants providing high statistical power at a reasonable cost. SNPs must be selected to maximize the probability that a causative mutation is in linkage disequilibrium (LD) with at least one marker genotyped in the study. The HapMap project performed a genome-wide survey of genetic variation with about a million SNPs typed in four populations, providing a rich resource to inform the design of association studies. A number of strategies have been proposed for the selection of SNPs based on observed LD, including construction of metric LD maps and the selection of haplotype tagging SNPs. Power calculations are important at the study design stage to ensure successful results. Integrating these methods and annotations can be challenging: the algorithms required to implement these methods are complex to deploy, and all the necessary data and annotations are deposited in disparate databases. Here, we present the SNPbrowser Software, a freely available tool to assist in the LD-based selection of markers for association studies. This stand-alone application provides fast query capabilities and swift visualization of SNPs, gene annotations, power, haplotype blocks, and LD map coordinates. Wizards implement several common SNP selection workflows including the selection of optimal subsets of SNPs (e.g. tagging SNPs). Selected SNPs are screened for their conversion potential to either TaqMan SNP Genotyping Assays or the SNPlex Genotyping System, two commercially available genotyping platforms, expediting the set-up of genetic studies with an increased probability of success.
Image segmentation via foreground and background semantic descriptors
NASA Astrophysics Data System (ADS)
Yuan, Ding; Qiang, Jingjing; Yin, Jihao
2017-09-01
In the field of image processing, it has been a challenging task to obtain a complete foreground that is not uniform in color or texture. Unlike other methods, which segment the image by only using low-level features, we present a segmentation framework, in which high-level visual features, such as semantic information, are used. First, the initial semantic labels were obtained by using the nonparametric method. Then, a subset of the training images, with a similar foreground to the input image, was selected. Consequently, the semantic labels could be further refined according to the subset. Finally, the input image was segmented by integrating the object affinity and refined semantic labels. State-of-the-art performance was achieved in experiments with the challenging MSRC 21 dataset.
NASA Technical Reports Server (NTRS)
Ruane, Alex C.; Mcdermid, Sonali P.
2017-01-01
We present the Representative Temperature and Precipitation (T&P) GCM Subsetting Approach developed within the Agricultural Model Intercomparison and Improvement Project (AgMIP) to select a practical subset of global climate models (GCMs) for regional integrated assessment of climate impacts when resource limitations do not permit the full ensemble of GCMs to be evaluated given the need to also focus on impacts sector and economics models. Subsetting inherently leads to a loss of information but can free up resources to explore important uncertainties in the integrated assessment that would otherwise be prohibitive. The Representative T&P GCM Subsetting Approach identifies five individual GCMs that capture a profile of the full ensemble of temperature and precipitation change within the growing season while maintaining information about the probability that basic classes of climate changes (relatively cool/wet, cool/dry, middle, hot/wet, and hot/dry) are projected in the full GCM ensemble. We demonstrate the selection methodology for maize impacts in Ames, Iowa, and discuss limitations and situations when additional information may be required to select representative GCMs. We then classify 29 GCMs over all land areas to identify regions and seasons with characteristic diagonal skewness related to surface moisture as well as extreme skewness connected to snow-albedo feedbacks and GCM uncertainty. Finally, we employ this basic approach to recognize that GCM projections demonstrate coherence across space, time, and greenhouse gas concentration pathway. The Representative T&P GCM Subsetting Approach provides a quantitative basis for the determination of useful GCM subsets, provides a practical and coherent approach where previous assessments selected solely on availability of scenarios, and may be extended for application to a range of scales and sectoral impacts.
Sample size determination for bibliographic retrieval studies
Yao, Xiaomei; Wilczynski, Nancy L; Walter, Stephen D; Haynes, R Brian
2008-01-01
Background Research for developing search strategies to retrieve high-quality clinical journal articles from MEDLINE is expensive and time-consuming. The objective of this study was to determine the minimal number of high-quality articles in a journal subset that would need to be hand-searched to update or create new MEDLINE search strategies for treatment, diagnosis, and prognosis studies. Methods The desired width of the 95% confidence intervals (W) for the lowest sensitivity among existing search strategies was used to calculate the number of high-quality articles needed to reliably update search strategies. New search strategies were derived in journal subsets formed by 2 approaches: random sampling of journals and top journals (having the most high-quality articles). The new strategies were tested in both the original large journal database and in a low-yielding journal (having few high-quality articles) subset. Results For treatment studies, if W was 10% or less for the lowest sensitivity among our existing search strategies, a subset of 15 randomly selected journals or 2 top journals were adequate for updating search strategies, based on each approach having at least 99 high-quality articles. The new strategies derived in 15 randomly selected journals or 2 top journals performed well in the original large journal database. Nevertheless, the new search strategies developed using the random sampling approach performed better than those developed using the top journal approach in a low-yielding journal subset. For studies of diagnosis and prognosis, no journal subset had enough high-quality articles to achieve the expected W (10%). Conclusion The approach of randomly sampling a small subset of journals that includes sufficient high-quality articles is an efficient way to update or create search strategies for high-quality articles on therapy in MEDLINE. The concentrations of diagnosis and prognosis articles are too low for this approach. PMID:18823538
USDA-ARS?s Scientific Manuscript database
Support Vector Machine (SVM) was used in the Genetic Algorithms (GA) process to select and classify a subset of hyperspectral image bands. The method was applied to fluorescence hyperspectral data for the detection of aflatoxin contamination in Aspergillus flavus infected single corn kernels. In the...
2012-01-01
Background An important question in the analysis of biochemical data is that of identifying subsets of molecular variables that may jointly influence a biological response. Statistical variable selection methods have been widely used for this purpose. In many settings, it may be important to incorporate ancillary biological information concerning the variables of interest. Pathway and network maps are one example of a source of such information. However, although ancillary information is increasingly available, it is not always clear how it should be used nor how it should be weighted in relation to primary data. Results We put forward an approach in which biological knowledge is incorporated using informative prior distributions over variable subsets, with prior information selected and weighted in an automated, objective manner using an empirical Bayes formulation. We employ continuous, linear models with interaction terms and exploit biochemically-motivated sparsity constraints to permit exact inference. We show an example of priors for pathway- and network-based information and illustrate our proposed method on both synthetic response data and by an application to cancer drug response data. Comparisons are also made to alternative Bayesian and frequentist penalised-likelihood methods for incorporating network-based information. Conclusions The empirical Bayes method proposed here can aid prior elicitation for Bayesian variable selection studies and help to guard against mis-specification of priors. Empirical Bayes, together with the proposed pathway-based priors, results in an approach with a competitive variable selection performance. In addition, the overall procedure is fast, deterministic, and has very few user-set parameters, yet is capable of capturing interplay between molecular players. The approach presented is general and readily applicable in any setting with multiple sources of biological prior knowledge. PMID:22578440
The depth estimation of 3D face from single 2D picture based on manifold learning constraints
NASA Astrophysics Data System (ADS)
Li, Xia; Yang, Yang; Xiong, Hailiang; Liu, Yunxia
2018-04-01
The estimation of depth is virtual important in 3D face reconstruction. In this paper, we propose a t-SNE based on manifold learning constraints and introduce K-means method to divide the original database into several subset, and the selected optimal subset to reconstruct the 3D face depth information can greatly reduce the computational complexity. Firstly, we carry out the t-SNE operation to reduce the key feature points in each 3D face model from 1×249 to 1×2. Secondly, the K-means method is applied to divide the training 3D database into several subset. Thirdly, the Euclidean distance between the 83 feature points of the image to be estimated and the feature point information before the dimension reduction of each cluster center is calculated. The category of the image to be estimated is judged according to the minimum Euclidean distance. Finally, the method Kong D will be applied only in the optimal subset to estimate the depth value information of 83 feature points of 2D face images. Achieving the final depth estimation results, thus the computational complexity is greatly reduced. Compared with the traditional traversal search estimation method, although the proposed method error rate is reduced by 0.49, the number of searches decreases with the change of the category. In order to validate our approach, we use a public database to mimic the task of estimating the depth of face images from 2D images. The average number of searches decreased by 83.19%.
Evaluation of subset matching methods and forms of covariate balance.
de Los Angeles Resa, María; Zubizarreta, José R
2016-11-30
This paper conducts a Monte Carlo simulation study to evaluate the performance of multivariate matching methods that select a subset of treatment and control observations. The matching methods studied are the widely used nearest neighbor matching with propensity score calipers and the more recently proposed methods, optimal matching of an optimally chosen subset and optimal cardinality matching. The main findings are: (i) covariate balance, as measured by differences in means, variance ratios, Kolmogorov-Smirnov distances, and cross-match test statistics, is better with cardinality matching because by construction it satisfies balance requirements; (ii) for given levels of covariate balance, the matched samples are larger with cardinality matching than with the other methods; (iii) in terms of covariate distances, optimal subset matching performs best; (iv) treatment effect estimates from cardinality matching have lower root-mean-square errors, provided strong requirements for balance, specifically, fine balance, or strength-k balance, plus close mean balance. In standard practice, a matched sample is considered to be balanced if the absolute differences in means of the covariates across treatment groups are smaller than 0.1 standard deviations. However, the simulation results suggest that stronger forms of balance should be pursued in order to remove systematic biases due to observed covariates when a difference in means treatment effect estimator is used. In particular, if the true outcome model is additive, then marginal distributions should be balanced, and if the true outcome model is additive with interactions, then low-dimensional joints should be balanced. Copyright © 2016 John Wiley & Sons, Ltd. Copyright © 2016 John Wiley & Sons, Ltd.
Vafaee Sharbaf, Fatemeh; Mosafer, Sara; Moattar, Mohammad Hossein
2016-06-01
This paper proposes an approach for gene selection in microarray data. The proposed approach consists of a primary filter approach using Fisher criterion which reduces the initial genes and hence the search space and time complexity. Then, a wrapper approach which is based on cellular learning automata (CLA) optimized with ant colony method (ACO) is used to find the set of features which improve the classification accuracy. CLA is applied due to its capability to learn and model complicated relationships. The selected features from the last phase are evaluated using ROC curve and the most effective while smallest feature subset is determined. The classifiers which are evaluated in the proposed framework are K-nearest neighbor; support vector machine and naïve Bayes. The proposed approach is evaluated on 4 microarray datasets. The evaluations confirm that the proposed approach can find the smallest subset of genes while approaching the maximum accuracy. Copyright © 2016 Elsevier Inc. All rights reserved.
Assembly of ordered contigs of cosmids selected with YACs of human chromosome 13
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fischer, S.G.; Cayanis, E.; Boukhgalter, B.
1994-06-01
The authors have developed an efficient method for assembling ordered cosmid contigs aligned to mega-YACs and midi-YACs (average insert sizes of 1.0 and 0.35 Mb, respectively) and used this general method to initiate high-resolution physical mapping of human chromosome 13 (Chr 13). Chr 13-enriched midi-YAC (mYAC) and mega-YAC (MYAC) sublibraries were obtained from corresponding CEPH total human YAC libraries by selecting colonies with inter-Alu PCR probes derived from Chr 13 monochromosomal cell hybrid DNA. These sublibraries were arrayed on filters at high density. In this approach, the MYAC 13 sublibrary is screened by hybridization with cytogenetically assigned Chr 13 DNAmore » probes to select one or a small subset of MYACs. Inter-Alu PCR products from each mYAC are then hybridized to the MYAC and mYAC sublibraries to identify overlapping YACs and to an arrayed Chr 13-specific cosmid library to select corresponding cosmids. The set of selected cosmids, gridded on filters at high density, is hybridized with inter-Alu PCR products from each of the overlapping YACs to identify subsets of cosmids and also with riboprobes from each cosmid of the arrayed set ({open_quotes}cosmid matrix cross-hybridization{close_quotes}). From these data, cosmid contigs are assembled by a specifically designed computer program. Application of this method generates cosmid contigs spanning the length of a MYAC with few gaps. To provide a high-resolution map, ends of cosmids are sequenced at preselected sites to position densely spaced sequence-tagged sites. 33 refs., 7 figs., 1 tab.« less
A Selective Overview of Variable Selection in High Dimensional Feature Space
Fan, Jianqing
2010-01-01
High dimensional statistical problems arise from diverse fields of scientific research and technological development. Variable selection plays a pivotal role in contemporary statistical learning and scientific discoveries. The traditional idea of best subset selection methods, which can be regarded as a specific form of penalized likelihood, is computationally too expensive for many modern statistical applications. Other forms of penalized likelihood methods have been successfully developed over the last decade to cope with high dimensionality. They have been widely applied for simultaneously selecting important variables and estimating their effects in high dimensional statistical inference. In this article, we present a brief account of the recent developments of theory, methods, and implementations for high dimensional variable selection. What limits of the dimensionality such methods can handle, what the role of penalty functions is, and what the statistical properties are rapidly drive the advances of the field. The properties of non-concave penalized likelihood and its roles in high dimensional statistical modeling are emphasized. We also review some recent advances in ultra-high dimensional variable selection, with emphasis on independence screening and two-scale methods. PMID:21572976
Alvarez, George A; Gill, Jonathan; Cavanagh, Patrick
2012-01-01
Previous studies have shown independent attentional selection of targets in the left and right visual hemifields during attentional tracking (Alvarez & Cavanagh, 2005) but not during a visual search (Luck, Hillyard, Mangun, & Gazzaniga, 1989). Here we tested whether multifocal spatial attention is the critical process that operates independently in the two hemifields. It is explicitly required in tracking (attend to a subset of object locations, suppress the others) but not in the standard visual search task (where all items are potential targets). We used a modified visual search task in which observers searched for a target within a subset of display items, where the subset was selected based on location (Experiments 1 and 3A) or based on a salient feature difference (Experiments 2 and 3B). The results show hemifield independence in this subset visual search task with location-based selection but not with feature-based selection; this effect cannot be explained by general difficulty (Experiment 4). Combined, these findings suggest that hemifield independence is a signature of multifocal spatial attention and highlight the need for cognitive and neural theories of attention to account for anatomical constraints on selection mechanisms. PMID:22637710
Multi-task feature selection in microarray data by binary integer programming.
Lan, Liang; Vucetic, Slobodan
2013-12-20
A major challenge in microarray classification is that the number of features is typically orders of magnitude larger than the number of examples. In this paper, we propose a novel feature filter algorithm to select the feature subset with maximal discriminative power and minimal redundancy by solving a quadratic objective function with binary integer constraints. To improve the computational efficiency, the binary integer constraints are relaxed and a low-rank approximation to the quadratic term is applied. The proposed feature selection algorithm was extended to solve multi-task microarray classification problems. We compared the single-task version of the proposed feature selection algorithm with 9 existing feature selection methods on 4 benchmark microarray data sets. The empirical results show that the proposed method achieved the most accurate predictions overall. We also evaluated the multi-task version of the proposed algorithm on 8 multi-task microarray datasets. The multi-task feature selection algorithm resulted in significantly higher accuracy than when using the single-task feature selection methods.
Local Feature Selection for Data Classification.
Armanfard, Narges; Reilly, James P; Komeili, Majid
2016-06-01
Typical feature selection methods choose an optimal global feature subset that is applied over all regions of the sample space. In contrast, in this paper we propose a novel localized feature selection (LFS) approach whereby each region of the sample space is associated with its own distinct optimized feature set, which may vary both in membership and size across the sample space. This allows the feature set to optimally adapt to local variations in the sample space. An associated method for measuring the similarities of a query datum to each of the respective classes is also proposed. The proposed method makes no assumptions about the underlying structure of the samples; hence the method is insensitive to the distribution of the data over the sample space. The method is efficiently formulated as a linear programming optimization problem. Furthermore, we demonstrate the method is robust against the over-fitting problem. Experimental results on eleven synthetic and real-world data sets demonstrate the viability of the formulation and the effectiveness of the proposed algorithm. In addition we show several examples where localized feature selection produces better results than a global feature selection method.
Missing value imputation in DNA microarrays based on conjugate gradient method.
Dorri, Fatemeh; Azmi, Paeiz; Dorri, Faezeh
2012-02-01
Analysis of gene expression profiles needs a complete matrix of gene array values; consequently, imputation methods have been suggested. In this paper, an algorithm that is based on conjugate gradient (CG) method is proposed to estimate missing values. k-nearest neighbors of the missed entry are first selected based on absolute values of their Pearson correlation coefficient. Then a subset of genes among the k-nearest neighbors is labeled as the best similar ones. CG algorithm with this subset as its input is then used to estimate the missing values. Our proposed CG based algorithm (CGimpute) is evaluated on different data sets. The results are compared with sequential local least squares (SLLSimpute), Bayesian principle component analysis (BPCAimpute), local least squares imputation (LLSimpute), iterated local least squares imputation (ILLSimpute) and adaptive k-nearest neighbors imputation (KNNKimpute) methods. The average of normalized root mean squares error (NRMSE) and relative NRMSE in different data sets with various missing rates shows CGimpute outperforms other methods. Copyright © 2011 Elsevier Ltd. All rights reserved.
Study of high speed complex number algorithms. [for determining antenna for field radiation patterns
NASA Technical Reports Server (NTRS)
Heisler, R.
1981-01-01
A method of evaluating the radiation integral on the curved surface of a reflecting antenna is presented. A three dimensional Fourier transform approach is used to generate a two dimensional radiation cross-section along a planer cut at any angle phi through the far field pattern. Salient to the method is an algorithm for evaluating a subset of the total three dimensional discrete Fourier transform results. The subset elements are selectively evaluated to yield data along a geometric plane of constant. The algorithm is extremely efficient so that computation of the induced surface currents via the physical optics approximation dominates the computer time required to compute a radiation pattern. Application to paraboloid reflectors with off-focus feeds in presented, but the method is easily extended to offset antenna systems and reflectors of arbitrary shapes. Numerical results were computed for both gain and phase and are compared with other published work.
ERIC Educational Resources Information Center
Kaplan, S.; Heiligenstein, J.; West, S.; Busner, J.; Harder, D.; Dittmann, R.; Casat, C.; Wernicke, J. F.
2004-01-01
Objective: To compare the safety and efficacy of atomoxetine, a selective inhibitor of the norepinephrine transporter, versus placebo in Attention-Deficit/Hyperactivity Disorder (ADHD) patients with comorbid Oppositional Defiant Disorder (ODD). Methods: A subset analysis of 98 children from two identical, multi-site, double-blind, randomized,…
Model selection for logistic regression models
NASA Astrophysics Data System (ADS)
Duller, Christine
2012-09-01
Model selection for logistic regression models decides which of some given potential regressors have an effect and hence should be included in the final model. The second interesting question is whether a certain factor is heterogeneous among some subsets, i.e. whether the model should include a random intercept or not. In this paper these questions will be answered with classical as well as with Bayesian methods. The application show some results of recent research projects in medicine and business administration.
Lin, Xiaohui; Li, Chao; Zhang, Yanhui; Su, Benzhe; Fan, Meng; Wei, Hai
2017-12-26
Feature selection is an important topic in bioinformatics. Defining informative features from complex high dimensional biological data is critical in disease study, drug development, etc. Support vector machine-recursive feature elimination (SVM-RFE) is an efficient feature selection technique that has shown its power in many applications. It ranks the features according to the recursive feature deletion sequence based on SVM. In this study, we propose a method, SVM-RFE-OA, which combines the classification accuracy rate and the average overlapping ratio of the samples to determine the number of features to be selected from the feature rank of SVM-RFE. Meanwhile, to measure the feature weights more accurately, we propose a modified SVM-RFE-OA (M-SVM-RFE-OA) algorithm that temporally screens out the samples lying in a heavy overlapping area in each iteration. The experiments on the eight public biological datasets show that the discriminative ability of the feature subset could be measured more accurately by combining the classification accuracy rate with the average overlapping degree of the samples compared with using the classification accuracy rate alone, and shielding the samples in the overlapping area made the calculation of the feature weights more stable and accurate. The methods proposed in this study can also be used with other RFE techniques to define potential biomarkers from big biological data.
Hashimoto, Shunji; Zushi, Yasuyuki; Fushimi, Akihiro; Takazawa, Yoshikatsu; Tanabe, Kiyoshi; Shibata, Yasuyuki
2013-03-22
We developed a method that selectively extracts a subset from comprehensive 2D gas chromatography (GC×GC) and high-resolution time-of-flight mass spectrometry (HRTOFMS) data to detect and identify trace levels of organohalogens. The data were obtained by measuring several environmental and biological samples, namely fly ash, soil, sediment, the atmosphere, and human urine. For global analysis, some samples were measured without purification. By using our novel software, the mass spectra of organochlorines or organobromines were then extracted into a data subset under high mass accuracy conditions that were approximately equivalent to a mass resolution of 6000 for some samples. Mass defect filtering as pre-screening for the data extraction was very effective in removing the mass spectra of hydrocarbons. Those results showed that data obtained with HRTOFMS are valuable for global analysis of organohalogens, and probably of other compounds if specific data extraction methods can be devised. Copyright © 2013 Elsevier B.V. All rights reserved.
Adaptive 4d Psi-Based Change Detection
NASA Astrophysics Data System (ADS)
Yang, Chia-Hsiang; Soergel, Uwe
2018-04-01
In a previous work, we proposed a PSI-based 4D change detection to detect disappearing and emerging PS points (3D) along with their occurrence dates (1D). Such change points are usually caused by anthropic events, e.g., building constructions in cities. This method first divides an entire SAR image stack into several subsets by a set of break dates. The PS points, which are selected based on their temporal coherences before or after a break date, are regarded as change candidates. Change points are then extracted from these candidates according to their change indices, which are modelled from their temporal coherences of divided image subsets. Finally, we check the evolution of the change indices for each change point to detect the break date that this change occurred. The experiment validated both feasibility and applicability of our method. However, two questions still remain. First, selection of temporal coherence threshold associates with a trade-off between quality and quantity of PS points. This selection is also crucial for the amount of change points in a more complex way. Second, heuristic selection of change index thresholds brings vulnerability and causes loss of change points. In this study, we adapt our approach to identify change points based on statistical characteristics of change indices rather than thresholding. The experiment validates this adaptive approach and shows increase of change points compared with the old version. In addition, we also explore and discuss optimal selection of temporal coherence threshold.
Effects of Sample Selection on Estimates of Economic Impacts of Outdoor Recreation
Donald B.K. English
1997-01-01
Estimates of the economic impacts of recreation often come from spending data provided by a self-selected subset of a random sample of site visitors. The subset is frequently less than half the onsite sample. Biased vectors of per trip spending and impact estimates can result if self-selection is related to spending pattctns, and proper corrective procedures arc not...
An ensemble method for extracting adverse drug events from social media.
Liu, Jing; Zhao, Songzheng; Zhang, Xiaodi
2016-06-01
Because adverse drug events (ADEs) are a serious health problem and a leading cause of death, it is of vital importance to identify them correctly and in a timely manner. With the development of Web 2.0, social media has become a large data source for information on ADEs. The objective of this study is to develop a relation extraction system that uses natural language processing techniques to effectively distinguish between ADEs and non-ADEs in informal text on social media. We develop a feature-based approach that utilizes various lexical, syntactic, and semantic features. Information-gain-based feature selection is performed to address high-dimensional features. Then, we evaluate the effectiveness of four well-known kernel-based approaches (i.e., subset tree kernel, tree kernel, shortest dependency path kernel, and all-paths graph kernel) and several ensembles that are generated by adopting different combination methods (i.e., majority voting, weighted averaging, and stacked generalization). All of the approaches are tested using three data sets: two health-related discussion forums and one general social networking site (i.e., Twitter). When investigating the contribution of each feature subset, the feature-based approach attains the best area under the receiver operating characteristics curve (AUC) values, which are 78.6%, 72.2%, and 79.2% on the three data sets. When individual methods are used, we attain the best AUC values of 82.1%, 73.2%, and 77.0% using the subset tree kernel, shortest dependency path kernel, and feature-based approach on the three data sets, respectively. When using classifier ensembles, we achieve the best AUC values of 84.5%, 77.3%, and 84.5% on the three data sets, outperforming the baselines. Our experimental results indicate that ADE extraction from social media can benefit from feature selection. With respect to the effectiveness of different feature subsets, lexical features and semantic features can enhance the ADE extraction capability. Kernel-based approaches, which can stay away from the feature sparsity issue, are qualified to address the ADE extraction problem. Combining different individual classifiers using suitable combination methods can further enhance the ADE extraction effectiveness. Copyright © 2016 Elsevier B.V. All rights reserved.
Bayesian Ensemble Trees (BET) for Clustering and Prediction in Heterogeneous Data
Duan, Leo L.; Clancy, John P.; Szczesniak, Rhonda D.
2016-01-01
We propose a novel “tree-averaging” model that utilizes the ensemble of classification and regression trees (CART). Each constituent tree is estimated with a subset of similar data. We treat this grouping of subsets as Bayesian Ensemble Trees (BET) and model them as a Dirichlet process. We show that BET determines the optimal number of trees by adapting to the data heterogeneity. Compared with the other ensemble methods, BET requires much fewer trees and shows equivalent prediction accuracy using weighted averaging. Moreover, each tree in BET provides variable selection criterion and interpretation for each subset. We developed an efficient estimating procedure with improved estimation strategies in both CART and mixture models. We demonstrate these advantages of BET with simulations and illustrate the approach with a real-world data example involving regression of lung function measurements obtained from patients with cystic fibrosis. Supplemental materials are available online. PMID:27524872
An opinion formation based binary optimization approach for feature selection
NASA Astrophysics Data System (ADS)
Hamedmoghadam, Homayoun; Jalili, Mahdi; Yu, Xinghuo
2018-02-01
This paper proposed a novel optimization method based on opinion formation in complex network systems. The proposed optimization technique mimics human-human interaction mechanism based on a mathematical model derived from social sciences. Our method encodes a subset of selected features to the opinion of an artificial agent and simulates the opinion formation process among a population of agents to solve the feature selection problem. The agents interact using an underlying interaction network structure and get into consensus in their opinions, while finding better solutions to the problem. A number of mechanisms are employed to avoid getting trapped in local minima. We compare the performance of the proposed method with a number of classical population-based optimization methods and a state-of-the-art opinion formation based method. Our experiments on a number of high dimensional datasets reveal outperformance of the proposed algorithm over others.
Kay, Jeremy N; De la Huerta, Irina; Kim, In-Jung; Zhang, Yifeng; Yamagata, Masahito; Chu, Monica W; Meister, Markus; Sanes, Joshua R
2011-05-25
The retina contains ganglion cells (RGCs) that respond selectively to objects moving in particular directions. Individual members of a group of ON-OFF direction-selective RGCs (ooDSGCs) detect stimuli moving in one of four directions: ventral, dorsal, nasal, or temporal. Despite this physiological diversity, little is known about subtype-specific differences in structure, molecular identity, and projections. To seek such differences, we characterized mouse transgenic lines that selectively mark ooDSGCs preferring ventral or nasal motion as well as a line that marks both ventral- and dorsal-preferring subsets. We then used the lines to identify cell surface molecules, including Cadherin 6, CollagenXXVα1, and Matrix metalloprotease 17, that are selectively expressed by distinct subsets of ooDSGCs. We also identify a neuropeptide, CART (cocaine- and amphetamine-regulated transcript), that distinguishes all ooDSGCs from other RGCs. Together, this panel of endogenous and transgenic markers distinguishes the four ooDSGC subsets. Patterns of molecular diversification occur before eye opening and are therefore experience independent. They may help to explain how the four subsets obtain distinct inputs. We also demonstrate differences among subsets in their dendritic patterns within the retina and their axonal projections to the brain. Differences in projections indicate that information about motion in different directions is sent to different destinations.
A statistical approach to selecting and confirming validation targets in -omics experiments
2012-01-01
Background Genomic technologies are, by their very nature, designed for hypothesis generation. In some cases, the hypotheses that are generated require that genome scientists confirm findings about specific genes or proteins. But one major advantage of high-throughput technology is that global genetic, genomic, transcriptomic, and proteomic behaviors can be observed. Manual confirmation of every statistically significant genomic result is prohibitively expensive. This has led researchers in genomics to adopt the strategy of confirming only a handful of the most statistically significant results, a small subset chosen for biological interest, or a small random subset. But there is no standard approach for selecting and quantitatively evaluating validation targets. Results Here we present a new statistical method and approach for statistically validating lists of significant results based on confirming only a small random sample. We apply our statistical method to show that the usual practice of confirming only the most statistically significant results does not statistically validate result lists. We analyze an extensively validated RNA-sequencing experiment to show that confirming a random subset can statistically validate entire lists of significant results. Finally, we analyze multiple publicly available microarray experiments to show that statistically validating random samples can both (i) provide evidence to confirm long gene lists and (ii) save thousands of dollars and hundreds of hours of labor over manual validation of each significant result. Conclusions For high-throughput -omics studies, statistical validation is a cost-effective and statistically valid approach to confirming lists of significant results. PMID:22738145
Voils, Corrine I.; Olsen, Maren K.; Williams, John W.; for the IMPACT Study Investigators
2008-01-01
Objective: To determine whether a subset of depressive symptoms could be identified to facilitate diagnosis of depression in older adults in primary care. Method: Secondary analysis was conducted on 898 participants aged 60 years or older with major depressive disorder and/or dysthymic disorder (according to DSM-IV criteria) who participated in the Improving Mood–Promoting Access to Collaborative Treatment (IMPACT) study, a multisite, randomized trial of collaborative care for depression (recruitment from July 1999 to August 2001). Linear regression was used to identify a core subset of depressive symptoms associated with decreased social, physical, and mental functioning. The sensitivity and specificity, adjusting for selection bias, were evaluated for these symptoms. The sensitivity and specificity of a second subset of 4 depressive symptoms previously validated in a midlife sample was also evaluated. Results: Psychomotor changes, fatigue, and suicidal ideation were associated with decreased functioning and served as the core set of symptoms. Adjusting for selection bias, the sensitivity of these 3 symptoms was 0.012 and specificity 0.994. The sensitivity of the 4 symptoms previously validated in a midlife sample was 0.019 and specificity was 0.997. Conclusion: We identified 3 depression symptoms that were highly specific for major depressive disorder in older adults. However, these symptoms and a previously identified subset were too insensitive for accurate diagnosis. Therefore, we recommend a full assessment of DSM-IV depression criteria for accurate diagnosis. PMID:18311416
Relevance popularity: A term event model based feature selection scheme for text classification.
Feng, Guozhong; An, Baiguo; Yang, Fengqin; Wang, Han; Zhang, Libiao
2017-01-01
Feature selection is a practical approach for improving the performance of text classification methods by optimizing the feature subsets input to classifiers. In traditional feature selection methods such as information gain and chi-square, the number of documents that contain a particular term (i.e. the document frequency) is often used. However, the frequency of a given term appearing in each document has not been fully investigated, even though it is a promising feature to produce accurate classifications. In this paper, we propose a new feature selection scheme based on a term event Multinomial naive Bayes probabilistic model. According to the model assumptions, the matching score function, which is based on the prediction probability ratio, can be factorized. Finally, we derive a feature selection measurement for each term after replacing inner parameters by their estimators. On a benchmark English text datasets (20 Newsgroups) and a Chinese text dataset (MPH-20), our numerical experiment results obtained from using two widely used text classifiers (naive Bayes and support vector machine) demonstrate that our method outperformed the representative feature selection methods.
A Cancer Gene Selection Algorithm Based on the K-S Test and CFS.
Su, Qiang; Wang, Yina; Jiang, Xiaobing; Chen, Fuxue; Lu, Wen-Cong
2017-01-01
To address the challenging problem of selecting distinguished genes from cancer gene expression datasets, this paper presents a gene subset selection algorithm based on the Kolmogorov-Smirnov (K-S) test and correlation-based feature selection (CFS) principles. The algorithm selects distinguished genes first using the K-S test, and then, it uses CFS to select genes from those selected by the K-S test. We adopted support vector machines (SVM) as the classification tool and used the criteria of accuracy to evaluate the performance of the classifiers on the selected gene subsets. This approach compared the proposed gene subset selection algorithm with the K-S test, CFS, minimum-redundancy maximum-relevancy (mRMR), and ReliefF algorithms. The average experimental results of the aforementioned gene selection algorithms for 5 gene expression datasets demonstrate that, based on accuracy, the performance of the new K-S and CFS-based algorithm is better than those of the K-S test, CFS, mRMR, and ReliefF algorithms. The experimental results show that the K-S test-CFS gene selection algorithm is a very effective and promising approach compared to the K-S test, CFS, mRMR, and ReliefF algorithms.
Water quality parameter measurement using spectral signatures
NASA Technical Reports Server (NTRS)
White, P. E.
1973-01-01
Regression analysis is applied to the problem of measuring water quality parameters from remote sensing spectral signature data. The equations necessary to perform regression analysis are presented and methods of testing the strength and reliability of a regression are described. An efficient algorithm for selecting an optimal subset of the independent variables available for a regression is also presented.
An Adaptive Genetic Association Test Using Double Kernel Machines.
Zhan, Xiang; Epstein, Michael P; Ghosh, Debashis
2015-10-01
Recently, gene set-based approaches have become very popular in gene expression profiling studies for assessing how genetic variants are related to disease outcomes. Since most genes are not differentially expressed, existing pathway tests considering all genes within a pathway suffer from considerable noise and power loss. Moreover, for a differentially expressed pathway, it is of interest to select important genes that drive the effect of the pathway. In this article, we propose an adaptive association test using double kernel machines (DKM), which can both select important genes within the pathway as well as test for the overall genetic pathway effect. This DKM procedure first uses the garrote kernel machines (GKM) test for the purposes of subset selection and then the least squares kernel machine (LSKM) test for testing the effect of the subset of genes. An appealing feature of the kernel machine framework is that it can provide a flexible and unified method for multi-dimensional modeling of the genetic pathway effect allowing for both parametric and nonparametric components. This DKM approach is illustrated with application to simulated data as well as to data from a neuroimaging genetics study.
Predictive equations for the estimation of body size in seals and sea lions (Carnivora: Pinnipedia)
Churchill, Morgan; Clementz, Mark T; Kohno, Naoki
2014-01-01
Body size plays an important role in pinniped ecology and life history. However, body size data is often absent for historical, archaeological, and fossil specimens. To estimate the body size of pinnipeds (seals, sea lions, and walruses) for today and the past, we used 14 commonly preserved cranial measurements to develop sets of single variable and multivariate predictive equations for pinniped body mass and total length. Principal components analysis (PCA) was used to test whether separate family specific regressions were more appropriate than single predictive equations for Pinnipedia. The influence of phylogeny was tested with phylogenetic independent contrasts (PIC). The accuracy of these regressions was then assessed using a combination of coefficient of determination, percent prediction error, and standard error of estimation. Three different methods of multivariate analysis were examined: bidirectional stepwise model selection using Akaike information criteria; all-subsets model selection using Bayesian information criteria (BIC); and partial least squares regression. The PCA showed clear discrimination between Otariidae (fur seals and sea lions) and Phocidae (earless seals) for the 14 measurements, indicating the need for family-specific regression equations. The PIC analysis found that phylogeny had a minor influence on relationship between morphological variables and body size. The regressions for total length were more accurate than those for body mass, and equations specific to Otariidae were more accurate than those for Phocidae. Of the three multivariate methods, the all-subsets approach required the fewest number of variables to estimate body size accurately. We then used the single variable predictive equations and the all-subsets approach to estimate the body size of two recently extinct pinniped taxa, the Caribbean monk seal (Monachus tropicalis) and the Japanese sea lion (Zalophus japonicus). Body size estimates using single variable regressions generally under or over-estimated body size; however, the all-subset regression produced body size estimates that were close to historically recorded body length for these two species. This indicates that the all-subset regression equations developed in this study can estimate body size accurately. PMID:24916814
Ensemble of sparse classifiers for high-dimensional biological data.
Kim, Sunghan; Scalzo, Fabien; Telesca, Donatello; Hu, Xiao
2015-01-01
Biological data are often high in dimension while the number of samples is small. In such cases, the performance of classification can be improved by reducing the dimension of data, which is referred to as feature selection. Recently, a novel feature selection method has been proposed utilising the sparsity of high-dimensional biological data where a small subset of features accounts for most variance of the dataset. In this study we propose a new classification method for high-dimensional biological data, which performs both feature selection and classification within a single framework. Our proposed method utilises a sparse linear solution technique and the bootstrap aggregating algorithm. We tested its performance on four public mass spectrometry cancer datasets along with two other conventional classification techniques such as Support Vector Machines and Adaptive Boosting. The results demonstrate that our proposed method performs more accurate classification across various cancer datasets than those conventional classification techniques.
NASA Technical Reports Server (NTRS)
Lo, Yunnhon; Johnson, Stephen B.; Breckenridge, Jonathan T.
2014-01-01
This paper describes the quantitative application of the theory of System Health Management and its operational subset, Fault Management, to the selection of abort triggers for a human-rated launch vehicle, the United States' National Aeronautics and Space Administration's (NASA) Space Launch System (SLS). The results demonstrate the efficacy of the theory to assess the effectiveness of candidate failure detection and response mechanisms to protect humans from time-critical and severe hazards. The quantitative method was successfully used on the SLS to aid selection of its suite of abort triggers.
NASA Technical Reports Server (NTRS)
Lo, Yunnhon; Johnson, Stephen B.; Breckenridge, Jonathan T.
2014-01-01
This paper describes the quantitative application of the theory of System Health Management and its operational subset, Fault Management, to the selection of Abort Triggers for a human-rated launch vehicle, the United States' National Aeronautics and Space Administration's (NASA) Space Launch System (SLS). The results demonstrate the efficacy of the theory to assess the effectiveness of candidate failure detection and response mechanisms to protect humans from time-critical and severe hazards. The quantitative method was successfully used on the SLS to aid selection of its suite of Abort Triggers.
Logistic regression trees for initial selection of interesting loci in case-control studies
Nickolov, Radoslav Z; Milanov, Valentin B
2007-01-01
Modern genetic epidemiology faces the challenge of dealing with hundreds of thousands of genetic markers. The selection of a small initial subset of interesting markers for further investigation can greatly facilitate genetic studies. In this contribution we suggest the use of a logistic regression tree algorithm known as logistic tree with unbiased selection. Using the simulated data provided for Genetic Analysis Workshop 15, we show how this algorithm, with incorporation of multifactor dimensionality reduction method, can reduce an initial large pool of markers to a small set that includes the interesting markers with high probability. PMID:18466557
AbdelRahman, Samir E; Zhang, Mingyuan; Bray, Bruce E; Kawamoto, Kensaku
2014-05-27
The aim of this study was to propose an analytical approach to develop high-performing predictive models for congestive heart failure (CHF) readmission using an operational dataset with incomplete records and changing data over time. Our analytical approach involves three steps: pre-processing, systematic model development, and risk factor analysis. For pre-processing, variables that were absent in >50% of records were removed. Moreover, the dataset was divided into a validation dataset and derivation datasets which were separated into three temporal subsets based on changes to the data over time. For systematic model development, using the different temporal datasets and the remaining explanatory variables, the models were developed by combining the use of various (i) statistical analyses to explore the relationships between the validation and the derivation datasets; (ii) adjustment methods for handling missing values; (iii) classifiers; (iv) feature selection methods; and (iv) discretization methods. We then selected the best derivation dataset and the models with the highest predictive performance. For risk factor analysis, factors in the highest-performing predictive models were analyzed and ranked using (i) statistical analyses of the best derivation dataset, (ii) feature rankers, and (iii) a newly developed algorithm to categorize risk factors as being strong, regular, or weak. The analysis dataset consisted of 2,787 CHF hospitalizations at University of Utah Health Care from January 2003 to June 2013. In this study, we used the complete-case analysis and mean-based imputation adjustment methods; the wrapper subset feature selection method; and four ranking strategies based on information gain, gain ratio, symmetrical uncertainty, and wrapper subset feature evaluators. The best-performing models resulted from the use of a complete-case analysis derivation dataset combined with the Class-Attribute Contingency Coefficient discretization method and a voting classifier which averaged the results of multi-nominal logistic regression and voting feature intervals classifiers. Of 42 final model risk factors, discharge disposition, discretized age, and indicators of anemia were the most significant. This model achieved a c-statistic of 86.8%. The proposed three-step analytical approach enhanced predictive model performance for CHF readmissions. It could potentially be leveraged to improve predictive model performance in other areas of clinical medicine.
Feature Extraction and Selection Strategies for Automated Target Recognition
NASA Technical Reports Server (NTRS)
Greene, W. Nicholas; Zhang, Yuhan; Lu, Thomas T.; Chao, Tien-Hsin
2010-01-01
Several feature extraction and selection methods for an existing automatic target recognition (ATR) system using JPLs Grayscale Optical Correlator (GOC) and Optimal Trade-Off Maximum Average Correlation Height (OT-MACH) filter were tested using MATLAB. The ATR system is composed of three stages: a cursory region of-interest (ROI) search using the GOC and OT-MACH filter, a feature extraction and selection stage, and a final classification stage. Feature extraction and selection concerns transforming potential target data into more useful forms as well as selecting important subsets of that data which may aide in detection and classification. The strategies tested were built around two popular extraction methods: Principal Component Analysis (PCA) and Independent Component Analysis (ICA). Performance was measured based on the classification accuracy and free-response receiver operating characteristic (FROC) output of a support vector machine(SVM) and a neural net (NN) classifier.
Feature extraction and selection strategies for automated target recognition
NASA Astrophysics Data System (ADS)
Greene, W. Nicholas; Zhang, Yuhan; Lu, Thomas T.; Chao, Tien-Hsin
2010-04-01
Several feature extraction and selection methods for an existing automatic target recognition (ATR) system using JPLs Grayscale Optical Correlator (GOC) and Optimal Trade-Off Maximum Average Correlation Height (OT-MACH) filter were tested using MATLAB. The ATR system is composed of three stages: a cursory regionof- interest (ROI) search using the GOC and OT-MACH filter, a feature extraction and selection stage, and a final classification stage. Feature extraction and selection concerns transforming potential target data into more useful forms as well as selecting important subsets of that data which may aide in detection and classification. The strategies tested were built around two popular extraction methods: Principal Component Analysis (PCA) and Independent Component Analysis (ICA). Performance was measured based on the classification accuracy and free-response receiver operating characteristic (FROC) output of a support vector machine(SVM) and a neural net (NN) classifier.
A new approach to human microRNA target prediction using ensemble pruning and rotation forest.
Mousavi, Reza; Eftekhari, Mahdi; Haghighi, Mehdi Ghezelbash
2015-12-01
MicroRNAs (miRNAs) are small non-coding RNAs that have important functions in gene regulation. Since finding miRNA target experimentally is costly and needs spending much time, the use of machine learning methods is a growing research area for miRNA target prediction. In this paper, a new approach is proposed by using two popular ensemble strategies, i.e. Ensemble Pruning and Rotation Forest (EP-RTF), to predict human miRNA target. For EP, the approach utilizes Genetic Algorithm (GA). In other words, a subset of classifiers from the heterogeneous ensemble is first selected by GA. Next, the selected classifiers are trained based on the RTF method and then are combined using weighted majority voting. In addition to seeking a better subset of classifiers, the parameter of RTF is also optimized by GA. Findings of the present study confirm that the newly developed EP-RTF outperforms (in terms of classification accuracy, sensitivity, and specificity) the previously applied methods over four datasets in the field of human miRNA target. Diversity-error diagrams reveal that the proposed ensemble approach constructs individual classifiers which are more accurate and usually diverse than the other ensemble approaches. Given these experimental results, we highly recommend EP-RTF for improving the performance of miRNA target prediction.
Algorithm For Solution Of Subset-Regression Problems
NASA Technical Reports Server (NTRS)
Verhaegen, Michel
1991-01-01
Reliable and flexible algorithm for solution of subset-regression problem performs QR decomposition with new column-pivoting strategy, enables selection of subset directly from originally defined regression parameters. This feature, in combination with number of extensions, makes algorithm very flexible for use in analysis of subset-regression problems in which parameters have physical meanings. Also extended to enable joint processing of columns contaminated by noise with those free of noise, without using scaling techniques.
Mori, K
1986-02-19
To examine differential carbohydrate expression among different subsets of primary afferent fibers, several fluorescein-isothiocyanate conjugated lectins were used in a histochemical study of the dorsal root ganglion (DRG) and spinal cord of the rabbit. The lectin Ulex europaeus agglutinin I specifically labeled a subset of DRG cells and primary afferent fibers which projected to the superficial laminae of the dorsal horn. These results suggest that specific carbohydrates containing L-fucosyl residue is expressed selectively in small diameter primary afferent fibers which subserve nociception or thermoception.
Takakusagi, Yoichi; Kuramochi, Kouji; Takagi, Manami; Kusayanagi, Tomoe; Manita, Daisuke; Ozawa, Hiroko; Iwakiri, Kanako; Takakusagi, Kaori; Miyano, Yuka; Nakazaki, Atsuo; Kobayashi, Susumu; Sugawara, Fumio; Sakaguchi, Kengo
2008-11-15
Here, we report an efficient one-cycle affinity selection using a natural-protein or random-peptide T7 phage pool for identification of binding proteins or peptides specific for small-molecules. The screening procedure involved a cuvette type 27-MHz quartz-crystal microbalance (QCM) apparatus with introduction of self-assembled monolayer (SAM) for a specific small-molecule immobilization on the gold electrode surface of a sensor chip. Using this apparatus, we attempted an affinity selection of proteins or peptides against synthetic ligand for FK506-binding protein (SLF) or irinotecan (Iri, CPT-11). An affinity selection using SLF-SAM and a natural-protein T7 phage pool successfully detected FK506-binding protein 12 (FKBP12)-displaying T7 phage after an interaction time of only 10 min. Extensive exploration of time-consuming wash and/or elution conditions together with several rounds of selection was not required. Furthermore, in the selection using a 15-mer random-peptide T7 phage pool and subsequent analysis utilizing receptor ligand contact (RELIC) software, a subset of SLF-selected peptides clearly pinpointed several amino-acid residues within the binding site of FKBP12. Likewise, a subset of Iri-selected peptides pinpointed part of the positive amino-acid region of residues from the Iri-binding site of the well-known direct targets, acetylcholinesterase (AChE) and carboxylesterase (CE). Our findings demonstrate the effectiveness of this method and general applicability for a wide range of small-molecules.
Hydraulic head interpolation using ANFIS—model selection and sensitivity analysis
NASA Astrophysics Data System (ADS)
Kurtulus, Bedri; Flipo, Nicolas
2012-01-01
The aim of this study is to investigate the efficiency of ANFIS (adaptive neuro fuzzy inference system) for interpolating hydraulic head in a 40-km 2 agricultural watershed of the Seine basin (France). Inputs of ANFIS are Cartesian coordinates and the elevation of the ground. Hydraulic head was measured at 73 locations during a snapshot campaign on September 2009, which characterizes low-water-flow regime in the aquifer unit. The dataset was then split into three subsets using a square-based selection method: a calibration one (55%), a training one (27%), and a test one (18%). First, a method is proposed to select the best ANFIS model, which corresponds to a sensitivity analysis of ANFIS to the type and number of membership functions (MF). Triangular, Gaussian, general bell, and spline-based MF are used with 2, 3, 4, and 5 MF per input node. Performance criteria on the test subset are used to select the 5 best ANFIS models among 16. Then each is used to interpolate the hydraulic head distribution on a (50×50)-m grid, which is compared to the soil elevation. The cells where the hydraulic head is higher than the soil elevation are counted as "error cells." The ANFIS model that exhibits the less "error cells" is selected as the best ANFIS model. The best model selection reveals that ANFIS models are very sensitive to the type and number of MF. Finally, a sensibility analysis of the best ANFIS model with four triangular MF is performed on the interpolation grid, which shows that ANFIS remains stable to error propagation with a higher sensitivity to soil elevation.
Quantitative assessment model for gastric cancer screening
Chen, Kun; Yu, Wei-Ping; Song, Liang; Zhu, Yi-Min
2005-01-01
AIM: To set up a mathematic model for gastric cancer screening and to evaluate its function in mass screening for gastric cancer. METHODS: A case control study was carried on in 66 patients and 198 normal people, then the risk and protective factors of gastric cancer were determined, including heavy manual work, foods such as small yellow-fin tuna, dried small shrimps, squills, crabs, mothers suffering from gastric diseases, spouse alive, use of refrigerators and hot food, etc. According to some principles and methods of probability and fuzzy mathematics, a quantitative assessment model was established as follows: first, we selected some factors significant in statistics, and calculated weight coefficient for each one by two different methods; second, population space was divided into gastric cancer fuzzy subset and non gastric cancer fuzzy subset, then a mathematic model for each subset was established, we got a mathematic expression of attribute degree (AD). RESULTS: Based on the data of 63 patients and 693 normal people, AD of each subject was calculated. Considering the sensitivity and specificity, the thresholds of AD values calculated were configured with 0.20 and 0.17, respectively. According to these thresholds, the sensitivity and specificity of the quantitative model were about 69% and 63%. Moreover, statistical test showed that the identification outcomes of these two different calculation methods were identical (P>0.05). CONCLUSION: The validity of this method is satisfactory. It is convenient, feasible, economic and can be used to determine individual and population risks of gastric cancer. PMID:15655813
ERIC Educational Resources Information Center
Crocker, Linda M.; Mehrens, William A.
Four new methods of item analysis were used to select subsets of items which would yield measures of attitude change. The sample consisted of 263 students at Michigan State University who were tested on the Inventory of Beliefs as freshmen and retested on the same instrument as juniors. Item change scores and total change scores were computed for…
Sanfilippo, Antonio [Richland, WA; Calapristi, Augustin J [West Richland, WA; Crow, Vernon L [Richland, WA; Hetzler, Elizabeth G [Kennewick, WA; Turner, Alan E [Kennewick, WA
2009-12-22
Document clustering methods, document cluster label disambiguation methods, document clustering apparatuses, and articles of manufacture are described. In one aspect, a document clustering method includes providing a document set comprising a plurality of documents, providing a cluster comprising a subset of the documents of the document set, using a plurality of terms of the documents, providing a cluster label indicative of subject matter content of the documents of the cluster, wherein the cluster label comprises a plurality of word senses, and selecting one of the word senses of the cluster label.
Zheng, Chunli; Wang, Jinan; Liu, Jianling; Pei, Mengjie; Huang, Chao; Wang, Yonghua
2014-08-01
The term systems pharmacology describes a field of study that uses computational and experimental approaches to broaden the view of drug actions rooted in molecular interactions and advance the process of drug discovery. The aim of this work is to stick out the role that the systems pharmacology plays across the multi-target drug discovery from natural products for cardiovascular diseases (CVDs). Firstly, based on network pharmacology methods, we reconstructed the drug-target and target-target networks to determine the putative protein target set of multi-target drugs for CVDs treatment. Secondly, we reintegrated a compound dataset of natural products and then obtained a multi-target compounds subset by virtual-screening process. Thirdly, a drug-likeness evaluation was applied to find the ADME-favorable compounds in this subset. Finally, we conducted in vitro experiments to evaluate the reliability of the selected chemicals and targets. We found that four of the five randomly selected natural molecules can effectively act on the target set for CVDs, indicating the reasonability of our systems-based method. This strategy may serve as a new model for multi-target drug discovery of complex diseases.
Scott, J.C.
1990-01-01
Computer software was written to randomly select sites for a ground-water-quality sampling network. The software uses digital cartographic techniques and subroutines from a proprietary geographic information system. The report presents the approaches, computer software, and sample applications. It is often desirable to collect ground-water-quality samples from various areas in a study region that have different values of a spatial characteristic, such as land-use or hydrogeologic setting. A stratified network can be used for testing hypotheses about relations between spatial characteristics and water quality, or for calculating statistical descriptions of water-quality data that account for variations that correspond to the spatial characteristic. In the software described, a study region is subdivided into areal subsets that have a common spatial characteristic to stratify the population into several categories from which sampling sites are selected. Different numbers of sites may be selected from each category of areal subsets. A population of potential sampling sites may be defined by either specifying a fixed population of existing sites, or by preparing an equally spaced population of potential sites. In either case, each site is identified with a single category, depending on the value of the spatial characteristic of the areal subset in which the site is located. Sites are selected from one category at a time. One of two approaches may be used to select sites. Sites may be selected randomly, or the areal subsets in the category can be grouped into cells and sites selected randomly from each cell.
Enhancement web proxy cache performance using Wrapper Feature Selection methods with NB and J48
NASA Astrophysics Data System (ADS)
Mahmoud Al-Qudah, Dua'a.; Funke Olanrewaju, Rashidah; Wong Azman, Amelia
2017-11-01
Web proxy cache technique reduces response time by storing a copy of pages between client and server sides. If requested pages are cached in the proxy, there is no need to access the server. Due to the limited size and excessive cost of cache compared to the other storages, cache replacement algorithm is used to determine evict page when the cache is full. On the other hand, the conventional algorithms for replacement such as Least Recently Use (LRU), First in First Out (FIFO), Least Frequently Use (LFU), Randomized Policy etc. may discard important pages just before use. Furthermore, using conventional algorithm cannot be well optimized since it requires some decision to intelligently evict a page before replacement. Hence, most researchers propose an integration among intelligent classifiers and replacement algorithm to improves replacement algorithms performance. This research proposes using automated wrapper feature selection methods to choose the best subset of features that are relevant and influence classifiers prediction accuracy. The result present that using wrapper feature selection methods namely: Best First (BFS), Incremental Wrapper subset selection(IWSS)embedded NB and particle swarm optimization(PSO)reduce number of features and have a good impact on reducing computation time. Using PSO enhance NB classifier accuracy by 1.1%, 0.43% and 0.22% over using NB with all features, using BFS and using IWSS embedded NB respectively. PSO rises J48 accuracy by 0.03%, 1.91 and 0.04% over using J48 classifier with all features, using IWSS-embedded NB and using BFS respectively. While using IWSS embedded NB fastest NB and J48 classifiers much more than BFS and PSO. However, it reduces computation time of NB by 0.1383 and reduce computation time of J48 by 2.998.
An analysis of the least-squares problem for the DSN systematic pointing error model
NASA Technical Reports Server (NTRS)
Alvarez, L. S.
1991-01-01
A systematic pointing error model is used to calibrate antennas in the Deep Space Network. The least squares problem is described and analyzed along with the solution methods used to determine the model's parameters. Specifically studied are the rank degeneracy problems resulting from beam pointing error measurement sets that incorporate inadequate sky coverage. A least squares parameter subset selection method is described and its applicability to the systematic error modeling process is demonstrated on Voyager 2 measurement distribution.
2011-01-01
Genome targeting methods enable cost-effective capture of specific subsets of the genome for sequencing. We present here an automated, highly scalable method for carrying out the Solution Hybrid Selection capture approach that provides a dramatic increase in scale and throughput of sequence-ready libraries produced. Significant process improvements and a series of in-process quality control checkpoints are also added. These process improvements can also be used in a manual version of the protocol. PMID:21205303
Qi, Miao; Wang, Ting; Yi, Yugen; Gao, Na; Kong, Jun; Wang, Jianzhong
2017-04-01
Feature selection has been regarded as an effective tool to help researchers understand the generating process of data. For mining the synthesis mechanism of microporous AlPOs, this paper proposes a novel feature selection method by joint l 2,1 norm and Fisher discrimination constraints (JNFDC). In order to obtain more effective feature subset, the proposed method can be achieved in two steps. The first step is to rank the features according to sparse and discriminative constraints. The second step is to establish predictive model with the ranked features, and select the most significant features in the light of the contribution of improving the predictive accuracy. To the best of our knowledge, JNFDC is the first work which employs the sparse representation theory to explore the synthesis mechanism of six kinds of pore rings. Numerical simulations demonstrate that our proposed method can select significant features affecting the specified structural property and improve the predictive accuracy. Moreover, comparison results show that JNFDC can obtain better predictive performances than some other state-of-the-art feature selection methods. © 2017 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim.
Surface Estimation, Variable Selection, and the Nonparametric Oracle Property.
Storlie, Curtis B; Bondell, Howard D; Reich, Brian J; Zhang, Hao Helen
2011-04-01
Variable selection for multivariate nonparametric regression is an important, yet challenging, problem due, in part, to the infinite dimensionality of the function space. An ideal selection procedure should be automatic, stable, easy to use, and have desirable asymptotic properties. In particular, we define a selection procedure to be nonparametric oracle (np-oracle) if it consistently selects the correct subset of predictors and at the same time estimates the smooth surface at the optimal nonparametric rate, as the sample size goes to infinity. In this paper, we propose a model selection procedure for nonparametric models, and explore the conditions under which the new method enjoys the aforementioned properties. Developed in the framework of smoothing spline ANOVA, our estimator is obtained via solving a regularization problem with a novel adaptive penalty on the sum of functional component norms. Theoretical properties of the new estimator are established. Additionally, numerous simulated and real examples further demonstrate that the new approach substantially outperforms other existing methods in the finite sample setting.
Surface Estimation, Variable Selection, and the Nonparametric Oracle Property
Storlie, Curtis B.; Bondell, Howard D.; Reich, Brian J.; Zhang, Hao Helen
2010-01-01
Variable selection for multivariate nonparametric regression is an important, yet challenging, problem due, in part, to the infinite dimensionality of the function space. An ideal selection procedure should be automatic, stable, easy to use, and have desirable asymptotic properties. In particular, we define a selection procedure to be nonparametric oracle (np-oracle) if it consistently selects the correct subset of predictors and at the same time estimates the smooth surface at the optimal nonparametric rate, as the sample size goes to infinity. In this paper, we propose a model selection procedure for nonparametric models, and explore the conditions under which the new method enjoys the aforementioned properties. Developed in the framework of smoothing spline ANOVA, our estimator is obtained via solving a regularization problem with a novel adaptive penalty on the sum of functional component norms. Theoretical properties of the new estimator are established. Additionally, numerous simulated and real examples further demonstrate that the new approach substantially outperforms other existing methods in the finite sample setting. PMID:21603586
Choice: 36 band feature selection software with applications to multispectral pattern recognition
NASA Technical Reports Server (NTRS)
Jones, W. C.
1973-01-01
Feature selection software was developed at the Earth Resources Laboratory that is capable of inputting up to 36 channels and selecting channel subsets according to several criteria based on divergence. One of the criterion used is compatible with the table look-up classifier requirements. The software indicates which channel subset best separates (based on average divergence) each class from all other classes. The software employs an exhaustive search technique, and computer time is not prohibitive. A typical task to select the best 4 of 22 channels for 12 classes takes 9 minutes on a Univac 1108 computer.
Frank, Laurence E; Heiser, Willem J
2008-05-01
A set of features is the basis for the network representation of proximity data achieved by feature network models (FNMs). Features are binary variables that characterize the objects in an experiment, with some measure of proximity as response variable. Sometimes features are provided by theory and play an important role in the construction of the experimental conditions. In some research settings, the features are not known a priori. This paper shows how to generate features in this situation and how to select an adequate subset of features that takes into account a good compromise between model fit and model complexity, using a new version of least angle regression that restricts coefficients to be non-negative, called the Positive Lasso. It will be shown that features can be generated efficiently with Gray codes that are naturally linked to the FNMs. The model selection strategy makes use of the fact that FNM can be considered as univariate multiple regression model. A simulation study shows that the proposed strategy leads to satisfactory results if the number of objects is less than or equal to 22. If the number of objects is larger than 22, the number of features selected by our method exceeds the true number of features in some conditions.
Varvil-Weld, Lindsey; Mallett, Kimberly A.; Turrisi, Rob; Abar, Caitlin C.
2012-01-01
Objective: Previous research identified a high-risk subset of college students experiencing a disproportionate number of alcohol-related consequences at the end of their first year. With the goal of identifying pre-college predictors of membership in this high-risk subset, the present study used a prospective design to identify latent profiles of student-reported maternal and paternal parenting styles and alcohol-specific behaviors and to determine whether these profiles were associated with membership in the high-risk consequences subset. Method: A sample of randomly selected 370 incoming first-year students at a large public university reported on their mothers’ and fathers’ communication quality, monitoring, approval of alcohol use, and modeling of drinking behaviors and on consequences experienced across the first year of college. Results: Students in the high-risk subset comprised 15.5% of the sample but accounted for almost half (46.6%) of the total consequences reported by the entire sample. Latent profile analyses identified four parental profiles: positive pro-alcohol, positive anti-alcohol, negative mother, and negative father. Logistic regression analyses revealed that students in the negative-father profile were at greatest odds of being in the high-risk consequences subset at a follow-up assessment 1 year later, even after drinking at baseline was controlled for. Students in the positive pro-alcohol profile also were at increased odds of being in the high-risk subset, although this association was attenuated after baseline drinking was controlled for. Conclusions: These findings have important implications for the improvement of existing parent- and individual-based college student drinking interventions designed to reduce alcohol-related consequences. PMID:22456248
ERIC Educational Resources Information Center
Bagner, Daniel M.; Sheinkopf, Stephen J.; Miller-Loncar, Cynthia; LaGasse, Linda L.; Lester, Barry M.; Liu, Jing; Bauer, Charles R.; Shankaran, Seetha; Bada, Henrietta; Das, Abhik
2009-01-01
Objective: To examine the relationship between early parenting stress and later child behavior in a high-risk sample and measure the effect of drug exposure on the relationship between parenting stress and child behavior. Methods: A subset of child-caregiver dyads (n = 607) were selected from the Maternal Lifestyle Study (MLS), which is a large…
Álvarez, Aitor; Sierra, Basilio; Arruti, Andoni; López-Gil, Juan-Miguel; Garay-Vitoria, Nestor
2015-01-01
In this paper, a new supervised classification paradigm, called classifier subset selection for stacked generalization (CSS stacking), is presented to deal with speech emotion recognition. The new approach consists of an improvement of a bi-level multi-classifier system known as stacking generalization by means of an integration of an estimation of distribution algorithm (EDA) in the first layer to select the optimal subset from the standard base classifiers. The good performance of the proposed new paradigm was demonstrated over different configurations and datasets. First, several CSS stacking classifiers were constructed on the RekEmozio dataset, using some specific standard base classifiers and a total of 123 spectral, quality and prosodic features computed using in-house feature extraction algorithms. These initial CSS stacking classifiers were compared to other multi-classifier systems and the employed standard classifiers built on the same set of speech features. Then, new CSS stacking classifiers were built on RekEmozio using a different set of both acoustic parameters (extended version of the Geneva Minimalistic Acoustic Parameter Set (eGeMAPS)) and standard classifiers and employing the best meta-classifier of the initial experiments. The performance of these two CSS stacking classifiers was evaluated and compared. Finally, the new paradigm was tested on the well-known Berlin Emotional Speech database. We compared the performance of single, standard stacking and CSS stacking systems using the same parametrization of the second phase. All of the classifications were performed at the categorical level, including the six primary emotions plus the neutral one. PMID:26712757
An Adaptive Genetic Association Test Using Double Kernel Machines
Zhan, Xiang; Epstein, Michael P.; Ghosh, Debashis
2014-01-01
Recently, gene set-based approaches have become very popular in gene expression profiling studies for assessing how genetic variants are related to disease outcomes. Since most genes are not differentially expressed, existing pathway tests considering all genes within a pathway suffer from considerable noise and power loss. Moreover, for a differentially expressed pathway, it is of interest to select important genes that drive the effect of the pathway. In this article, we propose an adaptive association test using double kernel machines (DKM), which can both select important genes within the pathway as well as test for the overall genetic pathway effect. This DKM procedure first uses the garrote kernel machines (GKM) test for the purposes of subset selection and then the least squares kernel machine (LSKM) test for testing the effect of the subset of genes. An appealing feature of the kernel machine framework is that it can provide a flexible and unified method for multi-dimensional modeling of the genetic pathway effect allowing for both parametric and nonparametric components. This DKM approach is illustrated with application to simulated data as well as to data from a neuroimaging genetics study. PMID:26640602
A weight based genetic algorithm for selecting views
NASA Astrophysics Data System (ADS)
Talebian, Seyed H.; Kareem, Sameem A.
2013-03-01
Data warehouse is a technology designed for supporting decision making. Data warehouse is made by extracting large amount of data from different operational systems; transforming it to a consistent form and loading it to the central repository. The type of queries in data warehouse environment differs from those in operational systems. In contrast to operational systems, the analytical queries that are issued in data warehouses involve summarization of large volume of data and therefore in normal circumstance take a long time to be answered. On the other hand, the result of these queries must be answered in a short time to enable managers to make decisions as short time as possible. As a result, an essential need in this environment is in improving the performances of queries. One of the most popular methods to do this task is utilizing pre-computed result of queries. In this method, whenever a new query is submitted by the user instead of calculating the query on the fly through a large underlying database, the pre-computed result or views are used to answer the queries. Although, the ideal option would be pre-computing and saving all possible views, but, in practice due to disk space constraint and overhead due to view updates it is not considered as a feasible choice. Therefore, we need to select a subset of possible views to save on disk. The problem of selecting the right subset of views is considered as an important challenge in data warehousing. In this paper we suggest a Weighted Based Genetic Algorithm (WBGA) for solving the view selection problem with two objectives.
The effect of feature selection methods on computer-aided detection of masses in mammograms
NASA Astrophysics Data System (ADS)
Hupse, Rianne; Karssemeijer, Nico
2010-05-01
In computer-aided diagnosis (CAD) research, feature selection methods are often used to improve generalization performance of classifiers and shorten computation times. In an application that detects malignant masses in mammograms, we investigated the effect of using a selection criterion that is similar to the final performance measure we are optimizing, namely the mean sensitivity of the system in a predefined range of the free-response receiver operating characteristics (FROC). To obtain the generalization performance of the selected feature subsets, a cross validation procedure was performed on a dataset containing 351 abnormal and 7879 normal regions, each region providing a set of 71 mass features. The same number of noise features, not containing any information, were added to investigate the ability of the feature selection algorithms to distinguish between useful and non-useful features. It was found that significantly higher performances were obtained using feature sets selected by the general test statistic Wilks' lambda than using feature sets selected by the more specific FROC measure. Feature selection leads to better performance when compared to a system in which all features were used.
A ℓ2, 1 norm regularized multi-kernel learning for false positive reduction in Lung nodule CAD.
Cao, Peng; Liu, Xiaoli; Zhang, Jian; Li, Wei; Zhao, Dazhe; Huang, Min; Zaiane, Osmar
2017-03-01
The aim of this paper is to describe a novel algorithm for False Positive Reduction in lung nodule Computer Aided Detection(CAD). In this paper, we describes a new CT lung CAD method which aims to detect solid nodules. Specially, we proposed a multi-kernel classifier with a ℓ 2, 1 norm regularizer for heterogeneous feature fusion and selection from the feature subset level, and designed two efficient strategies to optimize the parameters of kernel weights in non-smooth ℓ 2, 1 regularized multiple kernel learning algorithm. The first optimization algorithm adapts a proximal gradient method for solving the ℓ 2, 1 norm of kernel weights, and use an accelerated method based on FISTA; the second one employs an iterative scheme based on an approximate gradient descent method. The results demonstrates that the FISTA-style accelerated proximal descent method is efficient for the ℓ 2, 1 norm formulation of multiple kernel learning with the theoretical guarantee of the convergence rate. Moreover, the experimental results demonstrate the effectiveness of the proposed methods in terms of Geometric mean (G-mean) and Area under the ROC curve (AUC), and significantly outperforms the competing methods. The proposed approach exhibits some remarkable advantages both in heterogeneous feature subsets fusion and classification phases. Compared with the fusion strategies of feature-level and decision level, the proposed ℓ 2, 1 norm multi-kernel learning algorithm is able to accurately fuse the complementary and heterogeneous feature sets, and automatically prune the irrelevant and redundant feature subsets to form a more discriminative feature set, leading a promising classification performance. Moreover, the proposed algorithm consistently outperforms the comparable classification approaches in the literature. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Ramadan, Ahmed; Boss, Connor; Choi, Jongeun; Peter Reeves, N; Cholewicki, Jacek; Popovich, John M; Radcliffe, Clark J
2018-07-01
Estimating many parameters of biomechanical systems with limited data may achieve good fit but may also increase 95% confidence intervals in parameter estimates. This results in poor identifiability in the estimation problem. Therefore, we propose a novel method to select sensitive biomechanical model parameters that should be estimated, while fixing the remaining parameters to values obtained from preliminary estimation. Our method relies on identifying the parameters to which the measurement output is most sensitive. The proposed method is based on the Fisher information matrix (FIM). It was compared against the nonlinear least absolute shrinkage and selection operator (LASSO) method to guide modelers on the pros and cons of our FIM method. We present an application identifying a biomechanical parametric model of a head position-tracking task for ten human subjects. Using measured data, our method (1) reduced model complexity by only requiring five out of twelve parameters to be estimated, (2) significantly reduced parameter 95% confidence intervals by up to 89% of the original confidence interval, (3) maintained goodness of fit measured by variance accounted for (VAF) at 82%, (4) reduced computation time, where our FIM method was 164 times faster than the LASSO method, and (5) selected similar sensitive parameters to the LASSO method, where three out of five selected sensitive parameters were shared by FIM and LASSO methods.
Generating global network structures by triad types
Ferligoj, Anuška; Žiberna, Aleš
2018-01-01
This paper addresses the question of whether one can generate networks with a given global structure (defined by selected blockmodels, i.e., cohesive, core-periphery, hierarchical, and transitivity), considering only different types of triads. Two methods are used to generate networks: (i) the newly proposed method of relocating links; and (ii) the Monte Carlo Multi Chain algorithm implemented in the ergm package in R. Most of the selected blockmodel types can be generated by considering all types of triads. The selection of only a subset of triads can improve the generated networks’ blockmodel structure. Yet, in the case of a hierarchical blockmodel without complete blocks on the diagonal, additional local structures are needed to achieve the desired global structure of generated networks. This shows that blockmodels can emerge based only on local processes that do not take attributes into account. PMID:29847563
Criteria to Extract High-Quality Protein Data Bank Subsets for Structure Users.
Carugo, Oliviero; Djinović-Carugo, Kristina
2016-01-01
It is often necessary to build subsets of the Protein Data Bank to extract structural trends and average values. For this purpose it is mandatory that the subsets are non-redundant and of high quality. The first problem can be solved relatively easily at the sequence level or at the structural level. The second, on the contrary, needs special attention. It is not sufficient, in fact, to consider the crystallographic resolution and other feature must be taken into account: the absence of strings of residues from the electron density maps and from the files deposited in the Protein Data Bank; the B-factor values; the appropriate validation of the structural models; the quality of the electron density maps, which is not uniform; and the temperature of the diffraction experiments. More stringent criteria produce smaller subsets, which can be enlarged with more tolerant selection criteria. The incessant growth of the Protein Data Bank and especially of the number of high-resolution structures is allowing the use of more stringent selection criteria, with a consequent improvement of the quality of the subsets of the Protein Data Bank.
Mala, S.; Latha, K.
2014-01-01
Activity recognition is needed in different requisition, for example, reconnaissance system, patient monitoring, and human-computer interfaces. Feature selection plays an important role in activity recognition, data mining, and machine learning. In selecting subset of features, an efficient evolutionary algorithm Differential Evolution (DE), a very efficient optimizer, is used for finding informative features from eye movements using electrooculography (EOG). Many researchers use EOG signals in human-computer interactions with various computational intelligence methods to analyze eye movements. The proposed system involves analysis of EOG signals using clearness based features, minimum redundancy maximum relevance features, and Differential Evolution based features. This work concentrates more on the feature selection algorithm based on DE in order to improve the classification for faultless activity recognition. PMID:25574185
Mala, S; Latha, K
2014-01-01
Activity recognition is needed in different requisition, for example, reconnaissance system, patient monitoring, and human-computer interfaces. Feature selection plays an important role in activity recognition, data mining, and machine learning. In selecting subset of features, an efficient evolutionary algorithm Differential Evolution (DE), a very efficient optimizer, is used for finding informative features from eye movements using electrooculography (EOG). Many researchers use EOG signals in human-computer interactions with various computational intelligence methods to analyze eye movements. The proposed system involves analysis of EOG signals using clearness based features, minimum redundancy maximum relevance features, and Differential Evolution based features. This work concentrates more on the feature selection algorithm based on DE in order to improve the classification for faultless activity recognition.
Deep convolutional neural network based antenna selection in multiple-input multiple-output system
NASA Astrophysics Data System (ADS)
Cai, Jiaxin; Li, Yan; Hu, Ying
2018-03-01
Antenna selection of wireless communication system has attracted increasing attention due to the challenge of keeping a balance between communication performance and computational complexity in large-scale Multiple-Input MultipleOutput antenna systems. Recently, deep learning based methods have achieved promising performance for large-scale data processing and analysis in many application fields. This paper is the first attempt to introduce the deep learning technique into the field of Multiple-Input Multiple-Output antenna selection in wireless communications. First, the label of attenuation coefficients channel matrix is generated by minimizing the key performance indicator of training antenna systems. Then, a deep convolutional neural network that explicitly exploits the massive latent cues of attenuation coefficients is learned on the training antenna systems. Finally, we use the adopted deep convolutional neural network to classify the channel matrix labels of test antennas and select the optimal antenna subset. Simulation experimental results demonstrate that our method can achieve better performance than the state-of-the-art baselines for data-driven based wireless antenna selection.
NASA Astrophysics Data System (ADS)
Vatutin, Eduard
2017-12-01
The article deals with the problem of analysis of effectiveness of the heuristic methods with limited depth-first search techniques of decision obtaining in the test problem of getting the shortest path in graph. The article briefly describes the group of methods based on the limit of branches number of the combinatorial search tree and limit of analyzed subtree depth used to solve the problem. The methodology of comparing experimental data for the estimation of the quality of solutions based on the performing of computational experiments with samples of graphs with pseudo-random structure and selected vertices and arcs number using the BOINC platform is considered. It also shows description of obtained experimental results which allow to identify the areas of the preferable usage of selected subset of heuristic methods depending on the size of the problem and power of constraints. It is shown that the considered pair of methods is ineffective in the selected problem and significantly inferior to the quality of solutions that are provided by ant colony optimization method and its modification with combinatorial returns.
Which products are available for subsetting?
Atmospheric Science Data Center
2014-12-08
... users to create smaller files (subsets) of the original data by selecting desired parameters, parameter criterion, or latitude and ... fluxes, where the net flux is constrained to the global heat storage in netCDF format. Single Scanner Footprint TOA/Surface Fluxes ...
New TES Search and Subset Application
Atmospheric Science Data Center
2017-08-23
... Wednesday, September 19, 2012 The Atmospheric Science Data Center (ASDC) at NASA Langley Research Center in collaboration ... pleased to announce the release of the TES Search and Subset Web Application for select TES Level 2 products. Features of the Search and ...
Correlative feature analysis of FFDM images
NASA Astrophysics Data System (ADS)
Yuan, Yading; Giger, Maryellen L.; Li, Hui; Sennett, Charlene
2008-03-01
Identifying the corresponding image pair of a lesion is an essential step for combining information from different views of the lesion to improve the diagnostic ability of both radiologists and CAD systems. Because of the non-rigidity of the breasts and the 2D projective property of mammograms, this task is not trivial. In this study, we present a computerized framework that differentiates the corresponding images from different views of a lesion from non-corresponding ones. A dual-stage segmentation method, which employs an initial radial gradient index(RGI) based segmentation and an active contour model, was initially applied to extract mass lesions from the surrounding tissues. Then various lesion features were automatically extracted from each of the two views of each lesion to quantify the characteristics of margin, shape, size, texture and context of the lesion, as well as its distance to nipple. We employed a two-step method to select an effective subset of features, and combined it with a BANN to obtain a discriminant score, which yielded an estimate of the probability that the two images are of the same physical lesion. ROC analysis was used to evaluate the performance of the individual features and the selected feature subset in the task of distinguishing between corresponding and non-corresponding pairs. By using a FFDM database with 124 corresponding image pairs and 35 non-corresponding pairs, the distance feature yielded an AUC (area under the ROC curve) of 0.8 with leave-one-out evaluation by lesion, and the feature subset, which includes distance feature, lesion size and lesion contrast, yielded an AUC of 0.86. The improvement by using multiple features was statistically significant as compared to single feature performance. (p<0.001)
Combining markers with and without the limit of detection
Dong, Ting; Liu, Catherine Chunling; Petricoin, Emanuel F.; Tang, Liansheng Larry
2014-01-01
In this paper, we consider the combination of markers with and without the limit of detection (LOD). LOD is often encountered when measuring proteomic markers. Because of the limited detecting ability of an equipment or instrument, it is difficult to measure markers at a relatively low level. Suppose that after some monotonic transformation, the marker values approximately follow multivariate normal distributions. We propose to estimate distribution parameters while taking the LOD into account, and then combine markers using the results from the linear discriminant analysis. Our simulation results show that the ROC curve parameter estimates generated from the proposed method are much closer to the truth than simply using the linear discriminant analysis to combine markers without considering the LOD. In addition, we propose a procedure to select and combine a subset of markers when many candidate markers are available. The procedure based on the correlation among markers is different from a common understanding that a subset of the most accurate markers should be selected for the combination. The simulation studies show that the accuracy of a combined marker can be largely impacted by the correlation of marker measurements. Our methods are applied to a protein pathway dataset to combine proteomic biomarkers to distinguish cancer patients from non-cancer patients. PMID:24132938
Sparse Zero-Sum Games as Stable Functional Feature Selection
Sokolovska, Nataliya; Teytaud, Olivier; Rizkalla, Salwa; Clément, Karine; Zucker, Jean-Daniel
2015-01-01
In large-scale systems biology applications, features are structured in hidden functional categories whose predictive power is identical. Feature selection, therefore, can lead not only to a problem with a reduced dimensionality, but also reveal some knowledge on functional classes of variables. In this contribution, we propose a framework based on a sparse zero-sum game which performs a stable functional feature selection. In particular, the approach is based on feature subsets ranking by a thresholding stochastic bandit. We provide a theoretical analysis of the introduced algorithm. We illustrate by experiments on both synthetic and real complex data that the proposed method is competitive from the predictive and stability viewpoints. PMID:26325268
Performance Analysis of Relay Subset Selection for Amplify-and-Forward Cognitive Relay Networks
Qureshi, Ijaz Mansoor; Malik, Aqdas Naveed; Zubair, Muhammad
2014-01-01
Cooperative communication is regarded as a key technology in wireless networks, including cognitive radio networks (CRNs), which increases the diversity order of the signal to combat the unfavorable effects of the fading channels, by allowing distributed terminals to collaborate through sophisticated signal processing. Underlay CRNs have strict interference constraints towards the secondary users (SUs) active in the frequency band of the primary users (PUs), which limits their transmit power and their coverage area. Relay selection offers a potential solution to the challenges faced by underlay networks, by selecting either single best relay or a subset of potential relay set under different design requirements and assumptions. The best relay selection schemes proposed in the literature for amplify-and-forward (AF) based underlay cognitive relay networks have been very well studied in terms of outage probability (OP) and bit error rate (BER), which is deficient in multiple relay selection schemes. The novelty of this work is to study the outage behavior of multiple relay selection in the underlay CRN and derive the closed-form expressions for the OP and BER through cumulative distribution function (CDF) of the SNR received at the destination. The effectiveness of relay subset selection is shown through simulation results. PMID:24737980
Different hunting strategies select for different weights in red deer.
Martínez, María; Rodríguez-Vigal, Carlos; Jones, Owen R; Coulson, Tim; San Miguel, Alfonso
2005-09-22
Much insight can be derived from records of shot animals. Most researchers using such data assume that their data represents a random sample of a particular demographic class. However, hunters typically select a non-random subset of the population and hunting is, therefore, not a random process. Here, with red deer (Cervus elaphus) hunting data from a ranch in Toledo, Spain, we demonstrate that data collection methods have a significant influence upon the apparent relationship between age and weight. We argue that a failure to correct for such methodological bias may have significant consequences for the interpretation of analyses involving weight or correlated traits such as breeding success, and urge researchers to explore methods to identify and correct for such bias in their data.
Fourier spatial frequency analysis for image classification: training the training set
NASA Astrophysics Data System (ADS)
Johnson, Timothy H.; Lhamo, Yigah; Shi, Lingyan; Alfano, Robert R.; Russell, Stewart
2016-04-01
The Directional Fourier Spatial Frequencies (DFSF) of a 2D image can identify similarity in spatial patterns within groups of related images. A Support Vector Machine (SVM) can then be used to classify images if the inter-image variance of the FSF in the training set is bounded. However, if variation in FSF increases with training set size, accuracy may decrease as the size of the training set increases. This calls for a method to identify a set of training images from among the originals that can form a vector basis for the entire class. Applying the Cauchy product method we extract the DFSF spectrum from radiographs of osteoporotic bone, and use it as a matched filter set to eliminate noise and image specific frequencies, and demonstrate that selection of a subset of superclassifiers from within a set of training images improves SVM accuracy. Central to this challenge is that the size of the search space can become computationally prohibitive for all but the smallest training sets. We are investigating methods to reduce the search space to identify an optimal subset of basis training images.
Performance and Accuracy of LAPACK's Symmetric TridiagonalEigensolvers
DOE Office of Scientific and Technical Information (OSTI.GOV)
Demmel, Jim W.; Marques, Osni A.; Parlett, Beresford N.
2007-04-19
We compare four algorithms from the latest LAPACK 3.1 release for computing eigenpairs of a symmetric tridiagonal matrix. These include QR iteration, bisection and inverse iteration (BI), the Divide-and-Conquer method (DC), and the method of Multiple Relatively Robust Representations (MR). Our evaluation considers speed and accuracy when computing all eigenpairs, and additionally subset computations. Using a variety of carefully selected test problems, our study includes a variety of today's computer architectures. Our conclusions can be summarized as follows. (1) DC and MR are generally much faster than QR and BI on large matrices. (2) MR almost always does the fewestmore » floating point operations, but at a lower MFlop rate than all the other algorithms. (3) The exact performance of MR and DC strongly depends on the matrix at hand. (4) DC and QR are the most accurate algorithms with observed accuracy O({radical}ne). The accuracy of BI and MR is generally O(ne). (5) MR is preferable to BI for subset computations.« less
An Active RBSE Framework to Generate Optimal Stimulus Sequences in a BCI for Spelling
NASA Astrophysics Data System (ADS)
Moghadamfalahi, Mohammad; Akcakaya, Murat; Nezamfar, Hooman; Sourati, Jamshid; Erdogmus, Deniz
2017-10-01
A class of brain computer interfaces (BCIs) employs noninvasive recordings of electroencephalography (EEG) signals to enable users with severe speech and motor impairments to interact with their environment and social network. For example, EEG based BCIs for typing popularly utilize event related potentials (ERPs) for inference. Presentation paradigm design in current ERP-based letter by letter typing BCIs typically query the user with an arbitrary subset characters. However, the typing accuracy and also typing speed can potentially be enhanced with more informed subset selection and flash assignment. In this manuscript, we introduce the active recursive Bayesian state estimation (active-RBSE) framework for inference and sequence optimization. Prior to presentation in each iteration, rather than showing a subset of randomly selected characters, the developed framework optimally selects a subset based on a query function. Selected queries are made adaptively specialized for users during each intent detection. Through a simulation-based study, we assess the effect of active-RBSE on the performance of a language-model assisted typing BCI in terms of typing speed and accuracy. To provide a baseline for comparison, we also utilize standard presentation paradigms namely, row and column matrix presentation paradigm and also random rapid serial visual presentation paradigms. The results show that utilization of active-RBSE can enhance the online performance of the system, both in terms of typing accuracy and speed.
NASA Technical Reports Server (NTRS)
1979-01-01
A nonlinear, maximum likelihood, parameter identification computer program (NLSCIDNT) is described which evaluates rotorcraft stability and control coefficients from flight test data. The optimal estimates of the parameters (stability and control coefficients) are determined (identified) by minimizing the negative log likelihood cost function. The minimization technique is the Levenberg-Marquardt method, which behaves like the steepest descent method when it is far from the minimum and behaves like the modified Newton-Raphson method when it is nearer the minimum. Twenty-one states and 40 measurement variables are modeled, and any subset may be selected. States which are not integrated may be fixed at an input value, or time history data may be substituted for the state in the equations of motion. Any aerodynamic coefficient may be expressed as a nonlinear polynomial function of selected 'expansion variables'.
Multiobjective optimization of combinatorial libraries.
Agrafiotis, D K
2002-01-01
Combinatorial chemistry and high-throughput screening have caused a fundamental shift in the way chemists contemplate experiments. Designing a combinatorial library is a controversial art that involves a heterogeneous mix of chemistry, mathematics, economics, experience, and intuition. Although there seems to be little agreement as to what constitutes an ideal library, one thing is certain: only one property or measure seldom defines the quality of the design. In most real-world applications, a good experiment requires the simultaneous optimization of several, often conflicting, design objectives, some of which may be vague and uncertain. In this paper, we discuss a class of algorithms for subset selection rooted in the principles of multiobjective optimization. Our approach is to employ an objective function that encodes all of the desired selection criteria, and then use a simulated annealing or evolutionary approach to identify the optimal (or a nearly optimal) subset from among the vast number of possibilities. Many design criteria can be accommodated, including diversity, similarity to known actives, predicted activity and/or selectivity determined by quantitative structure-activity relationship (QSAR) models or receptor binding models, enforcement of certain property distributions, reagent cost and availability, and many others. The method is robust, convergent, and extensible, offers the user full control over the relative significance of the various objectives in the final design, and permits the simultaneous selection of compounds from multiple libraries in full- or sparse-array format.
Reducing seed dependent variability of non-uniformly sampled multidimensional NMR data
NASA Astrophysics Data System (ADS)
Mobli, Mehdi
2015-07-01
The application of NMR spectroscopy to study the structure, dynamics and function of macromolecules requires the acquisition of several multidimensional spectra. The one-dimensional NMR time-response from the spectrometer is extended to additional dimensions by introducing incremented delays in the experiment that cause oscillation of the signal along "indirect" dimensions. For a given dimension the delay is incremented at twice the rate of the maximum frequency (Nyquist rate). To achieve high-resolution requires acquisition of long data records sampled at the Nyquist rate. This is typically a prohibitive step due to time constraints, resulting in sub-optimal data records to the detriment of subsequent analyses. The multidimensional NMR spectrum itself is typically sparse, and it has been shown that in such cases it is possible to use non-Fourier methods to reconstruct a high-resolution multidimensional spectrum from a random subset of non-uniformly sampled (NUS) data. For a given acquisition time, NUS has the potential to improve the sensitivity and resolution of a multidimensional spectrum, compared to traditional uniform sampling. The improvements in sensitivity and/or resolution achieved by NUS are heavily dependent on the distribution of points in the random subset acquired. Typically, random points are selected from a probability density function (PDF) weighted according to the NMR signal envelope. In extreme cases as little as 1% of the data is subsampled. The heavy under-sampling can result in poor reproducibility, i.e. when two experiments are carried out where the same number of random samples is selected from the same PDF but using different random seeds. Here, a jittered sampling approach is introduced that is shown to improve random seed dependent reproducibility of multidimensional spectra generated from NUS data, compared to commonly applied NUS methods. It is shown that this is achieved due to the low variability of the inherent sensitivity of the random subset chosen from a given PDF. Finally, it is demonstrated that metrics used to find optimal NUS distributions are heavily dependent on the inherent sensitivity of the random subset, and such optimisation is therefore less critical when using the proposed sampling scheme.
A genetic algorithm-based framework for wavelength selection on sample categorization.
Anzanello, Michel J; Yamashita, Gabrielli; Marcelo, Marcelo; Fogliatto, Flávio S; Ortiz, Rafael S; Mariotti, Kristiane; Ferrão, Marco F
2017-08-01
In forensic and pharmaceutical scenarios, the application of chemometrics and optimization techniques has unveiled common and peculiar features of seized medicine and drug samples, helping investigative forces to track illegal operations. This paper proposes a novel framework aimed at identifying relevant subsets of attenuated total reflectance Fourier transform infrared (ATR-FTIR) wavelengths for classifying samples into two classes, for example authentic or forged categories in case of medicines, or salt or base form in cocaine analysis. In the first step of the framework, the ATR-FTIR spectra were partitioned into equidistant intervals and the k-nearest neighbour (KNN) classification technique was applied to each interval to insert samples into proper classes. In the next step, selected intervals were refined through the genetic algorithm (GA) by identifying a limited number of wavelengths from the intervals previously selected aimed at maximizing classification accuracy. When applied to Cialis®, Viagra®, and cocaine ATR-FTIR datasets, the proposed method substantially decreased the number of wavelengths needed to categorize, and increased the classification accuracy. From a practical perspective, the proposed method provides investigative forces with valuable information towards monitoring illegal production of drugs and medicines. In addition, focusing on a reduced subset of wavelengths allows the development of portable devices capable of testing the authenticity of samples during police checking events, avoiding the need for later laboratorial analyses and reducing equipment expenses. Theoretically, the proposed GA-based approach yields more refined solutions than the current methods relying on interval approaches, which tend to insert irrelevant wavelengths in the retained intervals. Copyright © 2016 John Wiley & Sons, Ltd. Copyright © 2016 John Wiley & Sons, Ltd.
Methods Used to Streamline the CAHPS® Hospital Survey
Keller, San; O'Malley, A James; Hays, Ron D; Matthew, Rebecca A; Zaslavsky, Alan M; Hepner, Kimberly A; Cleary, Paul D
2005-01-01
Objective To identify a parsimonious subset of reliable, valid, and consumer-salient items from 33 questions asking for patient reports about hospital care quality. Data Source CAHPS® Hospital Survey pilot data were collected during the summer of 2003 using mail and telephone from 19,720 patients who had been treated in 132 hospitals in three states and discharged from November 2002 to January 2003. Methods Standard psychometric methods were used to assess the reliability (internal consistency reliability and hospital-level reliability) and construct validity (exploratory and confirmatory factor analyses, strength of relationship to overall rating of hospital) of the 33 report items. The best subset of items from among the 33 was selected based on their statistical properties in conjunction with the importance assigned to each item by participants in 14 focus groups. Principal Findings Confirmatory factor analysis (CFA) indicated that a subset of 16 questions proposed to measure seven aspects of hospital care (communication with nurses, communication with doctors, responsiveness to patient needs, physical environment, pain control, communication about medication, and discharge information) demonstrated excellent fit to the data. Scales in each of these areas had acceptable levels of reliability to discriminate among hospitals and internal consistency reliability estimates comparable with previously developed CAHPS instruments. Conclusion Although half the length of the original, the shorter CAHPS hospital survey demonstrates promising measurement properties, identifies variations in care among hospitals, and deals with aspects of the hospital stay that are important to patients' evaluations of care quality. PMID:16316438
Targeted depletion of a MDSC subset unmasks pancreatic ductal adenocarcinoma to adaptive immunity
Stromnes, Ingunn M.; Brockenbrough, Scott; Izeradjene, Kamel; Carlson, Markus A.; Cuevas, Carlos; Simmons, Randi M.; Greenberg, Philip D.; Hingorani, Sunil R.
2015-01-01
Objective Pancreatic ductal adenocarcinoma (PDA) is characterized by a robust desmoplasia, including the notable accumulation of immunosuppressive cells that shield neoplastic cells from immune detection. Immune evasion may be further enhanced if the malignant cells fail to express high levels of antigens that are sufficiently immunogenic to engender an effector T cell response. In this report, we investigate the predominant subsets of immunosuppressive cancer-conditioned myeloid cells that chronicle and shape pancreas cancer progression. We show that selective depletion of one subset of myeloid-derived suppressor cells (MDSC) in an autochthonous, genetically engineered mouse model (GEMM) of PDA unmasks the ability of the adaptive immune response to engage and target tumor epithelial cells. Methods A combination of in vivo and in vitro studies were performed employing a GEMM that faithfully recapitulates the cardinal features of human PDA. The predominant cancer-conditioned myeloid cell subpopulation was specifically targeted in vivo and the biological outcomes determined. Results PDA orchestrates the induction of distinct subsets of cancer-associated myeloid cells through the production of factors known to influence myelopoeisis. These immature myeloid cells inhibit the proliferation and induce apoptosis of activated T cells. Targeted depletion of granulocytic MDSC (Gr-MDSC) in autochthonous PDA increases the intratumoral accumulation of activated CD8 T cells and apoptosis of tumor epithelial cells, and also remodels the tumor stroma. Conclusions Neoplastic ductal cells of the pancreas induce distinct myeloid cell subsets that promote tumor cell survival and accumulation. Targeted depletion of a single myeloid subset, the Gr-MDSC, can unmask an endogenous T cell response, revealing an unexpected latent immunity and invoking targeting of Gr-MDSC as a potential strategy to exploit for treating this highly lethal disease. PMID:24555999
Feature Selection for Nonstationary Data: Application to Human Recognition Using Medical Biometrics.
Komeili, Majid; Louis, Wael; Armanfard, Narges; Hatzinakos, Dimitrios
2018-05-01
Electrocardiogram (ECG) and transient evoked otoacoustic emission (TEOAE) are among the physiological signals that have attracted significant interest in biometric community due to their inherent robustness to replay and falsification attacks. However, they are time-dependent signals and this makes them hard to deal with in across-session human recognition scenario where only one session is available for enrollment. This paper presents a novel feature selection method to address this issue. It is based on an auxiliary dataset with multiple sessions where it selects a subset of features that are more persistent across different sessions. It uses local information in terms of sample margins while enforcing an across-session measure. This makes it a perfect fit for aforementioned biometric recognition problem. Comprehensive experiments on ECG and TEOAE variability due to time lapse and body posture are done. Performance of the proposed method is compared against seven state-of-the-art feature selection algorithms as well as another six approaches in the area of ECG and TEOAE biometric recognition. Experimental results demonstrate that the proposed method performs noticeably better than other algorithms.
A method for tailoring the information content of a software process model
NASA Technical Reports Server (NTRS)
Perkins, Sharon; Arend, Mark B.
1990-01-01
The framework is defined for a general method for selecting a necessary and sufficient subset of a general software life cycle's information products, to support new software development process. Procedures for characterizing problem domains in general and mapping to a tailored set of life cycle processes and products is presented. An overview of the method is shown using the following steps: (1) During the problem concept definition phase, perform standardized interviews and dialogs between developer and user, and between user and customer; (2) Generate a quality needs profile of the software to be developed, based on information gathered in step 1; (3) Translate the quality needs profile into a profile of quality criteria that must be met by the software to satisfy the quality needs; (4) Map the quality criteria to set of accepted processes and products for achieving each criterion; (5) Select the information products which match or support the accepted processes and product of step 4; and (6) Select the design methodology which produces the information products selected in step 5.
A method for tailoring the information content of a software process model
NASA Technical Reports Server (NTRS)
Perkins, Sharon; Arend, Mark B.
1990-01-01
The framework is defined for a general method for selecting a necessary and sufficient subset of a general software life cycle's information products, to support new software development process. Procedures for characterizing problem domains in general and mapping to a tailored set of life cycle processes and products is presented. An overview of the method is shown using the following steps: (1) During the problem concept definition phase, perform standardized interviews and dialogs between developer and user, and between user and customer; (2) Generate a quality needs profile of the software to be developed, based on information gathered in step 1; (3) Translate the quality needs profile into a profile of quality criteria that must be met by the software to satisfy the quality needs; (4) Map the quality criteria to a set of accepted processes and products for achieving each criterion; (5) select the information products which match or support the accepted processes and product of step 4; and (6) Select the design methodology which produces the information products selected in step 5.
Aggregating job exit statuses of a plurality of compute nodes executing a parallel application
DOE Office of Scientific and Technical Information (OSTI.GOV)
Aho, Michael E.; Attinella, John E.; Gooding, Thomas M.
Aggregating job exit statuses of a plurality of compute nodes executing a parallel application, including: identifying a subset of compute nodes in the parallel computer to execute the parallel application; selecting one compute node in the subset of compute nodes in the parallel computer as a job leader compute node; initiating execution of the parallel application on the subset of compute nodes; receiving an exit status from each compute node in the subset of compute nodes, where the exit status for each compute node includes information describing execution of some portion of the parallel application by the compute node; aggregatingmore » each exit status from each compute node in the subset of compute nodes; and sending an aggregated exit status for the subset of compute nodes in the parallel computer.« less
Subset selective search on the basis of color and preview.
Donk, Mieke
2017-01-01
In the preview paradigm observers are presented with one set of elements (the irrelevant set) followed by the addition of a second set among which the target is presented (the relevant set). Search efficiency in such a preview condition has been demonstrated to be higher than that in a full-baseline condition in which both sets are simultaneously presented, suggesting that a preview of the irrelevant set reduces its influence on the search process. However, numbers of irrelevant and relevant elements are typically not independently manipulated. Moreover, subset selective search also occurs when both sets are presented simultaneously but differ in color. The aim of the present study was to investigate how numbers of irrelevant and relevant elements contribute to preview search in the absence and presence of a color difference between subsets. In two experiments it was demonstrated that a preview reduced the influence of the number of irrelevant elements in the absence but not in the presence of a color difference between subsets. In the presence of a color difference, a preview lowered the effect of the number of relevant elements but only when the target was defined by a unique feature within the relevant set (Experiment 1); when the target was defined by a conjunction of features (Experiment 2), search efficiency as a function of the number of relevant elements was not modulated by a preview. Together the results are in line with the idea that subset selective search is based on different simultaneously operating mechanisms.
Efficient feature subset selection with probabilistic distance criteria. [pattern recognition
NASA Technical Reports Server (NTRS)
Chittineni, C. B.
1979-01-01
Recursive expressions are derived for efficiently computing the commonly used probabilistic distance measures as a change in the criteria both when a feature is added to and when a feature is deleted from the current feature subset. A combinatorial algorithm for generating all possible r feature combinations from a given set of s features in (s/r) steps with a change of a single feature at each step is presented. These expressions can also be used for both forward and backward sequential feature selection.
Novel and efficient tag SNPs selection algorithms.
Chen, Wen-Pei; Hung, Che-Lun; Tsai, Suh-Jen Jane; Lin, Yaw-Ling
2014-01-01
SNPs are the most abundant forms of genetic variations amongst species; the association studies between complex diseases and SNPs or haplotypes have received great attention. However, these studies are restricted by the cost of genotyping all SNPs; thus, it is necessary to find smaller subsets, or tag SNPs, representing the rest of the SNPs. In fact, the existing tag SNP selection algorithms are notoriously time-consuming. An efficient algorithm for tag SNP selection was presented, which was applied to analyze the HapMap YRI data. The experimental results show that the proposed algorithm can achieve better performance than the existing tag SNP selection algorithms; in most cases, this proposed algorithm is at least ten times faster than the existing methods. In many cases, when the redundant ratio of the block is high, the proposed algorithm can even be thousands times faster than the previously known methods. Tools and web services for haplotype block analysis integrated by hadoop MapReduce framework are also developed using the proposed algorithm as computation kernels.
NASA Astrophysics Data System (ADS)
Stas, Michiel; Dong, Qinghan; Heremans, Stien; Zhang, Beier; Van Orshoven, Jos
2016-08-01
This paper compares two machine learning techniques to predict regional winter wheat yields. The models, based on Boosted Regression Trees (BRT) and Support Vector Machines (SVM), are constructed of Normalized Difference Vegetation Indices (NDVI) derived from low resolution SPOT VEGETATION satellite imagery. Three types of NDVI-related predictors were used: Single NDVI, Incremental NDVI and Targeted NDVI. BRT and SVM were first used to select features with high relevance for predicting the yield. Although the exact selections differed between the prefectures, certain periods with high influence scores for multiple prefectures could be identified. The same period of high influence stretching from March to June was detected by both machine learning methods. After feature selection, BRT and SVM models were applied to the subset of selected features for actual yield forecasting. Whereas both machine learning methods returned very low prediction errors, BRT seems to slightly but consistently outperform SVM.
Zhao, Yu-Xiang; Chou, Chien-Hsing
2016-01-01
In this study, a new feature selection algorithm, the neighborhood-relationship feature selection (NRFS) algorithm, is proposed for identifying rat electroencephalogram signals and recognizing Chinese characters. In these two applications, dependent relationships exist among the feature vectors and their neighboring feature vectors. Therefore, the proposed NRFS algorithm was designed for solving this problem. By applying the NRFS algorithm, unselected feature vectors have a high priority of being added into the feature subset if the neighboring feature vectors have been selected. In addition, selected feature vectors have a high priority of being eliminated if the neighboring feature vectors are not selected. In the experiments conducted in this study, the NRFS algorithm was compared with two feature algorithms. The experimental results indicated that the NRFS algorithm can extract the crucial frequency bands for identifying rat vigilance states and identifying crucial character regions for recognizing Chinese characters. PMID:27314346
Determining which phenotypes underlie a pleiotropic signal
Majumdar, Arunabha; Haldar, Tanushree; Witte, John S.
2016-01-01
Discovering pleiotropic loci is important to understand the biological basis of seemingly distinct phenotypes. Most methods for assessing pleiotropy only test for the overall association between genetic variants and multiple phenotypes. To determine which specific traits are pleiotropic, we evaluate via simulation and application three different strategies. The first is model selection techniques based on the inverse regression of genotype on phenotypes. The second is a subset-based meta-analysis ASSET [Bhattacharjee et al., 2012], which provides an optimal subset of non-null traits. And the third is a modified Benjamini-Hochberg (B-H) procedure of controlling the expected false discovery rate [Benjamini and Hochberg, 1995] in the framework of phenome-wide association study. From our simulations we see that an inverse regression based approach MultiPhen [O’Reilly et al., 2012] is more powerful than ASSET for detecting overall pleiotropic association, except for when all the phenotypes are associated and have genetic effects in the same direction. For determining which specific traits are pleiotropic, the modified B-H procedure performs consistently better than the other two methods. The inverse regression based selection methods perform competitively with the modified B-H procedure only when the phenotypes are weakly correlated. The efficiency of ASSET is observed to lie below and in between the efficiency of the other two methods when the traits are weakly and strongly correlated, respectively. In our application to a large GWAS, we find that the modified B-H procedure also performs well, indicating that this may be an optimal approach for determining the traits underlying a pleiotropic signal. PMID:27238845
NASA Astrophysics Data System (ADS)
Zheng, Feifei; Maier, Holger R.; Wu, Wenyan; Dandy, Graeme C.; Gupta, Hoshin V.; Zhang, Tuqiao
2018-02-01
Hydrological models are used for a wide variety of engineering purposes, including streamflow forecasting and flood-risk estimation. To develop such models, it is common to allocate the available data to calibration and evaluation data subsets. Surprisingly, the issue of how this allocation can affect model evaluation performance has been largely ignored in the research literature. This paper discusses the evaluation performance bias that can arise from how available data are allocated to calibration and evaluation subsets. As a first step to assessing this issue in a statistically rigorous fashion, we present a comprehensive investigation of the influence of data allocation on the development of data-driven artificial neural network (ANN) models of streamflow. Four well-known formal data splitting methods are applied to 754 catchments from Australia and the U.S. to develop 902,483 ANN models. Results clearly show that the choice of the method used for data allocation has a significant impact on model performance, particularly for runoff data that are more highly skewed, highlighting the importance of considering the impact of data splitting when developing hydrological models. The statistical behavior of the data splitting methods investigated is discussed and guidance is offered on the selection of the most appropriate data splitting methods to achieve representative evaluation performance for streamflow data with different statistical properties. Although our results are obtained for data-driven models, they highlight the fact that this issue is likely to have a significant impact on all types of hydrological models, especially conceptual rainfall-runoff models.
Two methods for parameter estimation using multiple-trait models and beef cattle field data.
Bertrand, J K; Kriese, L A
1990-08-01
Two methods are presented for estimating variances and covariances from beef cattle field data using multiple-trait sire models. Both methods require that the first trait have no missing records and that the contemporary groups for the second trait be subsets of the contemporary groups for the first trait; however, the second trait may have missing records. One method uses pseudo expectations involving quadratics composed of the solutions and the right-hand sides of the mixed model equations. The other method is an extension of Henderson's Simple Method to the multiple trait case. Neither of these methods requires any inversions of large matrices in the computation of the parameters; therefore, both methods can handle very large sets of data. Four simulated data sets were generated to evaluate the methods. In general, both methods estimated genetic correlations and heritabilities that were close to the Restricted Maximum Likelihood estimates and the true data set values, even when selection within contemporary groups was practiced. The estimates of residual correlations by both methods, however, were biased by selection. These two methods can be useful in estimating variances and covariances from multiple-trait models in large populations that have undergone a minimal amount of selection within contemporary groups.
Ni, Ai; Cai, Jianwen
2018-07-01
Case-cohort designs are commonly used in large epidemiological studies to reduce the cost associated with covariate measurement. In many such studies the number of covariates is very large. An efficient variable selection method is needed for case-cohort studies where the covariates are only observed in a subset of the sample. Current literature on this topic has been focused on the proportional hazards model. However, in many studies the additive hazards model is preferred over the proportional hazards model either because the proportional hazards assumption is violated or the additive hazards model provides more relevent information to the research question. Motivated by one such study, the Atherosclerosis Risk in Communities (ARIC) study, we investigate the properties of a regularized variable selection procedure in stratified case-cohort design under an additive hazards model with a diverging number of parameters. We establish the consistency and asymptotic normality of the penalized estimator and prove its oracle property. Simulation studies are conducted to assess the finite sample performance of the proposed method with a modified cross-validation tuning parameter selection methods. We apply the variable selection procedure to the ARIC study to demonstrate its practical use.
Fernández-Varela, R; Andrade, J M; Muniategui, S; Prada, D; Ramírez-Villalobos, F
2010-04-01
Identifying petroleum-related products released into the environment is a complex and difficult task. To achieve this, polycyclic aromatic hydrocarbons (PAHs) are of outstanding importance nowadays. Despite traditional quantitative fingerprinting uses straightforward univariate statistical analyses to differentiate among oils and to assess their sources, a multivariate strategy based on Procrustes rotation (PR) was applied in this paper. The aim of PR is to select a reduced subset of PAHs still capable of performing a satisfactory identification of petroleum-related hydrocarbons. PR selected two subsets of three (C(2)-naphthalene, C(2)-dibenzothiophene and C(2)-phenanthrene) and five (C(1)-decahidronaphthalene, naphthalene, C(2)-phenanthrene, C(3)-phenanthrene and C(2)-fluoranthene) PAHs for each of the two datasets studied here. The classification abilities of each subset of PAHs were tested using principal components analysis, hierarchical cluster analysis and Kohonen neural networks and it was demonstrated that they unraveled the same patterns as the overall set of PAHs. (c) 2009 Elsevier Ltd. All rights reserved.
Different hunting strategies select for different weights in red deer
Martínez, María; Rodríguez-Vigal, Carlos; Jones, Owen R; Coulson, Tim; Miguel, Alfonso San
2005-01-01
Much insight can be derived from records of shot animals. Most researchers using such data assume that their data represents a random sample of a particular demographic class. However, hunters typically select a non-random subset of the population and hunting is, therefore, not a random process. Here, with red deer (Cervus elaphus) hunting data from a ranch in Toledo, Spain, we demonstrate that data collection methods have a significant influence upon the apparent relationship between age and weight. We argue that a failure to correct for such methodological bias may have significant consequences for the interpretation of analyses involving weight or correlated traits such as breeding success, and urge researchers to explore methods to identify and correct for such bias in their data. PMID:17148205
Genetic Algorithms and Classification Trees in Feature Discovery: Diabetes and the NHANES database
DOE Office of Scientific and Technical Information (OSTI.GOV)
Heredia-Langner, Alejandro; Jarman, Kristin H.; Amidan, Brett G.
2013-09-01
This paper presents a feature selection methodology that can be applied to datasets containing a mixture of continuous and categorical variables. Using a Genetic Algorithm (GA), this method explores a dataset and selects a small set of features relevant for the prediction of a binary (1/0) response. Binary classification trees and an objective function based on conditional probabilities are used to measure the fitness of a given subset of features. The method is applied to health data in order to find factors useful for the prediction of diabetes. Results show that our algorithm is capable of narrowing down the setmore » of predictors to around 8 factors that can be validated using reputable medical and public health resources.« less
Applications of self-organizing neural networks in virtual screening and diversity selection.
Selzer, Paul; Ertl, Peter
2006-01-01
Artificial neural networks provide a powerful technique for the analysis and modeling of nonlinear relationships between molecular structures and pharmacological activity. Many network types, including Kohonen and counterpropagation, also provide an intuitive method for the visual assessment of correspondence between the input and output data. This work shows how a combination of neural networks and radial distribution function molecular descriptors can be applied in various areas of industrial pharmaceutical research. These applications include the prediction of biological activity, the selection of screening candidates (cherry picking), and the extraction of representative subsets from large compound collections such as combinatorial libraries. The methods described have also been implemented as an easy-to-use Web tool, allowing chemists to perform interactive neural network experiments on the Novartis intranet.
Prediction of solvation enthalpy of gaseous organic compounds in propanol
NASA Astrophysics Data System (ADS)
Golmohammadi, Hassan; Dashtbozorgi, Zahra
2016-09-01
The purpose of this paper is to present a novel way for developing quantitative structure-property relationship (QSPR) models to predict the gas-to-propanol solvation enthalpy (Δ H solv) of 95 organic compounds. Different kinds of descriptors were calculated for each compound using the Dragon software package. The variable selection technique of replacement method (RM) was employed to select the optimal subset of solute descriptors. Our investigation reveals that the dependence of physical chemistry properties of solution on solvation enthalpy is nonlinear and that the RM method is unable to model the solvation enthalpy accurately. The results established that the calculated Δ H solv values by SVM were in good agreement with the experimental ones, and the performances of the SVM models were superior to those obtained by RM model.
Query3d: a new method for high-throughput analysis of functional residues in protein structures.
Ausiello, Gabriele; Via, Allegra; Helmer-Citterich, Manuela
2005-12-01
The identification of local similarities between two protein structures can provide clues of a common function. Many different methods exist for searching for similar subsets of residues in proteins of known structure. However, the lack of functional and structural information on single residues, together with the low level of integration of this information in comparison methods, is a limitation that prevents these methods from being fully exploited in high-throughput analyses. Here we describe Query3d, a program that is both a structural DBMS (Database Management System) and a local comparison method. The method conserves a copy of all the residues of the Protein Data Bank annotated with a variety of functional and structural information. New annotations can be easily added from a variety of methods and known databases. The algorithm makes it possible to create complex queries based on the residues' function and then to compare only subsets of the selected residues. Functional information is also essential to speed up the comparison and the analysis of the results. With Query3d, users can easily obtain statistics on how many and which residues share certain properties in all proteins of known structure. At the same time, the method also finds their structural neighbours in the whole PDB. Programs and data can be accessed through the PdbFun web interface.
Query3d: a new method for high-throughput analysis of functional residues in protein structures
Ausiello, Gabriele; Via, Allegra; Helmer-Citterich, Manuela
2005-01-01
Background The identification of local similarities between two protein structures can provide clues of a common function. Many different methods exist for searching for similar subsets of residues in proteins of known structure. However, the lack of functional and structural information on single residues, together with the low level of integration of this information in comparison methods, is a limitation that prevents these methods from being fully exploited in high-throughput analyses. Results Here we describe Query3d, a program that is both a structural DBMS (Database Management System) and a local comparison method. The method conserves a copy of all the residues of the Protein Data Bank annotated with a variety of functional and structural information. New annotations can be easily added from a variety of methods and known databases. The algorithm makes it possible to create complex queries based on the residues' function and then to compare only subsets of the selected residues. Functional information is also essential to speed up the comparison and the analysis of the results. Conclusion With Query3d, users can easily obtain statistics on how many and which residues share certain properties in all proteins of known structure. At the same time, the method also finds their structural neighbours in the whole PDB. Programs and data can be accessed through the PdbFun web interface. PMID:16351754
NASA Astrophysics Data System (ADS)
Pande, Saket; Sharma, Ashish
2014-05-01
This study is motivated by the need to robustly specify, identify, and forecast runoff generation processes for hydroelectricity production. It atleast requires the identification of significant predictors of runoff generation and the influence of each such significant predictor on runoff response. To this end, we compare two non-parametric algorithms of predictor subset selection. One is based on information theory that assesses predictor significance (and hence selection) based on Partial Information (PI) rationale of Sharma and Mehrotra (2014). The other algorithm is based on a frequentist approach that uses bounds on probability of error concept of Pande (2005), assesses all possible predictor subsets on-the-go and converges to a predictor subset in an computationally efficient manner. Both the algorithms approximate the underlying system by locally constant functions and select predictor subsets corresponding to these functions. The performance of the two algorithms is compared on a set of synthetic case studies as well as a real world case study of inflow forecasting. References: Sharma, A., and R. Mehrotra (2014), An information theoretic alternative to model a natural system using observational information alone, Water Resources Research, 49, doi:10.1002/2013WR013845. Pande, S. (2005), Generalized local learning in water resource management, PhD dissertation, Utah State University, UT-USA, 148p.
Ordering Elements and Subsets: Examples for Student Understanding
ERIC Educational Resources Information Center
Mellinger, Keith E.
2004-01-01
Teaching the art of counting can be quite difficult. Many undergraduate students have difficulty separating the ideas of permutation, combination, repetition, etc. This article develops some examples to help explain some of the underlying theory while looking carefully at the selection of various subsets of objects from a larger collection. The…
Kapetanovic, Suad; Aaron, Lisa; Montepiedra, Grace; Anthony, Patricia; Thuvamontolrat, Kasalyn; Pahwa, Savita; Burchett, Sandra; Weinberg, Adriana; Kovacs, Andrea
2015-01-01
Background We examined the effect of cytomegalovirus (CMV) co-infection and viremia on reconstitution of selected CD4+ and CD8+ T-cell subsets in perinatally HIV-infected (PHIV+) children ≥ 1-year old who participated in a partially randomized, open-label, 96-week combination antiretroviral therapy (cART)-algorithm study. Methods Participants were categorized as CMV-naïve, CMV-positive (CMV+) viremic, and CMV+ aviremic, based on blood, urine, or throat culture, CMV IgG and DNA polymerase chain reaction measured at baseline. At weeks 0, 12, 20 and 40, T-cell subsets including naïve (CD62L+CD45RA+; CD95-CD28+), activated (CD38+HLA-DR+) and terminally differentiated (CD62L-CD45RA+; CD95+CD28-) CD4+ and CD8+ T-cells were measured by flow cytometry. Results Of the 107 participants included in the analysis, 14% were CMV+ viremic; 49% CMV+ aviremic; 37% CMV-naïve. In longitudinal adjusted models, compared with CMV+ status, baseline CMV-naïve status was significantly associated with faster recovery of CD8+CD62L+CD45RA+% and CD8+CD95-CD28+% and faster decrease of CD8+CD95+CD28-%, independent of HIV VL response to treatment, cART regimen and baseline CD4%. Surprisingly, CMV status did not have a significant impact on longitudinal trends in CD8+CD38+HLA-DR+%. CMV status did not have a significant impact on any CD4+ T-cell subsets. Conclusions In this cohort of PHIV+ children, the normalization of naïve and terminally differentiated CD8+ T-cell subsets in response to cART was detrimentally affected by the presence of CMV co-infection. These findings may have implications for adjunctive treatment strategies targeting CMV co-infection in PHIV+ children, especially those that are now adults or reaching young adulthood and may have accelerated immunologic aging, increased opportunistic infections and aging diseases of the immune system. PMID:25794163
Selective accumulation of langerhans-type dendritic cells in small airways of patients with COPD
2010-01-01
Background Dendritic cells (DC) linking innate and adaptive immune responses are present in human lungs, but the characterization of different subsets and their role in COPD pathogenesis remain to be elucidated. The aim of this study is to characterize and quantify pulmonary myeloid DC subsets in small airways of current and ex-smokers with or without COPD. Methods Myeloid DC were characterized using flowcytometry on single cell suspensions of digested human lung tissue. Immunohistochemical staining for langerin, BDCA-1, CD1a and DC-SIGN was performed on surgical resection specimens from 85 patients. Expression of factors inducing Langerhans-type DC (LDC) differentiation was evaluated by RT-PCR on total lung RNA. Results Two segregated subsets of tissue resident pulmonary myeloid DC were identified in single cell suspensions by flowcytometry: the langerin+ LDC and the DC-SIGN+ interstitial-type DC (intDC). LDC partially expressed the markers CD1a and BDCA-1, which are also present on their known blood precursors. In contrast, intDC did not express langerin, CD1a or BDCA-1, but were more closely related to monocytes. Quantification of DC in the small airways by immunohistochemistry revealed a higher number of LDC in current smokers without COPD and in COPD patients compared to never smokers and ex-smokers without COPD. Importantly, there was no difference in the number of LDC between current and ex-smoking COPD patients. In contrast, the number of intDC did not differ between study groups. Interestingly, the number of BDCA-1+ DC was significantly lower in COPD patients compared to never smokers and further decreased with the severity of the disease. In addition, the accumulation of LDC in the small airways significantly correlated with the expression of the LDC inducing differentiation factor activin-A. Conclusions Myeloid DC differentiation is altered in small airways of current smokers and COPD patients resulting in a selective accumulation of the LDC subset which correlates with the pulmonary expression of the LDC-inducing differentiation factor activin-A. This study identified the LDC subset as an interesting focus for future research in COPD pathogenesis. PMID:20307269
Kandaswamy, Krishna Kumar; Pugalenthi, Ganesan; Möller, Steffen; Hartmann, Enno; Kalies, Kai-Uwe; Suganthan, P N; Martinetz, Thomas
2010-12-01
Apoptosis is an essential process for controlling tissue homeostasis by regulating a physiological balance between cell proliferation and cell death. The subcellular locations of proteins performing the cell death are determined by mostly independent cellular mechanisms. The regular bioinformatics tools to predict the subcellular locations of such apoptotic proteins do often fail. This work proposes a model for the sorting of proteins that are involved in apoptosis, allowing us to both the prediction of their subcellular locations as well as the molecular properties that contributed to it. We report a novel hybrid Genetic Algorithm (GA)/Support Vector Machine (SVM) approach to predict apoptotic protein sequences using 119 sequence derived properties like frequency of amino acid groups, secondary structure, and physicochemical properties. GA is used for selecting a near-optimal subset of informative features that is most relevant for the classification. Jackknife cross-validation is applied to test the predictive capability of the proposed method on 317 apoptosis proteins. Our method achieved 85.80% accuracy using all 119 features and 89.91% accuracy for 25 features selected by GA. Our models were examined by a test dataset of 98 apoptosis proteins and obtained an overall accuracy of 90.34%. The results show that the proposed approach is promising; it is able to select small subsets of features and still improves the classification accuracy. Our model can contribute to the understanding of programmed cell death and drug discovery. The software and dataset are available at http://www.inb.uni-luebeck.de/tools-demos/apoptosis/GASVM.
Monocyte Subset Dynamics in Human Atherosclerosis Can Be Profiled with Magnetic Nano-Sensors
Wildgruber, Moritz; Lee, Hakho; Chudnovskiy, Aleksey; Yoon, Tae-Jong; Etzrodt, Martin; Pittet, Mikael J.; Nahrendorf, Matthias; Croce, Kevin; Libby, Peter; Weissleder, Ralph; Swirski, Filip K.
2009-01-01
Monocytes are circulating macrophage and dendritic cell precursors that populate healthy and diseased tissue. In humans, monocytes consist of at least two subsets whose proportions in the blood fluctuate in response to coronary artery disease, sepsis, and viral infection. Animal studies have shown that specific shifts in the monocyte subset repertoire either exacerbate or attenuate disease, suggesting a role for monocyte subsets as biomarkers and therapeutic targets. Assays are therefore needed that can selectively and rapidly enumerate monocytes and their subsets. This study shows that two major human monocyte subsets express similar levels of the receptor for macrophage colony stimulating factor (MCSFR) but differ in their phagocytic capacity. We exploit these properties and custom-engineer magnetic nanoparticles for ex vivo sensing of monocytes and their subsets. We present a two-dimensional enumerative mathematical model that simultaneously reports number and proportion of monocyte subsets in a small volume of human blood. Using a recently described diagnostic magnetic resonance (DMR) chip with 1 µl sample size and high throughput capabilities, we then show that application of the model accurately quantifies subset fluctuations that occur in patients with atherosclerosis. PMID:19461894
Wang, LiQiang; Li, CuiFeng
2014-10-01
A genetic algorithm (GA) coupled with multiple linear regression (MLR) was used to extract useful features from amino acids and g-gap dipeptides for distinguishing between thermophilic and non-thermophilic proteins. The method was trained by a benchmark dataset of 915 thermophilic and 793 non-thermophilic proteins. The method reached an overall accuracy of 95.4 % in a Jackknife test using nine amino acids, 38 0-gap dipeptides and 29 1-gap dipeptides. The accuracy as a function of protein size ranged between 85.8 and 96.9 %. The overall accuracies of three independent tests were 93, 93.4 and 91.8 %. The observed results of detecting thermophilic proteins suggest that the GA-MLR approach described herein should be a powerful method for selecting features that describe thermostabile machines and be an aid in the design of more stable proteins.
Max-AUC Feature Selection in Computer-Aided Detection of Polyps in CT Colonography
Xu, Jian-Wu; Suzuki, Kenji
2014-01-01
We propose a feature selection method based on a sequential forward floating selection (SFFS) procedure to improve the performance of a classifier in computerized detection of polyps in CT colonography (CTC). The feature selection method is coupled with a nonlinear support vector machine (SVM) classifier. Unlike the conventional linear method based on Wilks' lambda, the proposed method selected the most relevant features that would maximize the area under the receiver operating characteristic curve (AUC), which directly maximizes classification performance, evaluated based on AUC value, in the computer-aided detection (CADe) scheme. We presented two variants of the proposed method with different stopping criteria used in the SFFS procedure. The first variant searched all feature combinations allowed in the SFFS procedure and selected the subsets that maximize the AUC values. The second variant performed a statistical test at each step during the SFFS procedure, and it was terminated if the increase in the AUC value was not statistically significant. The advantage of the second variant is its lower computational cost. To test the performance of the proposed method, we compared it against the popular stepwise feature selection method based on Wilks' lambda for a colonic-polyp database (25 polyps and 2624 nonpolyps). We extracted 75 morphologic, gray-level-based, and texture features from the segmented lesion candidate regions. The two variants of the proposed feature selection method chose 29 and 7 features, respectively. Two SVM classifiers trained with these selected features yielded a 96% by-polyp sensitivity at false-positive (FP) rates of 4.1 and 6.5 per patient, respectively. Experiments showed a significant improvement in the performance of the classifier with the proposed feature selection method over that with the popular stepwise feature selection based on Wilks' lambda that yielded 18.0 FPs per patient at the same sensitivity level. PMID:24608058
Max-AUC feature selection in computer-aided detection of polyps in CT colonography.
Xu, Jian-Wu; Suzuki, Kenji
2014-03-01
We propose a feature selection method based on a sequential forward floating selection (SFFS) procedure to improve the performance of a classifier in computerized detection of polyps in CT colonography (CTC). The feature selection method is coupled with a nonlinear support vector machine (SVM) classifier. Unlike the conventional linear method based on Wilks' lambda, the proposed method selected the most relevant features that would maximize the area under the receiver operating characteristic curve (AUC), which directly maximizes classification performance, evaluated based on AUC value, in the computer-aided detection (CADe) scheme. We presented two variants of the proposed method with different stopping criteria used in the SFFS procedure. The first variant searched all feature combinations allowed in the SFFS procedure and selected the subsets that maximize the AUC values. The second variant performed a statistical test at each step during the SFFS procedure, and it was terminated if the increase in the AUC value was not statistically significant. The advantage of the second variant is its lower computational cost. To test the performance of the proposed method, we compared it against the popular stepwise feature selection method based on Wilks' lambda for a colonic-polyp database (25 polyps and 2624 nonpolyps). We extracted 75 morphologic, gray-level-based, and texture features from the segmented lesion candidate regions. The two variants of the proposed feature selection method chose 29 and 7 features, respectively. Two SVM classifiers trained with these selected features yielded a 96% by-polyp sensitivity at false-positive (FP) rates of 4.1 and 6.5 per patient, respectively. Experiments showed a significant improvement in the performance of the classifier with the proposed feature selection method over that with the popular stepwise feature selection based on Wilks' lambda that yielded 18.0 FPs per patient at the same sensitivity level.
2010-09-01
matrix is used in many methods, like Jacobi or Gauss Seidel , for solving linear systems. Also, no partial pivoting is necessary for a strictly column...problems that arise during the procedure, which in general, converges to the solving of a linear system. The most common issue with the solution is the... iterative procedure to find an appropriate subset of parameters that produce an optimal solution commonly known as forward selection. Then, the
Selecting informative subsets of sparse supermatrices increases the chance to find correct trees.
Misof, Bernhard; Meyer, Benjamin; von Reumont, Björn Marcus; Kück, Patrick; Misof, Katharina; Meusemann, Karen
2013-12-03
Character matrices with extensive missing data are frequently used in phylogenomics with potentially detrimental effects on the accuracy and robustness of tree inference. Therefore, many investigators select taxa and genes with high data coverage. Drawbacks of these selections are their exclusive reliance on data coverage without consideration of actual signal in the data which might, thus, not deliver optimal data matrices in terms of potential phylogenetic signal. In order to circumvent this problem, we have developed a heuristics implemented in a software called mare which (1) assesses information content of genes in supermatrices using a measure of potential signal combined with data coverage and (2) reduces supermatrices with a simple hill climbing procedure to submatrices with high total information content. We conducted simulation studies using matrices of 50 taxa × 50 genes with heterogeneous phylogenetic signal among genes and data coverage between 10-30%. With matrices of 50 taxa × 50 genes with heterogeneous phylogenetic signal among genes and data coverage between 10-30% Maximum Likelihood (ML) tree reconstructions failed to recover correct trees. A selection of a data subset with the herein proposed approach increased the chance to recover correct partial trees more than 10-fold. The selection of data subsets with the herein proposed simple hill climbing procedure performed well either considering the information content or just a simple presence/absence information of genes. We also applied our approach on an empirical data set, addressing questions of vertebrate systematics. With this empirical dataset selecting a data subset with high information content and supporting a tree with high average boostrap support was most successful if information content of genes was considered. Our analyses of simulated and empirical data demonstrate that sparse supermatrices can be reduced on a formal basis outperforming the usually used simple selections of taxa and genes with high data coverage.
An active learning representative subset selection method using net analyte signal.
He, Zhonghai; Ma, Zhenhe; Luan, Jingmin; Cai, Xi
2018-05-05
To guarantee accurate predictions, representative samples are needed when building a calibration model for spectroscopic measurements. However, in general, it is not known whether a sample is representative prior to measuring its concentration, which is both time-consuming and expensive. In this paper, a method to determine whether a sample should be selected into a calibration set is presented. The selection is based on the difference of Euclidean norm of net analyte signal (NAS) vector between the candidate and existing samples. First, the concentrations and spectra of a group of samples are used to compute the projection matrix, NAS vector, and scalar values. Next, the NAS vectors of candidate samples are computed by multiplying projection matrix with spectra of samples. Scalar value of NAS is obtained by norm computation. The distance between the candidate set and the selected set is computed, and samples with the largest distance are added to selected set sequentially. Last, the concentration of the analyte is measured such that the sample can be used as a calibration sample. Using a validation test, it is shown that the presented method is more efficient than random selection. As a result, the amount of time and money spent on reference measurements is greatly reduced. Copyright © 2018 Elsevier B.V. All rights reserved.
An active learning representative subset selection method using net analyte signal
NASA Astrophysics Data System (ADS)
He, Zhonghai; Ma, Zhenhe; Luan, Jingmin; Cai, Xi
2018-05-01
To guarantee accurate predictions, representative samples are needed when building a calibration model for spectroscopic measurements. However, in general, it is not known whether a sample is representative prior to measuring its concentration, which is both time-consuming and expensive. In this paper, a method to determine whether a sample should be selected into a calibration set is presented. The selection is based on the difference of Euclidean norm of net analyte signal (NAS) vector between the candidate and existing samples. First, the concentrations and spectra of a group of samples are used to compute the projection matrix, NAS vector, and scalar values. Next, the NAS vectors of candidate samples are computed by multiplying projection matrix with spectra of samples. Scalar value of NAS is obtained by norm computation. The distance between the candidate set and the selected set is computed, and samples with the largest distance are added to selected set sequentially. Last, the concentration of the analyte is measured such that the sample can be used as a calibration sample. Using a validation test, it is shown that the presented method is more efficient than random selection. As a result, the amount of time and money spent on reference measurements is greatly reduced.
RELAX: detecting relaxed selection in a phylogenetic framework.
Wertheim, Joel O; Murrell, Ben; Smith, Martin D; Kosakovsky Pond, Sergei L; Scheffler, Konrad
2015-03-01
Relaxation of selective strength, manifested as a reduction in the efficiency or intensity of natural selection, can drive evolutionary innovation and presage lineage extinction or loss of function. Mechanisms through which selection can be relaxed range from the removal of an existing selective constraint to a reduction in effective population size. Standard methods for estimating the strength and extent of purifying or positive selection from molecular sequence data are not suitable for detecting relaxed selection, because they lack power and can mistake an increase in the intensity of positive selection for relaxation of both purifying and positive selection. Here, we present a general hypothesis testing framework (RELAX) for detecting relaxed selection in a codon-based phylogenetic framework. Given two subsets of branches in a phylogeny, RELAX can determine whether selective strength was relaxed or intensified in one of these subsets relative to the other. We establish the validity of our test via simulations and show that it can distinguish between increased positive selection and a relaxation of selective strength. We also demonstrate the power of RELAX in a variety of biological scenarios where relaxation of selection has been hypothesized or demonstrated previously. We find that obligate and facultative γ-proteobacteria endosymbionts of insects are under relaxed selection compared with their free-living relatives and obligate endosymbionts are under relaxed selection compared with facultative endosymbionts. Selective strength is also relaxed in asexual Daphnia pulex lineages, compared with sexual lineages. Endogenous, nonfunctional, bornavirus-like elements are found to be under relaxed selection compared with exogenous Borna viruses. Finally, selection on the short-wavelength sensitive, SWS1, opsin genes in echolocating and nonecholocating bats is relaxed only in lineages in which this gene underwent pseudogenization; however, selection on the functional medium/long-wavelength sensitive opsin, M/LWS1, is found to be relaxed in all echolocating bats compared with nonecholocating bats. © The Author 2014. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
Texture analysis of tissues in Gleason grading of prostate cancer
NASA Astrophysics Data System (ADS)
Alexandratou, Eleni; Yova, Dido; Gorpas, Dimitris; Maragos, Petros; Agrogiannis, George; Kavantzas, Nikolaos
2008-02-01
Prostate cancer is a common malignancy among maturing men and the second leading cause of cancer death in USA. Histopathological grading of prostate cancer is based on tissue structural abnormalities. Gleason grading system is the gold standard and is based on the organization features of prostatic glands. Although Gleason score has contributed on cancer prognosis and on treatment planning, its accuracy is about 58%, with this percentage to be lower in GG2, GG3 and GG5 grading. On the other hand it is strongly affected by "inter- and intra observer variations", making the whole process very subjective. Therefore, there is need for the development of grading tools based on imaging and computer vision techniques for a more accurate prostate cancer prognosis. The aim of this paper is the development of a novel method for objective grading of biopsy specimen in order to support histopathological prognosis of the tumor. This new method is based on texture analysis techniques, and particularly on Gray Level Co-occurrence Matrix (GLCM) that estimates image properties related to second order statistics. Histopathological images of prostate cancer, from Gleason grade2 to Gleason grade 5, were acquired and subjected to image texture analysis. Thirteen texture characteristics were calculated from this matrix as they were proposed by Haralick. Using stepwise variable selection, a subset of four characteristics were selected and used for the description and classification of each image field. The selected characteristics profile was used for grading the specimen with the multiparameter statistical method of multiple logistic discrimination analysis. The subset of these characteristics provided 87% correct grading of the specimens. The addition of any of the remaining characteristics did not improve significantly the diagnostic ability of the method. This study demonstrated that texture analysis techniques could provide valuable grading decision support to the pathologists, concerning prostate cancer prognosis.
Hash Bit Selection for Nearest Neighbor Search.
Xianglong Liu; Junfeng He; Shih-Fu Chang
2017-11-01
To overcome the barrier of storage and computation when dealing with gigantic-scale data sets, compact hashing has been studied extensively to approximate the nearest neighbor search. Despite the recent advances, critical design issues remain open in how to select the right features, hashing algorithms, and/or parameter settings. In this paper, we address these by posing an optimal hash bit selection problem, in which an optimal subset of hash bits are selected from a pool of candidate bits generated by different features, algorithms, or parameters. Inspired by the optimization criteria used in existing hashing algorithms, we adopt the bit reliability and their complementarity as the selection criteria that can be carefully tailored for hashing performance in different tasks. Then, the bit selection solution is discovered by finding the best tradeoff between search accuracy and time using a modified dynamic programming method. To further reduce the computational complexity, we employ the pairwise relationship among hash bits to approximate the high-order independence property, and formulate it as an efficient quadratic programming method that is theoretically equivalent to the normalized dominant set problem in a vertex- and edge-weighted graph. Extensive large-scale experiments have been conducted under several important application scenarios of hash techniques, where our bit selection framework can achieve superior performance over both the naive selection methods and the state-of-the-art hashing algorithms, with significant accuracy gains ranging from 10% to 50%, relatively.
Generating a Simulated Fluid Flow over a Surface Using Anisotropic Diffusion
NASA Technical Reports Server (NTRS)
Rodriguez, David L. (Inventor); Sturdza, Peter (Inventor)
2016-01-01
A fluid-flow simulation over a computer-generated surface is generated using a diffusion technique. The surface is comprised of a surface mesh of polygons. A boundary-layer fluid property is obtained for a subset of the polygons of the surface mesh. A gradient vector is determined for a selected polygon, the selected polygon belonging to the surface mesh but not one of the subset of polygons. A maximum and minimum diffusion rate is determined along directions determined using the gradient vector corresponding to the selected polygon. A diffusion-path vector is defined between a point in the selected polygon and a neighboring point in a neighboring polygon. An updated fluid property is determined for the selected polygon using a variable diffusion rate, the variable diffusion rate based on the minimum diffusion rate, maximum diffusion rate, and the gradient vector.
Liang, Ja-Der; Ping, Xiao-Ou; Tseng, Yi-Ju; Huang, Guan-Tarn; Lai, Feipei; Yang, Pei-Ming
2014-12-01
Recurrence of hepatocellular carcinoma (HCC) is an important issue despite effective treatments with tumor eradication. Identification of patients who are at high risk for recurrence may provide more efficacious screening and detection of tumor recurrence. The aim of this study was to develop recurrence predictive models for HCC patients who received radiofrequency ablation (RFA) treatment. From January 2007 to December 2009, 83 newly diagnosed HCC patients receiving RFA as their first treatment were enrolled. Five feature selection methods including genetic algorithm (GA), simulated annealing (SA) algorithm, random forests (RF) and hybrid methods (GA+RF and SA+RF) were utilized for selecting an important subset of features from a total of 16 clinical features. These feature selection methods were combined with support vector machine (SVM) for developing predictive models with better performance. Five-fold cross-validation was used to train and test SVM models. The developed SVM-based predictive models with hybrid feature selection methods and 5-fold cross-validation had averages of the sensitivity, specificity, accuracy, positive predictive value, negative predictive value, and area under the ROC curve as 67%, 86%, 82%, 69%, 90%, and 0.69, respectively. The SVM derived predictive model can provide suggestive high-risk recurrent patients, who should be closely followed up after complete RFA treatment. Copyright © 2014 Elsevier Ireland Ltd. All rights reserved.
Efficient feature selection using a hybrid algorithm for the task of epileptic seizure detection
NASA Astrophysics Data System (ADS)
Lai, Kee Huong; Zainuddin, Zarita; Ong, Pauline
2014-07-01
Feature selection is a very important aspect in the field of machine learning. It entails the search of an optimal subset from a very large data set with high dimensional feature space. Apart from eliminating redundant features and reducing computational cost, a good selection of feature also leads to higher prediction and classification accuracy. In this paper, an efficient feature selection technique is introduced in the task of epileptic seizure detection. The raw data are electroencephalography (EEG) signals. Using discrete wavelet transform, the biomedical signals were decomposed into several sets of wavelet coefficients. To reduce the dimension of these wavelet coefficients, a feature selection method that combines the strength of both filter and wrapper methods is proposed. Principal component analysis (PCA) is used as part of the filter method. As for wrapper method, the evolutionary harmony search (HS) algorithm is employed. This metaheuristic method aims at finding the best discriminating set of features from the original data. The obtained features were then used as input for an automated classifier, namely wavelet neural networks (WNNs). The WNNs model was trained to perform a binary classification task, that is, to determine whether a given EEG signal was normal or epileptic. For comparison purposes, different sets of features were also used as input. Simulation results showed that the WNNs that used the features chosen by the hybrid algorithm achieved the highest overall classification accuracy.
1976-07-01
PURDUE UNIVERSITY DEPARTMENT OF STATISTICS DIVISION OF MATHEMATICAL SCIENCES ON SUBSET SELECTION PROCEDURES FOR POISSON PROCESSES AND SOME...Mathematical Sciences Mimeograph Series #457, July 1976 This research was supported by the Office of Naval Research under Contract NOOO14-75-C-0455 at Purdue...11 CON PC-111 riFIC-F ,A.F ANO ADDPFS Office of INaval ResearchJu#07 Washington, DC07 36AE 14~~~ rjCr; NF A ’ , A FAA D F 6 - I S it 9 i 1, - ,1 I
NASA Astrophysics Data System (ADS)
Rao, Xiong; Tang, Yunwei
2014-11-01
Land surface deformation evidently exists in a newly-built high-speed railway in the southeast of China. In this study, we utilize the Small BAseline Subsets (SBAS)-Differential Synthetic Aperture Radar Interferometry (DInSAR) technique to detect land surface deformation along the railway. In this work, 40 Cosmo-SkyMed satellite images were selected to analyze the spatial distribution and velocity of the deformation in study area. 88 pairs of image with high coherence were firstly chosen with an appropriate threshold. These images were used to deduce the deformation velocity map and the variation in time series. This result can provide information for orbit correctness and ground control point (GCP) selection in the following steps. Then, more pairs of image were selected to tighten the constraint in time dimension, and to improve the final result by decreasing the phase unwrapping error. 171 combinations of SAR pairs were ultimately selected. Reliable GCPs were re-selected according to the previously derived deformation velocity map. Orbital residuals error was rectified using these GCPs, and nonlinear deformation components were estimated. Therefore, a more accurate surface deformation velocity map was produced. Precise geodetic leveling work was implemented in the meantime. We compared the leveling result with the geocoding SBAS product using the nearest neighbour method. The mean error and standard deviation of the error were respectively 0.82 mm and 4.17 mm. This result demonstrates the effectiveness of DInSAR technique for monitoring land surface deformation, which can serve as a reliable decision for supporting highspeed railway project design, construction, operation and maintenance.
Morales, Dinora Araceli; Bengoetxea, Endika; Larrañaga, Pedro; García, Miguel; Franco, Yosu; Fresnada, Mónica; Merino, Marisa
2008-05-01
In vitro fertilization (IVF) is a medically assisted reproduction technique that enables infertile couples to achieve successful pregnancy. Given the uncertainty of the treatment, we propose an intelligent decision support system based on supervised classification by Bayesian classifiers to aid to the selection of the most promising embryos that will form the batch to be transferred to the woman's uterus. The aim of the supervised classification system is to improve overall success rate of each IVF treatment in which a batch of embryos is transferred each time, where the success is achieved when implantation (i.e. pregnancy) is obtained. Due to ethical reasons, different legislative restrictions apply in every country on this technique. In Spain, legislation allows a maximum of three embryos to form each transfer batch. As a result, clinicians prefer to select the embryos by non-invasive embryo examination based on simple methods and observation focused on morphology and dynamics of embryo development after fertilization. This paper proposes the application of Bayesian classifiers to this embryo selection problem in order to provide a decision support system that allows a more accurate selection than with the actual procedures which fully rely on the expertise and experience of embryologists. For this, we propose to take into consideration a reduced subset of feature variables related to embryo morphology and clinical data of patients, and from this data to induce Bayesian classification models. Results obtained applying a filter technique to choose the subset of variables, and the performance of Bayesian classifiers using them, are presented.
Lee, Kyu Ha; Tadesse, Mahlet G; Baccarelli, Andrea A; Schwartz, Joel; Coull, Brent A
2017-03-01
The analysis of multiple outcomes is becoming increasingly common in modern biomedical studies. It is well-known that joint statistical models for multiple outcomes are more flexible and more powerful than fitting a separate model for each outcome; they yield more powerful tests of exposure or treatment effects by taking into account the dependence among outcomes and pooling evidence across outcomes. It is, however, unlikely that all outcomes are related to the same subset of covariates. Therefore, there is interest in identifying exposures or treatments associated with particular outcomes, which we term outcome-specific variable selection. In this work, we propose a variable selection approach for multivariate normal responses that incorporates not only information on the mean model, but also information on the variance-covariance structure of the outcomes. The approach effectively leverages evidence from all correlated outcomes to estimate the effect of a particular covariate on a given outcome. To implement this strategy, we develop a Bayesian method that builds a multivariate prior for the variable selection indicators based on the variance-covariance of the outcomes. We show via simulation that the proposed variable selection strategy can boost power to detect subtle effects without increasing the probability of false discoveries. We apply the approach to the Normative Aging Study (NAS) epigenetic data and identify a subset of five genes in the asthma pathway for which gene-specific DNA methylations are associated with exposures to either black carbon, a marker of traffic pollution, or sulfate, a marker of particles generated by power plants. © 2016, The International Biometric Society.
Profiling dendritic cell subsets in head and neck squamous cell tonsillar cancer and benign tonsils.
Abolhalaj, Milad; Askmyr, David; Sakellariou, Christina Alexandra; Lundberg, Kristina; Greiff, Lennart; Lindstedt, Malin
2018-05-23
Dendritic cells (DCs) have a key role in orchestrating immune responses and are considered important targets for immunotherapy against cancer. In order to develop effective cancer vaccines, detailed knowledge of the micromilieu in cancer lesions is warranted. In this study, flow cytometry and human transcriptome arrays were used to characterize subsets of DCs in head and neck squamous cell tonsillar cancer and compare them to their counterparts in benign tonsils to evaluate subset-selective biomarkers associated with tonsillar cancer. We describe, for the first time, four subsets of DCs in tonsillar cancer: CD123 + plasmacytoid DCs (pDC), CD1c + , CD141 + , and CD1c - CD141 - myeloid DCs (mDC). An increased frequency of DCs and an elevated mDC/pDC ratio were shown in malignant compared to benign tonsillar tissue. The microarray data demonstrates characteristics specific for tonsil cancer DC subsets, including expression of immunosuppressive molecules and lower expression levels of genes involved in development of effector immune responses in DCs in malignant tonsillar tissue, compared to their counterparts in benign tonsillar tissue. Finally, we present target candidates selectively expressed by different DC subsets in malignant tonsils and confirm expression of CD206/MRC1 and CD207/Langerin on CD1c + DCs at protein level. This study descibes DC characteristics in the context of head and neck cancer and add valuable steps towards future DC-based therapies against tonsillar cancer.
System, method and apparatus for conducting a keyterm search
NASA Technical Reports Server (NTRS)
McGreevy, Michael W. (Inventor)
2004-01-01
A keyterm search is a method of searching a database for subsets of the database that are relevant to an input query. First, a number of relational models of subsets of a database are provided. A query is then input. The query can include one or more keyterms. Next, a gleaning model of the query is created. The gleaning model of the query is then compared to each one of the relational models of subsets of the database. The identifiers of the relevant subsets are then output.
System, method and apparatus for conducting a phrase search
NASA Technical Reports Server (NTRS)
McGreevy, Michael W. (Inventor)
2004-01-01
A phrase search is a method of searching a database for subsets of the database that are relevant to an input query. First, a number of relational models of subsets of a database are provided. A query is then input. The query can include one or more sequences of terms. Next, a relational model of the query is created. The relational model of the query is then compared to each one of the relational models of subsets of the database. The identifiers of the relevant subsets are then output.
Kesharaju, Manasa; Nagarajah, Romesh
2015-09-01
The motivation for this research stems from a need for providing a non-destructive testing method capable of detecting and locating any defects and microstructural variations within armour ceramic components before issuing them to the soldiers who rely on them for their survival. The development of an automated ultrasonic inspection based classification system would make possible the checking of each ceramic component and immediately alert the operator about the presence of defects. Generally, in many classification problems a choice of features or dimensionality reduction is significant and simultaneously very difficult, as a substantial computational effort is required to evaluate possible feature subsets. In this research, a combination of artificial neural networks and genetic algorithms are used to optimize the feature subset used in classification of various defects in reaction-sintered silicon carbide ceramic components. Initially wavelet based feature extraction is implemented from the region of interest. An Artificial Neural Network classifier is employed to evaluate the performance of these features. Genetic Algorithm based feature selection is performed. Principal Component Analysis is a popular technique used for feature selection and is compared with the genetic algorithm based technique in terms of classification accuracy and selection of optimal number of features. The experimental results confirm that features identified by Principal Component Analysis lead to improved performance in terms of classification percentage with 96% than Genetic algorithm with 94%. Copyright © 2015 Elsevier B.V. All rights reserved.
Metabolomics analysis was performed on the supernatant of human embryonic stem (hES) cell cultures exposed to a blinded subset of 11 chemicals selected from the chemical library of EPA's ToxCast™ chemical screening and prioritization research project. Metabolites from hES cultur...
Mapping tropical rainforest canopies using multi-temporal spaceborne imaging spectroscopy
NASA Astrophysics Data System (ADS)
Somers, Ben; Asner, Gregory P.
2013-10-01
The use of imaging spectroscopy for florisic mapping of forests is complicated by the spectral similarity among coexisting species. Here we evaluated an alternative spectral unmixing strategy combining a time series of EO-1 Hyperion images and an automated feature selection strategy in MESMA. Instead of using the same spectral subset to unmix each image pixel, our modified approach allowed the spectral subsets to vary on a per pixel basis such that each pixel is evaluated using a spectral subset tuned towards maximal separability of its specific endmember class combination or species mixture. The potential of the new approach for floristic mapping of tree species in Hawaiian rainforests was quantitatively demonstrated using both simulated and actual hyperspectral image time-series. With a Cohen's Kappa coefficient of 0.65, our approach provided a more accurate tree species map compared to MESMA (Kappa = 0.54). In addition, by the selection of spectral subsets our approach was about 90% faster than MESMA. The flexible or adaptive use of band sets in spectral unmixing as such provides an interesting avenue to address spectral similarities in complex vegetation canopies.
An efficient ensemble learning method for gene microarray classification.
Osareh, Alireza; Shadgar, Bita
2013-01-01
The gene microarray analysis and classification have demonstrated an effective way for the effective diagnosis of diseases and cancers. However, it has been also revealed that the basic classification techniques have intrinsic drawbacks in achieving accurate gene classification and cancer diagnosis. On the other hand, classifier ensembles have received increasing attention in various applications. Here, we address the gene classification issue using RotBoost ensemble methodology. This method is a combination of Rotation Forest and AdaBoost techniques which in turn preserve both desirable features of an ensemble architecture, that is, accuracy and diversity. To select a concise subset of informative genes, 5 different feature selection algorithms are considered. To assess the efficiency of the RotBoost, other nonensemble/ensemble techniques including Decision Trees, Support Vector Machines, Rotation Forest, AdaBoost, and Bagging are also deployed. Experimental results have revealed that the combination of the fast correlation-based feature selection method with ICA-based RotBoost ensemble is highly effective for gene classification. In fact, the proposed method can create ensemble classifiers which outperform not only the classifiers produced by the conventional machine learning but also the classifiers generated by two widely used conventional ensemble learning methods, that is, Bagging and AdaBoost.
Biagiotti, R; Desii, C; Vanzi, E; Gacci, G
1999-02-01
To compare the performance of artificial neural networks (ANNs) with that of multiple logistic regression (MLR) models for predicting ovarian malignancy in patients with adnexal masses by using transvaginal B-mode and color Doppler flow ultrasonography (US). A total of 226 adnexal masses were examined before surgery: Fifty-one were malignant and 175 were benign. The data were divided into training and testing subsets by using a "leave n out method." The training subsets were used to compute the optimum MLR equations and to train the ANNs. The cross-validation subsets were used to estimate the performance of each of the two models in predicting ovarian malignancy. At testing, three-layer back-propagation networks, based on the same input variables selected by using MLR (i.e., women's ages, papillary projections, random echogenicity, peak systolic velocity, and resistance index), had a significantly higher sensitivity than did MLR (96% vs 84%; McNemar test, p = .04). The Brier scores for ANNs were significantly lower than those calculated for MLR (Student t test for paired samples, P = .004). ANNs might have potential for categorizing adnexal masses as either malignant or benign on the basis of multiple variables related to demographic and US features.
The public cancer radiology imaging collections of The Cancer Imaging Archive.
Prior, Fred; Smith, Kirk; Sharma, Ashish; Kirby, Justin; Tarbox, Lawrence; Clark, Ken; Bennett, William; Nolan, Tracy; Freymann, John
2017-09-19
The Cancer Imaging Archive (TCIA) is the U.S. National Cancer Institute's repository for cancer imaging and related information. TCIA contains 30.9 million radiology images representing data collected from approximately 37,568 subjects. This data is organized into collections by tumor-type with many collections also including analytic results or clinical data. TCIA staff carefully de-identify and curate all incoming collections prior to making the information available via web browser or programmatic interfaces. Each published collection within TCIA is assigned a Digital Object Identifier that references the collection. Additionally, researchers who use TCIA data may publish the subset of information used in their analysis by requesting a TCIA generated Digital Object Identifier. This data descriptor is a review of a selected subset of existing publicly available TCIA collections. It outlines the curation and publication methods employed by TCIA and makes available 15 collections of cancer imaging data.
Modulation Depth Estimation and Variable Selection in State-Space Models for Neural Interfaces
Hochberg, Leigh R.; Donoghue, John P.; Brown, Emery N.
2015-01-01
Rapid developments in neural interface technology are making it possible to record increasingly large signal sets of neural activity. Various factors such as asymmetrical information distribution and across-channel redundancy may, however, limit the benefit of high-dimensional signal sets, and the increased computational complexity may not yield corresponding improvement in system performance. High-dimensional system models may also lead to overfitting and lack of generalizability. To address these issues, we present a generalized modulation depth measure using the state-space framework that quantifies the tuning of a neural signal channel to relevant behavioral covariates. For a dynamical system, we develop computationally efficient procedures for estimating modulation depth from multivariate data. We show that this measure can be used to rank neural signals and select an optimal channel subset for inclusion in the neural decoding algorithm. We present a scheme for choosing the optimal subset based on model order selection criteria. We apply this method to neuronal ensemble spike-rate decoding in neural interfaces, using our framework to relate motor cortical activity with intended movement kinematics. With offline analysis of intracortical motor imagery data obtained from individuals with tetraplegia using the BrainGate neural interface, we demonstrate that our variable selection scheme is useful for identifying and ranking the most information-rich neural signals. We demonstrate that our approach offers several orders of magnitude lower complexity but virtually identical decoding performance compared to greedy search and other selection schemes. Our statistical analysis shows that the modulation depth of human motor cortical single-unit signals is well characterized by the generalized Pareto distribution. Our variable selection scheme has wide applicability in problems involving multisensor signal modeling and estimation in biomedical engineering systems. PMID:25265627
Guo, Song; Liu, Chunhua; Zhou, Peng; Li, Yanling
2016-01-01
Tyrosine sulfation is one of the ubiquitous protein posttranslational modifications, where some sulfate groups are added to the tyrosine residues. It plays significant roles in various physiological processes in eukaryotic cells. To explore the molecular mechanism of tyrosine sulfation, one of the prerequisites is to correctly identify possible protein tyrosine sulfation residues. In this paper, a novel method was presented to predict protein tyrosine sulfation residues from primary sequences. By means of informative feature construction and elaborate feature selection and parameter optimization scheme, the proposed predictor achieved promising results and outperformed many other state-of-the-art predictors. Using the optimal features subset, the proposed method achieved mean MCC of 94.41% on the benchmark dataset, and a MCC of 90.09% on the independent dataset. The experimental performance indicated that our new proposed method could be effective in identifying the important protein posttranslational modifications and the feature selection scheme would be powerful in protein functional residues prediction research fields.
Liu, Chunhua; Zhou, Peng; Li, Yanling
2016-01-01
Tyrosine sulfation is one of the ubiquitous protein posttranslational modifications, where some sulfate groups are added to the tyrosine residues. It plays significant roles in various physiological processes in eukaryotic cells. To explore the molecular mechanism of tyrosine sulfation, one of the prerequisites is to correctly identify possible protein tyrosine sulfation residues. In this paper, a novel method was presented to predict protein tyrosine sulfation residues from primary sequences. By means of informative feature construction and elaborate feature selection and parameter optimization scheme, the proposed predictor achieved promising results and outperformed many other state-of-the-art predictors. Using the optimal features subset, the proposed method achieved mean MCC of 94.41% on the benchmark dataset, and a MCC of 90.09% on the independent dataset. The experimental performance indicated that our new proposed method could be effective in identifying the important protein posttranslational modifications and the feature selection scheme would be powerful in protein functional residues prediction research fields. PMID:27034949
Domino: Extracting, Comparing, and Manipulating Subsets across Multiple Tabular Datasets
Gratzl, Samuel; Gehlenborg, Nils; Lex, Alexander; Pfister, Hanspeter; Streit, Marc
2016-01-01
Answering questions about complex issues often requires analysts to take into account information contained in multiple interconnected datasets. A common strategy in analyzing and visualizing large and heterogeneous data is dividing it into meaningful subsets. Interesting subsets can then be selected and the associated data and the relationships between the subsets visualized. However, neither the extraction and manipulation nor the comparison of subsets is well supported by state-of-the-art techniques. In this paper we present Domino, a novel multiform visualization technique for effectively representing subsets and the relationships between them. By providing comprehensive tools to arrange, combine, and extract subsets, Domino allows users to create both common visualization techniques and advanced visualizations tailored to specific use cases. In addition to the novel technique, we present an implementation that enables analysts to manage the wide range of options that our approach offers. Innovative interactive features such as placeholders and live previews support rapid creation of complex analysis setups. We introduce the technique and the implementation using a simple example and demonstrate scalability and effectiveness in a use case from the field of cancer genomics. PMID:26356916
Intraclonal Cell Expansion and Selection Driven by B Cell Receptor in Chronic Lymphocytic Leukemia
Colombo, Monica; Cutrona, Giovanna; Reverberi, Daniele; Fabris, Sonia; Neri, Antonino; Fabbi, Marina; Quintana, Giovanni; Quarta, Giovanni; Ghiotto, Fabio; Fais, Franco; Ferrarini, Manlio
2011-01-01
The mutational status of the immunoglobulin heavy-chain variable region (IGHV) genes utilized by chronic lymphocytic leukemia (CLL) clones defines two disease subgroups. Patients with unmutated IGHV have a more aggressive disease and a worse outcome than patients with cells having somatic IGHV gene mutations. Moreover, up to 30% of the unmutated CLL clones exhibit very similar or identical B cell receptors (BcR), often encoded by the same IG genes. These “stereotyped” BcRs have been classified into defined subsets. The presence of an IGHV gene somatic mutation and the utilization of a skewed gene repertoire compared with normal B cells together with the expression of stereotyped receptors by unmutated CLL clones may indicate stimulation/selection by antigenic epitopes. This antigenic stimulation may occur prior to or during neoplastic transformation, but it is unknown whether this stimulation/selection continues after leukemogenesis has ceased. In this study, we focused on seven CLL cases with stereotyped BcR Subset #8 found among a cohort of 700 patients; in six, the cells expressed IgG and utilized IGHV4-39 and IGKV1-39/IGKV1D-39 genes, as reported for Subset #8 BcR. One case exhibited special features, including expression of IgM or IgG by different subclones consequent to an isotype switch, allelic inclusion at the IGH locus in the IgM-expressing cells and a particular pattern of cytogenetic lesions. Collectively, the data indicate a process of antigenic stimulation/selection of the fully transformed CLL cells leading to the expansion of the Subset #8 IgG-bearing subclone. PMID:21541442
Analyte separation utilizing temperature programmed desorption of a preconcentrator mesh
Linker, Kevin L.; Bouchier, Frank A.; Theisen, Lisa; Arakaki, Lester H.
2007-11-27
A method and system for controllably releasing contaminants from a contaminated porous metallic mesh by thermally desorbing and releasing a selected subset of contaminants from a contaminated mesh by rapidly raising the mesh to a pre-determined temperature step or plateau that has been chosen beforehand to preferentially desorb a particular chemical specie of interest, but not others. By providing a sufficiently long delay or dwell period in-between heating pulses, and by selecting the optimum plateau temperatures, then different contaminant species can be controllably released in well-defined batches at different times to a chemical detector in gaseous communication with the mesh. For some detectors, such as an Ion Mobility Spectrometer (IMS), separating different species in time before they enter the IMS allows the detector to have an enhanced selectivity.
NASA Technical Reports Server (NTRS)
Spirkovska, Liljana (Inventor)
2006-01-01
Method and system for automatically displaying, visually and/or audibly and/or by an audible alarm signal, relevant weather data for an identified aircraft pilot, when each of a selected subset of measured or estimated aviation situation parameters, corresponding to a given aviation situation, has a value lying in a selected range. Each range for a particular pilot may be a default range, may be entered by the pilot and/or may be automatically determined from experience and may be subsequently edited by the pilot to change a range and to add or delete parameters describing a situation for which a display should be provided. The pilot can also verbally activate an audible display or visual display of selected information by verbal entry of a first command or a second command, respectively, that specifies the information required.
Towards a Framework for Modeling Space Systems Architectures
NASA Technical Reports Server (NTRS)
Shames, Peter; Skipper, Joseph
2006-01-01
Topics covered include: 1) Statement of the problem: a) Space system architecture is complex; b) Existing terrestrial approaches must be adapted for space; c) Need a common architecture methodology and information model; d) Need appropriate set of viewpoints. 2) Requirements on a space systems model. 3) Model Based Engineering and Design (MBED) project: a) Evaluated different methods; b) Adapted and utilized RASDS & RM-ODP; c) Identified useful set of viewpoints; d) Did actual model exchanges among selected subset of tools. 4) Lessons learned & future vision.
Surveillance system and method having parameter estimation and operating mode partitioning
NASA Technical Reports Server (NTRS)
Bickford, Randall L. (Inventor)
2005-01-01
A system and method for monitoring an apparatus or process asset including creating a process model comprised of a plurality of process submodels each correlative to at least one training data subset partitioned from an unpartitioned training data set and each having an operating mode associated thereto; acquiring a set of observed signal data values from the asset; determining an operating mode of the asset for the set of observed signal data values; selecting a process submodel from the process model as a function of the determined operating mode of the asset; calculating a set of estimated signal data values from the selected process submodel for the determined operating mode; and determining asset status as a function of the calculated set of estimated signal data values for providing asset surveillance and/or control.
Stefan, Sabina; Schorr, Barbara; Lopez-Rolon, Alex; Kolassa, Iris-Tatjana; Shock, Jonathan P; Rosenfelder, Martin; Heck, Suzette; Bender, Andreas
2018-04-17
We applied the following methods to resting-state EEG data from patients with disorders of consciousness (DOC) for consciousness indexing and outcome prediction: microstates, entropy (i.e. approximate, permutation), power in alpha and delta frequency bands, and connectivity (i.e. weighted symbolic mutual information, symbolic transfer entropy, complex network analysis). Patients with unresponsive wakefulness syndrome (UWS) and patients in a minimally conscious state (MCS) were classified into these two categories by fitting and testing a generalised linear model. We aimed subsequently to develop an automated system for outcome prediction in severe DOC by selecting an optimal subset of features using sequential floating forward selection (SFFS). The two outcome categories were defined as UWS or dead, and MCS or emerged from MCS. Percentage of time spent in microstate D in the alpha frequency band performed best at distinguishing MCS from UWS patients. The average clustering coefficient obtained from thresholding beta coherence performed best at predicting outcome. The optimal subset of features selected with SFFS consisted of the frequency of microstate A in the 2-20 Hz frequency band, path length obtained from thresholding alpha coherence, and average path length obtained from thresholding alpha coherence. Combining these features seemed to afford high prediction power. Python and MATLAB toolboxes for the above calculations are freely available under the GNU public license for non-commercial use ( https://qeeg.wordpress.com ).
Selecting climate simulations for impact studies based on multivariate patterns of climate change.
Mendlik, Thomas; Gobiet, Andreas
In climate change impact research it is crucial to carefully select the meteorological input for impact models. We present a method for model selection that enables the user to shrink the ensemble to a few representative members, conserving the model spread and accounting for model similarity. This is done in three steps: First, using principal component analysis for a multitude of meteorological parameters, to find common patterns of climate change within the multi-model ensemble. Second, detecting model similarities with regard to these multivariate patterns using cluster analysis. And third, sampling models from each cluster, to generate a subset of representative simulations. We present an application based on the ENSEMBLES regional multi-model ensemble with the aim to provide input for a variety of climate impact studies. We find that the two most dominant patterns of climate change relate to temperature and humidity patterns. The ensemble can be reduced from 25 to 5 simulations while still maintaining its essential characteristics. Having such a representative subset of simulations reduces computational costs for climate impact modeling and enhances the quality of the ensemble at the same time, as it prevents double-counting of dependent simulations that would lead to biased statistics. The online version of this article (doi:10.1007/s10584-015-1582-0) contains supplementary material, which is available to authorized users.
Generating a Simulated Fluid Flow Over an Aircraft Surface Using Anisotropic Diffusion
NASA Technical Reports Server (NTRS)
Rodriguez, David L. (Inventor); Sturdza, Peter (Inventor)
2013-01-01
A fluid-flow simulation over a computer-generated aircraft surface is generated using a diffusion technique. The surface is comprised of a surface mesh of polygons. A boundary-layer fluid property is obtained for a subset of the polygons of the surface mesh. A pressure-gradient vector is determined for a selected polygon, the selected polygon belonging to the surface mesh but not one of the subset of polygons. A maximum and minimum diffusion rate is determined along directions determined using a pressure gradient vector corresponding to the selected polygon. A diffusion-path vector is defined between a point in the selected polygon and a neighboring point in a neighboring polygon. An updated fluid property is determined for the selected polygon using a variable diffusion rate, the variable diffusion rate based on the minimum diffusion rate, maximum diffusion rate, and angular difference between the diffusion-path vector and the pressure-gradient vector.
Enhancing the Performance of LibSVM Classifier by Kernel F-Score Feature Selection
NASA Astrophysics Data System (ADS)
Sarojini, Balakrishnan; Ramaraj, Narayanasamy; Nickolas, Savarimuthu
Medical Data mining is the search for relationships and patterns within the medical datasets that could provide useful knowledge for effective clinical decisions. The inclusion of irrelevant, redundant and noisy features in the process model results in poor predictive accuracy. Much research work in data mining has gone into improving the predictive accuracy of the classifiers by applying the techniques of feature selection. Feature selection in medical data mining is appreciable as the diagnosis of the disease could be done in this patient-care activity with minimum number of significant features. The objective of this work is to show that selecting the more significant features would improve the performance of the classifier. We empirically evaluate the classification effectiveness of LibSVM classifier on the reduced feature subset of diabetes dataset. The evaluations suggest that the feature subset selected improves the predictive accuracy of the classifier and reduce false negatives and false positives.
Sherman, Stephen E; Kuljanin, Miljan; Cooper, Tyler T; Putman, David M; Lajoie, Gilles A; Hess, David A
2017-06-01
During culture expansion, multipotent mesenchymal stromal cells (MSCs) differentially express aldehyde dehydrogenase (ALDH), an intracellular detoxification enzyme that protects long-lived cells against oxidative stress. Thus, MSC selection based on ALDH-activity may be used to reduce heterogeneity and distinguish MSC subsets with improved regenerative potency. After expansion of human bone marrow-derived MSCs, cell progeny was purified based on low versus high ALDH-activity (ALDH hi ) by fluorescence-activated cell sorting, and each subset was compared for multipotent stromal and provascular regenerative functions. Both ALDH l ° and ALDH hi MSC subsets demonstrated similar expression of stromal cell (>95% CD73 + , CD90 + , CD105 + ) and pericyte (>95% CD146 + ) surface markers and showed multipotent differentiation into bone, cartilage, and adipose cells in vitro. Conditioned media (CDM) generated by ALDH hi MSCs demonstrated a potent proliferative and prosurvival effect on human microvascular endothelial cells (HMVECs) under serum-free conditions and augmented HMVEC tube-forming capacity in growth factor-reduced matrices. After subcutaneous transplantation within directed in vivo angiogenesis assay implants into immunodeficient mice, ALDH hi MSC or CDM produced by ALDH hi MSC significantly augmented murine vascular cell recruitment and perfused vessel infiltration compared with ALDH l ° MSC. Although both subsets demonstrated strikingly similar mRNA expression patterns, quantitative proteomic analyses performed on subset-specific CDM revealed the ALDH hi MSC subset uniquely secreted multiple proangiogenic cytokines (vascular endothelial growth factor beta, platelet derived growth factor alpha, and angiogenin) and actively produced multiple factors with chemoattractant (transforming growth factor-β, C-X-C motif chemokine ligand 1, 2, and 3 (GRO), C-C motif chemokine ligand 5 (RANTES), monocyte chemotactic protein 1 (MCP-1), interleukin [IL]-6, IL-8) and matrix-modifying functions (tissue inhibitor of metalloprotinase 1 & 2 (TIMP1/2)). Collectively, MSCs selected for ALDH hi demonstrated enhanced proangiogenic secretory functions and represent a purified MSC subset amenable for vascular regenerative applications. Stem Cells 2017;35:1542-1553. © 2017 AlphaMed Press.
Darmann, Andreas; Nicosia, Gaia; Pferschy, Ulrich; Schauer, Joachim
2014-03-16
In this work we address a game theoretic variant of the Subset Sum problem, in which two decision makers (agents/players) compete for the usage of a common resource represented by a knapsack capacity. Each agent owns a set of integer weighted items and wants to maximize the total weight of its own items included in the knapsack. The solution is built as follows: Each agent, in turn, selects one of its items (not previously selected) and includes it in the knapsack if there is enough capacity. The process ends when the remaining capacity is too small for including any item left. We look at the problem from a single agent point of view and show that finding an optimal sequence of items to select is an [Formula: see text]-hard problem. Therefore we propose two natural heuristic strategies and analyze their worst-case performance when (1) the opponent is able to play optimally and (2) the opponent adopts a greedy strategy. From a centralized perspective we observe that some known results on the approximation of the classical Subset Sum can be effectively adapted to the multi-agent version of the problem.
Darmann, Andreas; Nicosia, Gaia; Pferschy, Ulrich; Schauer, Joachim
2014-01-01
In this work we address a game theoretic variant of the Subset Sum problem, in which two decision makers (agents/players) compete for the usage of a common resource represented by a knapsack capacity. Each agent owns a set of integer weighted items and wants to maximize the total weight of its own items included in the knapsack. The solution is built as follows: Each agent, in turn, selects one of its items (not previously selected) and includes it in the knapsack if there is enough capacity. The process ends when the remaining capacity is too small for including any item left. We look at the problem from a single agent point of view and show that finding an optimal sequence of items to select is an NP-hard problem. Therefore we propose two natural heuristic strategies and analyze their worst-case performance when (1) the opponent is able to play optimally and (2) the opponent adopts a greedy strategy. From a centralized perspective we observe that some known results on the approximation of the classical Subset Sum can be effectively adapted to the multi-agent version of the problem. PMID:25844012
What's wrong with hazard-ranking systems? An expository note.
Cox, Louis Anthony Tony
2009-07-01
Two commonly recommended principles for allocating risk management resources to remediate uncertain hazards are: (1) select a subset to maximize risk-reduction benefits (e.g., maximize the von Neumann-Morgenstern expected utility of the selected risk-reducing activities), and (2) assign priorities to risk-reducing opportunities and then select activities from the top of the priority list down until no more can be afforded. When different activities create uncertain but correlated risk reductions, as is often the case in practice, then these principles are inconsistent: priority scoring and ranking fails to maximize risk-reduction benefits. Real-world risk priority scoring systems used in homeland security and terrorism risk assessment, environmental risk management, information system vulnerability rating, business risk matrices, and many other important applications do not exploit correlations among risk-reducing opportunities or optimally diversify risk-reducing investments. As a result, they generally make suboptimal risk management recommendations. Applying portfolio optimization methods instead of risk prioritization ranking, rating, or scoring methods can achieve greater risk-reduction value for resources spent.
A natural language query system for Hubble Space Telescope proposal selection
NASA Technical Reports Server (NTRS)
Hornick, Thomas; Cohen, William; Miller, Glenn
1987-01-01
The proposal selection process for the Hubble Space Telescope is assisted by a robust and easy to use query program (TACOS). The system parses an English subset language sentence regardless of the order of the keyword phases, allowing the user a greater flexibility than a standard command query language. Capabilities for macro and procedure definition are also integrated. The system was designed for flexibility in both use and maintenance. In addition, TACOS can be applied to any knowledge domain that can be expressed in terms of a single reaction. The system was implemented mostly in Common LISP. The TACOS design is described in detail, with particular attention given to the implementation methods of sentence processing.
Wang, Nanyi; Wang, Lirong; Xie, Xiang-Qun
2017-11-27
Molecular docking is widely applied to computer-aided drug design and has become relatively mature in the recent decades. Application of docking in modeling varies from single lead compound optimization to large-scale virtual screening. The performance of molecular docking is highly dependent on the protein structures selected. It is especially challenging for large-scale target prediction research when multiple structures are available for a single target. Therefore, we have established ProSelection, a docking preferred-protein selection algorithm, in order to generate the proper structure subset(s). By the ProSelection algorithm, protein structures of "weak selectors" are filtered out whereas structures of "strong selectors" are kept. Specifically, the structure which has a good statistical performance of distinguishing active ligands from inactive ligands is defined as a strong selector. In this study, 249 protein structures of 14 autophagy-related targets are investigated. Surflex-dock was used as the docking engine to distinguish active and inactive compounds against these protein structures. Both t test and Mann-Whitney U test were used to distinguish the strong from the weak selectors based on the normality of the docking score distribution. The suggested docking score threshold for active ligands (SDA) was generated for each strong selector structure according to the receiver operating characteristic (ROC) curve. The performance of ProSelection was further validated by predicting the potential off-targets of 43 U.S. Federal Drug Administration approved small molecule antineoplastic drugs. Overall, ProSelection will accelerate the computational work in protein structure selection and could be a useful tool for molecular docking, target prediction, and protein-chemical database establishment research.
Delineation of soil temperature regimes from HCMM data
NASA Technical Reports Server (NTRS)
Day, R. L.; Petersen, G. W. (Principal Investigator)
1981-01-01
Supplementary data including photographs as well as topographic, geologic, and soil maps were obtained and evaluated for ground truth purposes and control point selection. A study area (approximately 450 by 450 pixels) was subset from LANDSAT scene No. 2477-17142. Geometric corrections and scaling were performed. Initial enhancement techniques were initiated to aid control point selection and soils interpretation. The SUBSET program was modified to read HCMM tapes and HCMM data were reformated so that they are compatible with the ORSER system. Initial NMAP products of geometrically corrected and scaled raw data tapes (unregistered) of the study were produced.
Decision Aids for Airborne Intercept Operations in Advanced Aircrafts
NASA Technical Reports Server (NTRS)
Madni, A.; Freedy, A.
1981-01-01
A tactical decision aid (TDA) for the F-14 aircrew, i.e., the naval flight officer and pilot, in conducting a multitarget attack during the performance of a Combat Air Patrol (CAP) role is presented. The TDA employs hierarchical multiattribute utility models for characterizing mission objectives in operationally measurable terms, rule based AI-models for tactical posture selection, and fast time simulation for maneuver consequence prediction. The TDA makes aspect maneuver recommendations, selects and displays the optimum mission posture, evaluates attackable and potentially attackable subsets, and recommends the 'best' attackable subset along with the required course perturbation.
Selection of noisy measurement locations for error reduction in static parameter identification
NASA Astrophysics Data System (ADS)
Sanayei, Masoud; Onipede, Oladipo; Babu, Suresh R.
1992-09-01
An incomplete set of noisy static force and displacement measurements is used for parameter identification of structures at the element level. Measurement location and the level of accuracy in the measured data can drastically affect the accuracy of the identified parameters. A heuristic method is presented to select a limited number of degrees of freedom (DOF) to perform a successful parameter identification and to reduce the impact of measurement errors on the identified parameters. This pretest simulation uses an error sensitivity analysis to determine the effect of measurement errors on the parameter estimates. The selected DOF can be used for nondestructive testing and health monitoring of structures. Two numerical examples, one for a truss and one for a frame, are presented to demonstrate that using the measurements at the selected subset of DOF can limit the error in the parameter estimates.
The First SeaWiFS HPLC Analysis Round-Robin Experiment (SeaHARRE-1)
NASA Technical Reports Server (NTRS)
Hooker, Stanford B. (Editor); Firestone, Elaine R. (Editor); Claustre, Herve; Ras, Josephine; VanHeukelem, Laurie; Berthon, Jean-Francois; Targa, Cristina; vanderLinde, Dirk; Barlow, Ray; Sessions, Heather
2001-01-01
Four laboratories, which had contributed to various aspects of SeaWiFS calibration and validation activities, participated in the first SeaWiFS HPLC Analysis Round-Robin Experiment (SeaHARRE-1): Horn Point Laboratory (USA), the Joint Research Centre (Italy), the Laboratoire de Physique et Chimie Marines (France), and the Marine and Coastal Management group (South Africa). The analyses of the data are presented in Chapter 1 and the individual methods of the four groups are presented in Chapters 2-5. The average (or overall) conclusions of the round-robin are derived from 12 in situ stations occupied during a cruise in the Mediterranean Sea, although, only 11 stations are used in the analyses. The data set is composed of 12 replicates taken during each sampling opportunity with 3 replicates going to each of the 4 laboratories. The average (or overall) results from the intercomparison of 15 pigments or pigment associations are as follows (in some cases, data subsets that exclude pigments which were not analyzed by all the laboratories, or that had unusually large variances, are used to exclude a variety of problematic pigments): a) the accuracy of the four methods in determining the concentration of total chlorophyll a is 7.9%, (one method did not separate mono- and divinyl chlorophyll a, and if the samples containing significant divinyl chlorophyll a concentrations are ignored, the four methods have an accuracy of 6.7%); b) the accuracy in determining the full set of pigments is 19.1%; c) there is a reduction in accuracy of approximately - 12.2% for every decade (factor of 10) decrease in concentration (based on a data subset); d) the precision of the four methods using a subset data is 8.617( 6.2% for an edited subset); e) the repeatability of the four methods using the subset data is 9.2% (7.2%; for an edited subset, and f) the reproducibility of the four methods using the subset data is 21.31% (15.0% for an edited subset).
NASA Astrophysics Data System (ADS)
Xu, Kai; Wang, Yiwen; Wang, Yueming; Wang, Fang; Hao, Yaoyao; Zhang, Shaomin; Zhang, Qiaosheng; Chen, Weidong; Zheng, Xiaoxiang
2013-04-01
Objective. The high-dimensional neural recordings bring computational challenges to movement decoding in motor brain machine interfaces (mBMI), especially for portable applications. However, not all recorded neural activities relate to the execution of a certain movement task. This paper proposes to use a local-learning-based method to perform neuron selection for the gesture prediction in a reaching and grasping task. Approach. Nonlinear neural activities are decomposed into a set of linear ones in a weighted feature space. A margin is defined to measure the distance between inter-class and intra-class neural patterns. The weights, reflecting the importance of neurons, are obtained by minimizing a margin-based exponential error function. To find the most dominant neurons in the task, 1-norm regularization is introduced to the objective function for sparse weights, where near-zero weights indicate irrelevant neurons. Main results. The signals of only 10 neurons out of 70 selected by the proposed method could achieve over 95% of the full recording's decoding accuracy of gesture predictions, no matter which different decoding methods are used (support vector machine and K-nearest neighbor). The temporal activities of the selected neurons show visually distinguishable patterns associated with various hand states. Compared with other algorithms, the proposed method can better eliminate the irrelevant neurons with near-zero weights and provides the important neuron subset with the best decoding performance in statistics. The weights of important neurons converge usually within 10-20 iterations. In addition, we study the temporal and spatial variation of neuron importance along a period of one and a half months in the same task. A high decoding performance can be maintained by updating the neuron subset. Significance. The proposed algorithm effectively ascertains the neuronal importance without assuming any coding model and provides a high performance with different decoding models. It shows better robustness of identifying the important neurons with noisy signals presented. The low demand of computational resources which, reflected by the fast convergence, indicates the feasibility of the method applied in portable BMI systems. The ascertainment of the important neurons helps to inspect neural patterns visually associated with the movement task. The elimination of irrelevant neurons greatly reduces the computational burden of mBMI systems and maintains the performance with better robustness.
NASA Astrophysics Data System (ADS)
Wang, Lijuan; Yan, Yong; Wang, Xue; Wang, Tao
2017-03-01
Input variable selection is an essential step in the development of data-driven models for environmental, biological and industrial applications. Through input variable selection to eliminate the irrelevant or redundant variables, a suitable subset of variables is identified as the input of a model. Meanwhile, through input variable selection the complexity of the model structure is simplified and the computational efficiency is improved. This paper describes the procedures of the input variable selection for the data-driven models for the measurement of liquid mass flowrate and gas volume fraction under two-phase flow conditions using Coriolis flowmeters. Three advanced input variable selection methods, including partial mutual information (PMI), genetic algorithm-artificial neural network (GA-ANN) and tree-based iterative input selection (IIS) are applied in this study. Typical data-driven models incorporating support vector machine (SVM) are established individually based on the input candidates resulting from the selection methods. The validity of the selection outcomes is assessed through an output performance comparison of the SVM based data-driven models and sensitivity analysis. The validation and analysis results suggest that the input variables selected from the PMI algorithm provide more effective information for the models to measure liquid mass flowrate while the IIS algorithm provides a fewer but more effective variables for the models to predict gas volume fraction.
Modelling the pre-assessment learning effects of assessment: evidence in the validity chain
Cilliers, Francois J; Schuwirth, Lambert W T; van der Vleuten, Cees P M
2012-01-01
OBJECTIVES We previously developed a model of the pre-assessment learning effects of consequential assessment and started to validate it. The model comprises assessment factors, mechanism factors and learning effects. The purpose of this study was to continue the validation process. For stringency, we focused on a subset of assessment factor–learning effect associations that featured least commonly in a baseline qualitative study. Our aims were to determine whether these uncommon associations were operational in a broader but similar population to that in which the model was initially derived. METHODS A cross-sectional survey of 361 senior medical students at one medical school was undertaken using a purpose-made questionnaire based on a grounded theory and comprising pairs of written situational tests. In each pair, the manifestation of an assessment factor was varied. The frequencies at which learning effects were selected were compared for each item pair, using an adjusted alpha to assign significance. The frequencies at which mechanism factors were selected were calculated. RESULTS There were significant differences in the learning effect selected between the two scenarios of an item pair for 13 of this subset of 21 uncommon associations, even when a p-value of < 0.00625 was considered to indicate significance. Three mechanism factors were operational in most scenarios: agency; response efficacy, and response value. CONCLUSIONS For a subset of uncommon associations in the model, the role of most assessment factor–learning effect associations and the mechanism factors involved were supported in a broader but similar population to that in which the model was derived. Although model validation is an ongoing process, these results move the model one step closer to the stage of usefully informing interventions. Results illustrate how factors not typically included in studies of the learning effects of assessment could confound the results of interventions aimed at using assessment to influence learning. Discuss ideas arising from this article at ‘http://www.mededuc.com discuss’ PMID:23078685
Ho, Vincent K.; Angelotti, Timothy
2013-01-01
Receptor expression enhancing proteins (REEPs) were identified by their ability to enhance cell surface expression of a subset of G protein-coupled receptors (GPCRs), specifically GPCRs that have proven difficult to express in heterologous cell systems. Further analysis revealed that they belong to the Yip (Ypt-interacting protein) family and that some REEP subtypes affect ER structure. Yip family comparisons have established other potential roles for REEPs, including regulation of ER-Golgi transport and processing/neuronal localization of cargo proteins. However, these other potential REEP functions and the mechanism by which they selectively enhance GPCR cell surface expression have not been clarified. By utilizing several REEP family members (REEP1, REEP2, and REEP6) and model GPCRs (α2A and α2C adrenergic receptors), we examined REEP regulation of GPCR plasma membrane expression, intracellular processing, and trafficking. Using a combination of immunolocalization and biochemical methods, we demonstrated that this REEP subset is localized primarily to ER, but not plasma membranes. Single cell analysis demonstrated that these REEPs do not specifically enhance surface expression of all GPCRs, but affect ER cargo capacity of specific GPCRs and thus their surface expression. REEP co-expression with α2 adrenergic receptors (ARs) revealed that this REEP subset interacts with and alter glycosidic processing of α2C, but not α2A ARs, demonstrating selective interaction with cargo proteins. Specifically, these REEPs enhanced expression of and interacted with minimally/non-glycosylated forms of α2C ARs. Most importantly, expression of a mutant REEP1 allele (hereditary spastic paraplegia SPG31) lacking the carboxyl terminus led to loss of this interaction. Thus specific REEP isoforms have additional intracellular functions besides altering ER structure, such as enhancing ER cargo capacity, regulating ER-Golgi processing, and interacting with select cargo proteins. Therefore, some REEPs can be further described as ER membrane shaping adapter proteins. PMID:24098485
A genetic programming approach to oral cancer prognosis
Tan, Mei Sze; Tan, Jing Wei; Yap, Hwa Jen; Abdul Kareem, Sameem; Zain, Rosnah Binti
2016-01-01
Background The potential of genetic programming (GP) on various fields has been attained in recent years. In bio-medical field, many researches in GP are focused on the recognition of cancerous cells and also on gene expression profiling data. In this research, the aim is to study the performance of GP on the survival prediction of a small sample size of oral cancer prognosis dataset, which is the first study in the field of oral cancer prognosis. Method GP is applied on an oral cancer dataset that contains 31 cases collected from the Malaysia Oral Cancer Database and Tissue Bank System (MOCDTBS). The feature subsets that is automatically selected through GP were noted and the influences of this subset on the results of GP were recorded. In addition, a comparison between the GP performance and that of the Support Vector Machine (SVM) and logistic regression (LR) are also done in order to verify the predictive capabilities of the GP. Result The result shows that GP performed the best (average accuracy of 83.87% and average AUROC of 0.8341) when the features selected are smoking, drinking, chewing, histological differentiation of SCC, and oncogene p63. In addition, based on the comparison results, we found that the GP outperformed the SVM and LR in oral cancer prognosis. Discussion Some of the features in the dataset are found to be statistically co-related. This is because the accuracy of the GP prediction drops when one of the feature in the best feature subset is excluded. Thus, GP provides an automatic feature selection function, which chooses features that are highly correlated to the prognosis of oral cancer. This makes GP an ideal prediction model for cancer clinical and genomic data that can be used to aid physicians in their decision making stage of diagnosis or prognosis. PMID:27688975
Adaptive Greedy Dictionary Selection for Web Media Summarization.
Cong, Yang; Liu, Ji; Sun, Gan; You, Quanzeng; Li, Yuncheng; Luo, Jiebo
2017-01-01
Initializing an effective dictionary is an indispensable step for sparse representation. In this paper, we focus on the dictionary selection problem with the objective to select a compact subset of basis from original training data instead of learning a new dictionary matrix as dictionary learning models do. We first design a new dictionary selection model via l 2,0 norm. For model optimization, we propose two methods: one is the standard forward-backward greedy algorithm, which is not suitable for large-scale problems; the other is based on the gradient cues at each forward iteration and speeds up the process dramatically. In comparison with the state-of-the-art dictionary selection models, our model is not only more effective and efficient, but also can control the sparsity. To evaluate the performance of our new model, we select two practical web media summarization problems: 1) we build a new data set consisting of around 500 users, 3000 albums, and 1 million images, and achieve effective assisted albuming based on our model and 2) by formulating the video summarization problem as a dictionary selection issue, we employ our model to extract keyframes from a video sequence in a more flexible way. Generally, our model outperforms the state-of-the-art methods in both these two tasks.
Gene selection for tumor classification using neighborhood rough sets and entropy measures.
Chen, Yumin; Zhang, Zunjun; Zheng, Jianzhong; Ma, Ying; Xue, Yu
2017-03-01
With the development of bioinformatics, tumor classification from gene expression data becomes an important useful technology for cancer diagnosis. Since a gene expression data often contains thousands of genes and a small number of samples, gene selection from gene expression data becomes a key step for tumor classification. Attribute reduction of rough sets has been successfully applied to gene selection field, as it has the characters of data driving and requiring no additional information. However, traditional rough set method deals with discrete data only. As for the gene expression data containing real-value or noisy data, they are usually employed by a discrete preprocessing, which may result in poor classification accuracy. In this paper, we propose a novel gene selection method based on the neighborhood rough set model, which has the ability of dealing with real-value data whilst maintaining the original gene classification information. Moreover, this paper addresses an entropy measure under the frame of neighborhood rough sets for tackling the uncertainty and noisy of gene expression data. The utilization of this measure can bring about a discovery of compact gene subsets. Finally, a gene selection algorithm is designed based on neighborhood granules and the entropy measure. Some experiments on two gene expression data show that the proposed gene selection is an effective method for improving the accuracy of tumor classification. Copyright © 2017 Elsevier Inc. All rights reserved.
Texture analysis based on the Hermite transform for image classification and segmentation
NASA Astrophysics Data System (ADS)
Estudillo-Romero, Alfonso; Escalante-Ramirez, Boris; Savage-Carmona, Jesus
2012-06-01
Texture analysis has become an important task in image processing because it is used as a preprocessing stage in different research areas including medical image analysis, industrial inspection, segmentation of remote sensed imaginary, multimedia indexing and retrieval. In order to extract visual texture features a texture image analysis technique is presented based on the Hermite transform. Psychovisual evidence suggests that the Gaussian derivatives fit the receptive field profiles of mammalian visual systems. The Hermite transform describes locally basic texture features in terms of Gaussian derivatives. Multiresolution combined with several analysis orders provides detection of patterns that characterizes every texture class. The analysis of the local maximum energy direction and steering of the transformation coefficients increase the method robustness against the texture orientation. This method presents an advantage over classical filter bank design because in the latter a fixed number of orientations for the analysis has to be selected. During the training stage, a subset of the Hermite analysis filters is chosen in order to improve the inter-class separability, reduce dimensionality of the feature vectors and computational cost during the classification stage. We exhaustively evaluated the correct classification rate of real randomly selected training and testing texture subsets using several kinds of common used texture features. A comparison between different distance measurements is also presented. Results of the unsupervised real texture segmentation using this approach and comparison with previous approaches showed the benefits of our proposal.
USDA-ARS?s Scientific Manuscript database
he USDA rice (Oryza sativa L.) core subset (RCS) was assembled to represent the genetic diversity of the entire USDA-ARS National Small Grains Collection and consists of 1,794 accessions from 114 countries. The USDA rice mini-core (MC) is a subset of 217 accessions from the RCS and was selected to ...
Sarkar, Mohosin; Liu, Yun; Qi, Junpeng; Peng, Haiyong; Morimoto, Jumpei; Rader, Christoph; Chiorazzi, Nicholas; Kodadek, Thomas
2016-04-01
Chronic lymphocytic leukemia (CLL) is a disease in which a single B-cell clone proliferates relentlessly in peripheral lymphoid organs, bone marrow, and blood. DNA sequencing experiments have shown that about 30% of CLL patients have stereotyped antigen-specific B-cell receptors (BCRs) with a high level of sequence homology in the variable domains of the heavy and light chains. These include many of the most aggressive cases that haveIGHV-unmutated BCRs whose sequences have not diverged significantly from the germ line. This suggests a personalized therapy strategy in which a toxin or immune effector function is delivered selectively to the pathogenic B-cells but not to healthy B-cells. To execute this strategy, serum-stable, drug-like compounds able to target the antigen-binding sites of most or all patients in a stereotyped subset are required. We demonstrate here the feasibility of this approach with the discovery of selective, high affinity ligands for CLL BCRs of the aggressive, stereotyped subset 7P that cross-react with the BCRs of several CLL patients in subset 7p, but not with BCRs from patients outside this subset. © 2016 by The American Society for Biochemistry and Molecular Biology, Inc.
Yang, Ling; Liu, Zhichao; Li, Jianbin; He, Kaili; Kong, Lingna; Guo, Runqing; Liu, Wenjiao; Gao, Yundong; Zhong, Jifeng
2018-05-25
High immune response (HIR) cows have a balanced and robust host defense and lower disease incidence, and immune response is more important to consider for selecting young sires than for selecting cows. The protective immune response against foot-and-mouth disease (FMD) virus infection is T-cell-independent in an animal experimental model. However, there is no convenient method to select young sires with a HIR to FMD virus. In this study, 39 healthy Holstein young sires were vaccinated with the trivalent (A, O and Asia 1) FMD vaccine, and T-lymphocyte subsets in peripheral blood lymphocytes (PBLs) were detected using flow cytometric analysis before and after vaccination. The expression of interferon-gamma (IFN-γ), interleukin-2 (IL-2), IL-4, and IL-6 mRNA in PBLs was analyzed after stimulation by lipopolysaccharide (LPS) or Concanavalin A (ConA) after vaccination. According to the percentage of CD4 + lymphocyte and CD4/CD8 ratio after vaccination for selecting the HIR young sires, the results showed that the percentages of CD3 + , CD4 + , CD3 + CD4 + lymphocytes and the CD4/CD8 ratio in the HIR group were higher compared to those in the medium immune response (MIR) and low immune response (LIR) groups before vaccination. Additionally, the percentage of CD4 + lymphocytes and the CD4/CD8 ratio after vaccination were positively associated with the expression level of IFN-γ mRNA in the PBLs after stimulation by LPS. In conclusion, the in vitro expression level of IFN-γ mRNA in the PBLs stimulated by LPS may serve as a parameter for selecting young sires with a HIR to FMD virus. Copyright © 2018 Elsevier Ltd. All rights reserved.
Collectively loading an application in a parallel computer
DOE Office of Scientific and Technical Information (OSTI.GOV)
Aho, Michael E.; Attinella, John E.; Gooding, Thomas M.
Collectively loading an application in a parallel computer, the parallel computer comprising a plurality of compute nodes, including: identifying, by a parallel computer control system, a subset of compute nodes in the parallel computer to execute a job; selecting, by the parallel computer control system, one of the subset of compute nodes in the parallel computer as a job leader compute node; retrieving, by the job leader compute node from computer memory, an application for executing the job; and broadcasting, by the job leader to the subset of compute nodes in the parallel computer, the application for executing the job.
Long-Term Variations of the EOP and ICRF2
NASA Technical Reports Server (NTRS)
Zharov, Vladimir; Sazhin, Mikhail; Sementsov, Valerian; Sazhina, Olga
2010-01-01
We analyzed the time series of the coordinates of the ICRF radio sources. We show that part of the radio sources, including the defining sources, shows a significant apparent motion. The stability of the celestial reference frame is provided by a no-net-rotation condition applied to the defining sources. In our case this condition leads to a rotation of the frame axes with time. We calculated the effect of this rotation on the Earth orientation parameters (EOP). In order to improve the stability of the celestial reference frame we suggest a new method for the selection of the defining sources. The method consists of two criteria: the first one we call cosmological and the second one kinematical. It is shown that a subset of the ICRF sources selected according to cosmological criteria provides the most stable reference frame for the next decade.
Active-learning strategies in computer-assisted drug discovery.
Reker, Daniel; Schneider, Gisbert
2015-04-01
High-throughput compound screening is time and resource consuming, and considerable effort is invested into screening compound libraries, profiling, and selecting the most promising candidates for further testing. Active-learning methods assist the selection process by focusing on areas of chemical space that have the greatest chance of success while considering structural novelty. The core feature of these algorithms is their ability to adapt the structure-activity landscapes through feedback. Instead of full-deck screening, only focused subsets of compounds are tested, and the experimental readout is used to refine molecule selection for subsequent screening cycles. Once implemented, these techniques have the potential to reduce costs and save precious materials. Here, we provide a comprehensive overview of the various computational active-learning approaches and outline their potential for drug discovery. Copyright © 2014 Elsevier Ltd. All rights reserved.
Ma, Hong-Wu; Zhao, Xue-Ming; Yuan, Ying-Jin; Zeng, An-Ping
2004-08-12
Metabolic networks are organized in a modular, hierarchical manner. Methods for a rational decomposition of the metabolic network into relatively independent functional subsets are essential to better understand the modularity and organization principle of a large-scale, genome-wide network. Network decomposition is also necessary for functional analysis of metabolism by pathway analysis methods that are often hampered by the problem of combinatorial explosion due to the complexity of metabolic network. Decomposition methods proposed in literature are mainly based on the connection degree of metabolites. To obtain a more reasonable decomposition, the global connectivity structure of metabolic networks should be taken into account. In this work, we use a reaction graph representation of a metabolic network for the identification of its global connectivity structure and for decomposition. A bow-tie connectivity structure similar to that previously discovered for metabolite graph is found also to exist in the reaction graph. Based on this bow-tie structure, a new decomposition method is proposed, which uses a distance definition derived from the path length between two reactions. An hierarchical classification tree is first constructed from the distance matrix among the reactions in the giant strong component of the bow-tie structure. These reactions are then grouped into different subsets based on the hierarchical tree. Reactions in the IN and OUT subsets of the bow-tie structure are subsequently placed in the corresponding subsets according to a 'majority rule'. Compared with the decomposition methods proposed in literature, ours is based on combined properties of the global network structure and local reaction connectivity rather than, primarily, on the connection degree of metabolites. The method is applied to decompose the metabolic network of Escherichia coli. Eleven subsets are obtained. More detailed investigations of the subsets show that reactions in the same subset are really functionally related. The rational decomposition of metabolic networks, and subsequent studies of the subsets, make it more amenable to understand the inherent organization and functionality of metabolic networks at the modular level. http://genome.gbf.de/bioinformatics/
PSYCHE Pure Shift NMR Spectroscopy.
Foroozandeh, Mohammadali; Morris, Gareth; Nilsson, Mathias
2018-03-13
Broadband homodecoupling techniques in NMR, also known as "pure shift" methods, aim to enhance spectral resolution by suppressing the effects of homonuclear coupling interactions to turn multiplet signals into singlets. Such techniques typically work by selecting a subset of "active" nuclear spins to observe, and selectively inverting the remaining, "passive", spins to reverse the effects of coupling. Pure Shift Yielded by Chirp Excitation (PSYCHE) is one such method; it is relatively recent, but has already been successfully implemented in a range of different NMR experiments. Paradoxically, PSYCHE is one of the trickiest of pure shift NMR techniques to understand but one of the easiest to use. Here we offer some insights into theoretical and practical aspects of the method, and into the effects and importance of the experimental parameters. Some recent improvements that enhance the spectral purity of PSYCHE spectra will be presented, and some experimental frameworks including examples in 1D and 2D NMR spectroscopy, for the implementation of PSYCHE will be introduced. © 2018 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Limitations of inclusive fitness.
Allen, Benjamin; Nowak, Martin A; Wilson, Edward O
2013-12-10
Until recently, inclusive fitness has been widely accepted as a general method to explain the evolution of social behavior. Affirming and expanding earlier criticism, we demonstrate that inclusive fitness is instead a limited concept, which exists only for a small subset of evolutionary processes. Inclusive fitness assumes that personal fitness is the sum of additive components caused by individual actions. This assumption does not hold for the majority of evolutionary processes or scenarios. To sidestep this limitation, inclusive fitness theorists have proposed a method using linear regression. On the basis of this method, it is claimed that inclusive fitness theory (i) predicts the direction of allele frequency changes, (ii) reveals the reasons for these changes, (iii) is as general as natural selection, and (iv) provides a universal design principle for evolution. In this paper we evaluate these claims, and show that all of them are unfounded. If the objective is to analyze whether mutations that modify social behavior are favored or opposed by natural selection, then no aspect of inclusive fitness theory is needed.
Derivation of an artificial gene to improve classification accuracy upon gene selection.
Seo, Minseok; Oh, Sejong
2012-02-01
Classification analysis has been developed continuously since 1936. This research field has advanced as a result of development of classifiers such as KNN, ANN, and SVM, as well as through data preprocessing areas. Feature (gene) selection is required for very high dimensional data such as microarray before classification work. The goal of feature selection is to choose a subset of informative features that reduces processing time and provides higher classification accuracy. In this study, we devised a method of artificial gene making (AGM) for microarray data to improve classification accuracy. Our artificial gene was derived from a whole microarray dataset, and combined with a result of gene selection for classification analysis. We experimentally confirmed a clear improvement of classification accuracy after inserting artificial gene. Our artificial gene worked well for popular feature (gene) selection algorithms and classifiers. The proposed approach can be applied to any type of high dimensional dataset. Copyright © 2011 Elsevier Ltd. All rights reserved.
USDA-ARS?s Scientific Manuscript database
Selective principal component regression analysis (SPCR) uses a subset of the original image bands for principal component transformation and regression. For optimal band selection before the transformation, this paper used genetic algorithms (GA). In this case, the GA process used the regression co...
Rethinking the assessment of risk of bias due to selective reporting: a cross-sectional study.
Page, Matthew J; Higgins, Julian P T
2016-07-08
Selective reporting is included as a core domain of Cochrane's tool for assessing risk of bias in randomised trials. There has been no evaluation of review authors' use of this domain. We aimed to evaluate assessments of selective reporting in a cross-section of Cochrane reviews and to outline areas for improvement. We obtained data on selective reporting judgements for 8434 studies included in 586 Cochrane reviews published from issue 1-8, 2015. One author classified the reasons for judgements of high risk of selective reporting bias. We randomly selected 100 reviews with at least one trial rated at high risk of outcome non-reporting bias (non-/partial reporting of an outcome on the basis of its results). One author recorded whether the authors of these reviews incorporated the selective reporting assessment when interpreting results. Of the 8434 studies, 1055 (13 %) were rated at high risk of bias on the selective reporting domain. The most common reason was concern about outcome non-reporting bias. Few studies were rated at high risk because of concerns about bias in selection of the reported result (e.g. reporting of only a subset of measurements, analysis methods or subsets of the data that were pre-specified). Review authors often specified in the risk of bias tables the study outcomes that were not reported (84 % of studies) but less frequently specified the outcomes that were partially reported (61 % of studies). At least one study was rated at high risk of outcome non-reporting bias in 31 % of reviews. In the random sample of these reviews, only 30 % incorporated this information when interpreting results, by acknowledging that the synthesis of an outcome was missing data that were not/partially reported. Our audit of user practice in Cochrane reviews suggests that the assessment of selective reporting in the current risk of bias tool does not work well. It is not always clear which outcomes were selectively reported or what the corresponding risk of bias is in the synthesis with missing outcome data. New tools that will make it easier for reviewers to convey this information are being developed.
Band selection method based on spectrum difference in targets of interest in hyperspectral imagery
NASA Astrophysics Data System (ADS)
Zhang, Xiaohan; Yang, Guang; Yang, Yongbo; Huang, Junhua
2016-10-01
While hyperspectral data shares rich spectrum information, it has numbers of bands with high correlation coefficients, causing great data redundancy. A reasonable band selection is important for subsequent processing. Bands with large amount of information and low correlation should be selected. On this basis, according to the needs of target detection applications, the spectral characteristics of the objects of interest are taken into consideration in this paper, and a new method based on spectrum difference is proposed. Firstly, according to the spectrum differences of targets of interest, a difference matrix which represents the different spectral reflectance of different targets in different bands is structured. By setting a threshold, the bands satisfying the conditions would be left, constituting a subset of bands. Then, the correlation coefficients between bands are calculated and correlation matrix is given. According to the size of the correlation coefficient, the bands can be set into several groups. At last, the conception of normalized variance is used on behalf of the information content of each band. The bands are sorted by the value of its normalized variance. Set needing number of bands, and the optimum band combination solution can be get by these three steps. This method retains the greatest degree of difference between the target of interest and is easy to achieve by computer automatically. Besides, false color image synthesis experiment is carried out using the bands selected by this method as well as other 3 methods to show the performance of method in this paper.
Pagnozzi, Alex M; Fiori, Simona; Boyd, Roslyn N; Guzzetta, Andrea; Doecke, James; Gal, Yaniv; Rose, Stephen; Dowson, Nicholas
2016-02-01
Several scoring systems for measuring brain injury severity have been developed to standardize the classification of MRI results, which allows for the prediction of functional outcomes to help plan effective interventions for children with cerebral palsy. The aim of this study is to use statistical techniques to optimize the clinical utility of a recently proposed template-based scoring method by weighting individual anatomical scores of injury, while maintaining its simplicity by retaining only a subset of scored anatomical regions. Seventy-six children with unilateral cerebral palsy were evaluated in terms of upper limb motor function using the Assisting Hand Assessment measure and injuries visible on MRI using a semiquantitative approach. This cohort included 52 children with periventricular white matter injury and 24 with cortical and deep gray matter injuries. A subset of the template-derived cerebral regions was selected using a data-driven region selection algorithm. Linear regression was performed using this subset, with interaction effects excluded. Linear regression improved multiple correlations between MRI-based and Assisting Hand Assessment scores for both periventricular white matter (R squared increased to 0.45 from 0, P < 0.0001) and cortical and deep gray matter (0.84 from 0.44, P < 0.0001) cohorts. In both cohorts, the data-driven approach retained fewer than 8 of the 40 template-derived anatomical regions. The equal or better prediction of the clinically meaningful Assisting Hand Assessment measure using fewer anatomical regions highlights the potential of these developments to enable enhanced quantification of injury and prediction of patient motor outcome, while maintaining the clinical expediency of the scoring approach.
Cipolleschi, Maria Grazia; Rovida, Elisabetta; Sbarba, Persio Dello
2013-01-01
The Culture-Repopulating Ability (CRA) assays is a method to measure in vitro the bone marrow-repopulating potential of haematopoietic cells. The method was developed in our laboratory in the course of studies based on the use of growth factor-supplemented liquid cultures to study haematopoietic stem/progenitor cell resistance to, and selection at, low oxygen tensions in the incubation atmosphere. These studies led us to put forward the first hypothesis of the existence in vivo of haematopoietic stem cell niches where oxygen tension is physiologically lower than in other bone marrow areas. The CRA assays and incubation in low oxygen were later adapted to the study of leukaemias. Stabilized leukaemia cell lines, ensuring genetically homogeneous cells and enhancing repeatability of results, were found nevertheless phenotypically heterogeneous, comprising cell subsets exhibiting functional phenotypes of stem or progenitor cells. These subsets can be assayed separately, provided an experimental system capable to select one from another (such as different criteria for incubation in low oxygen) is established. On this basis, a two-step procedure was designed, including a primary culture of leukaemia cells in low oxygen for different times, where drug treatment is applied, followed by the transfer of residual cell population (CRA assay) to a drug-free secondary culture incubated at standard oxygen tension, where the expansion of population is allowed. The CRA assays, applied to cell lines first and then to primary cells, represent a simple and relatively rapid, yet accurate and reliable, method for the pre-screening of drugs potentially active on leukaemias which in our opinion could be adopted systematically before they are tested in vivo. PMID:23394087
Algae for biofuel: will the evolution of weeds limit the enterprise?
Bull, J. J.; Collins, Sinéad
2012-01-01
Algae hold promise as a source of biofuel. Yet the manner in which algae are most efficiently propagated and harvested is different from that used in traditional agriculture. In theory, algae can be grown in continuous culture and harvested frequently to maintain high yields with a short turnaround time. However, the maintenance of the population in a state of continuous growth will likely impose selection for fast growth, possibly opposing the maintenance of lipid stores desiriable for fuel. Any harvesting that removes a subset of the population and leaves the survivors to establish the next generation may quickly select traits that escape harvesting. An understanding of these problems should help identify methods for retarding the evolution and enhancing biofuel production. PMID:22946819
A review of channel selection algorithms for EEG signal processing
NASA Astrophysics Data System (ADS)
Alotaiby, Turky; El-Samie, Fathi E. Abd; Alshebeili, Saleh A.; Ahmad, Ishtiaq
2015-12-01
Digital processing of electroencephalography (EEG) signals has now been popularly used in a wide variety of applications such as seizure detection/prediction, motor imagery classification, mental task classification, emotion classification, sleep state classification, and drug effects diagnosis. With the large number of EEG channels acquired, it has become apparent that efficient channel selection algorithms are needed with varying importance from one application to another. The main purpose of the channel selection process is threefold: (i) to reduce the computational complexity of any processing task performed on EEG signals by selecting the relevant channels and hence extracting the features of major importance, (ii) to reduce the amount of overfitting that may arise due to the utilization of unnecessary channels, for the purpose of improving the performance, and (iii) to reduce the setup time in some applications. Signal processing tools such as time-domain analysis, power spectral estimation, and wavelet transform have been used for feature extraction and hence for channel selection in most of channel selection algorithms. In addition, different evaluation approaches such as filtering, wrapper, embedded, hybrid, and human-based techniques have been widely used for the evaluation of the selected subset of channels. In this paper, we survey the recent developments in the field of EEG channel selection methods along with their applications and classify these methods according to the evaluation approach.
Detection of artifacts from high energy bursts in neonatal EEG.
Bhattacharyya, Sourya; Biswas, Arunava; Mukherjee, Jayanta; Majumdar, Arun Kumar; Majumdar, Bandana; Mukherjee, Suchandra; Singh, Arun Kumar
2013-11-01
Detection of non-cerebral activities or artifacts, intermixed within the background EEG, is essential to discard them from subsequent pattern analysis. The problem is much harder in neonatal EEG, where the background EEG contains spikes, waves, and rapid fluctuations in amplitude and frequency. Existing artifact detection methods are mostly limited to detect only a subset of artifacts such as ocular, muscle or power line artifacts. Few methods integrate different modules, each for detection of one specific category of artifact. Furthermore, most of the reference approaches are implemented and tested on adult EEG recordings. Direct application of those methods on neonatal EEG causes performance deterioration, due to greater pattern variation and inherent complexity. A method for detection of a wide range of artifact categories in neonatal EEG is thus required. At the same time, the method should be specific enough to preserve the background EEG information. The current study describes a feature based classification approach to detect both repetitive (generated from ECG, EMG, pulse, respiration, etc.) and transient (generated from eye blinking, eye movement, patient movement, etc.) artifacts. It focuses on artifact detection within high energy burst patterns, instead of detecting artifacts within the complete background EEG with wide pattern variation. The objective is to find true burst patterns, which can later be used to identify the Burst-Suppression (BS) pattern, which is commonly observed during newborn seizure. Such selective artifact detection is proven to be more sensitive to artifacts and specific to bursts, compared to the existing artifact detection approaches applied on the complete background EEG. Several time domain, frequency domain, statistical features, and features generated by wavelet decomposition are analyzed to model the proposed bi-classification between burst and artifact segments. A feature selection method is also applied to select the feature subset producing highest classification accuracy. The suggested feature based classification method is executed using our recorded neonatal EEG dataset, consisting of burst and artifact segments. We obtain 78% sensitivity and 72% specificity as the accuracy measures. The accuracy obtained using the proposed method is found to be about 20% higher than that of the reference approaches. Joint use of the proposed method with our previous work on burst detection outperforms reference methods on simultaneous burst and artifact detection. As the proposed method supports detection of a wide range of artifact patterns, it can be improved to incorporate the detection of artifacts within other seizure patterns and background EEG information as well. © 2013 Elsevier Ltd. All rights reserved.
What MISR data are available for field experiments?
Atmospheric Science Data Center
2014-12-08
MISR data and imagery are available for many field campaigns. Select data products are subset for the region and dates of interest. Special gridded regional products may be available as well as Local Mode data for select targets...
Hierarchical Kohonenen net for anomaly detection in network security.
Sarasamma, Suseela T; Zhu, Qiuming A; Huff, Julie
2005-04-01
A novel multilevel hierarchical Kohonen Net (K-Map) for an intrusion detection system is presented. Each level of the hierarchical map is modeled as a simple winner-take-all K-Map. One significant advantage of this multilevel hierarchical K-Map is its computational efficiency. Unlike other statistical anomaly detection methods such as nearest neighbor approach, K-means clustering or probabilistic analysis that employ distance computation in the feature space to identify the outliers, our approach does not involve costly point-to-point computation in organizing the data into clusters. Another advantage is the reduced network size. We use the classification capability of the K-Map on selected dimensions of data set in detecting anomalies. Randomly selected subsets that contain both attacks and normal records from the KDD Cup 1999 benchmark data are used to train the hierarchical net. We use a confidence measure to label the clusters. Then we use the test set from the same KDD Cup 1999 benchmark to test the hierarchical net. We show that a hierarchical K-Map in which each layer operates on a small subset of the feature space is superior to a single-layer K-Map operating on the whole feature space in detecting a variety of attacks in terms of detection rate as well as false positive rate.
NASA Astrophysics Data System (ADS)
Alexandre, E.; Cuadra, L.; Nieto-Borge, J. C.; Candil-García, G.; del Pino, M.; Salcedo-Sanz, S.
2015-08-01
Wave parameters computed from time series measured by buoys (significant wave height Hs, mean wave period, etc.) play a key role in coastal engineering and in the design and operation of wave energy converters. Storms or navigation accidents can make measuring buoys break down, leading to missing data gaps. In this paper we tackle the problem of locally reconstructing Hs at out-of-operation buoys by using wave parameters from nearby buoys, based on the spatial correlation among values at neighboring buoy locations. The novelty of our approach for its potential application to problems in coastal engineering is twofold. On one hand, we propose a genetic algorithm hybridized with an extreme learning machine that selects, among the available wave parameters from the nearby buoys, a subset FnSP with nSP parameters that minimizes the Hs reconstruction error. On the other hand, we evaluate to what extent the selected parameters in subset FnSP are good enough in assisting other machine learning (ML) regressors (extreme learning machines, support vector machines and gaussian process regression) to reconstruct Hs. The results show that all the ML method explored achieve a good Hs reconstruction in the two different locations studied (Caribbean Sea and West Atlantic).
Seismic Characterizations of Fractures: Dynamic Diagnostics
NASA Astrophysics Data System (ADS)
Pyrak-Nolte, L. J.
2017-12-01
Fracture geometry controls fluid flow in a fracture, affects mechanical stability and influences energy partitioning that affects wave scattering. Our ability to detect and monitor fracture evolution is controlled by the frequency of the signal used to probe a fracture system, i.e. frequency selects the scales. No matter the frequency chosen, some set of discontinuities will be optimal for detection because different wavelengths sample different subsets of fractures. The select subset of fractures is based on the stiffness of the fractures which in turn is linked to fluid flow. A goal is obtaining information from scales outside the optimal detection regime. Fracture geometry trajectories are a potential approach to drive a fracture system across observation scales, i.e. moving systems between effective medium and scattering regimes. Dynamic trajectories (such as perturbing stress, fluid pressure, chemical alteration, etc.) can be used to perturb fracture geometry to enhance scattering or give rise to discrete modes that are intimately related to the micro-structural evolution of a fracture. However, identification of these signal features will require methods for identifying these micro-structural signatures in complicated scattered fields. Acknowledgment: This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Basic Energy Sciences, Geosciences Research Program under Award Number (DE-FG02-09ER16022).
Solav, Dana; Rubin, M B; Cereatti, Andrea; Camomilla, Valentina; Wolf, Alon
2016-04-01
Accurate estimation of the position and orientation (pose) of a bone from a cluster of skin markers is limited mostly by the relative motion between the bone and the markers, which is known as the soft tissue artifact (STA). This work presents a method, based on continuum mechanics, to describe the kinematics of a cluster affected by STA. The cluster is characterized by triangular cosserat point elements (TCPEs) defined by all combinations of three markers. The effects of the STA on the TCPEs are quantified using three parameters describing the strain in each TCPE and the relative rotation and translation between TCPEs. The method was evaluated using previously collected ex vivo kinematic data. Femur pose was estimated from 12 skin markers on the thigh, while its reference pose was measured using bone pins. Analysis revealed that instantaneous subsets of TCPEs exist which estimate bone position and orientation more accurately than the Procrustes Superimposition applied to the cluster of all markers. It has been shown that some of these parameters correlate well with femur pose errors, which suggests that they can be used to select, at each instant, subsets of TCPEs leading an improved estimation of the underlying bone pose.
Lünse, Christina E.; Corbino, Keith A.; Ames, Tyler D.; Nelson, James W.; Roth, Adam; Perkins, Kevin R.; Sherlock, Madeline E.
2017-01-01
Abstract The discovery of structured non-coding RNAs (ncRNAs) in bacteria can reveal new facets of biology and biochemistry. Comparative genomics analyses executed by powerful computer algorithms have successfully been used to uncover many novel bacterial ncRNA classes in recent years. However, this general search strategy favors the discovery of more common ncRNA classes, whereas progressively rarer classes are correspondingly more difficult to identify. In the current study, we confront this problem by devising several methods to select subsets of intergenic regions that can concentrate these rare RNA classes, thereby increasing the probability that comparative sequence analysis approaches will reveal their existence. By implementing these methods, we discovered 224 novel ncRNA classes, which include ROOL RNA, an RNA class averaging 581 nt and present in multiple phyla, several highly conserved and widespread ncRNA classes with properties that suggest sophisticated biochemical functions and a multitude of putative cis-regulatory RNA classes involved in a variety of biological processes. We expect that further research on these newly found RNA classes will reveal additional aspects of novel biology, and allow for greater insights into the biochemistry performed by ncRNAs. PMID:28977401
A Scalable and Robust Multi-Agent Approach to Distributed Optimization
NASA Technical Reports Server (NTRS)
Tumer, Kagan
2005-01-01
Modularizing a large optimization problem so that the solutions to the subproblems provide a good overall solution is a challenging problem. In this paper we present a multi-agent approach to this problem based on aligning the agent objectives with the system objectives, obviating the need to impose external mechanisms to achieve collaboration among the agents. This approach naturally addresses scaling and robustness issues by ensuring that the agents do not rely on the reliable operation of other agents We test this approach in the difficult distributed optimization problem of imperfect device subset selection [Challet and Johnson, 2002]. In this problem, there are n devices, each of which has a "distortion", and the task is to find the subset of those n devices that minimizes the average distortion. Our results show that in large systems (1000 agents) the proposed approach provides improvements of over an order of magnitude over both traditional optimization methods and traditional multi-agent methods. Furthermore, the results show that even in extreme cases of agent failures (i.e., half the agents fail midway through the simulation) the system remains coordinated and still outperforms a failure-free and centralized optimization algorithm.
Boonjing, Veera; Intakosum, Sarun
2016-01-01
This study investigated the use of Artificial Neural Network (ANN) and Genetic Algorithm (GA) for prediction of Thailand's SET50 index trend. ANN is a widely accepted machine learning method that uses past data to predict future trend, while GA is an algorithm that can find better subsets of input variables for importing into ANN, hence enabling more accurate prediction by its efficient feature selection. The imported data were chosen technical indicators highly regarded by stock analysts, each represented by 4 input variables that were based on past time spans of 4 different lengths: 3-, 5-, 10-, and 15-day spans before the day of prediction. This import undertaking generated a big set of diverse input variables with an exponentially higher number of possible subsets that GA culled down to a manageable number of more effective ones. SET50 index data of the past 6 years, from 2009 to 2014, were used to evaluate this hybrid intelligence prediction accuracy, and the hybrid's prediction results were found to be more accurate than those made by a method using only one input variable for one fixed length of past time span. PMID:27974883
Inthachot, Montri; Boonjing, Veera; Intakosum, Sarun
2016-01-01
This study investigated the use of Artificial Neural Network (ANN) and Genetic Algorithm (GA) for prediction of Thailand's SET50 index trend. ANN is a widely accepted machine learning method that uses past data to predict future trend, while GA is an algorithm that can find better subsets of input variables for importing into ANN, hence enabling more accurate prediction by its efficient feature selection. The imported data were chosen technical indicators highly regarded by stock analysts, each represented by 4 input variables that were based on past time spans of 4 different lengths: 3-, 5-, 10-, and 15-day spans before the day of prediction. This import undertaking generated a big set of diverse input variables with an exponentially higher number of possible subsets that GA culled down to a manageable number of more effective ones. SET50 index data of the past 6 years, from 2009 to 2014, were used to evaluate this hybrid intelligence prediction accuracy, and the hybrid's prediction results were found to be more accurate than those made by a method using only one input variable for one fixed length of past time span.
Feature selection with harmony search.
Diao, Ren; Shen, Qiang
2012-12-01
Many search strategies have been exploited for the task of feature selection (FS), in an effort to identify more compact and better quality subsets. Such work typically involves the use of greedy hill climbing (HC), or nature-inspired heuristics, in order to discover the optimal solution without going through exhaustive search. In this paper, a novel FS approach based on harmony search (HS) is presented. It is a general approach that can be used in conjunction with many subset evaluation techniques. The simplicity of HS is exploited to reduce the overall complexity of the search process. The proposed approach is able to escape from local solutions and identify multiple solutions owing to the stochastic nature of HS. Additional parameter control schemes are introduced to reduce the effort and impact of parameter configuration. These can be further combined with the iterative refinement strategy, tailored to enforce the discovery of quality subsets. The resulting approach is compared with those that rely on HC, genetic algorithms, and particle swarm optimization, accompanied by in-depth studies of the suggested improvements.
Garcia-Chimeno, Yolanda; Garcia-Zapirain, Begonya; Gomez-Beldarrain, Marian; Fernandez-Ruanova, Begonya; Garcia-Monco, Juan Carlos
2017-04-13
Feature selection methods are commonly used to identify subsets of relevant features to facilitate the construction of models for classification, yet little is known about how feature selection methods perform in diffusion tensor images (DTIs). In this study, feature selection and machine learning classification methods were tested for the purpose of automating diagnosis of migraines using both DTIs and questionnaire answers related to emotion and cognition - factors that influence of pain perceptions. We select 52 adult subjects for the study divided into three groups: control group (15), subjects with sporadic migraine (19) and subjects with chronic migraine and medication overuse (18). These subjects underwent magnetic resonance with diffusion tensor to see white matter pathway integrity of the regions of interest involved in pain and emotion. The tests also gather data about pathology. The DTI images and test results were then introduced into feature selection algorithms (Gradient Tree Boosting, L1-based, Random Forest and Univariate) to reduce features of the first dataset and classification algorithms (SVM (Support Vector Machine), Boosting (Adaboost) and Naive Bayes) to perform a classification of migraine group. Moreover we implement a committee method to improve the classification accuracy based on feature selection algorithms. When classifying the migraine group, the greatest improvements in accuracy were made using the proposed committee-based feature selection method. Using this approach, the accuracy of classification into three types improved from 67 to 93% when using the Naive Bayes classifier, from 90 to 95% with the support vector machine classifier, 93 to 94% in boosting. The features that were determined to be most useful for classification included are related with the pain, analgesics and left uncinate brain (connected with the pain and emotions). The proposed feature selection committee method improved the performance of migraine diagnosis classifiers compared to individual feature selection methods, producing a robust system that achieved over 90% accuracy in all classifiers. The results suggest that the proposed methods can be used to support specialists in the classification of migraines in patients undergoing magnetic resonance imaging.
Multi-Domain Transfer Learning for Early Diagnosis of Alzheimer's Disease.
Cheng, Bo; Liu, Mingxia; Shen, Dinggang; Li, Zuoyong; Zhang, Daoqiang
2017-04-01
Recently, transfer learning has been successfully applied in early diagnosis of Alzheimer's Disease (AD) based on multi-domain data. However, most of existing methods only use data from a single auxiliary domain, and thus cannot utilize the intrinsic useful correlation information from multiple domains. Accordingly, in this paper, we consider the joint learning of tasks in multi-auxiliary domains and the target domain, and propose a novel Multi-Domain Transfer Learning (MDTL) framework for early diagnosis of AD. Specifically, the proposed MDTL framework consists of two key components: 1) a multi-domain transfer feature selection (MDTFS) model that selects the most informative feature subset from multi-domain data, and 2) a multi-domain transfer classification (MDTC) model that can identify disease status for early AD detection. We evaluate our method on 807 subjects from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database using baseline magnetic resonance imaging (MRI) data. The experimental results show that the proposed MDTL method can effectively utilize multi-auxiliary domain data for improving the learning performance in the target domain, compared with several state-of-the-art methods.
Multi-Domain Transfer Learning for Early Diagnosis of Alzheimer’s Disease
Cheng, Bo; Liu, Mingxia; Li, Zuoyong
2017-01-01
Recently, transfer learning has been successfully applied in early diagnosis of Alzheimer’s Disease (AD) based on multi-domain data. However, most of existing methods only use data from a single auxiliary domain, and thus cannot utilize the intrinsic useful correlation information from multiple domains. Accordingly, in this paper, we consider the joint learning of tasks in multi-auxiliary domains and the target domain, and propose a novel Multi-Domain Transfer Learning (MDTL) framework for early diagnosis of AD. Specifically, the proposed MDTL framework consists of two key components: 1) a multi-domain transfer feature selection (MDTFS) model that selects the most informative feature subset from multi-domain data, and 2) a multidomain transfer classification (MDTC) model that can identify disease status for early AD detection. We evaluate our method on 807 subjects from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database using baseline magnetic resonance imaging (MRI) data. The experimental results show that the proposed MDTL method can effectively utilize multi-auxiliary domain data for improving the learning performance in the target domain, compared with several state-of-the-art methods. PMID:27928657
Koutsoukas, Alexios; Paricharak, Shardul; Galloway, Warren R J D; Spring, David R; Ijzerman, Adriaan P; Glen, Robert C; Marcus, David; Bender, Andreas
2014-01-27
Chemical diversity is a widely applied approach to select structurally diverse subsets of molecules, often with the objective of maximizing the number of hits in biological screening. While many methods exist in the area, few systematic comparisons using current descriptors in particular with the objective of assessing diversity in bioactivity space have been published, and this shortage is what the current study is aiming to address. In this work, 13 widely used molecular descriptors were compared, including fingerprint-based descriptors (ECFP4, FCFP4, MACCS keys), pharmacophore-based descriptors (TAT, TAD, TGT, TGD, GpiDAPH3), shape-based descriptors (rapid overlay of chemical structures (ROCS) and principal moments of inertia (PMI)), a connectivity-matrix-based descriptor (BCUT), physicochemical-property-based descriptors (prop2D), and a more recently introduced molecular descriptor type (namely, "Bayes Affinity Fingerprints"). We assessed both the similar behavior of the descriptors in assessing the diversity of chemical libraries, and their ability to select compounds from libraries that are diverse in bioactivity space, which is a property of much practical relevance in screening library design. This is particularly evident, given that many future targets to be screened are not known in advance, but that the library should still maximize the likelihood of containing bioactive matter also for future screening campaigns. Overall, our results showed that descriptors based on atom topology (i.e., fingerprint-based descriptors and pharmacophore-based descriptors) correlate well in rank-ordering compounds, both within and between descriptor types. On the other hand, shape-based descriptors such as ROCS and PMI showed weak correlation with the other descriptors utilized in this study, demonstrating significantly different behavior. We then applied eight of the molecular descriptors compared in this study to sample a diverse subset of sample compounds (4%) from an initial population of 2587 compounds, covering the 25 largest human activity classes from ChEMBL and measured the coverage of activity classes by the subsets. Here, it was found that "Bayes Affinity Fingerprints" achieved an average coverage of 92% of activity classes. Using the descriptors ECFP4, GpiDAPH3, TGT, and random sampling, 91%, 84%, 84%, and 84% of the activity classes were represented in the selected compounds respectively, followed by BCUT, prop2D, MACCS, and PMI (in order of decreasing performance). In addition, we were able to show that there is no visible correlation between compound diversity in PMI space and in bioactivity space, despite frequent utilization of PMI plots to this end. To summarize, in this work, we assessed which descriptors select compounds with high coverage of bioactivity space, and can hence be used for diverse compound selection for biological screening. In cases where multiple descriptors are to be used for diversity selection, this work describes which descriptors behave complementarily, and can hence be used jointly to focus on different aspects of diversity in chemical space.
Kalkan, Erol; Chopra, Anil K.
2010-01-01
Earthquake engineering practice is increasingly using nonlinear response history analysis (RHA) to demonstrate performance of structures. This rigorous method of analysis requires selection and scaling of ground motions appropriate to design hazard levels. Presented herein is a modal-pushover-based scaling (MPS) method to scale ground motions for use in nonlinear RHA of buildings and bridges. In the MPS method, the ground motions are scaled to match (to a specified tolerance) a target value of the inelastic deformation of the first-'mode' inelastic single-degree-of-freedom (SDF) system whose properties are determined by first-'mode' pushover analysis. Appropriate for first-?mode? dominated structures, this approach is extended for structures with significant contributions of higher modes by considering elastic deformation of second-'mode' SDF system in selecting a subset of the scaled ground motions. Based on results presented for two bridges, covering single- and multi-span 'ordinary standard' bridge types, and six buildings, covering low-, mid-, and tall building types in California, the accuracy and efficiency of the MPS procedure are established and its superiority over the ASCE/SEI 7-05 scaling procedure is demonstrated.
NASA Astrophysics Data System (ADS)
Yang, Jian; He, Yuhong
2017-02-01
Quantifying impervious surfaces in urban and suburban areas is a key step toward a sustainable urban planning and management strategy. With the availability of fine-scale remote sensing imagery, automated mapping of impervious surfaces has attracted growing attention. However, the vast majority of existing studies have selected pixel-based and object-based methods for impervious surface mapping, with few adopting sub-pixel analysis of high spatial resolution imagery. This research makes use of a vegetation-bright impervious-dark impervious linear spectral mixture model to characterize urban and suburban surface components. A WorldView-3 image acquired on May 9th, 2015 is analyzed for its potential in automated unmixing of meaningful surface materials for two urban subsets and one suburban subset in Toronto, ON, Canada. Given the wide distribution of shadows in urban areas, the linear spectral unmixing is implemented in non-shadowed and shadowed areas separately for the two urban subsets. The results indicate that the accuracy of impervious surface mapping in suburban areas reaches up to 86.99%, much higher than the accuracies in urban areas (80.03% and 79.67%). Despite its merits in mapping accuracy and automation, the application of our proposed vegetation-bright impervious-dark impervious model to map impervious surfaces is limited due to the absence of soil component. To further extend the operational transferability of our proposed method, especially for the areas where plenty of bare soils exist during urbanization or reclamation, it is still of great necessity to mask out bare soils by automated classification prior to the implementation of linear spectral unmixing.
Adoptive therapy with chimeric antigen receptor-modified T cells of defined subset composition.
Riddell, Stanley R; Sommermeyer, Daniel; Berger, Carolina; Liu, Lingfeng Steven; Balakrishnan, Ashwini; Salter, Alex; Hudecek, Michael; Maloney, David G; Turtle, Cameron J
2014-01-01
The ability to engineer T cells to recognize tumor cells through genetic modification with a synthetic chimeric antigen receptor has ushered in a new era in cancer immunotherapy. The most advanced clinical applications are in targeting CD19 on B-cell malignancies. The clinical trials of CD19 chimeric antigen receptor therapy have thus far not attempted to select defined subsets before transduction or imposed uniformity of the CD4 and CD8 cell composition of the cell products. This review will discuss the rationale for and challenges to using adoptive therapy with genetically modified T cells of defined subset and phenotypic composition.
Synthesis and evaluation of 7-substituted 4-aminoquinoline analogues for antimalarial activity.
Hwang, Jong Yeon; Kawasuji, Takashi; Lowes, David J; Clark, Julie A; Connelly, Michele C; Zhu, Fangyi; Guiguemde, W Armand; Sigal, Martina S; Wilson, Emily B; Derisi, Joseph L; Guy, R Kiplin
2011-10-27
We previously reported that substituted 4-aminoquinolines with a phenyl ether substituent at the 7-position of the quinoline ring and the capability of intramolecular hydrogen bonding between the protonated amine on the side chain and a hydrogen bond acceptor on the amine's alkyl substituents exhibited potent antimalarial activity against the multidrug resistant strain P. falciparum W2. We employed a parallel synthetic method to generate diaryl ether, biaryl, and alkylaryl 4-aminoquinoline analogues in the background of a limited number of side chain variations that had previously afforded potent 4-aminoquinolines. All subsets were evaluated for their antimalarial activity against the chloroquine-sensitive strain 3D7 and the chloroquine-resistant K1 strain as well as for cytotoxicity against mammalian cell lines. While all three arrays showed good antimalarial activity, only the biaryl-containing subset showed consistently good potency against the drug-resistant K1 strain and good selectivity with regard to mammalian cytotoxicity. Overall, our data indicate that the biaryl-containing series contains promising candidates for further study.
NASA Astrophysics Data System (ADS)
Parker, L.; Dye, R. A.; Perez, J.; Rinsland, P.
2012-12-01
Over the past decade the Atmospheric Science Data Center (ASDC) at NASA Langley Research Center has archived and distributed a variety of satellite mission and aircraft campaign data sets. These datasets posed unique challenges to the user community at large due to the sheer volume and variety of the data and the lack of intuitive features in the order tools available to the investigator. Some of these data sets also lack sufficient metadata to provide rudimentary data discovery. To meet the needs of emerging users, the ASDC addressed issues in data discovery and delivery through the use of standards in data and access methods, and distribution through appropriate portals. The ASDC is currently undergoing a refresh of its webpages and Ordering Tools that will leverage updated collection level metadata in an effort to enhance the user experience. The ASDC is now providing search and subset capability to key mission satellite data sets. The ASDC has collaborated with Science Teams to accommodate prospective science users in the climate and modeling communities. The ASDC is using a common framework that enables more rapid development and deployment of search and subset tools that provide enhanced access features for the user community. Features of the Search and Subset web application enables a more sophisticated approach to selecting and ordering data subsets by parameter, date, time, and geographic area. The ASDC has also applied key practices from satellite missions to the multi-campaign aircraft missions executed for Earth Venture-1 and MEaSUReS
Selecting a climate model subset to optimise key ensemble properties
NASA Astrophysics Data System (ADS)
Herger, Nadja; Abramowitz, Gab; Knutti, Reto; Angélil, Oliver; Lehmann, Karsten; Sanderson, Benjamin M.
2018-02-01
End users studying impacts and risks caused by human-induced climate change are often presented with large multi-model ensembles of climate projections whose composition and size are arbitrarily determined. An efficient and versatile method that finds a subset which maintains certain key properties from the full ensemble is needed, but very little work has been done in this area. Therefore, users typically make their own somewhat subjective subset choices and commonly use the equally weighted model mean as a best estimate. However, different climate model simulations cannot necessarily be regarded as independent estimates due to the presence of duplicated code and shared development history. Here, we present an efficient and flexible tool that makes better use of the ensemble as a whole by finding a subset with improved mean performance compared to the multi-model mean while at the same time maintaining the spread and addressing the problem of model interdependence. Out-of-sample skill and reliability are demonstrated using model-as-truth experiments. This approach is illustrated with one set of optimisation criteria but we also highlight the flexibility of cost functions, depending on the focus of different users. The technique is useful for a range of applications that, for example, minimise present-day bias to obtain an accurate ensemble mean, reduce dependence in ensemble spread, maximise future spread, ensure good performance of individual models in an ensemble, reduce the ensemble size while maintaining important ensemble characteristics, or optimise several of these at the same time. As in any calibration exercise, the final ensemble is sensitive to the metric, observational product, and pre-processing steps used.
Zu, Chen; Jie, Biao; Liu, Mingxia; Chen, Songcan
2015-01-01
Multimodal classification methods using different modalities of imaging and non-imaging data have recently shown great advantages over traditional single-modality-based ones for diagnosis and prognosis of Alzheimer’s disease (AD), as well as its prodromal stage, i.e., mild cognitive impairment (MCI). However, to the best of our knowledge, most existing methods focus on mining the relationship across multiple modalities of the same subjects, while ignoring the potentially useful relationship across different subjects. Accordingly, in this paper, we propose a novel learning method for multimodal classification of AD/MCI, by fully exploring the relationships across both modalities and subjects. Specifically, our proposed method includes two subsequent components, i.e., label-aligned multi-task feature selection and multimodal classification. In the first step, the feature selection learning from multiple modalities are treated as different learning tasks and a group sparsity regularizer is imposed to jointly select a subset of relevant features. Furthermore, to utilize the discriminative information among labeled subjects, a new label-aligned regularization term is added into the objective function of standard multi-task feature selection, where label-alignment means that all multi-modality subjects with the same class labels should be closer in the new feature-reduced space. In the second step, a multi-kernel support vector machine (SVM) is adopted to fuse the selected features from multi-modality data for final classification. To validate our method, we perform experiments on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database using baseline MRI and FDG-PET imaging data. The experimental results demonstrate that our proposed method achieves better classification performance compared with several state-of-the-art methods for multimodal classification of AD/MCI. PMID:26572145
NASA Astrophysics Data System (ADS)
O'Connell, D.; Ruan, D.; Thomas, D. H.; Dou, T. H.; Lewis, J. H.; Santhanam, A.; Lee, P.; Low, D. A.
2018-02-01
Breathing motion modeling requires observation of tissues at sufficiently distinct respiratory states for proper 4D characterization. This work proposes a method to improve sampling of the breathing cycle with limited imaging dose. We designed and tested a prospective free-breathing acquisition protocol with a simulation using datasets from five patients imaged with a model-based 4DCT technique. Each dataset contained 25 free-breathing fast helical CT scans with simultaneous breathing surrogate measurements. Tissue displacements were measured using deformable image registration. A correspondence model related tissue displacement to the surrogate. Model residual was computed by comparing predicted displacements to image registration results. To determine a stopping criteria for the prospective protocol, i.e. when the breathing cycle had been sufficiently sampled, subsets of N scans where 5 ⩽ N ⩽ 9 were used to fit reduced models for each patient. A previously published metric was employed to describe the phase coverage, or ‘spread’, of the respiratory trajectories of each subset. Minimum phase coverage necessary to achieve mean model residual within 0.5 mm of the full 25-scan model was determined and used as the stopping criteria. Using the patient breathing traces, a prospective acquisition protocol was simulated. In all patients, phase coverage greater than the threshold necessary for model accuracy within 0.5 mm of the 25 scan model was achieved in six or fewer scans. The prospectively selected respiratory trajectories ranked in the (97.5 ± 4.2)th percentile among subsets of the originally sampled scans on average. Simulation results suggest that the proposed prospective method provides an effective means to sample the breathing cycle with limited free-breathing scans. One application of the method is to reduce the imaging dose of a previously published model-based 4DCT protocol to 25% of its original value while achieving mean model residual within 0.5 mm.
Relevant, irredundant feature selection and noisy example elimination.
Lashkia, George V; Anthony, Laurence
2004-04-01
In many real-world situations, the method for computing the desired output from a set of inputs is unknown. One strategy for solving these types of problems is to learn the input-output functionality from examples in a training set. However, in many situations it is difficult to know what information is relevant to the task at hand. Subsequently, researchers have investigated ways to deal with the so-called problem of consistency of attributes, i.e., attributes that can distinguish examples from different classes. In this paper, we first prove that the notion of relevance of attributes is directly related to the consistency of attributes, and show how relevant, irredundant attributes can be selected. We then compare different relevant attribute selection algorithms, and show the superiority of algorithms that select irredundant attributes over those that select relevant attributes. We also show that searching for an "optimal" subset of attributes, which is considered to be the main purpose of attribute selection, is not the best way to improve the accuracy of classifiers. Employing sets of relevant, irredundant attributes improves classification accuracy in many more cases. Finally, we propose a new method for selecting relevant examples, which is based on filtering the so-called pattern frequency domain. By identifying examples that are nontypical in the determination of relevant, irredundant attributes, irrelevant examples can be eliminated prior to the learning process. Empirical results using artificial and real databases show the effectiveness of the proposed method in selecting relevant examples leading to improved performance even on greatly reduced training sets.
Chen, Chung-De; Huang, Yen-Chieh; Chiang, Hsin-Lin; Hsieh, Yin-Cheng; Guan, Hong-Hsiang; Chuankhayan, Phimonphan; Chen, Chun-Jung
2014-09-01
Optimization of the initial phasing has been a decisive factor in the success of the subsequent electron-density modification, model building and structure determination of biological macromolecules using the single-wavelength anomalous dispersion (SAD) method. Two possible phase solutions (φ1 and φ2) generated from two symmetric phase triangles in the Harker construction for the SAD method cause the well known phase ambiguity. A novel direct phase-selection method utilizing the θ(DS) list as a criterion to select optimized phases φ(am) from φ1 or φ2 of a subset of reflections with a high percentage of correct phases to replace the corresponding initial SAD phases φ(SAD) has been developed. Based on this work, reflections with an angle θ(DS) in the range 35-145° are selected for an optimized improvement, where θ(DS) is the angle between the initial phase φ(SAD) and a preliminary density-modification (DM) phase φ(DM)(NHL). The results show that utilizing the additional direct phase-selection step prior to simple solvent flattening without phase combination using existing DM programs, such as RESOLVE or DM from CCP4, significantly improves the final phases in terms of increased correlation coefficients of electron-density maps and diminished mean phase errors. With the improved phases and density maps from the direct phase-selection method, the completeness of residues of protein molecules built with main chains and side chains is enhanced for efficient structure determination.
Hyperspectral image classification based on local binary patterns and PCANet
NASA Astrophysics Data System (ADS)
Yang, Huizhen; Gao, Feng; Dong, Junyu; Yang, Yang
2018-04-01
Hyperspectral image classification has been well acknowledged as one of the challenging tasks of hyperspectral data processing. In this paper, we propose a novel hyperspectral image classification framework based on local binary pattern (LBP) features and PCANet. In the proposed method, linear prediction error (LPE) is first employed to select a subset of informative bands, and LBP is utilized to extract texture features. Then, spectral and texture features are stacked into a high dimensional vectors. Next, the extracted features of a specified position are transformed to a 2-D image. The obtained images of all pixels are fed into PCANet for classification. Experimental results on real hyperspectral dataset demonstrate the effectiveness of the proposed method.
Bliss, Sarah A; Paul, Sunirmal; Pobiarzyn, Piotr W; Ayer, Seda; Sinha, Garima; Pant, Saumya; Hilton, Holly; Sharma, Neha; Cunha, Maria F; Engelberth, Daniel J; Greco, Steven J; Bryan, Margarette; Kucia, Magdalena J; Kakar, Sham S; Ratajczak, Mariusz Z; Rameshwar, Pranela
2018-01-10
This study proposes that a novel developmental hierarchy of breast cancer (BC) cells (BCCs) could predict treatment response and outcome. The continued challenge to treat BC requires stratification of BCCs into distinct subsets. This would provide insights on how BCCs evade treatment and adapt dormancy for decades. We selected three subsets, based on the relative expression of octamer-binding transcription factor 4 A (Oct4A) and then analysed each with Affymetrix gene chip. Oct4A is a stem cell gene and would separate subsets based on maturation. Data analyses and gene validation identified three membrane proteins, TMEM98, GPR64 and FAT4. BCCs from cell lines and blood from BC patients were analysed for these three membrane proteins by flow cytometry, along with known markers of cancer stem cells (CSCs), CD44, CD24 and Oct4, aldehyde dehydrogenase 1 (ALDH1) activity and telomere length. A novel working hierarchy of BCCs was established with the most immature subset as CSCs. This group was further subdivided into long- and short-term CSCs. Analyses of 20 post-treatment blood indicated that circulating CSCs and early BC progenitors may be associated with recurrence or early death. These results suggest that the novel hierarchy may predict treatment response and prognosis.
Morikawa, Masatoshi; Tsujibe, Satoshi; Kiyoshima-Shibata, Junko; Watanabe, Yohei; Kato-Nagaoka, Noriko; Shida, Kan; Matsumoto, Satoshi
2016-01-01
Phagocytes such as dendritic cells and macrophages, which are distributed in the small intestinal mucosa, play a crucial role in maintaining mucosal homeostasis by sampling the luminal gut microbiota. However, there is limited information regarding microbial uptake in a steady state. We investigated the composition of murine gut microbiota that is engulfed by phagocytes of specific subsets in the small intestinal lamina propria (SILP) and Peyer’s patches (PP). Analysis of bacterial 16S rRNA gene amplicon sequences revealed that: 1) all the phagocyte subsets in the SILP primarily engulfed Lactobacillus (the most abundant microbe in the small intestine), whereas CD11bhi and CD11bhiCD11chi cell subsets in PP mostly engulfed segmented filamentous bacteria (indigenous bacteria in rodents that are reported to adhere to intestinal epithelial cells); and 2) among the Lactobacillus species engulfed by the SILP cell subsets, L. murinus was engulfed more frequently than L. taiwanensis, although both these Lactobacillus species were abundant in the small intestine under physiological conditions. These results suggest that small intestinal microbiota is selectively engulfed by phagocytes that localize in the adjacent intestinal mucosa in a steady state. These observations may provide insight into the crucial role of phagocytes in immune surveillance of the small intestinal mucosa. PMID:27701454
Morikawa, Masatoshi; Tsujibe, Satoshi; Kiyoshima-Shibata, Junko; Watanabe, Yohei; Kato-Nagaoka, Noriko; Shida, Kan; Matsumoto, Satoshi
2016-01-01
Phagocytes such as dendritic cells and macrophages, which are distributed in the small intestinal mucosa, play a crucial role in maintaining mucosal homeostasis by sampling the luminal gut microbiota. However, there is limited information regarding microbial uptake in a steady state. We investigated the composition of murine gut microbiota that is engulfed by phagocytes of specific subsets in the small intestinal lamina propria (SILP) and Peyer's patches (PP). Analysis of bacterial 16S rRNA gene amplicon sequences revealed that: 1) all the phagocyte subsets in the SILP primarily engulfed Lactobacillus (the most abundant microbe in the small intestine), whereas CD11bhi and CD11bhiCD11chi cell subsets in PP mostly engulfed segmented filamentous bacteria (indigenous bacteria in rodents that are reported to adhere to intestinal epithelial cells); and 2) among the Lactobacillus species engulfed by the SILP cell subsets, L. murinus was engulfed more frequently than L. taiwanensis, although both these Lactobacillus species were abundant in the small intestine under physiological conditions. These results suggest that small intestinal microbiota is selectively engulfed by phagocytes that localize in the adjacent intestinal mucosa in a steady state. These observations may provide insight into the crucial role of phagocytes in immune surveillance of the small intestinal mucosa.
Kavuluru, Ramakanth; Rios, Anthony; Lu, Yuan
2015-01-01
Background Diagnosis codes are assigned to medical records in healthcare facilities by trained coders by reviewing all physician authored documents associated with a patient's visit. This is a necessary and complex task involving coders adhering to coding guidelines and coding all assignable codes. With the popularity of electronic medical records (EMRs), computational approaches to code assignment have been proposed in the recent years. However, most efforts have focused on single and often short clinical narratives, while realistic scenarios warrant full EMR level analysis for code assignment. Objective We evaluate supervised learning approaches to automatically assign international classification of diseases (ninth revision) - clinical modification (ICD-9-CM) codes to EMRs by experimenting with a large realistic EMR dataset. The overall goal is to identify methods that offer superior performance in this task when considering such datasets. Methods We use a dataset of 71,463 EMRs corresponding to in-patient visits with discharge date falling in a two year period (2011–2012) from the University of Kentucky (UKY) Medical Center. We curate a smaller subset of this dataset and also use a third gold standard dataset of radiology reports. We conduct experiments using different problem transformation approaches with feature and data selection components and employing suitable label calibration and ranking methods with novel features involving code co-occurrence frequencies and latent code associations. Results Over all codes with at least 50 training examples we obtain a micro F-score of 0.48. On the set of codes that occur at least in 1% of the two year dataset, we achieve a micro F-score of 0.54. For the smaller radiology report dataset, the classifier chaining approach yields best results. For the smaller subset of the UKY dataset, feature selection, data selection, and label calibration offer best performance. Conclusions We show that datasets at different scale (size of the EMRs, number of distinct codes) and with different characteristics warrant different learning approaches. For shorter narratives pertaining to a particular medical subdomain (e.g., radiology, pathology), classifier chaining is ideal given the codes are highly related with each other. For realistic in-patient full EMRs, feature and data selection methods offer high performance for smaller datasets. However, for large EMR datasets, we observe that the binary relevance approach with learning-to-rank based code reranking offers the best performance. Regardless of the training dataset size, for general EMRs, label calibration to select the optimal number of labels is an indispensable final step. PMID:26054428
A ROC-based feature selection method for computer-aided detection and diagnosis
NASA Astrophysics Data System (ADS)
Wang, Songyuan; Zhang, Guopeng; Liao, Qimei; Zhang, Junying; Jiao, Chun; Lu, Hongbing
2014-03-01
Image-based computer-aided detection and diagnosis (CAD) has been a very active research topic aiming to assist physicians to detect lesions and distinguish them from benign to malignant. However, the datasets fed into a classifier usually suffer from small number of samples, as well as significantly less samples available in one class (have a disease) than the other, resulting in the classifier's suboptimal performance. How to identifying the most characterizing features of the observed data for lesion detection is critical to improve the sensitivity and minimize false positives of a CAD system. In this study, we propose a novel feature selection method mR-FAST that combines the minimal-redundancymaximal relevance (mRMR) framework with a selection metric FAST (feature assessment by sliding thresholds) based on the area under a ROC curve (AUC) generated on optimal simple linear discriminants. With three feature datasets extracted from CAD systems for colon polyps and bladder cancer, we show that the space of candidate features selected by mR-FAST is more characterizing for lesion detection with higher AUC, enabling to find a compact subset of superior features at low cost.
Efficient least angle regression for identification of linear-in-the-parameters models
Beach, Thomas H.; Rezgui, Yacine
2017-01-01
Least angle regression, as a promising model selection method, differentiates itself from conventional stepwise and stagewise methods, in that it is neither too greedy nor too slow. It is closely related to L1 norm optimization, which has the advantage of low prediction variance through sacrificing part of model bias property in order to enhance model generalization capability. In this paper, we propose an efficient least angle regression algorithm for model selection for a large class of linear-in-the-parameters models with the purpose of accelerating the model selection process. The entire algorithm works completely in a recursive manner, where the correlations between model terms and residuals, the evolving directions and other pertinent variables are derived explicitly and updated successively at every subset selection step. The model coefficients are only computed when the algorithm finishes. The direct involvement of matrix inversions is thereby relieved. A detailed computational complexity analysis indicates that the proposed algorithm possesses significant computational efficiency, compared with the original approach where the well-known efficient Cholesky decomposition is involved in solving least angle regression. Three artificial and real-world examples are employed to demonstrate the effectiveness, efficiency and numerical stability of the proposed algorithm. PMID:28293140
Zhang, Dongqing; Zhao, Yiyuan; Noble, Jack H; Dawant, Benoit M
2018-04-01
Cochlear implants (CIs) are neural prostheses that restore hearing using an electrode array implanted in the cochlea. After implantation, the CI processor is programmed by an audiologist. One factor that negatively impacts outcomes and can be addressed by programming is cross-electrode neural stimulation overlap (NSO). We have proposed a system to assist the audiologist in programming the CI that we call image-guided CI programming (IGCIP). IGCIP permits using CT images to detect NSO and recommend deactivation of a subset of electrodes to avoid NSO. We have shown that IGCIP significantly improves hearing outcomes. Most of the IGCIP steps are robustly automated but electrode configuration selection still sometimes requires manual intervention. With expertise, distance-versus-frequency curves, which are a way to visualize the spatial relationship learned from CT between the electrodes and the nerves they stimulate, can be used to select the electrode configuration. We propose an automated technique for electrode configuration selection. A comparison between this approach and one we have previously proposed shows that our method produces results that are as good as those obtained with our previous method while being generic and requiring fewer parameters.
A picture's worth a thousand words: a food-selection observational method.
Carins, Julia E; Rundle-Thiele, Sharyn R; Parkinson, Joy E
2016-05-04
Issue addressed: Methods are needed to accurately measure and describe behaviour so that social marketers and other behaviour change researchers can gain consumer insights before designing behaviour change strategies and so, in time, they can measure the impact of strategies or interventions when implemented. This paper describes a photographic method developed to meet these needs. Methods: Direct observation and photographic methods were developed and used to capture food-selection behaviour and examine those selections according to their healthfulness. Four meals (two lunches and two dinners) were observed at a workplace buffet-style cafeteria over a 1-week period. The healthfulness of individual meals was assessed using a classification scheme developed for the present study and based on the Australian Dietary Guidelines. Results: Approximately 27% of meals (n = 168) were photographed. Agreement was high between raters classifying dishes using the scheme, as well as between researchers when coding photographs. The subset of photographs was representative of patterns observed in the entire dining room. Diners chose main dishes in line with the proportions presented, but in opposition to the proportions presented for side dishes. Conclusions: The present study developed a rigorous observational method to investigate food choice behaviour. The comprehensive food classification scheme produced consistent classifications of foods. The photographic data collection method was found to be robust and accurate. Combining the two observation methods allows researchers and/or practitioners to accurately measure and interpret food selections. Consumer insights gained suggest that, in this setting, increasing the availability of green (healthful) offerings for main dishes would assist in improving healthfulness, whereas other strategies (e.g. promotion) may be needed for side dishes. So what?: Visual observation methods that accurately measure and interpret food-selection behaviour provide both insight for those developing healthy eating interventions and a means to evaluate the effect of implemented interventions on food selection.
A LEAST ABSOLUTE SHRINKAGE AND SELECTION OPERATOR (LASSO) FOR NONLINEAR SYSTEM IDENTIFICATION
NASA Technical Reports Server (NTRS)
Kukreja, Sunil L.; Lofberg, Johan; Brenner, Martin J.
2006-01-01
Identification of parametric nonlinear models involves estimating unknown parameters and detecting its underlying structure. Structure computation is concerned with selecting a subset of parameters to give a parsimonious description of the system which may afford greater insight into the functionality of the system or a simpler controller design. In this study, a least absolute shrinkage and selection operator (LASSO) technique is investigated for computing efficient model descriptions of nonlinear systems. The LASSO minimises the residual sum of squares by the addition of a 1 penalty term on the parameter vector of the traditional 2 minimisation problem. Its use for structure detection is a natural extension of this constrained minimisation approach to pseudolinear regression problems which produces some model parameters that are exactly zero and, therefore, yields a parsimonious system description. The performance of this LASSO structure detection method was evaluated by using it to estimate the structure of a nonlinear polynomial model. Applicability of the method to more complex systems such as those encountered in aerospace applications was shown by identifying a parsimonious system description of the F/A-18 Active Aeroelastic Wing using flight test data.
T-ray relevant frequencies for osteosarcoma classification
NASA Astrophysics Data System (ADS)
Withayachumnankul, W.; Ferguson, B.; Rainsford, T.; Findlay, D.; Mickan, S. P.; Abbott, D.
2006-01-01
We investigate the classification of the T-ray response of normal human bone cells and human osteosarcoma cells, grown in culture. Given the magnitude and phase responses within a reliable spectral range as features for input vectors, a trained support vector machine can correctly classify the two cell types to some extent. Performance of the support vector machine is deteriorated by the curse of dimensionality, resulting from the comparatively large number of features in the input vectors. Feature subset selection methods are used to select only an optimal number of relevant features for inputs. As a result, an improvement in generalization performance is attainable, and the selected frequencies can be used for further describing different mechanisms of the cells, responding to T-rays. We demonstrate a consistent classification accuracy of 89.6%, while the only one fifth of the original features are retained in the data set.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Palmintier, Bryan S; Bugbee, Bruce; Gotseff, Peter
Capturing technical and economic impacts of solar photovoltaics (PV) and other distributed energy resources (DERs) on electric distribution systems can require high-time resolution (e.g. 1 minute), long-duration (e.g. 1 year) simulations. However, such simulations can be computationally prohibitive, particularly when including complex control schemes in quasi-steady-state time series (QSTS) simulation. Various approaches have been used in the literature to down select representative time segments (e.g. days), but typically these are best suited for lower time resolutions or consider only a single data stream (e.g. PV production) for selection. We present a statistical approach that combines stratified sampling and bootstrapping tomore » select representative days while also providing a simple method to reassemble annual results. We describe the approach in the context of a recent study with a utility partner. This approach enables much faster QSTS analysis by simulating only a subset of days, while maintaining accurate annual estimates.« less
SKOLAR MD: A Model for Self-Directed, In-Context Continuing Medical Education
Strasberg, Howard R.; Rindfleisch, Thomas C.; Hardy, Steven
2003-01-01
INTRODUCTION SKOLAR has implemented a web-based CME program with which physicians can earn AMA Category 1 credit for self-directed learning. METHODS Physicians researched their questions in SKOLAR and applied for CME. Physician auditors reviewed all requests across two phases of the project. A selection rule set was derived from phase one and used in phase two to flag a subset of requests for detailed review. The selection rule set is described. RESULTS In phase one, SKOLAR received 1039 CME applications. Applicants frequently found their answer (94%) and would apply it clinically (93%). A linear regression analysis comparing time awarded to time requested (capped at actual time spent) had R2=0.79. DISCUSSION We believe that hat this self-directed approach to CME is effective and an important complement to traditional CME programs. However, selective audit of self-directed CME requests is necessary to ensure validity of credits awarded. PMID:14728250
Hybrid feature selection for supporting lightweight intrusion detection systems
NASA Astrophysics Data System (ADS)
Song, Jianglong; Zhao, Wentao; Liu, Qiang; Wang, Xin
2017-08-01
Redundant and irrelevant features not only cause high resource consumption but also degrade the performance of Intrusion Detection Systems (IDS), especially when coping with big data. These features slow down the process of training and testing in network traffic classification. Therefore, a hybrid feature selection approach in combination with wrapper and filter selection is designed in this paper to build a lightweight intrusion detection system. Two main phases are involved in this method. The first phase conducts a preliminary search for an optimal subset of features, in which the chi-square feature selection is utilized. The selected set of features from the previous phase is further refined in the second phase in a wrapper manner, in which the Random Forest(RF) is used to guide the selection process and retain an optimized set of features. After that, we build an RF-based detection model and make a fair comparison with other approaches. The experimental results on NSL-KDD datasets show that our approach results are in higher detection accuracy as well as faster training and testing processes.
Characterization of Plasma Membrane Proteins from Ovarian Cancer Cells Using Mass Spectrometry
Springer, David L.; Auberry, Deanna L.; Ahram, Mamoun; ...
2004-01-01
To determine how the repertoire of plasma membrane proteins change with disease state, specifically related to cancer, several methods for preparation of plasma membrane proteins were evaluated. Cultured cells derived from stage IV ovarian tumors were grown to 90% confluence and harvested in buffer containing CHAPS detergent. This preparation was centrifuged at low speed to remove insoluble cellular debris resulting in a crude homogenate. Glycosylated proteins in the crude homogenate were selectively enriched using lectin affinity chromatography. The crude homogenate and the lectin purified sample were prepared for mass spectrometric evaluation. The general procedure for protein identification began with trypsinmore » digestion of protein fractions followed by separation by reversed phase liquid chromatography that was coupled directly to a conventional tandem mass spectrometer (i.e. LCQ ion trap). Mass and fragmentation data for the peptides were searched against a human proteome data base using the informatics program SEQUEST. Using this procedure 398 proteins were identified with high confidence, including receptors, membrane-associated ligands, proteases, phosphatases, as well as structural and adhesion proteins. Results indicate that lectin chromatography provides a select subset of proteins and that the number and quality of the identifications improve as does the confidence of the protein identifications for this subset. These results represent the first step in development of methods to separate and successfully identify plasma membrane proteins from advanced ovarian cancer cells. Further characterization of plasma membrane proteins will contribute to our understanding of the mechanisms underlying progression of this deadly disease and may lead to new targeted interventions as well as new biomarkers for diagnosis.« less
Storage and retrieval of large digital images
Bradley, J.N.
1998-01-20
Image compression and viewing are implemented with (1) a method for performing DWT-based compression on a large digital image with a computer system possessing a two-level system of memory and (2) a method for selectively viewing areas of the image from its compressed representation at multiple resolutions and, if desired, in a client-server environment. The compression of a large digital image I(x,y) is accomplished by first defining a plurality of discrete tile image data subsets T{sub ij}(x,y) that, upon superposition, form the complete set of image data I(x,y). A seamless wavelet-based compression process is effected on I(x,y) that is comprised of successively inputting the tiles T{sub ij}(x,y) in a selected sequence to a DWT routine, and storing the resulting DWT coefficients in a first primary memory. These coefficients are periodically compressed and transferred to a secondary memory to maintain sufficient memory in the primary memory for data processing. The sequence of DWT operations on the tiles T{sub ij}(x,y) effectively calculates a seamless DWT of I(x,y). Data retrieval consists of specifying a resolution and a region of I(x,y) for display. The subset of stored DWT coefficients corresponding to each requested scene is determined and then decompressed for input to an inverse DWT, the output of which forms the image display. The repeated process whereby image views are specified may take the form an interaction with a computer pointing device on an image display from a previous retrieval. 6 figs.
M&A For Lithography Of Sparse Arrays Of Sub-Micrometer Features
Brueck, Steven R.J.; Chen, Xiaolan; Zaidi, Saleem; Devine, Daniel J.
1998-06-02
Methods and apparatuses are disclosed for the exposure of sparse hole and/or mesa arrays with line:space ratios of 1:3 or greater and sub-micrometer hole and/or mesa diameters in a layer of photosensitive material atop a layered material. Methods disclosed include: double exposure interferometric lithography pairs in which only those areas near the overlapping maxima of each single-period exposure pair receive a clearing exposure dose; double interferometric lithography exposure pairs with additional processing steps to transfer the array from a first single-period interferometric lithography exposure pair into an intermediate mask layer and a second single-period interferometric lithography exposure to further select a subset of the first array of holes; a double exposure of a single period interferometric lithography exposure pair to define a dense array of sub-micrometer holes and an optical lithography exposure in which only those holes near maxima of both exposures receive a clearing exposure dose; combination of a single-period interferometric exposure pair, processing to transfer resulting dense array of sub-micrometer holes into an intermediate etch mask, and an optical lithography exposure to select a subset of initial array to form a sparse array; combination of an optical exposure, transfer of exposure pattern into an intermediate mask layer, and a single-period interferometric lithography exposure pair; three-beam interferometric exposure pairs to form sparse arrays of sub-micrometer holes; five- and four-beam interferometric exposures to form a sparse array of sub-micrometer holes in a single exposure. Apparatuses disclosed include arrangements for the three-beam, five-beam and four-beam interferometric exposures.
Storage and retrieval of large digital images
Bradley, Jonathan N.
1998-01-01
Image compression and viewing are implemented with (1) a method for performing DWT-based compression on a large digital image with a computer system possessing a two-level system of memory and (2) a method for selectively viewing areas of the image from its compressed representation at multiple resolutions and, if desired, in a client-server environment. The compression of a large digital image I(x,y) is accomplished by first defining a plurality of discrete tile image data subsets T.sub.ij (x,y) that, upon superposition, form the complete set of image data I(x,y). A seamless wavelet-based compression process is effected on I(x,y) that is comprised of successively inputting the tiles T.sub.ij (x,y) in a selected sequence to a DWT routine, and storing the resulting DWT coefficients in a first primary memory. These coefficients are periodically compressed and transferred to a secondary memory to maintain sufficient memory in the primary memory for data processing. The sequence of DWT operations on the tiles T.sub.ij (x,y) effectively calculates a seamless DWT of I(x,y). Data retrieval consists of specifying a resolution and a region of I(x,y) for display. The subset of stored DWT coefficients corresponding to each requested scene is determined and then decompressed for input to an inverse DWT, the output of which forms the image display. The repeated process whereby image views are specified may take the form an interaction with a computer pointing device on an image display from a previous retrieval.
Nanolaminate microfluidic device for mobility selection of particles
Surh, Michael P [Livermore, CA; Wilson, William D [Pleasanton, CA; Barbee, Jr., Troy W.; Lane, Stephen M [Oakland, CA
2006-10-10
A microfluidic device made from nanolaminate materials that are capable of electrophoretic selection of particles on the basis of their mobility. Nanolaminate materials are generally alternating layers of two materials (one conducting, one insulating) that are made by sputter coating a flat substrate with a large number of layers. Specific subsets of the conducting layers are coupled together to form a single, extended electrode, interleaved with other similar electrodes. Thereby, the subsets of conducting layers may be dynamically charged to create time-dependent potential fields that can trap or transport charge colloidal particles. The addition of time-dependence is applicable to all geometries of nanolaminate electrophoretic and electrochemical designs from sinusoidal to nearly step-like.
Kader, Muhamuda; Bixler, Sandra; Roederer, Mario; Veazey, Ronald; Mattapallil, Joseph J.
2009-01-01
Background CD4 T cell depletion in the mucosa has been well documented during acute HIV and SIV infections. The demonstration the HIV/SIV can use the α4β7 receptor for viral entry suggests that these viruses selectively target CD4 T cells in the mucosa that express high levels of α4β7 receptor. Methods Mucosal samples obtained from SIV infected rhesus macaques during the early phase of infection were used for immunophenotypic analysis. CD4 T cell subsets were sorted based on the expression of β7 and CD95 to quantify the level of SIV infection in different subsets of CD4 T cells. Changes in IL-17, IL-21, IL-23 and TGFβ mRNA expression was determined using Taqman PCR. Results CD4 T cells in the mucosa were found to harbor two major population of cells; ~25% of CD4 T cells expressed the α4+β7hi phenotype, whereas the rest of the 75% expressed an α4+β7int phenotype. Both the subsets were predominantly CD28+Ki-67− HLA-DR− but CD69+, and expressed detectable levels of CCR5 on their surface. Interestingly, however, α4+β7hiCD4 T cells were found to harbor more SIV than the α4+β7int subsets at day 10 pi. Early infection was associated with a dramatic increase in the expression of IL-17, and IL-17 promoting cytokines IL-21, IL-23, and TGFβ that stayed high even after the loss of mucosal CD4 T cells. Conclusions Our results suggest that the differential expression of the α4β7 receptor contributes to the differences in the extent of infection in CD4 T cell subsets in the mucosa. Early infection associated dysregulation of the IL-17 network in mucosal tissues involves other non-Th-17 cells that likely contributes to the pro-inflammatory environment in the mucosa during acute stages of SIV infection. PMID:19863675
2015-01-01
Background Investigations into novel biomarkers using omics techniques generate large amounts of data. Due to their size and numbers of attributes, these data are suitable for analysis with machine learning methods. A key component of typical machine learning pipelines for omics data is feature selection, which is used to reduce the raw high-dimensional data into a tractable number of features. Feature selection needs to balance the objective of using as few features as possible, while maintaining high predictive power. This balance is crucial when the goal of data analysis is the identification of highly accurate but small panels of biomarkers with potential clinical utility. In this paper we propose a heuristic for the selection of very small feature subsets, via an iterative feature elimination process that is guided by rule-based machine learning, called RGIFE (Rule-guided Iterative Feature Elimination). We use this heuristic to identify putative biomarkers of osteoarthritis (OA), articular cartilage degradation and synovial inflammation, using both proteomic and transcriptomic datasets. Results and discussion Our RGIFE heuristic increased the classification accuracies achieved for all datasets when no feature selection is used, and performed well in a comparison with other feature selection methods. Using this method the datasets were reduced to a smaller number of genes or proteins, including those known to be relevant to OA, cartilage degradation and joint inflammation. The results have shown the RGIFE feature reduction method to be suitable for analysing both proteomic and transcriptomics data. Methods that generate large ‘omics’ datasets are increasingly being used in the area of rheumatology. Conclusions Feature reduction methods are advantageous for the analysis of omics data in the field of rheumatology, as the applications of such techniques are likely to result in improvements in diagnosis, treatment and drug discovery. PMID:25923811
Eradication of melanomas by targeted elimination of a minor subset of tumor cells
Schmidt, Patrick; Kopecky, Caroline; Hombach, Andreas; Zigrino, Paola; Mauch, Cornelia; Abken, Hinrich
2011-01-01
Proceeding on the assumption that all cancer cells have equal malignant capacities, current regimens in cancer therapy attempt to eradicate all malignant cells of a tumor lesion. Using in vivo targeting of tumor cell subsets, we demonstrate that selective elimination of a definite, minor tumor cell subpopulation is particularly effective in eradicating established melanoma lesions irrespective of the bulk of cancer cells. Tumor cell subsets were specifically eliminated in a tumor lesion by adoptive transfer of engineered cytotoxic T cells redirected in an antigen-restricted manner via a chimeric antigen receptor. Targeted elimination of less than 2% of the tumor cells that coexpress high molecular weight melanoma-associated antigen (HMW-MAA) (melanoma-associated chondroitin sulfate proteoglycan, MCSP) and CD20 lastingly eradicated melanoma lesions, whereas targeting of any random 10% tumor cell subset was not effective. Our data challenge the biological therapy and current drug development paradigms in the treatment of cancer. PMID:21282657
OligoIS: Scalable Instance Selection for Class-Imbalanced Data Sets.
García-Pedrajas, Nicolás; Perez-Rodríguez, Javier; de Haro-García, Aida
2013-02-01
In current research, an enormous amount of information is constantly being produced, which poses a challenge for data mining algorithms. Many of the problems in extremely active research areas, such as bioinformatics, security and intrusion detection, or text mining, share the following two features: large data sets and class-imbalanced distribution of samples. Although many methods have been proposed for dealing with class-imbalanced data sets, most of these methods are not scalable to the very large data sets common to those research fields. In this paper, we propose a new approach to dealing with the class-imbalance problem that is scalable to data sets with many millions of instances and hundreds of features. This proposal is based on the divide-and-conquer principle combined with application of the selection process to balanced subsets of the whole data set. This divide-and-conquer principle allows the execution of the algorithm in linear time. Furthermore, the proposed method is easy to implement using a parallel environment and can work without loading the whole data set into memory. Using 40 class-imbalanced medium-sized data sets, we will demonstrate our method's ability to improve the results of state-of-the-art instance selection methods for class-imbalanced data sets. Using three very large data sets, we will show the scalability of our proposal to millions of instances and hundreds of features.
Fink, Herbert; Panne, Ulrich; Niessner, Reinhard
2002-09-01
An experimental setup for direct elemental analysis of recycled thermoplasts from consumer electronics by laser-induced plasma spectroscopy (LIPS, or laser-induced breakdown spectroscopy, LIBS) was realized. The combination of a echelle spectrograph, featuring a high resolution with a broad spectral coverage, with multivariate methods, such as PLS, PCR, and variable subset selection via a genetic algorithm, resulted in considerable improvements in selectivity and sensitivity for this complex matrix. With a normalization to carbon as internal standard, the limits of detection were in the ppm range. A preliminary pattern recognition study points to the possibility of polymer recognition via the line-rich echelle spectra. Several experiments at an extruder within a recycling plant demonstrated successfully the capability of LIPS for different kinds of routine on-line process analysis.
Dhaliwal, J S; Malar, B; Quck, C K; Sukumaran, K D; Hassan, K
1991-06-01
Immunoperoxidase staining was compared with flowcytometry for the enumeration of lymphocyte subsets. The percentages obtained for peripheral blood lymphocytes using immunoperoxidase (CD3 = 76 CD4 = 27.9, B = 10.7 CD4/CD8 = 1.8) differed significantly from those obtained by flowcytometry (CD3 = 65.7 CD4 = 39.4, CD8 = 25.6, B = 16.7, HLA DR = 11.9 CD4/CD8 = 1.54) for certain subsets (CD3, CD4, B). There was no significant difference in lymphocyte subsets between children and adults using the same method. These differences are probably due to the different methods used to prepare lymphocytes for analysis. Other factors that should also be considered are the presence of CD4 antigen on monocytes and CD8 on natural killer cells.
Abu-Jamous, Basel; Fa, Rui; Roberts, David J; Nandi, Asoke K
2015-06-04
Collective analysis of the increasingly emerging gene expression datasets are required. The recently proposed binarisation of consensus partition matrices (Bi-CoPaM) method can combine clustering results from multiple datasets to identify the subsets of genes which are consistently co-expressed in all of the provided datasets in a tuneable manner. However, results validation and parameter setting are issues that complicate the design of such methods. Moreover, although it is a common practice to test methods by application to synthetic datasets, the mathematical models used to synthesise such datasets are usually based on approximations which may not always be sufficiently representative of real datasets. Here, we propose an unsupervised method for the unification of clustering results from multiple datasets using external specifications (UNCLES). This method has the ability to identify the subsets of genes consistently co-expressed in a subset of datasets while being poorly co-expressed in another subset of datasets, and to identify the subsets of genes consistently co-expressed in all given datasets. We also propose the M-N scatter plots validation technique and adopt it to set the parameters of UNCLES, such as the number of clusters, automatically. Additionally, we propose an approach for the synthesis of gene expression datasets using real data profiles in a way which combines the ground-truth-knowledge of synthetic data and the realistic expression values of real data, and therefore overcomes the problem of faithfulness of synthetic expression data modelling. By application to those datasets, we validate UNCLES while comparing it with other conventional clustering methods, and of particular relevance, biclustering methods. We further validate UNCLES by application to a set of 14 real genome-wide yeast datasets as it produces focused clusters that conform well to known biological facts. Furthermore, in-silico-based hypotheses regarding the function of a few previously unknown genes in those focused clusters are drawn. The UNCLES method, the M-N scatter plots technique, and the expression data synthesis approach will have wide application for the comprehensive analysis of genomic and other sources of multiple complex biological datasets. Moreover, the derived in-silico-based biological hypotheses represent subjects for future functional studies.
Minimal ensemble based on subset selection using ECG to diagnose categories of CAN.
Abawajy, Jemal; Kelarev, Andrei; Yi, Xun; Jelinek, Herbert F
2018-07-01
Early diagnosis of cardiac autonomic neuropathy (CAN) is critical for reversing or decreasing its progression and prevent complications. Diagnostic accuracy or precision is one of the core requirements of CAN detection. As the standard Ewing battery tests suffer from a number of shortcomings, research in automating and improving the early detection of CAN has recently received serious attention in identifying additional clinical variables and designing advanced ensembles of classifiers to improve the accuracy or precision of CAN diagnostics. Although large ensembles are commonly proposed for the automated diagnosis of CAN, large ensembles are characterized by slow processing speed and computational complexity. This paper applies ECG features and proposes a new ensemble-based approach for diagnosis of CAN progression. We introduce a Minimal Ensemble Based On Subset Selection (MEBOSS) for the diagnosis of all categories of CAN including early, definite and atypical CAN. MEBOSS is based on a novel multi-tier architecture applying classifier subset selection as well as the training subset selection during several steps of its operation. Our experiments determined the diagnostic accuracy or precision obtained in 5 × 2 cross-validation for various options employed in MEBOSS and other classification systems. The experiments demonstrate the operation of the MEBOSS procedure invoking the most effective classifiers available in the open source software environment SageMath. The results of our experiments show that for the large DiabHealth database of CAN related parameters MEBOSS outperformed other classification systems available in SageMath and achieved 94% to 97% precision in 5 × 2 cross-validation correctly distinguishing any two CAN categories to a maximum of five categorizations including control, early, definite, severe and atypical CAN. These results show that MEBOSS architecture is effective and can be recommended for practical implementations in systems for the diagnosis of CAN progression. Copyright © 2018 Elsevier B.V. All rights reserved.
Non-specific filtering of beta-distributed data.
Wang, Xinhui; Laird, Peter W; Hinoue, Toshinori; Groshen, Susan; Siegmund, Kimberly D
2014-06-19
Non-specific feature selection is a dimension reduction procedure performed prior to cluster analysis of high dimensional molecular data. Not all measured features are expected to show biological variation, so only the most varying are selected for analysis. In DNA methylation studies, DNA methylation is measured as a proportion, bounded between 0 and 1, with variance a function of the mean. Filtering on standard deviation biases the selection of probes to those with mean values near 0.5. We explore the effect this has on clustering, and develop alternate filter methods that utilize a variance stabilizing transformation for Beta distributed data and do not share this bias. We compared results for 11 different non-specific filters on eight Infinium HumanMethylation data sets, selected to span a variety of biological conditions. We found that for data sets having a small fraction of samples showing abnormal methylation of a subset of normally unmethylated CpGs, a characteristic of the CpG island methylator phenotype in cancer, a novel filter statistic that utilized a variance-stabilizing transformation for Beta distributed data outperformed the common filter of using standard deviation of the DNA methylation proportion, or its log-transformed M-value, in its ability to detect the cancer subtype in a cluster analysis. However, the standard deviation filter always performed among the best for distinguishing subgroups of normal tissue. The novel filter and standard deviation filter tended to favour features in different genome contexts; for the same data set, the novel filter always selected more features from CpG island promoters and the standard deviation filter always selected more features from non-CpG island intergenic regions. Interestingly, despite selecting largely non-overlapping sets of features, the two filters did find sample subsets that overlapped for some real data sets. We found two different filter statistics that tended to prioritize features with different characteristics, each performed well for identifying clusters of cancer and non-cancer tissue, and identifying a cancer CpG island hypermethylation phenotype. Since cluster analysis is for discovery, we would suggest trying both filters on any new data sets, evaluating the overlap of features selected and clusters discovered.
Chemical library subset selection algorithms: a unified derivation using spatial statistics.
Hamprecht, Fred A; Thiel, Walter; van Gunsteren, Wilfred F
2002-01-01
If similar compounds have similar activity, rational subset selection becomes superior to random selection in screening for pharmacological lead discovery programs. Traditional approaches to this experimental design problem fall into two classes: (i) a linear or quadratic response function is assumed (ii) some space filling criterion is optimized. The assumptions underlying the first approach are clear but not always defendable; the second approach yields more intuitive designs but lacks a clear theoretical foundation. We model activity in a bioassay as realization of a stochastic process and use the best linear unbiased estimator to construct spatial sampling designs that optimize the integrated mean square prediction error, the maximum mean square prediction error, or the entropy. We argue that our approach constitutes a unifying framework encompassing most proposed techniques as limiting cases and sheds light on their underlying assumptions. In particular, vector quantization is obtained, in dimensions up to eight, in the limiting case of very smooth response surfaces for the integrated mean square error criterion. Closest packing is obtained for very rough surfaces under the integrated mean square error and entropy criteria. We suggest to use either the integrated mean square prediction error or the entropy as optimization criteria rather than approximations thereof and propose a scheme for direct iterative minimization of the integrated mean square prediction error. Finally, we discuss how the quality of chemical descriptors manifests itself and clarify the assumptions underlying the selection of diverse or representative subsets.
Niu, Xiaoping; Qi, Jianmin; Zhang, Gaoyang; Xu, Jiantang; Tao, Aifen; Fang, Pingping; Su, Jianguang
2015-01-01
To accurately measure gene expression using quantitative reverse transcription PCR (qRT-PCR), reliable reference gene(s) are required for data normalization. Corchorus capsularis, an annual herbaceous fiber crop with predominant biodegradability and renewability, has not been investigated for the stability of reference genes with qRT-PCR. In this study, 11 candidate reference genes were selected and their expression levels were assessed using qRT-PCR. To account for the influence of experimental approach and tissue type, 22 different jute samples were selected from abiotic and biotic stress conditions as well as three different tissue types. The stability of the candidate reference genes was evaluated using geNorm, NormFinder, and BestKeeper programs, and the comprehensive rankings of gene stability were generated by aggregate analysis. For the biotic stress and NaCl stress subsets, ACT7 and RAN were suitable as stable reference genes for gene expression normalization. For the PEG stress subset, UBC, and DnaJ were sufficient for accurate normalization. For the tissues subset, four reference genes TUBβ, UBI, EF1α, and RAN were sufficient for accurate normalization. The selected genes were further validated by comparing expression profiles of WRKY15 in various samples, and two stable reference genes were recommended for accurate normalization of qRT-PCR data. Our results provide researchers with appropriate reference genes for qRT-PCR in C. capsularis, and will facilitate gene expression study under these conditions. PMID:26528312
Wang, Xiaorong; Kang, Yu; Luo, Chunxiong; Zhao, Tong; Liu, Lin; Jiang, Xiangdan; Fu, Rongrong; An, Shuchang; Chen, Jichao; Jiang, Ning; Ren, Lufeng; Wang, Qi; Baillie, J Kenneth; Gao, Zhancheng; Yu, Jun
2014-02-11
Heteroresistance refers to phenotypic heterogeneity of microbial clonal populations under antibiotic stress, and it has been thought to be an allocation of a subset of "resistant" cells for surviving in higher concentrations of antibiotic. The assumption fits the so-called bet-hedging strategy, where a bacterial population "hedges" its "bet" on different phenotypes to be selected by unpredicted environment stresses. To test this hypothesis, we constructed a heteroresistance model by introducing a blaCTX-M-14 gene (coding for a cephalosporin hydrolase) into a sensitive Escherichia coli strain. We confirmed heteroresistance in this clone and that a subset of the cells expressed more hydrolase and formed more colonies in the presence of ceftriaxone (exhibited stronger "resistance"). However, subsequent single-cell-level investigation by using a microfluidic device showed that a subset of cells with a distinguishable phenotype of slowed growth and intensified hydrolase expression emerged, and they were not positively selected but increased their proportion in the population with ascending antibiotic concentrations. Therefore, heteroresistance--the gradually decreased colony-forming capability in the presence of antibiotic--was a result of a decreased growth rate rather than of selection for resistant cells. Using a mock strain without the resistance gene, we further demonstrated the existence of two nested growth-centric feedback loops that control the expression of the hydrolase and maximize population growth in various antibiotic concentrations. In conclusion, phenotypic heterogeneity is a population-based strategy beneficial for bacterial survival and propagation through task allocation and interphenotypic collaboration, and the growth rate provides a critical control for the expression of stress-related genes and an essential mechanism in responding to environmental stresses. Heteroresistance is essentially phenotypic heterogeneity, where a population-based strategy is thought to be at work, being assumed to be variable cell-to-cell resistance to be selected under antibiotic stress. Exact mechanisms of heteroresistance and its roles in adaptation to antibiotic stress have yet to be fully understood at the molecular and single-cell levels. In our study, we have not been able to detect any apparent subset of "resistant" cells selected by antibiotics; on the contrary, cell populations differentiate into phenotypic subsets with variable growth statuses and hydrolase expression. The growth rate appears to be sensitive to stress intensity and plays a key role in controlling hydrolase expression at both the bulk population and single-cell levels. We have shown here, for the first time, that phenotypic heterogeneity can be beneficial to a growing bacterial population through task allocation and interphenotypic collaboration other than partitioning cells into different categories of selective advantage.
2008-06-01
capacity planning; • Electrical generation capacity planning; • Machine scheduling; • Freight scheduling; • Dairy farm expansion planning...Support Systems and Multi Criteria Decision Analysis Products A.2.11.2.2.1 ELECTRE IS ELECTRE IS is a generalization of ELECTRE I. It is a...criteria, ELECTRE IS supports the user in the process of selecting one alternative or a subset of alternatives. The method consists of two parts
Communication methods, systems, apparatus, and devices involving RF tag registration
Burghard, Brion J [W. Richland, WA; Skorpik, James R [Kennewick, WA
2008-04-22
One technique of the present invention includes a number of Radio Frequency (RF) tags that each have a different identifier. Information is broadcast to the tags from an RF tag interrogator. This information corresponds to a maximum quantity of tag response time slots that are available. This maximum quantity may be less than the total number of tags. The tags each select one of the time slots as a function of the information and a random number provided by each respective tag. The different identifiers are transmitted to the interrogator from at least a subset of the RF tags.
2011-06-01
implementing, and evaluating many feature selection algorithms. Mucciardi and Gose compared seven different techniques for choosing subsets of pattern...122 THIS PAGE INTENTIONALLY LEFT BLANK 123 LIST OF REFERENCES [1] A. Mucciardi and E. Gose , “A comparison of seven techniques for
Wong, J T; Pinto, C E; Gifford, J D; Kurnick, J T; Kradin, R L
1989-11-15
To study the CD4+ and CD8+ tumor infiltrating lymphocytes (TIL) in the antitumor response, we propagated these subsets directly from tumor tissues with anti-CD3:anti-CD8 (CD3,8) and anti-CD3:anti-CD4 (CD3,4) bispecific mAb (BSMAB). CD3,8 BSMAB cause selective cytolysis of CD8+ lymphocytes by bridging the CD8 molecules of target lymphocytes to the CD3 molecular complex of cytolytic T lymphocytes with concurrent activation and proliferation of residual CD3+CD4+ T lymphocytes. Similarly, CD3,4 BSMAB cause selective lysis of CD4+ lymphocytes whereas concurrently activating the residual CD3+CD8+ T cells. Small tumor fragments from four malignant melanoma and three renal cell carcinoma patients were cultured in medium containing CD3,8 + IL-2, CD3,4 + IL-2, or IL-2 alone. CD3,8 led to selective propagation of the CD4+ TIL whereas CD3,4 led to selective propagation of the CD8+ TIL from each of the tumors. The phenotypes of the TIL subset cultures were generally stable when assayed over a 1 to 3 months period and after further expansion with anti-CD3 mAb or lectins. Specific 51Cr release of labeled target cells that were bridged to the CD3 molecular complexes of TIL suggested that both CD4+ and CD8+ TIL cultures have the capacity of mediating cytolysis via their Ti/CD3 TCR complexes. In addition, both CD4+ and CD8+ TIL cultures from most patients caused substantial (greater than 20%) lysis of the NK-sensitive K562 cell line. The majority of CD4+ but not CD8+ TIL cultures also produced substantial lysis of the NK-resistant Daudi cell line. Lysis of the autologous tumor by the TIL subsets was assessed in two patients with malignant melanoma. The CD8+ TIL from one tumor demonstrated cytotoxic activity against the autologous tumor but negligible lysis of allogeneic melanoma targets. In conclusion, immunocompetent CD4+ and CD8+ TIL subsets can be isolated and expanded directly from small tumor fragments of malignant melanoma and renal cell carcinoma using BSMAB. The resultant TIL subsets can be further expanded for detailed studies or for adoptive immunotherapy.
Leger, Stefan; Zwanenburg, Alex; Pilz, Karoline; Lohaus, Fabian; Linge, Annett; Zöphel, Klaus; Kotzerke, Jörg; Schreiber, Andreas; Tinhofer, Inge; Budach, Volker; Sak, Ali; Stuschke, Martin; Balermpas, Panagiotis; Rödel, Claus; Ganswindt, Ute; Belka, Claus; Pigorsch, Steffi; Combs, Stephanie E; Mönnich, David; Zips, Daniel; Krause, Mechthild; Baumann, Michael; Troost, Esther G C; Löck, Steffen; Richter, Christian
2017-10-16
Radiomics applies machine learning algorithms to quantitative imaging data to characterise the tumour phenotype and predict clinical outcome. For the development of radiomics risk models, a variety of different algorithms is available and it is not clear which one gives optimal results. Therefore, we assessed the performance of 11 machine learning algorithms combined with 12 feature selection methods by the concordance index (C-Index), to predict loco-regional tumour control (LRC) and overall survival for patients with head and neck squamous cell carcinoma. The considered algorithms are able to deal with continuous time-to-event survival data. Feature selection and model building were performed on a multicentre cohort (213 patients) and validated using an independent cohort (80 patients). We found several combinations of machine learning algorithms and feature selection methods which achieve similar results, e.g. C-Index = 0.71 and BT-COX: C-Index = 0.70 in combination with Spearman feature selection. Using the best performing models, patients were stratified into groups of low and high risk of recurrence. Significant differences in LRC were obtained between both groups on the validation cohort. Based on the presented analysis, we identified a subset of algorithms which should be considered in future radiomics studies to develop stable and clinically relevant predictive models for time-to-event endpoints.
Badia, Jordi; Raspopovic, Stanisa; Carpaneto, Jacopo; Micera, Silvestro; Navarro, Xavier
2016-01-01
The selection of suitable peripheral nerve electrodes for biomedical applications implies a trade-off between invasiveness and selectivity. The optimal design should provide the highest selectivity for targeting a large number of nerve fascicles with the least invasiveness and potential damage to the nerve. The transverse intrafascicular multichannel electrode (TIME), transversally inserted in the peripheral nerve, has been shown to be useful for the selective activation of subsets of axons, both at inter- and intra-fascicular levels, in the small sciatic nerve of the rat. In this study we assessed the capabilities of TIME for the selective recording of neural activity, considering the topographical selectivity and the distinction of neural signals corresponding to different sensory types. Topographical recording selectivity was proved by the differential recording of CNAPs from different subsets of nerve fibers, such as those innervating toes 2 and 4 of the hindpaw of the rat. Neural signals elicited by sensory stimuli applied to the rat paw were successfully recorded. Signal processing allowed distinguishing three different types of sensory stimuli such as tactile, proprioceptive and nociceptive ones with high performance. These findings further support the suitability of TIMEs for neuroprosthetic applications, by exploiting the transversal topographical structure of the peripheral nerves.
Systems and methods for data quality control and cleansing
Wenzel, Michael; Boettcher, Andrew; Drees, Kirk; Kummer, James
2016-05-31
A method for detecting and cleansing suspect building automation system data is shown and described. The method includes using processing electronics to automatically determine which of a plurality of error detectors and which of a plurality of data cleansers to use with building automation system data. The method further includes using processing electronics to automatically detect errors in the data and cleanse the data using a subset of the error detectors and a subset of the cleansers.
Better physical activity classification using smartphone acceleration sensor.
Arif, Muhammad; Bilal, Mohsin; Kattan, Ahmed; Ahamed, S Iqbal
2014-09-01
Obesity is becoming one of the serious problems for the health of worldwide population. Social interactions on mobile phones and computers via internet through social e-networks are one of the major causes of lack of physical activities. For the health specialist, it is important to track the record of physical activities of the obese or overweight patients to supervise weight loss control. In this study, acceleration sensor present in the smartphone is used to monitor the physical activity of the user. Physical activities including Walking, Jogging, Sitting, Standing, Walking upstairs and Walking downstairs are classified. Time domain features are extracted from the acceleration data recorded by smartphone during different physical activities. Time and space complexity of the whole framework is done by optimal feature subset selection and pruning of instances. Classification results of six physical activities are reported in this paper. Using simple time domain features, 99 % classification accuracy is achieved. Furthermore, attributes subset selection is used to remove the redundant features and to minimize the time complexity of the algorithm. A subset of 30 features produced more than 98 % classification accuracy for the six physical activities.
On Some Multiple Decision Problems
1976-08-01
parameter space. Some recent results in the area of subset selection formulation are Gnanadesikan and Gupta [28], Gupta and Studden [43], Gupta and...York, pp. 363-376. [27) Gnanadesikan , M. (1966). Some Selection and Ranking Procedures for Multivariate Normal Populations. Ph.D. Thesis. Dept. of...Statist., Purdue Univ., West Lafayette, Indiana 47907. [28) Gnanadesikan , M. and Gupta, S. S. (1970). Selection procedures for multivariate normal
USDA-ARS?s Scientific Manuscript database
The phytopathogen Ralstonia solanacearum is a species complex that contains a subset of strains that are quarantined or select agent pathogens. An unidentified R. solanacearum strain is considered a select agent in the US until proven otherwise, which can be done by phylogenetic analysis of a partia...
Supplier Selection Using Weighted Utility Additive Method
NASA Astrophysics Data System (ADS)
Karande, Prasad; Chakraborty, Shankar
2015-10-01
Supplier selection is a multi-criteria decision-making (MCDM) problem which mainly involves evaluating a number of available suppliers according to a set of common criteria for choosing the best one to meet the organizational needs. For any manufacturing or service organization, selecting the right upstream suppliers is a key success factor that will significantly reduce purchasing cost, increase downstream customer satisfaction and improve competitive ability. The past researchers have attempted to solve the supplier selection problem employing different MCDM techniques which involve active participation of the decision makers in the decision-making process. This paper deals with the application of weighted utility additive (WUTA) method for solving supplier selection problems. The WUTA method, an extension of utility additive approach, is based on ordinal regression and consists of building a piece-wise linear additive decision model from a preference structure using linear programming (LP). It adopts preference disaggregation principle and addresses the decision-making activities through operational models which need implicit preferences in the form of a preorder of reference alternatives or a subset of these alternatives present in the process. The preferential preorder provided by the decision maker is used as a restriction of a LP problem, which has its own objective function, minimization of the sum of the errors associated with the ranking of each alternative. Based on a given reference ranking of alternatives, one or more additive utility functions are derived. Using these utility functions, the weighted utilities for individual criterion values are combined into an overall weighted utility for a given alternative. It is observed that WUTA method, having a sound mathematical background, can provide accurate ranking to the candidate suppliers and choose the best one to fulfill the organizational requirements. Two real time examples are illustrated to prove its applicability and appropriateness in solving supplier selection problems.
Modal-pushover-based ground-motion scaling procedure
Kalkan, Erol; Chopra, Anil K.
2011-01-01
Earthquake engineering is increasingly using nonlinear response history analysis (RHA) to demonstrate the performance of structures. This rigorous method of analysis requires selection and scaling of ground motions appropriate to design hazard levels. This paper presents a modal-pushover-based scaling (MPS) procedure to scale ground motions for use in a nonlinear RHA of buildings. In the MPS method, the ground motions are scaled to match to a specified tolerance, a target value of the inelastic deformation of the first-mode inelastic single-degree-of-freedom (SDF) system whose properties are determined by the first-mode pushover analysis. Appropriate for first-mode dominated structures, this approach is extended for structures with significant contributions of higher modes by considering elastic deformation of second-mode SDF systems in selecting a subset of the scaled ground motions. Based on results presented for three actual buildings-4, 6, and 13-story-the accuracy and efficiency of the MPS procedure are established and its superiority over the ASCE/SEI 7-05 scaling procedure is demonstrated.
NASA Astrophysics Data System (ADS)
Wöhling, T.; Schöniger, A.; Geiges, A.; Nowak, W.; Gayler, S.
2013-12-01
The objective selection of appropriate models for realistic simulations of coupled soil-plant processes is a challenging task since the processes are complex, not fully understood at larger scales, and highly non-linear. Also, comprehensive data sets are scarce, and measurements are uncertain. In the past decades, a variety of different models have been developed that exhibit a wide range of complexity regarding their approximation of processes in the coupled model compartments. We present a method for evaluating experimental design for maximum confidence in the model selection task. The method considers uncertainty in parameters, measurements and model structures. Advancing the ideas behind Bayesian Model Averaging (BMA), we analyze the changes in posterior model weights and posterior model choice uncertainty when more data are made available. This allows assessing the power of different data types, data densities and data locations in identifying the best model structure from among a suite of plausible models. The models considered in this study are the crop models CERES, SUCROS, GECROS and SPASS, which are coupled to identical routines for simulating soil processes within the modelling framework Expert-N. The four models considerably differ in the degree of detail at which crop growth and root water uptake are represented. Monte-Carlo simulations were conducted for each of these models considering their uncertainty in soil hydraulic properties and selected crop model parameters. Using a Bootstrap Filter (BF), the models were then conditioned on field measurements of soil moisture, matric potential, leaf-area index, and evapotranspiration rates (from eddy-covariance measurements) during a vegetation period of winter wheat at a field site at the Swabian Alb in Southwestern Germany. Following our new method, we derived model weights when using all data or different subsets thereof. We discuss to which degree the posterior mean outperforms the prior mean and all individual posterior models, how informative the data types were for reducing prediction uncertainty of evapotranspiration and deep drainage, and how well the model structure can be identified based on the different data types and subsets. We further analyze the impact of measurement uncertainty und systematic model errors on the effective sample size of the BF and the resulting model weights.
Eliason, Michele J; Streed, Carl G
2017-10-01
Researchers struggle to find effective ways to measure sexual and gender identities to determine whether there are health differences among subsets of the LGBTQ+ population. This study examines responses on the National Health Interview Survey (NHIS) sexual identity questions among 277 LGBTQ+ healthcare providers. Eighteen percent indicated that their sexual identity was "something else" on the first question, and 57% of those also selected "something else" on the second question. Half of the genderqueer/gender variant participants and 100% of transgender-identified participants selected "something else" as their sexual identity. The NHIS question does not allow all respondents in LGBTQ+ populations to be categorized, thus we are potentially missing vital health disparity information about subsets of the LGBTQ+ population.
Nonparametric Bayesian Dictionary Learning for Analysis of Noisy and Incomplete Images
Zhou, Mingyuan; Chen, Haojun; Paisley, John; Ren, Lu; Li, Lingbo; Xing, Zhengming; Dunson, David; Sapiro, Guillermo; Carin, Lawrence
2013-01-01
Nonparametric Bayesian methods are considered for recovery of imagery based upon compressive, incomplete, and/or noisy measurements. A truncated beta-Bernoulli process is employed to infer an appropriate dictionary for the data under test and also for image recovery. In the context of compressive sensing, significant improvements in image recovery are manifested using learned dictionaries, relative to using standard orthonormal image expansions. The compressive-measurement projections are also optimized for the learned dictionary. Additionally, we consider simpler (incomplete) measurements, defined by measuring a subset of image pixels, uniformly selected at random. Spatial interrelationships within imagery are exploited through use of the Dirichlet and probit stick-breaking processes. Several example results are presented, with comparisons to other methods in the literature. PMID:21693421
NASA Astrophysics Data System (ADS)
Wen, Hongwei; Liu, Yue; Wang, Jieqiong; Zhang, Jishui; Peng, Yun; He, Huiguang
2016-03-01
Tourette syndrome (TS) is a childhood-onset neurobehavioral disorder characterized by the presence of multiple motor and vocal tics. Tic generation has been linked to disturbed networks of brain areas involved in planning, controlling and execution of action. The aim of our work is to select topological characteristics of structural network which were most efficient for estimating the classification models to identify early TS children. Here we employed the diffusion tensor imaging (DTI) and deterministic tractography to construct the structural networks of 44 TS children and 48 age and gender matched healthy children. We calculated four different connection matrices (fiber number, mean FA, averaged fiber length weighted and binary matrices) and then applied graph theoretical methods to extract the regional nodal characteristics of structural network. For each weighted or binary network, nodal degree, nodal efficiency and nodal betweenness were selected as features. Support Vector Machine Recursive Feature Extraction (SVM-RFE) algorithm was used to estimate the best feature subset for classification. The accuracy of 88.26% evaluated by a nested cross validation was achieved on combing best feature subset of each network characteristic. The identified discriminative brain nodes mostly located in the basal ganglia and frontal cortico-cortical networks involved in TS children which was associated with tic severity. Our study holds promise for early identification and predicting prognosis of TS children.
Mujtaba, Ghulam; Shuib, Liyana; Raj, Ram Gopal; Rajandram, Retnagowri; Shaikh, Khairunisa; Al-Garadi, Mohammed Ali
2017-01-01
Objectives Widespread implementation of electronic databases has improved the accessibility of plaintext clinical information for supplementary use. Numerous machine learning techniques, such as supervised machine learning approaches or ontology-based approaches, have been employed to obtain useful information from plaintext clinical data. This study proposes an automatic multi-class classification system to predict accident-related causes of death from plaintext autopsy reports through expert-driven feature selection with supervised automatic text classification decision models. Methods Accident-related autopsy reports were obtained from one of the largest hospital in Kuala Lumpur. These reports belong to nine different accident-related causes of death. Master feature vector was prepared by extracting features from the collected autopsy reports by using unigram with lexical categorization. This master feature vector was used to detect cause of death [according to internal classification of disease version 10 (ICD-10) classification system] through five automated feature selection schemes, proposed expert-driven approach, five subset sizes of features, and five machine learning classifiers. Model performance was evaluated using precisionM, recallM, F-measureM, accuracy, and area under ROC curve. Four baselines were used to compare the results with the proposed system. Results Random forest and J48 decision models parameterized using expert-driven feature selection yielded the highest evaluation measure approaching (85% to 90%) for most metrics by using a feature subset size of 30. The proposed system also showed approximately 14% to 16% improvement in the overall accuracy compared with the existing techniques and four baselines. Conclusion The proposed system is feasible and practical to use for automatic classification of ICD-10-related cause of death from autopsy reports. The proposed system assists pathologists to accurately and rapidly determine underlying cause of death based on autopsy findings. Furthermore, the proposed expert-driven feature selection approach and the findings are generally applicable to other kinds of plaintext clinical reports. PMID:28166263
Distinguishing Different Strategies of Across-Dimension Attentional Selection
ERIC Educational Resources Information Center
Huang, Liqiang; Pashler, Harold
2012-01-01
Selective attention in multidimensional displays has usually been examined using search tasks requiring the detection of a single target. We examined the ability to perceive a spatial structure in multi-item subsets of a display that were defined either conjunctively or disjunctively. Observers saw two adjacent displays and indicated whether the…
[Regression analysis to select native-like structures from decoys of antigen-antibody docking].
Chen, Zhengshan; Chi, Xiangyang; Fan, Pengfei; Zhang, Guanying; Wang, Meirong; Yu, Changming; Chen, Wei
2018-06-25
Given the increasing exploitation of antibodies in different contexts such as molecular diagnostics and therapeutics, it would be beneficial to unravel properties of antigen-antibody interaction with modeling of computational protein-protein docking, especially, in the absence of a cocrystal structure. However, obtaining a native-like antigen-antibody structure remains challenging due in part to failing to reliably discriminate accurate from inaccurate structures among tens of thousands of decoys after computational docking with existing scoring function. We hypothesized that some important physicochemical and energetic features could be used to describe antigen-antibody interfaces and identify native-like antigen-antibody structure. We prepared a dataset, a subset of Protein-Protein Docking Benchmark Version 4.0, comprising 37 nonredundant 3D structures of antigen-antibody complexes, and used it to train and test multivariate logistic regression equation which took several important physicochemical and energetic features of decoys as dependent variables. Our results indicate that the ability to identify native-like structures of our method is superior to ZRANK and ZDOCK score for the subset of antigen-antibody complexes. And then, we use our method in workflow of predicting epitope of anti-Ebola glycoprotein monoclonal antibody-4G7 and identify three accurate residues in its epitope.
NASA Astrophysics Data System (ADS)
Henderson, J. M.; Hoffman, R. N.; Leidner, S. M.; Atlas, R.; Brin, E.; Ardizzone, J. V.
2003-06-01
The ocean surface vector wind can be measured from space by scatterometers. For a set of measurements observed from several viewing directions and collocated in space and time, there will usually exist two, three, or four consistent wind vectors. These multiple wind solutions are known as ambiguities. Ambiguity removal procedures select one ambiguity at each location. We compare results of two different ambiguity removal algorithms, the operational median filter (MF) used by the Jet Propulsion Laboratory (JPL) and a two-dimensional variational analysis method (2d-VAR). We applied 2d-VAR to the entire NASA Scatterometer (NSCAT) mission, orbit by orbit, using European Centre for Medium-Range Weather Forecasts (ECMWF) 10-m wind analyses as background fields. We also applied 2d-VAR to a 51-day subset of the NSCAT mission using National Centers for Environmental Prediction (NCEP) 1000-hPa wind analyses as background fields. This second data set uses the same background fields as the MF data set. When both methods use the same NCEP background fields as a starting point for ambiguity removal, agreement is very good: Approximately only 3% of the wind vector cells (WVCs) have different ambiguity selections; however, most of the WVCs with changes occur in coherent patches. Since at least one of the selections is in error, this implies that errors due to ambiguity selection are not isolated, but are horizontally correlated. When we examine ambiguity selection differences at synoptic scales, we often find that the 2d-VAR selections are more meteorologically reasonable and more consistent with cloud imagery.
Li, Guang-Qing; Liu, Zi; Shen, Hong-Bin; Yu, Dong-Jun
2016-10-01
As one of the most ubiquitous post-transcriptional modifications of RNA, N 6 -methyladenosine ( [Formula: see text]) plays an essential role in many vital biological processes. The identification of [Formula: see text] sites in RNAs is significantly important for both basic biomedical research and practical drug development. In this study, we designed a computational-based method, called TargetM6A, to rapidly and accurately target [Formula: see text] sites solely from the primary RNA sequences. Two new features, i.e., position-specific nucleotide/dinucleotide propensities (PSNP/PSDP), are introduced and combined with the traditional nucleotide composition (NC) feature to formulate RNA sequences. The extracted features are further optimized to obtain a much more compact and discriminative feature subset by applying an incremental feature selection (IFS) procedure. Based on the optimized feature subset, we trained TargetM6A on the training dataset with a support vector machine (SVM) as the prediction engine. We compared the proposed TargetM6A method with existing methods for predicting [Formula: see text] sites by performing stringent jackknife tests and independent validation tests on benchmark datasets. The experimental results show that the proposed TargetM6A method outperformed the existing methods for predicting [Formula: see text] sites and remarkably improved the prediction performances, with MCC = 0.526 and AUC = 0.818. We also provided a user-friendly web server for TargetM6A, which is publicly accessible for academic use at http://csbio.njust.edu.cn/bioinf/TargetM6A.
Kavakiotis, Ioannis; Samaras, Patroklos; Triantafyllidis, Alexandros; Vlahavas, Ioannis
2017-11-01
Single Nucleotide Polymorphism (SNPs) are, nowadays, becoming the marker of choice for biological analyses involving a wide range of applications with great medical, biological, economic and environmental interest. Classification tasks i.e. the assignment of individuals to groups of origin based on their (multi-locus) genotypes, are performed in many fields such as forensic investigations, discrimination between wild and/or farmed populations and others. Τhese tasks, should be performed with a small number of loci, for computational as well as biological reasons. Thus, feature selection should precede classification tasks, especially for Single Nucleotide Polymorphism (SNP) datasets, where the number of features can amount to hundreds of thousands or millions. In this paper, we present a novel data mining approach, called FIFS - Frequent Item Feature Selection, based on the use of frequent items for selection of the most informative markers from population genomic data. It is a modular method, consisting of two main components. The first one identifies the most frequent and unique genotypes for each sampled population. The second one selects the most appropriate among them, in order to create the informative SNP subsets to be returned. The proposed method (FIFS) was tested on a real dataset, which comprised of a comprehensive coverage of pig breed types present in Britain. This dataset consisted of 446 individuals divided in 14 sub-populations, genotyped at 59,436 SNPs. Our method outperforms the state-of-the-art and baseline methods in every case. More specifically, our method surpassed the assignment accuracy threshold of 95% needing only half the number of SNPs selected by other methods (FIFS: 28 SNPs, Delta: 70 SNPs Pairwise FST: 70 SNPs, In: 100 SNPs.) CONCLUSION: Our approach successfully deals with the problem of informative marker selection in high dimensional genomic datasets. It offers better results compared to existing approaches and can aid biologists in selecting the most informative markers with maximum discrimination power for optimization of cost-effective panels with applications related to e.g. species identification, wildlife management, and forensics. Copyright © 2017 Elsevier Ltd. All rights reserved.
Paroxysmal atrial fibrillation prediction method with shorter HRV sequences.
Boon, K H; Khalil-Hani, M; Malarvili, M B; Sia, C W
2016-10-01
This paper proposes a method that predicts the onset of paroxysmal atrial fibrillation (PAF), using heart rate variability (HRV) segments that are shorter than those applied in existing methods, while maintaining good prediction accuracy. PAF is a common cardiac arrhythmia that increases the health risk of a patient, and the development of an accurate predictor of the onset of PAF is clinical important because it increases the possibility to stabilize (electrically) and prevent the onset of atrial arrhythmias with different pacing techniques. We investigate the effect of HRV features extracted from different lengths of HRV segments prior to PAF onset with the proposed PAF prediction method. The pre-processing stage of the predictor includes QRS detection, HRV quantification and ectopic beat correction. Time-domain, frequency-domain, non-linear and bispectrum features are then extracted from the quantified HRV. In the feature selection, the HRV feature set and classifier parameters are optimized simultaneously using an optimization procedure based on genetic algorithm (GA). Both full feature set and statistically significant feature subset are optimized by GA respectively. For the statistically significant feature subset, Mann-Whitney U test is used to filter non-statistical significance features that cannot pass the statistical test at 20% significant level. The final stage of our predictor is the classifier that is based on support vector machine (SVM). A 10-fold cross-validation is applied in performance evaluation, and the proposed method achieves 79.3% prediction accuracy using 15-minutes HRV segment. This accuracy is comparable to that achieved by existing methods that use 30-minutes HRV segments, most of which achieves accuracy of around 80%. More importantly, our method significantly outperforms those that applied segments shorter than 30 minutes. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
A Metacommunity Framework for Enhancing the Effectiveness of Biological Monitoring Strategies
Roque, Fabio O.; Cottenie, Karl
2012-01-01
Because of inadequate knowledge and funding, the use of biodiversity indicators is often suggested as a way to support management decisions. Consequently, many studies have analyzed the performance of certain groups as indicator taxa. However, in addition to knowing whether certain groups can adequately represent the biodiversity as a whole, we must also know whether they show similar responses to the main structuring processes affecting biodiversity. Here we present an application of the metacommunity framework for evaluating the effectiveness of biodiversity indicators. Although the metacommunity framework has contributed to a better understanding of biodiversity patterns, there is still limited discussion about its implications for conservation and biomonitoring. We evaluated the effectiveness of indicator taxa in representing spatial variation in macroinvertebrate community composition in Atlantic Forest streams, and the processes that drive this variation. We focused on analyzing whether some groups conform to environmental processes and other groups are more influenced by spatial processes, and on how this can help in deciding which indicator group or groups should be used. We showed that a relatively small subset of taxa from the metacommunity would represent 80% of the variation in community composition shown by the entire metacommunity. Moreover, this subset does not have to be composed of predetermined taxonomic groups, but rather can be defined based on random subsets. We also found that some random subsets composed of a small number of genera performed better in responding to major environmental gradients. There were also random subsets that seemed to be affected by spatial processes, which could indicate important historical processes. We were able to integrate in the same theoretical and practical framework, the selection of biodiversity surrogates, indicators of environmental conditions, and more importantly, an explicit integration of environmental and spatial processes into the selection approach. PMID:22937068
Berker, Yannick; Karp, Joel S; Schulz, Volkmar
2017-09-01
The use of scattered coincidences for attenuation correction of positron emission tomography (PET) data has recently been proposed. For practical applications, convergence speeds require further improvement, yet there exists a trade-off between convergence speed and the risk of non-convergence. In this respect, a maximum-likelihood gradient-ascent (MLGA) algorithm and a two-branch back-projection (2BP), which was previously proposed, were evaluated. MLGA was combined with the Armijo step size rule; and accelerated using conjugate gradients, Nesterov's momentum method, and data subsets of different sizes. In 2BP, we varied the subset size, an important determinant of convergence speed and computational burden. We used three sets of simulation data to evaluate the impact of a spatial scale factor. The Armijo step size allowed 10-fold increased step sizes compared to native MLGA. Conjugate gradients and Nesterov momentum lead to slightly faster, yet non-uniform convergence; improvements were mostly confined to later iterations, possibly due to the non-linearity of the problem. MLGA with data subsets achieved faster, uniform, and predictable convergence, with a speed-up factor equivalent to the number of subsets and no increase in computational burden. By contrast, 2BP computational burden increased linearly with the number of subsets due to repeated evaluation of the objective function, and convergence was limited to the case of many (and therefore small) subsets, which resulted in high computational burden. Possibilities of improving 2BP appear limited. While general-purpose acceleration methods appear insufficient for MLGA, results suggest that data subsets are a promising way of improving MLGA performance.
Kavuluru, Ramakanth; Rios, Anthony; Lu, Yuan
2015-10-01
Diagnosis codes are assigned to medical records in healthcare facilities by trained coders by reviewing all physician authored documents associated with a patient's visit. This is a necessary and complex task involving coders adhering to coding guidelines and coding all assignable codes. With the popularity of electronic medical records (EMRs), computational approaches to code assignment have been proposed in the recent years. However, most efforts have focused on single and often short clinical narratives, while realistic scenarios warrant full EMR level analysis for code assignment. We evaluate supervised learning approaches to automatically assign international classification of diseases (ninth revision) - clinical modification (ICD-9-CM) codes to EMRs by experimenting with a large realistic EMR dataset. The overall goal is to identify methods that offer superior performance in this task when considering such datasets. We use a dataset of 71,463 EMRs corresponding to in-patient visits with discharge date falling in a two year period (2011-2012) from the University of Kentucky (UKY) Medical Center. We curate a smaller subset of this dataset and also use a third gold standard dataset of radiology reports. We conduct experiments using different problem transformation approaches with feature and data selection components and employing suitable label calibration and ranking methods with novel features involving code co-occurrence frequencies and latent code associations. Over all codes with at least 50 training examples we obtain a micro F-score of 0.48. On the set of codes that occur at least in 1% of the two year dataset, we achieve a micro F-score of 0.54. For the smaller radiology report dataset, the classifier chaining approach yields best results. For the smaller subset of the UKY dataset, feature selection, data selection, and label calibration offer best performance. We show that datasets at different scale (size of the EMRs, number of distinct codes) and with different characteristics warrant different learning approaches. For shorter narratives pertaining to a particular medical subdomain (e.g., radiology, pathology), classifier chaining is ideal given the codes are highly related with each other. For realistic in-patient full EMRs, feature and data selection methods offer high performance for smaller datasets. However, for large EMR datasets, we observe that the binary relevance approach with learning-to-rank based code reranking offers the best performance. Regardless of the training dataset size, for general EMRs, label calibration to select the optimal number of labels is an indispensable final step. Copyright © 2015 Elsevier B.V. All rights reserved.
NASA Astrophysics Data System (ADS)
Yingst, R. A.; Bartley, J. K.; Chidsey, T. C.; Cohen, B. A.; Gilleaudeau, G. J.; Hynek, B. M.; Kah, L. C.; Minitti, M. E.; Williams, R. M. E.; Black, S.; Gemperline, J.; Schaufler, R.; Thomas, R. J.
2018-05-01
The GHOST field tests are designed to isolate and test science-driven rover operations protocols, to determine best practices. During a recent field test at a potential Mars 2020 landing site analog, we tested two Mars Science Laboratory data-acquisition and decision-making methods to assess resulting science return and sample quality: a linear method, where sites of interest are studied in the order encountered, and a "walkabout-first" method, where sites of interest are examined remotely before down-selecting to a subset of sites that are interrogated with more resource-intensive instruments. The walkabout method cost less time and fewer resources, while increasing confidence in interpretations. Contextual data critical to evaluating site geology was acquired earlier than for the linear method, and given a higher priority, which resulted in development of more mature hypotheses earlier in the analysis process. Combined, this saved time and energy in the collection of data with more limited spatial coverage. Based on these results, we suggest that the walkabout method be used where doing so would provide early context and time for the science team to develop hypotheses-critical tests; and that in gathering context, coverage may be more important than higher resolution.
Use of direct and iterative solvers for estimation of SNP effects in genome-wide selection
2010-01-01
The aim of this study was to compare iterative and direct solvers for estimation of marker effects in genomic selection. One iterative and two direct methods were used: Gauss-Seidel with Residual Update, Cholesky Decomposition and Gentleman-Givens rotations. For resembling different scenarios with respect to number of markers and of genotyped animals, a simulated data set divided into 25 subsets was used. Number of markers ranged from 1,200 to 5,925 and number of animals ranged from 1,200 to 5,865. Methods were also applied to real data comprising 3081 individuals genotyped for 45181 SNPs. Results from simulated data showed that the iterative solver was substantially faster than direct methods for larger numbers of markers. Use of a direct solver may allow for computing (co)variances of SNP effects. When applied to real data, performance of the iterative method varied substantially, depending on the level of ill-conditioning of the coefficient matrix. From results with real data, Gentleman-Givens rotations would be the method of choice in this particular application as it provided an exact solution within a fairly reasonable time frame (less than two hours). It would indeed be the preferred method whenever computer resources allow its use. PMID:21637627
Li, Yang; Cui, Weigang; Luo, Meilin; Li, Ke; Wang, Lina
2018-01-25
The electroencephalogram (EEG) signal analysis is a valuable tool in the evaluation of neurological disorders, which is commonly used for the diagnosis of epileptic seizures. This paper presents a novel automatic EEG signal classification method for epileptic seizure detection. The proposed method first employs a continuous wavelet transform (CWT) method for obtaining the time-frequency images (TFI) of EEG signals. The processed EEG signals are then decomposed into five sub-band frequency components of clinical interest since these sub-band frequency components indicate much better discriminative characteristics. Both Gaussian Mixture Model (GMM) features and Gray Level Co-occurrence Matrix (GLCM) descriptors are then extracted from these sub-band TFI. Additionally, in order to improve classification accuracy, a compact feature selection method by combining the ReliefF and the support vector machine-based recursive feature elimination (RFE-SVM) algorithm is adopted to select the most discriminative feature subset, which is an input to the SVM with the radial basis function (RBF) for classifying epileptic seizure EEG signals. The experimental results from a publicly available benchmark database demonstrate that the proposed approach provides better classification accuracy than the recently proposed methods in the literature, indicating the effectiveness of the proposed method in the detection of epileptic seizures.
Weighted Geometric Dilution of Precision Calculations with Matrix Multiplication
Chen, Chien-Sheng
2015-01-01
To enhance the performance of location estimation in wireless positioning systems, the geometric dilution of precision (GDOP) is widely used as a criterion for selecting measurement units. Since GDOP represents the geometric effect on the relationship between measurement error and positioning determination error, the smallest GDOP of the measurement unit subset is usually chosen for positioning. The conventional GDOP calculation using matrix inversion method requires many operations. Because more and more measurement units can be chosen nowadays, an efficient calculation should be designed to decrease the complexity. Since the performance of each measurement unit is different, the weighted GDOP (WGDOP), instead of GDOP, is used to select the measurement units to improve the accuracy of location. To calculate WGDOP effectively and efficiently, the closed-form solution for WGDOP calculation is proposed when more than four measurements are available. In this paper, an efficient WGDOP calculation method applying matrix multiplication that is easy for hardware implementation is proposed. In addition, the proposed method can be used when more than exactly four measurements are available. Even when using all-in-view method for positioning, the proposed method still can reduce the computational overhead. The proposed WGDOP methods with less computation are compatible with global positioning system (GPS), wireless sensor networks (WSN) and cellular communication systems. PMID:25569755
NASA Astrophysics Data System (ADS)
Polewski, Przemyslaw; Yao, Wei; Heurich, Marco; Krzystek, Peter; Stilla, Uwe
2018-06-01
In this study, we present a method for improving the quality of automatic single fallen tree stem segmentation in ALS data by applying a specialized constrained conditional random field (CRF). The entire processing pipeline is composed of two steps. First, short stem segments of equal length are detected and a subset of them is selected for further processing, while in the second step the chosen segments are merged to form entire trees. The first step is accomplished using the specialized CRF defined on the space of segment labelings, capable of finding segment candidates which are easier to merge subsequently. To achieve this, the CRF considers not only the features of every candidate individually, but incorporates pairwise spatial interactions between adjacent segments into the model. In particular, pairwise interactions include a collinearity/angular deviation probability which is learned from training data as well as the ratio of spatial overlap, whereas unary potentials encode a learned probabilistic model of the laser point distribution around each segment. Each of these components enters the CRF energy with its own balance factor. To process previously unseen data, we first calculate the subset of segments for merging on a grid of balance factors by minimizing the CRF energy. Then, we perform the merging and rank the balance configurations according to the quality of their resulting merged trees, obtained from a learned tree appearance model. The final result is derived from the top-ranked configuration. We tested our approach on 5 plots from the Bavarian Forest National Park using reference data acquired in a field inventory. Compared to our previous segment selection method without pairwise interactions, an increase in detection correctness and completeness of up to 7 and 9 percentage points, respectively, was observed.
2011-01-01
Background CD8+ T cells may play an important role in protecting against HIV. However, the changes of CD8+ T cell subsets during early period of ART have not been fully studied. Methods Twenty-one asymptomatic treatment-naive HIV-infected patients with CD4 T+ cells less than 350 cells/μl were enrolled in the study. Naïve, central memory(CM), effective memory(EM) and terminally differentiated effector (EMRA) CD8+ cell subsets and their activation and proliferation subsets were evaluated in blood samples collected at base line, and week 2, 4, 8 and 12 of ART. Results The total CD8+ T cells declined and the Naïve and CM subsets had a tendency of increase. Activation levels of all CD8+ T cell subsets except EMRA subset decreased after ART. However, proliferation levels of total CD8+ T cells, EMRA, EM and CM subsets increased at the first 4 weeks of ART, then decreased. Proliferation level of the naïve cells decreased after ART. Conclusion The changes of CD8+ T cell subsets during initial ART are complex. Our results display a complete phenotypical picture of CD8+ cell subsets during initial ART and provide insights for understanding of immune status during ART. PMID:21435275
Optimization of Self-Directed Target Coverage in Wireless Multimedia Sensor Network
Yang, Yang; Wang, Yufei; Pi, Dechang; Wang, Ruchuan
2014-01-01
Video and image sensors in wireless multimedia sensor networks (WMSNs) have directed view and limited sensing angle. So the methods to solve target coverage problem for traditional sensor networks, which use circle sensing model, are not suitable for WMSNs. Based on the FoV (field of view) sensing model and FoV disk model proposed, how expected multimedia sensor covers the target is defined by the deflection angle between target and the sensor's current orientation and the distance between target and the sensor. Then target coverage optimization algorithms based on expected coverage value are presented for single-sensor single-target, multisensor single-target, and single-sensor multitargets problems distinguishingly. Selecting the orientation that sensor rotated to cover every target falling in the FoV disk of that sensor for candidate orientations and using genetic algorithm to multisensor multitargets problem, which has NP-complete complexity, then result in the approximated minimum subset of sensors which covers all the targets in networks. Simulation results show the algorithm's performance and the effect of number of targets on the resulting subset. PMID:25136667
Supernova Cosmology Inference with Probabilistic Photometric Redshifts (SCIPPR)
NASA Astrophysics Data System (ADS)
Peters, Christina; Malz, Alex; Hlozek, Renée
2018-01-01
The Bayesian Estimation Applied to Multiple Species (BEAMS) framework employs probabilistic supernova type classifications to do photometric SN cosmology. This work extends BEAMS to replace high-confidence spectroscopic redshifts with photometric redshift probability density functions, a capability that will be essential in the era the Large Synoptic Survey Telescope and other next-generation photometric surveys where it will not be possible to perform spectroscopic follow up on every SN. We present the Supernova Cosmology Inference with Probabilistic Photometric Redshifts (SCIPPR) Bayesian hierarchical model for constraining the cosmological parameters from photometric lightcurves and host galaxy photometry, which includes selection effects and is extensible to uncertainty in the redshift-dependent supernova type proportions. We create a pair of realistic mock catalogs of joint posteriors over supernova type, redshift, and distance modulus informed by photometric supernova lightcurves and over redshift from simulated host galaxy photometry. We perform inference under our model to obtain a joint posterior probability distribution over the cosmological parameters and compare our results with other methods, namely: a spectroscopic subset, a subset of high probability photometrically classified supernovae, and reducing the photometric redshift probability to a single measurement and error bar.
Ensemble Feature Learning of Genomic Data Using Support Vector Machine
Anaissi, Ali; Goyal, Madhu; Catchpoole, Daniel R.; Braytee, Ali; Kennedy, Paul J.
2016-01-01
The identification of a subset of genes having the ability to capture the necessary information to distinguish classes of patients is crucial in bioinformatics applications. Ensemble and bagging methods have been shown to work effectively in the process of gene selection and classification. Testament to that is random forest which combines random decision trees with bagging to improve overall feature selection and classification accuracy. Surprisingly, the adoption of these methods in support vector machines has only recently received attention but mostly on classification not gene selection. This paper introduces an ensemble SVM-Recursive Feature Elimination (ESVM-RFE) for gene selection that follows the concepts of ensemble and bagging used in random forest but adopts the backward elimination strategy which is the rationale of RFE algorithm. The rationale behind this is, building ensemble SVM models using randomly drawn bootstrap samples from the training set, will produce different feature rankings which will be subsequently aggregated as one feature ranking. As a result, the decision for elimination of features is based upon the ranking of multiple SVM models instead of choosing one particular model. Moreover, this approach will address the problem of imbalanced datasets by constructing a nearly balanced bootstrap sample. Our experiments show that ESVM-RFE for gene selection substantially increased the classification performance on five microarray datasets compared to state-of-the-art methods. Experiments on the childhood leukaemia dataset show that an average 9% better accuracy is achieved by ESVM-RFE over SVM-RFE, and 5% over random forest based approach. The selected genes by the ESVM-RFE algorithm were further explored with Singular Value Decomposition (SVD) which reveals significant clusters with the selected data. PMID:27304923
Vijayan, R S K; Ghoshal, Nanda
2008-10-01
Given the heterogeneity of GABA(A) receptor, the pharmacological significance of identifying subtype selective modulators is increasingly being recognized. Thus, drugs selective for GABA(A) alpha(3) receptors are expected to display fewer side effects than the drugs presently in clinical use. Hence we carried out 3D QSAR (three-dimensional quantitative structure-activity relationship) studies on a series of novel GABA(A) alpha(3) subtype selective modulators to gain more insight into subtype affinity. To identify the 3D functional attributes required for subtype selectivity, a chemical feature-based pharmacophore, primarily based on selective ligands representing diverse structural classes was generated. The obtained pseudo receptor model of the benzodiazepine binding site revealed a binding mode akin to "Message-Address" concept. Scaffold hopping was carried out across multi-conformational May Bridge database for the identification of novel chemotypes. Further a focused data reduction approach was employed to choose a subset of enriched compounds based on "Drug likeness" and "Similarity-based" methods. These results taken together could provide impetus for rational design and optimization of more selective and high affinity leads with a potential to have decreased adverse effects.
Selection of suitable hand gestures for reliable myoelectric human computer interface.
Castro, Maria Claudia F; Arjunan, Sridhar P; Kumar, Dinesh K
2015-04-09
Myoelectric controlled prosthetic hand requires machine based identification of hand gestures using surface electromyogram (sEMG) recorded from the forearm muscles. This study has observed that a sub-set of the hand gestures have to be selected for an accurate automated hand gesture recognition, and reports a method to select these gestures to maximize the sensitivity and specificity. Experiments were conducted where sEMG was recorded from the muscles of the forearm while subjects performed hand gestures and then was classified off-line. The performances of ten gestures were ranked using the proposed Positive-Negative Performance Measurement Index (PNM), generated by a series of confusion matrices. When using all the ten gestures, the sensitivity and specificity was 80.0% and 97.8%. After ranking the gestures using the PNM, six gestures were selected and these gave sensitivity and specificity greater than 95% (96.5% and 99.3%); Hand open, Hand close, Little finger flexion, Ring finger flexion, Middle finger flexion and Thumb flexion. This work has shown that reliable myoelectric based human computer interface systems require careful selection of the gestures that have to be recognized and without such selection, the reliability is poor.
Alpha-beta coordination method for collective search
Goldsmith, Steven Y.
2002-01-01
The present invention comprises a decentralized coordination strategy called alpha-beta coordination. The alpha-beta coordination strategy is a family of collective search methods that allow teams of communicating agents to implicitly coordinate their search activities through a division of labor based on self-selected roles and self-determined status. An agent can play one of two complementary roles. An agent in the alpha role is motivated to improve its status by exploring new regions of the search space. An agent in the beta role is also motivated to improve its status, but is conservative and tends to remain aggregated with other agents until alpha agents have clearly identified and communicated better regions of the search space. An agent can select its role dynamically based on its current status value relative to the status values of neighboring team members. Status can be determined by a function of the agent's sensor readings, and can generally be a measurement of source intensity at the agent's current location. An agent's decision cycle can comprise three sequential decision rules: (1) selection of a current role based on the evaluation of the current status data, (2) selection of a specific subset of the current data, and (3) determination of the next heading using the selected data. Variations of the decision rules produce different versions of alpha and beta behaviors that lead to different collective behavior properties.
Wavelet-based energy features for glaucomatous image classification.
Dua, Sumeet; Acharya, U Rajendra; Chowriappa, Pradeep; Sree, S Vinitha
2012-01-01
Texture features within images are actively pursued for accurate and efficient glaucoma classification. Energy distribution over wavelet subbands is applied to find these important texture features. In this paper, we investigate the discriminatory potential of wavelet features obtained from the daubechies (db3), symlets (sym3), and biorthogonal (bio3.3, bio3.5, and bio3.7) wavelet filters. We propose a novel technique to extract energy signatures obtained using 2-D discrete wavelet transform, and subject these signatures to different feature ranking and feature selection strategies. We have gauged the effectiveness of the resultant ranked and selected subsets of features using a support vector machine, sequential minimal optimization, random forest, and naïve Bayes classification strategies. We observed an accuracy of around 93% using tenfold cross validations to demonstrate the effectiveness of these methods.
504 Participation of Invariant NKT Cells (Vα24Jα18) during Asthma Exacerbation in Children
Carpio-Pedroza, Juan Carlos; del Rio-Navarro, Blanca Estela; del Río-Chivardí, Jaime Mariano; Morales-Flores, Amelia; Jiménez-Zamudio, Luis Antonio; Moreno-Lafont, Martha; Pedraza-Sánchez, Sigifredo; Vaughan-Figueroa, Juan Gilberto; Escobar-Gutiérrez, Alejandro
2012-01-01
Background Invariant NKT cells (or type 1 NKT cells) co-express CD3 marker and NK receptors (CD56, CD161) and use a single type of TCRα chain (Vα24Jα18 for humans), comprising CD4-CD8-, CD4+ and CD8+ subsets. Participation of these cells and their cytokines in asthmatic children, in stable conditions and under exacerbation, was studied. Methods Three groups on children (6–12 years old) were selected: 1) asthmatics under exacerbation attack (AE) within the first 24 hours after the attack and before starting any treatment; 2) asthmatics with stable asthma (SA), without symptoms for at least a month before bleeding; and 3) healthy controls (HC) without history of asthma, atopy and with normal lung function were selected in the Allergy and Clinical Immunology Service, Hospital Infantil de Mexico. Invariant NKT cells and subset levels as well as intracellular cytokines were evaluated in whole blood by 4-color flow cytometry (antibodies against CD3, CD4, CD8, CD161, Va24, IL-4 and IFN-g). Results Proportion of iNKT cells among total CD3+ cells in HC group was 0.9%, while in SA patients they were increased up to 2.6%; interestingly, during exacerbation such cells were dimished (1.8%). Concerning iNKT CD4+ cells were 0.6% in HC, 1.8% in SA, and 0.7% in AE, while iNKT CD8+ cells were 0.1% in HC, 0.7% in SA, and 0.4% in AE. Both iNKT cell subsets expressed intracellular IFN-g and IL-4 cytokines in AE, SA and HC but predominantly IFN-g in iNKT CD8+ cells from AE patients. Conclusions iNKT cells participation in asthma pathogenesis was confirmed. Increase of IFN-g production in patients with exacerbations, may provide a regulatory environment to stabilize the condition.
NASA Astrophysics Data System (ADS)
Cromwell, G.; Johnson, C. L.; Tauxe, L.; Constable, C.; Jarboe, N.
2015-12-01
Previous paleosecular variation (PSV) and time-averaged field (TAF) models draw on compilations of paleodirectional data that lack equatorial and high latitude sites and use latitudinal virtual geomagnetic pole (VGP) cutoffs designed to remove transitional field directions. We present a new selected global dataset (PSV10) of paleodirectional data spanning the last 10 Ma. We include all results calculated with modern laboratory methods, regardless of site VGP colatitude, that meet statistically derived selection criteria. We exclude studies that target transitional field states or identify significant tectonic effects, and correct for any bias from serial correlation by averaging directions from sequential lava flows. PSV10 has an improved global distribution compared with previous compilations, comprising 1519 sites from 71 studies. VGP dispersion in PSV10 varies with latitude, exhibiting substantially higher values in the southern hemisphere than at corresponding northern latitudes. Inclination anomaly estimates at many latitudes are within error of an expected GAD field, but significant negative anomalies are found at equatorial and mid-northern latitudes. Current PSV models Model G and TK03 do not fit observed PSV or TAF latitudinal behavior in PSV10, or subsets of normal and reverse polarity data, particularly for southern hemisphere sites. Attempts to fit these observations with simple modifications to TK03 showed slight statistical improvements, but still exceed acceptable errors. The root-mean-square misfit of TK03 (and subsequent iterations) is substantially lower for the normal polarity subset of PSV10, compared to reverse polarity data. Two-thirds of data in PSV10 are normal polarity, most which are from the last 5 Ma, so we develop a new TAF model using this subset of data. We use the resulting TAF model to explore whether new statistical PSV models can better describe our new global compilation.
Predicting Degree of Benefit From Adjuvant Trastuzumab in NSABP Trial B-31
Pogue-Geile, Katherine L.; Kim, Chungyeul; Jeong, Jong-Hyeon; Tanaka, Noriko; Bandos, Hanna; Gavin, Patrick G.; Fumagalli, Debora; Goldstein, Lynn C.; Sneige, Nour; Burandt, Eike; Taniyama, Yusuke; Bohn, Olga L.; Lee, Ahwon; Kim, Seung-Il; Reilly, Megan L.; Remillard, Matthew Y.; Blackmon, Nicole L.; Kim, Seong-Rim; Horne, Zachary D.; Rastogi, Priya; Fehrenbacher, Louis; Romond, Edward H.; Swain, Sandra M.; Mamounas, Eleftherios P.; Wickerham, D. Lawrence; Geyer, Charles E.; Costantino, Joseph P.; Wolmark, Norman
2013-01-01
Background National Surgical Adjuvant Breast and Bowel Project (NSABP) trial B-31 suggested the efficacy of adjuvant trastuzumab, even in HER2-negative breast cancer. This finding prompted us to develop a predictive model for degree of benefit from trastuzumab using archived tumor blocks from B-31. Methods Case subjects with tumor blocks were randomly divided into discovery (n = 588) and confirmation cohorts (n = 991). A predictive model was built from the discovery cohort through gene expression profiling of 462 genes with nCounter assay. A predefined cut point for the predictive model was tested in the confirmation cohort. Gene-by-treatment interaction was tested with Cox models, and correlations between variables were assessed with Spearman correlation. Principal component analysis was performed on the final set of selected genes. All statistical tests were two-sided. Results Eight predictive genes associated with HER2 (ERBB2, c17orf37, GRB7) or ER (ESR1, NAT1, GATA3, CA12, IGF1R) were selected for model building. Three-dimensional subset treatment effect pattern plot using two principal components of these genes was used to identify a subset with no benefit from trastuzumab, characterized by intermediate-level ERBB2 and high-level ESR1 mRNA expression. In the confirmation set, the predefined cut points for this model classified patients into three subsets with differential benefit from trastuzumab with hazard ratios of 1.58 (95% confidence interval [CI] = 0.67 to 3.69; P = .29; n = 100), 0.60 (95% CI = 0.41 to 0.89; P = .01; n = 449), and 0.28 (95% CI = 0.20 to 0.41; P < .001; n = 442; P interaction between the model and trastuzumab < .001). Conclusions We developed a gene expression–based predictive model for degree of benefit from trastuzumab and demonstrated that HER2-negative tumors belong to the moderate benefit group, thus providing justification for testing trastuzumab in HER2-negative patients (NSABP B-47). PMID:24262440
He, Hua; McDermott, Michael P.
2012-01-01
Sensitivity and specificity are common measures of the accuracy of a diagnostic test. The usual estimators of these quantities are unbiased if data on the diagnostic test result and the true disease status are obtained from all subjects in an appropriately selected sample. In some studies, verification of the true disease status is performed only for a subset of subjects, possibly depending on the result of the diagnostic test and other characteristics of the subjects. Estimators of sensitivity and specificity based on this subset of subjects are typically biased; this is known as verification bias. Methods have been proposed to correct verification bias under the assumption that the missing data on disease status are missing at random (MAR), that is, the probability of missingness depends on the true (missing) disease status only through the test result and observed covariate information. When some of the covariates are continuous, or the number of covariates is relatively large, the existing methods require parametric models for the probability of disease or the probability of verification (given the test result and covariates), and hence are subject to model misspecification. We propose a new method for correcting verification bias based on the propensity score, defined as the predicted probability of verification given the test result and observed covariates. This is estimated separately for those with positive and negative test results. The new method classifies the verified sample into several subsamples that have homogeneous propensity scores and allows correction for verification bias. Simulation studies demonstrate that the new estimators are more robust to model misspecification than existing methods, but still perform well when the models for the probability of disease and probability of verification are correctly specified. PMID:21856650
Li, Juan
2011-03-01
To study the change law of serum IL-6, TNF-α and peripheral blood T lymphocyte subsets in the pregnant women during perinatal period. 100 pregnant women in our hospital from November 2009 to October 2010 were selected as research object, and the serum IL-6, TNF-α and peripheral blood T lymphocyte subsets be-fore and at labor onset occurring, after delivery at the first and third day were analyzed and compared. According the study, the serum IL-6 and TNF-aat labor onset occurring were higher than those before labor onset and af-ter delivery at the first and third day , the CD3(+), CD4 (+), CD8(+) and CD4/CD8 decreased first and then increased, all P < 0. 05, there were significant differences. The changes of serum IL-6, TNF-α and peripheral blood T lymphocyte subsets in the pregnant women during perinatal period has a regular pattern, and it is worthy of.
MODIS Interactive Subsetting Tool (MIST)
NASA Astrophysics Data System (ADS)
McAllister, M.; Duerr, R.; Haran, T.; Khalsa, S. S.; Miller, D.
2008-12-01
In response to requests from the user community, NSIDC has teamed with the Oak Ridge National Laboratory Distributive Active Archive Center (ORNL DAAC) and the Moderate Resolution Data Center (MrDC) to provide time series subsets of satellite data covering stations in the Greenland Climate Network (GC-NET) and the International Arctic Systems for Observing the Atmosphere (IASOA) network. To serve these data NSIDC created the MODIS Interactive Subsetting Tool (MIST). MIST works with 7 km by 7 km subset time series of certain Version 5 (V005) MODIS products over GC-Net and IASOA stations. User- selected data are delivered in a text Comma Separated Value (CSV) file format. MIST also provides online analysis capabilities that include generating time series and scatter plots. Currently, MIST is a Beta prototype and NSIDC intends that user requests will drive future development of the tool. The intent of this poster is to introduce MIST to the MODIS data user audience and illustrate some of the online analysis capabilities.
Fish swarm intelligent to optimize real time monitoring of chips drying using machine vision
NASA Astrophysics Data System (ADS)
Hendrawan, Y.; Hawa, L. C.; Damayanti, R.
2018-03-01
This study attempted to apply machine vision-based chips drying monitoring system which is able to optimise the drying process of cassava chips. The objective of this study is to propose fish swarm intelligent (FSI) optimization algorithms to find the most significant set of image features suitable for predicting water content of cassava chips during drying process using artificial neural network model (ANN). Feature selection entails choosing the feature subset that maximizes the prediction accuracy of ANN. Multi-Objective Optimization (MOO) was used in this study which consisted of prediction accuracy maximization and feature-subset size minimization. The results showed that the best feature subset i.e. grey mean, L(Lab) Mean, a(Lab) energy, red entropy, hue contrast, and grey homogeneity. The best feature subset has been tested successfully in ANN model to describe the relationship between image features and water content of cassava chips during drying process with R2 of real and predicted data was equal to 0.9.
McClelland, Shawn; Brennan, Gary P; Dubé, Celine; Rajpara, Seeta; Iyer, Shruti; Richichi, Cristina; Bernard, Christophe; Baram, Tallie Z
2014-01-01
The mechanisms generating epileptic neuronal networks following insults such as severe seizures are unknown. We have previously shown that interfering with the function of the neuron-restrictive silencer factor (NRSF/REST), an important transcription factor that influences neuronal phenotype, attenuated development of this disorder. In this study, we found that epilepsy-provoking seizures increased the low NRSF levels in mature hippocampus several fold yet surprisingly, provoked repression of only a subset (∼10%) of potential NRSF target genes. Accordingly, the repressed gene-set was rescued when NRSF binding to chromatin was blocked. Unexpectedly, genes selectively repressed by NRSF had mid-range binding frequencies to the repressor, a property that rendered them sensitive to moderate fluctuations of NRSF levels. Genes selectively regulated by NRSF during epileptogenesis coded for ion channels, receptors, and other crucial contributors to neuronal function. Thus, dynamic, selective regulation of NRSF target genes may play a role in influencing neuronal properties in pathological and physiological contexts. DOI: http://dx.doi.org/10.7554/eLife.01267.001 PMID:25117540
2012-01-01
Background Through the wealth of information contained within them, genome-wide association studies (GWAS) have the potential to provide researchers with a systematic means of associating genetic variants with a wide variety of disease phenotypes. Due to the limitations of approaches that have analyzed single variants one at a time, it has been proposed that the genetic basis of these disorders could be determined through detailed analysis of the genetic variants themselves and in conjunction with one another. The construction of models that account for these subsets of variants requires methodologies that generate predictions based on the total risk of a particular group of polymorphisms. However, due to the excessive number of variants, constructing these types of models has so far been computationally infeasible. Results We have implemented an algorithm, known as greedy RLS, that we use to perform the first known wrapper-based feature selection on the genome-wide level. The running time of greedy RLS grows linearly in the number of training examples, the number of features in the original data set, and the number of selected features. This speed is achieved through computational short-cuts based on matrix calculus. Since the memory consumption in present-day computers can form an even tighter bottleneck than running time, we also developed a space efficient variation of greedy RLS which trades running time for memory. These approaches are then compared to traditional wrapper-based feature selection implementations based on support vector machines (SVM) to reveal the relative speed-up and to assess the feasibility of the new algorithm. As a proof of concept, we apply greedy RLS to the Hypertension – UK National Blood Service WTCCC dataset and select the most predictive variants using 3-fold external cross-validation in less than 26 minutes on a high-end desktop. On this dataset, we also show that greedy RLS has a better classification performance on independent test data than a classifier trained using features selected by a statistical p-value-based filter, which is currently the most popular approach for constructing predictive models in GWAS. Conclusions Greedy RLS is the first known implementation of a machine learning based method with the capability to conduct a wrapper-based feature selection on an entire GWAS containing several thousand examples and over 400,000 variants. In our experiments, greedy RLS selected a highly predictive subset of genetic variants in a fraction of the time spent by wrapper-based selection methods used together with SVM classifiers. The proposed algorithms are freely available as part of the RLScore software library at http://users.utu.fi/aatapa/RLScore/. PMID:22551170
Dense mesh sampling for video-based facial animation
NASA Astrophysics Data System (ADS)
Peszor, Damian; Wojciechowska, Marzena
2016-06-01
The paper describes an approach for selection of feature points on three-dimensional, triangle mesh obtained using various techniques from several video footages. This approach has a dual purpose. First, it allows to minimize the data stored for the purpose of facial animation, so that instead of storing position of each vertex in each frame, one could store only a small subset of vertices for each frame and calculate positions of others based on the subset. Second purpose is to select feature points that could be used for anthropometry-based retargeting of recorded mimicry to another model, with sampling density beyond that which can be achieved using marker-based performance capture techniques. Developed approach was successfully tested on artificial models, models constructed using structured light scanner, and models constructed from video footages using stereophotogrammetry.
Kraschnewski, Jennifer L; Keyserling, Thomas C; Bangdiwala, Shrikant I; Gizlice, Ziya; Garcia, Beverly A; Johnston, Larry F; Gustafson, Alison; Petrovic, Lindsay; Glasgow, Russell E; Samuel-Hodge, Carmen D
2010-01-01
Studies of type 2 translation, the adaption of evidence-based interventions to real-world settings, should include representative study sites and staff to improve external validity. Sites for such studies are, however, often selected by convenience sampling, which limits generalizability. We used an optimized probability sampling protocol to select an unbiased, representative sample of study sites to prepare for a randomized trial of a weight loss intervention. We invited North Carolina health departments within 200 miles of the research center to participate (N = 81). Of the 43 health departments that were eligible, 30 were interested in participating. To select a representative and feasible sample of 6 health departments that met inclusion criteria, we generated all combinations of 6 from the 30 health departments that were eligible and interested. From the subset of combinations that met inclusion criteria, we selected 1 at random. Of 593,775 possible combinations of 6 counties, 15,177 (3%) met inclusion criteria. Sites in the selected subset were similar to all eligible sites in terms of health department characteristics and county demographics. Optimized probability sampling improved generalizability by ensuring an unbiased and representative sample of study sites.
NASA Astrophysics Data System (ADS)
Borgelt, Christian
In clustering we often face the situation that only a subset of the available attributes is relevant for forming clusters, even though this may not be known beforehand. In such cases it is desirable to have a clustering algorithm that automatically weights attributes or even selects a proper subset. In this paper I study such an approach for fuzzy clustering, which is based on the idea to transfer an alternative to the fuzzifier (Klawonn and Höppner, What is fuzzy about fuzzy clustering? Understanding and improving the concept of the fuzzifier, In: Proc. 5th Int. Symp. on Intelligent Data Analysis, 254-264, Springer, Berlin, 2003) to attribute weighting fuzzy clustering (Keller and Klawonn, Int J Uncertain Fuzziness Knowl Based Syst 8:735-746, 2000). In addition, by reformulating Gustafson-Kessel fuzzy clustering, a scheme for weighting and selecting principal axes can be obtained. While in Borgelt (Feature weighting and feature selection in fuzzy clustering, In: Proc. 17th IEEE Int. Conf. on Fuzzy Systems, IEEE Press, Piscataway, NJ, 2008) I already presented such an approach for a global selection of attributes and principal axes, this paper extends it to a cluster-specific selection, thus arriving at a fuzzy subspace clustering algorithm (Parsons, Haque, and Liu, 2004).
Hypergraph Based Feature Selection Technique for Medical Diagnosis.
Somu, Nivethitha; Raman, M R Gauthama; Kirthivasan, Kannan; Sriram, V S Shankar
2016-11-01
The impact of internet and information systems across various domains have resulted in substantial generation of multidimensional datasets. The use of data mining and knowledge discovery techniques to extract the original information contained in the multidimensional datasets play a significant role in the exploitation of complete benefit provided by them. The presence of large number of features in the high dimensional datasets incurs high computational cost in terms of computing power and time. Hence, feature selection technique has been commonly used to build robust machine learning models to select a subset of relevant features which projects the maximal information content of the original dataset. In this paper, a novel Rough Set based K - Helly feature selection technique (RSKHT) which hybridize Rough Set Theory (RST) and K - Helly property of hypergraph representation had been designed to identify the optimal feature subset or reduct for medical diagnostic applications. Experiments carried out using the medical datasets from the UCI repository proves the dominance of the RSKHT over other feature selection techniques with respect to the reduct size, classification accuracy and time complexity. The performance of the RSKHT had been validated using WEKA tool, which shows that RSKHT had been computationally attractive and flexible over massive datasets.
The proposal of architecture for chemical splitting to optimize QSAR models for aquatic toxicity.
Colombo, Andrea; Benfenati, Emilio; Karelson, Mati; Maran, Uko
2008-06-01
One of the challenges in the field of quantitative structure-activity relationship (QSAR) analysis is the correct classification of a chemical compound to an appropriate model for the prediction of activity. Thus, in previous studies, compounds have been divided into distinct groups according to their mode of action or chemical class. In the current study, theoretical molecular descriptors were used to divide 568 organic substances into subsets with toxicity measured for the 96-h lethal median concentration for the Fathead minnow (Pimephales promelas). Simple constitutional descriptors such as the number of aliphatic and aromatic rings and a quantum chemical descriptor, maximum bond order of a carbon atom divide compounds into nine subsets. For each subset of compounds the automatic forward selection of descriptors was applied to construct QSAR models. Significant correlations were achieved for each subset of chemicals and all models were validated with the leave-one-out internal validation procedure (R(2)(cv) approximately 0.80). The results encourage to consider this alternative way for the prediction of toxicity using QSAR subset models without direct reference to the mechanism of toxic action or the traditional chemical classification.
Liu, Aiming; Liu, Quan; Ai, Qingsong; Xie, Yi; Chen, Anqi
2017-01-01
Motor Imagery (MI) electroencephalography (EEG) is widely studied for its non-invasiveness, easy availability, portability, and high temporal resolution. As for MI EEG signal processing, the high dimensions of features represent a research challenge. It is necessary to eliminate redundant features, which not only create an additional overhead of managing the space complexity, but also might include outliers, thereby reducing classification accuracy. The firefly algorithm (FA) can adaptively select the best subset of features, and improve classification accuracy. However, the FA is easily entrapped in a local optimum. To solve this problem, this paper proposes a method of combining the firefly algorithm and learning automata (LA) to optimize feature selection for motor imagery EEG. We employed a method of combining common spatial pattern (CSP) and local characteristic-scale decomposition (LCD) algorithms to obtain a high dimensional feature set, and classified it by using the spectral regression discriminant analysis (SRDA) classifier. Both the fourth brain–computer interface competition data and real-time data acquired in our designed experiments were used to verify the validation of the proposed method. Compared with genetic and adaptive weight particle swarm optimization algorithms, the experimental results show that our proposed method effectively eliminates redundant features, and improves the classification accuracy of MI EEG signals. In addition, a real-time brain–computer interface system was implemented to verify the feasibility of our proposed methods being applied in practical brain–computer interface systems. PMID:29117100
Liu, Aiming; Chen, Kun; Liu, Quan; Ai, Qingsong; Xie, Yi; Chen, Anqi
2017-11-08
Motor Imagery (MI) electroencephalography (EEG) is widely studied for its non-invasiveness, easy availability, portability, and high temporal resolution. As for MI EEG signal processing, the high dimensions of features represent a research challenge. It is necessary to eliminate redundant features, which not only create an additional overhead of managing the space complexity, but also might include outliers, thereby reducing classification accuracy. The firefly algorithm (FA) can adaptively select the best subset of features, and improve classification accuracy. However, the FA is easily entrapped in a local optimum. To solve this problem, this paper proposes a method of combining the firefly algorithm and learning automata (LA) to optimize feature selection for motor imagery EEG. We employed a method of combining common spatial pattern (CSP) and local characteristic-scale decomposition (LCD) algorithms to obtain a high dimensional feature set, and classified it by using the spectral regression discriminant analysis (SRDA) classifier. Both the fourth brain-computer interface competition data and real-time data acquired in our designed experiments were used to verify the validation of the proposed method. Compared with genetic and adaptive weight particle swarm optimization algorithms, the experimental results show that our proposed method effectively eliminates redundant features, and improves the classification accuracy of MI EEG signals. In addition, a real-time brain-computer interface system was implemented to verify the feasibility of our proposed methods being applied in practical brain-computer interface systems.
Selected-node stochastic simulation algorithm
NASA Astrophysics Data System (ADS)
Duso, Lorenzo; Zechner, Christoph
2018-04-01
Stochastic simulations of biochemical networks are of vital importance for understanding complex dynamics in cells and tissues. However, existing methods to perform such simulations are associated with computational difficulties and addressing those remains a daunting challenge to the present. Here we introduce the selected-node stochastic simulation algorithm (snSSA), which allows us to exclusively simulate an arbitrary, selected subset of molecular species of a possibly large and complex reaction network. The algorithm is based on an analytical elimination of chemical species, thereby avoiding explicit simulation of the associated chemical events. These species are instead described continuously in terms of statistical moments derived from a stochastic filtering equation, resulting in a substantial speedup when compared to Gillespie's stochastic simulation algorithm (SSA). Moreover, we show that statistics obtained via snSSA profit from a variance reduction, which can significantly lower the number of Monte Carlo samples needed to achieve a certain performance. We demonstrate the algorithm using several biological case studies for which the simulation time could be reduced by orders of magnitude.
Gorla, Suresh Kumar; Kavitha, Mandapati; Zhang, Minjia; Liu, Xiaoping; Sharling, Lisa; Gollapalli, Deviprasad R.; Striepen, Boris; Hedstrom, Lizbeth; Cuny, Gregory D.
2012-01-01
Cryptosporidium parvum and related species are zoonotic intracellular parasites of the intestine. Cryptosporidium is a leading cause of diarrhea in small children around the world. Infection can cause severe pathology in children and immunocompromised patients. This waterborne parasite is resistant to common methods of water treatment and therefore a prominent threat to drinking and recreation water even in countries with strong water safety systems. The drugs currently used to combat these organisms are ineffective. Genomic analysis revealed that the parasite relies solely on inosine-5′-monophosphate dehydrogenase (IMPDH) for the biosynthesis of guanine nucleotides. Herein, we report a selective urea-based inhibitor of C. parvum IMPDH (CpIMPDH) identified by high throughput screening. We performed a SAR study of these inhibitors with some analogues exhibiting high potency (IC50 < 2 nM) against CpIMPDH, excellent selectivity > 1000-fold versus human IMPDH type 2 and good stability in mouse liver microsomes. A subset of inhibitors also displayed potent antiparasitic activity in a Toxoplasma gondii model. PMID:22950983
efficient association study design via power-optimized tag SNP selection
HAN, BUHM; KANG, HYUN MIN; SEO, MYEONG SEONG; ZAITLEN, NOAH; ESKIN, ELEAZAR
2008-01-01
Discovering statistical correlation between causal genetic variation and clinical traits through association studies is an important method for identifying the genetic basis of human diseases. Since fully resequencing a cohort is prohibitively costly, genetic association studies take advantage of local correlation structure (or linkage disequilibrium) between single nucleotide polymorphisms (SNPs) by selecting a subset of SNPs to be genotyped (tag SNPs). While many current association studies are performed using commercially available high-throughput genotyping products that define a set of tag SNPs, choosing tag SNPs remains an important problem for both custom follow-up studies as well as designing the high-throughput genotyping products themselves. The most widely used tag SNP selection method optimizes over the correlation between SNPs (r2). However, tag SNPs chosen based on an r2 criterion do not necessarily maximize the statistical power of an association study. We propose a study design framework that chooses SNPs to maximize power and efficiently measures the power through empirical simulation. Empirical results based on the HapMap data show that our method gains considerable power over a widely used r2-based method, or equivalently reduces the number of tag SNPs required to attain the desired power of a study. Our power-optimized 100k whole genome tag set provides equivalent power to the Affymetrix 500k chip for the CEU population. For the design of custom follow-up studies, our method provides up to twice the power increase using the same number of tag SNPs as r2-based methods. Our method is publicly available via web server at http://design.cs.ucla.edu. PMID:18702637
Enantioselectivity in Candida antarctica lipase B: A molecular dynamics study
Raza, Sami; Fransson, Linda; Hult, Karl
2001-01-01
A major problem in predicting the enantioselectivity of an enzyme toward substrate molecules is that even high selectivity toward one substrate enantiomer over the other corresponds to a very small difference in free energy. However, total free energies in enzyme-substrate systems are very large and fluctuate significantly because of general protein motion. Candida antarctica lipase B (CALB), a serine hydrolase, displays enantioselectivity toward secondary alcohols. Here, we present a modeling study where the aim has been to develop a molecular dynamics-based methodology for the prediction of enantioselectivity in CALB. The substrates modeled (seven in total) were 3-methyl-2-butanol with various aliphatic carboxylic acids and also 2-butanol, as well as 3,3-dimethyl-2-butanol with octanoic acid. The tetrahedral reaction intermediate was used as a model of the transition state. Investigative analyses were performed on ensembles of nonminimized structures and focused on the potential energies of a number of subsets within the modeled systems to determine which specific regions are important for the prediction of enantioselectivity. One category of subset was based on atoms that make up the core structural elements of the transition state. We considered that a more favorable energetic conformation of such a subset should relate to a greater likelihood for catalysis to occur, thus reflecting higher selectivity. The results of this study conveyed that the use of this type of subset was viable for the analysis of structural ensembles and yielded good predictions of enantioselectivity. PMID:11266619
Creating a non-linear total sediment load formula using polynomial best subset regression model
NASA Astrophysics Data System (ADS)
Okcu, Davut; Pektas, Ali Osman; Uyumaz, Ali
2016-08-01
The aim of this study is to derive a new total sediment load formula which is more accurate and which has less application constraints than the well-known formulae of the literature. 5 most known stream power concept sediment formulae which are approved by ASCE are used for benchmarking on a wide range of datasets that includes both field and flume (lab) observations. The dimensionless parameters of these widely used formulae are used as inputs in a new regression approach. The new approach is called Polynomial Best subset regression (PBSR) analysis. The aim of the PBRS analysis is fitting and testing all possible combinations of the input variables and selecting the best subset. Whole the input variables with their second and third powers are included in the regression to test the possible relation between the explanatory variables and the dependent variable. While selecting the best subset a multistep approach is used that depends on significance values and also the multicollinearity degrees of inputs. The new formula is compared to others in a holdout dataset and detailed performance investigations are conducted for field and lab datasets within this holdout data. Different goodness of fit statistics are used as they represent different perspectives of the model accuracy. After the detailed comparisons are carried out we figured out the most accurate equation that is also applicable on both flume and river data. Especially, on field dataset the prediction performance of the proposed formula outperformed the benchmark formulations.
Atlas ranking and selection for automatic segmentation of the esophagus from CT scans
NASA Astrophysics Data System (ADS)
Yang, Jinzhong; Haas, Benjamin; Fang, Raymond; Beadle, Beth M.; Garden, Adam S.; Liao, Zhongxing; Zhang, Lifei; Balter, Peter; Court, Laurence
2017-12-01
In radiation treatment planning, the esophagus is an important organ-at-risk that should be spared in patients with head and neck cancer or thoracic cancer who undergo intensity-modulated radiation therapy. However, automatic segmentation of the esophagus from CT scans is extremely challenging because of the structure’s inconsistent intensity, low contrast against the surrounding tissues, complex and variable shape and location, and random air bubbles. The goal of this study is to develop an online atlas selection approach to choose a subset of optimal atlases for multi-atlas segmentation to the delineate esophagus automatically. We performed atlas selection in two phases. In the first phase, we used the correlation coefficient of the image content in a cubic region between each atlas and the new image to evaluate their similarity and to rank the atlases in an atlas pool. A subset of atlases based on this ranking was selected, and deformable image registration was performed to generate deformed contours and deformed images in the new image space. In the second phase of atlas selection, we used Kullback-Leibler divergence to measure the similarity of local-intensity histograms between the new image and each of the deformed images, and the measurements were used to rank the previously selected atlases. Deformed contours were overlapped sequentially, from the most to the least similar, and the overlap ratio was examined. We further identified a subset of optimal atlases by analyzing the variation of the overlap ratio versus the number of atlases. The deformed contours from these optimal atlases were fused together using a modified simultaneous truth and performance level estimation algorithm to produce the final segmentation. The approach was validated with promising results using both internal data sets (21 head and neck cancer patients and 15 thoracic cancer patients) and external data sets (30 thoracic patients).
NASA Astrophysics Data System (ADS)
Tang, Jian; Qiao, Junfei; Wu, ZhiWei; Chai, Tianyou; Zhang, Jian; Yu, Wen
2018-01-01
Frequency spectral data of mechanical vibration and acoustic signals relate to difficult-to-measure production quality and quantity parameters of complex industrial processes. A selective ensemble (SEN) algorithm can be used to build a soft sensor model of these process parameters by fusing valued information selectively from different perspectives. However, a combination of several optimized ensemble sub-models with SEN cannot guarantee the best prediction model. In this study, we use several techniques to construct mechanical vibration and acoustic frequency spectra of a data-driven industrial process parameter model based on selective fusion multi-condition samples and multi-source features. Multi-layer SEN (MLSEN) strategy is used to simulate the domain expert cognitive process. Genetic algorithm and kernel partial least squares are used to construct the inside-layer SEN sub-model based on each mechanical vibration and acoustic frequency spectral feature subset. Branch-and-bound and adaptive weighted fusion algorithms are integrated to select and combine outputs of the inside-layer SEN sub-models. Then, the outside-layer SEN is constructed. Thus, "sub-sampling training examples"-based and "manipulating input features"-based ensemble construction methods are integrated, thereby realizing the selective information fusion process based on multi-condition history samples and multi-source input features. This novel approach is applied to a laboratory-scale ball mill grinding process. A comparison with other methods indicates that the proposed MLSEN approach effectively models mechanical vibration and acoustic signals.
Inference of Spatio-Temporal Functions Over Graphs via Multikernel Kriged Kalman Filtering
NASA Astrophysics Data System (ADS)
Ioannidis, Vassilis N.; Romero, Daniel; Giannakis, Georgios B.
2018-06-01
Inference of space-time varying signals on graphs emerges naturally in a plethora of network science related applications. A frequently encountered challenge pertains to reconstructing such dynamic processes, given their values over a subset of vertices and time instants. The present paper develops a graph-aware kernel-based kriged Kalman filter that accounts for the spatio-temporal variations, and offers efficient online reconstruction, even for dynamically evolving network topologies. The kernel-based learning framework bypasses the need for statistical information by capitalizing on the smoothness that graph signals exhibit with respect to the underlying graph. To address the challenge of selecting the appropriate kernel, the proposed filter is combined with a multi-kernel selection module. Such a data-driven method selects a kernel attuned to the signal dynamics on-the-fly within the linear span of a pre-selected dictionary. The novel multi-kernel learning algorithm exploits the eigenstructure of Laplacian kernel matrices to reduce computational complexity. Numerical tests with synthetic and real data demonstrate the superior reconstruction performance of the novel approach relative to state-of-the-art alternatives.
Xia, Junfeng; Yue, Zhenyu; Di, Yunqiang; Zhu, Xiaolei; Zheng, Chun-Hou
2016-01-01
The identification of hot spots, a small subset of protein interfaces that accounts for the majority of binding free energy, is becoming more important for the research of drug design and cancer development. Based on our previous methods (APIS and KFC2), here we proposed a novel hot spot prediction method. For each hot spot residue, we firstly constructed a wide variety of 108 sequence, structural, and neighborhood features to characterize potential hot spot residues, including conventional ones and new one (pseudo hydrophobicity) exploited in this study. We then selected 3 top-ranking features that contribute the most in the classification by a two-step feature selection process consisting of minimal-redundancy-maximal-relevance algorithm and an exhaustive search method. We used support vector machines to build our final prediction model. When testing our model on an independent test set, our method showed the highest F1-score of 0.70 and MCC of 0.46 comparing with the existing state-of-the-art hot spot prediction methods. Our results indicate that these features are more effective than the conventional features considered previously, and that the combination of our and traditional features may support the creation of a discriminative feature set for efficient prediction of hot spots in protein interfaces. PMID:26934646
Turley, James P; Johnson, Todd R; Smith, Danielle Paige; Zhang, Jaijie; Brixey, Juliana J
2006-04-01
Use of medical devices often directly contributes to medical errors. Because it is difficult or impossible to change the design of existing devices, the best opportunity for improving medical device safety is during the purchasing process. However, most hospital personnel are not familiar with the usability evaluation methods designed to identify aspects of a user interface that do not support intuitive and safe use. A review of medical device operating manuals is proposed as a more practical method of usability evaluation. Operating manuals for five volumetric infusion pumps from three manufacturers were selected for this study (January-April 2003). Each manual's safety message content was evaluated to determine whether the message indicated a device design characteristic that violated known usability principles (heuristics) or indicated a violation of an affordance of the device. "Minimize memory load," with 65 violations, was the heuristic violated most frequently across pumps. Variations between pumps, including the frequency and severity of violations for each, were noted. Results suggest that manual review can provide a proxy for heuristic evaluation of the actual medical device. This method, intended to be a component of prepurchasing evaluation, can complement more formal usability evaluation methods and be used to select a subset of devices for more extensive and formal testing.
Wang, Jianing; Liu, Yuan; Noble, Jack H; Dawant, Benoit M
2017-10-01
Medical image registration establishes a correspondence between images of biological structures, and it is at the core of many applications. Commonly used deformable image registration methods depend on a good preregistration initialization. We develop a learning-based method to automatically find a set of robust landmarks in three-dimensional MR image volumes of the head. These landmarks are then used to compute a thin plate spline-based initialization transformation. The process involves two steps: (1) identifying a set of landmarks that can be reliably localized in the images and (2) selecting among them the subset that leads to a good initial transformation. To validate our method, we use it to initialize five well-established deformable registration algorithms that are subsequently used to register an atlas to MR images of the head. We compare our proposed initialization method with a standard approach that involves estimating an affine transformation with an intensity-based approach. We show that for all five registration algorithms the final registration results are statistically better when they are initialized with the method that we propose than when a standard approach is used. The technique that we propose is generic and could be used to initialize nonrigid registration algorithms for other applications.
Federal Register 2010, 2011, 2012, 2013, 2014
2012-07-26
... Proposed Rule Change Amending NYSE Arca Equities Rule 7.31(h) To Add a PL Select Order Type July 20, 2012...(h) to add a PL Select Order type. The proposed rule change was published for comment in the Federal... security at a specified, undisplayed price. The PL Select Order would be a subset of the PL Order that...
Towards semi-automatic rock mass discontinuity orientation and set analysis from 3D point clouds
NASA Astrophysics Data System (ADS)
Guo, Jiateng; Liu, Shanjun; Zhang, Peina; Wu, Lixin; Zhou, Wenhui; Yu, Yinan
2017-06-01
Obtaining accurate information on rock mass discontinuities for deformation analysis and the evaluation of rock mass stability is important. Obtaining measurements for high and steep zones with the traditional compass method is difficult. Photogrammetry, three-dimensional (3D) laser scanning and other remote sensing methods have gradually become mainstream methods. In this study, a method that is based on a 3D point cloud is proposed to semi-automatically extract rock mass structural plane information. The original data are pre-treated prior to segmentation by removing outlier points. The next step is to segment the point cloud into different point subsets. Various parameters, such as the normal, dip/direction and dip, can be calculated for each point subset after obtaining the equation of the best fit plane for the relevant point subset. A cluster analysis (a point subset that satisfies some conditions and thus forms a cluster) is performed based on the normal vectors by introducing the firefly algorithm (FA) and the fuzzy c-means (FCM) algorithm. Finally, clusters that belong to the same discontinuity sets are merged and coloured for visualization purposes. A prototype system is developed based on this method to extract the points of the rock discontinuity from a 3D point cloud. A comparison with existing software shows that this method is feasible. This method can provide a reference for rock mechanics, 3D geological modelling and other related fields.
A small number of candidate gene SNPs reveal continental ancestry in African Americans
KODAMAN, NURI; ALDRICH, MELINDA C.; SMITH, JEFFREY R.; SIGNORELLO, LISA B.; BRADLEY, KEVIN; BREYER, JOAN; COHEN, SARAH S.; LONG, JIRONG; CAI, QIUYIN; GILES, JUSTIN; BUSH, WILLIAM S.; BLOT, WILLIAM J.; MATTHEWS, CHARLES E.; WILLIAMS, SCOTT M.
2013-01-01
SUMMARY Using genetic data from an obesity candidate gene study of self-reported African Americans and European Americans, we investigated the number of Ancestry Informative Markers (AIMs) and candidate gene SNPs necessary to infer continental ancestry. Proportions of African and European ancestry were assessed with STRUCTURE (K=2), using 276 AIMs. These reference values were compared to estimates derived using 120, 60, 30, and 15 SNP subsets randomly chosen from the 276 AIMs and from 1144 SNPs in 44 candidate genes. All subsets generated estimates of ancestry consistent with the reference estimates, with mean correlations greater than 0.99 for all subsets of AIMs, and mean correlations of 0.99±0.003; 0.98± 0.01; 0.93±0.03; and 0.81± 0.11 for subsets of 120, 60, 30, and 15 candidate gene SNPs, respectively. Among African Americans, the median absolute difference from reference African ancestry values ranged from 0.01 to 0.03 for the four AIMs subsets and from 0.03 to 0.09 for the four candidate gene SNP subsets. Furthermore, YRI/CEU Fst values provided a metric to predict the performance of candidate gene SNPs. Our results demonstrate that a small number of SNPs randomly selected from candidate genes can be used to estimate admixture proportions in African Americans reliably. PMID:23278390
Zhang, Shanxin; Zhou, Zhiping; Chen, Xinmeng; Hu, Yong; Yang, Lindong
2017-08-07
DNase I hypersensitive sites (DHSs) are accessible chromatin regions hypersensitive to cleavages by DNase I endonucleases. DHSs are indicative of cis-regulatory DNA elements (CREs), all of which play important roles in global gene expression regulation. It is helpful for discovering CREs by recognition of DHSs in genome. To accelerate the investigation, it is an important complement to develop cost-effective computational methods to identify DHSs. However, there is a lack of tools used for identifying DHSs in plant genome. Here we presented pDHS-SVM, a computational predictor to identify plant DHSs. To integrate the global sequence-order information and local DNA properties, reverse complement kmer and dinucleotide-based auto covariance of DNA sequences were applied to construct the feature space. In this work, fifteen physical-chemical properties of dinucleotides were used and Support Vector Machine (SVM) was employed. To further improve the performance of the predictor and extract an optimized subset of nucleotide physical-chemical properties positive for the DHSs, a heuristic nucleotide physical-chemical property selection algorithm was introduced. With the optimized subset of properties, experimental results of Arabidopsis thaliana and rice (Oryza sativa) showed that pDHS-SVM could achieve accuracies up to 87.00%, and 85.79%, respectively. The results indicated the effectiveness of proposed method for predicting DHSs. Furthermore, pDHS-SVM could provide a helpful complement for predicting CREs in plant genome. Our implementation of the novel proposed method pDHS-SVM is freely available as source code, at https://github.com/shanxinzhang/pDHS-SVM. Copyright © 2017 Elsevier Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Li, Yun; Zhang, Ji; Li, Tao; Liu, Honggao; Li, Jieqing; Wang, Yuanzhong
2017-04-01
In this work, the data fusion strategy of Fourier transform mid infrared (FT-MIR) spectroscopy and inductively coupled plasma-atomic emission spectrometry (ICP-AES) was used in combination with Support Vector Machine (SVM) to determine the geographic origin of Boletus edulis collected from nine regions of Yunnan Province in China. Firstly, competitive adaptive reweighted sampling (CARS) was used for selecting an optimal combination of key wavenumbers of second derivative FT-MIR spectra, and thirteen elements were sorted with variable importance in projection (VIP) scores. Secondly, thirteen subsets of multi-elements with the best VIP score were generated and each subset was used to fuse with FT-MIR. Finally, the classification models were established by SVM, and the combination of parameter C and γ (gamma) of SVM models was calculated by the approaches of grid search (GS) and genetic algorithm (GA). The results showed that both GS-SVM and GA-SVM models achieved good performances based on the #9 subset and the prediction accuracy in calibration and validation sets of the two models were 81.40% and 90.91%, correspondingly. In conclusion, it indicated that the data fusion strategy of FT-MIR and ICP-AES coupled with the algorithm of SVM can be used as a reliable tool for accurate identification of B. edulis, and it can provide a useful way of thinking for the quality control of edible mushrooms.
Melzer, Nina; Wittenburg, Dörte; Repsilber, Dirk
2013-01-01
In this study the benefit of metabolome level analysis for the prediction of genetic value of three traditional milk traits was investigated. Our proposed approach consists of three steps: First, milk metabolite profiles are used to predict three traditional milk traits of 1,305 Holstein cows. Two regression methods, both enabling variable selection, are applied to identify important milk metabolites in this step. Second, the prediction of these important milk metabolite from single nucleotide polymorphisms (SNPs) enables the detection of SNPs with significant genetic effects. Finally, these SNPs are used to predict milk traits. The observed precision of predicted genetic values was compared to the results observed for the classical genotype-phenotype prediction using all SNPs or a reduced SNP subset (reduced classical approach). To enable a comparison between SNP subsets, a special invariable evaluation design was implemented. SNPs close to or within known quantitative trait loci (QTL) were determined. This enabled us to determine if detected important SNP subsets were enriched in these regions. The results show that our approach can lead to genetic value prediction, but requires less than 1% of the total amount of (40,317) SNPs., significantly more important SNPs in known QTL regions were detected using our approach compared to the reduced classical approach. Concluding, our approach allows a deeper insight into the associations between the different levels of the genotype-phenotype map (genotype-metabolome, metabolome-phenotype, genotype-phenotype). PMID:23990900
Li, Yun; Zhang, Ji; Li, Tao; Liu, Honggao; Li, Jieqing; Wang, Yuanzhong
2017-04-15
In this work, the data fusion strategy of Fourier transform mid infrared (FT-MIR) spectroscopy and inductively coupled plasma-atomic emission spectrometry (ICP-AES) was used in combination with Support Vector Machine (SVM) to determine the geographic origin of Boletus edulis collected from nine regions of Yunnan Province in China. Firstly, competitive adaptive reweighted sampling (CARS) was used for selecting an optimal combination of key wavenumbers of second derivative FT-MIR spectra, and thirteen elements were sorted with variable importance in projection (VIP) scores. Secondly, thirteen subsets of multi-elements with the best VIP score were generated and each subset was used to fuse with FT-MIR. Finally, the classification models were established by SVM, and the combination of parameter C and γ (gamma) of SVM models was calculated by the approaches of grid search (GS) and genetic algorithm (GA). The results showed that both GS-SVM and GA-SVM models achieved good performances based on the #9 subset and the prediction accuracy in calibration and validation sets of the two models were 81.40% and 90.91%, correspondingly. In conclusion, it indicated that the data fusion strategy of FT-MIR and ICP-AES coupled with the algorithm of SVM can be used as a reliable tool for accurate identification of B. edulis, and it can provide a useful way of thinking for the quality control of edible mushrooms. Copyright © 2017. Published by Elsevier B.V.
Multi-atlas segmentation enables robust multi-contrast MRI spleen segmentation for splenomegaly
NASA Astrophysics Data System (ADS)
Huo, Yuankai; Liu, Jiaqi; Xu, Zhoubing; Harrigan, Robert L.; Assad, Albert; Abramson, Richard G.; Landman, Bennett A.
2017-02-01
Non-invasive spleen volume estimation is essential in detecting splenomegaly. Magnetic resonance imaging (MRI) has been used to facilitate splenomegaly diagnosis in vivo. However, achieving accurate spleen volume estimation from MR images is challenging given the great inter-subject variance of human abdomens and wide variety of clinical images/modalities. Multi-atlas segmentation has been shown to be a promising approach to handle heterogeneous data and difficult anatomical scenarios. In this paper, we propose to use multi-atlas segmentation frameworks for MRI spleen segmentation for splenomegaly. To the best of our knowledge, this is the first work that integrates multi-atlas segmentation for splenomegaly as seen on MRI. To address the particular concerns of spleen MRI, automated and novel semi-automated atlas selection approaches are introduced. The automated approach interactively selects a subset of atlases using selective and iterative method for performance level estimation (SIMPLE) approach. To further control the outliers, semi-automated craniocaudal length based SIMPLE atlas selection (L-SIMPLE) is proposed to introduce a spatial prior in a fashion to guide the iterative atlas selection. A dataset from a clinical trial containing 55 MRI volumes (28 T1 weighted and 27 T2 weighted) was used to evaluate different methods. Both automated and semi-automated methods achieved median DSC > 0.9. The outliers were alleviated by the L-SIMPLE (≍1 min manual efforts per scan), which achieved 0.9713 Pearson correlation compared with the manual segmentation. The results demonstrated that the multi-atlas segmentation is able to achieve accurate spleen segmentation from the multi-contrast splenomegaly MRI scans.
Multi-atlas Segmentation Enables Robust Multi-contrast MRI Spleen Segmentation for Splenomegaly.
Huo, Yuankai; Liu, Jiaqi; Xu, Zhoubing; Harrigan, Robert L; Assad, Albert; Abramson, Richard G; Landman, Bennett A
2017-02-11
Non-invasive spleen volume estimation is essential in detecting splenomegaly. Magnetic resonance imaging (MRI) has been used to facilitate splenomegaly diagnosis in vivo. However, achieving accurate spleen volume estimation from MR images is challenging given the great inter-subject variance of human abdomens and wide variety of clinical images/modalities. Multi-atlas segmentation has been shown to be a promising approach to handle heterogeneous data and difficult anatomical scenarios. In this paper, we propose to use multi-atlas segmentation frameworks for MRI spleen segmentation for splenomegaly. To the best of our knowledge, this is the first work that integrates multi-atlas segmentation for splenomegaly as seen on MRI. To address the particular concerns of spleen MRI, automated and novel semi-automated atlas selection approaches are introduced. The automated approach interactively selects a subset of atlases using selective and iterative method for performance level estimation (SIMPLE) approach. To further control the outliers, semi-automated craniocaudal length based SIMPLE atlas selection (L-SIMPLE) is proposed to introduce a spatial prior in a fashion to guide the iterative atlas selection. A dataset from a clinical trial containing 55 MRI volumes (28 T1 weighted and 27 T2 weighted) was used to evaluate different methods. Both automated and semi-automated methods achieved median DSC > 0.9. The outliers were alleviated by the L-SIMPLE (≈1 min manual efforts per scan), which achieved 0.9713 Pearson correlation compared with the manual segmentation. The results demonstrated that the multi-atlas segmentation is able to achieve accurate spleen segmentation from the multi-contrast splenomegaly MRI scans.
Pauci ex tanto numero: reduce redundancy in multi-model ensembles
NASA Astrophysics Data System (ADS)
Solazzo, E.; Riccio, A.; Kioutsioukis, I.; Galmarini, S.
2013-08-01
We explicitly address the fundamental issue of member diversity in multi-model ensembles. To date, no attempts in this direction have been documented within the air quality (AQ) community despite the extensive use of ensembles in this field. Common biases and redundancy are the two issues directly deriving from lack of independence, undermining the significance of a multi-model ensemble, and are the subject of this study. Shared, dependant biases among models do not cancel out but will instead determine a biased ensemble. Redundancy derives from having too large a portion of common variance among the members of the ensemble, producing overconfidence in the predictions and underestimation of the uncertainty. The two issues of common biases and redundancy are analysed in detail using the AQMEII ensemble of AQ model results for four air pollutants in two European regions. We show that models share large portions of bias and variance, extending well beyond those induced by common inputs. We make use of several techniques to further show that subsets of models can explain the same amount of variance as the full ensemble with the advantage of being poorly correlated. Selecting the members for generating skilful, non-redundant ensembles from such subsets proved, however, non-trivial. We propose and discuss various methods of member selection and rate the ensemble performance they produce. In most cases, the full ensemble is outscored by the reduced ones. We conclude that, although independence of outputs may not always guarantee enhancement of scores (but this depends upon the skill being investigated), we discourage selecting the members of the ensemble simply on the basis of scores; that is, independence and skills need to be considered disjointly.
Pauci ex tanto numero: reducing redundancy in multi-model ensembles
NASA Astrophysics Data System (ADS)
Solazzo, E.; Riccio, A.; Kioutsioukis, I.; Galmarini, S.
2013-02-01
We explicitly address the fundamental issue of member diversity in multi-model ensembles. To date no attempts in this direction are documented within the air quality (AQ) community, although the extensive use of ensembles in this field. Common biases and redundancy are the two issues directly deriving from lack of independence, undermining the significance of a multi-model ensemble, and are the subject of this study. Shared biases among models will determine a biased ensemble, making therefore essential the errors of the ensemble members to be independent so that bias can cancel out. Redundancy derives from having too large a portion of common variance among the members of the ensemble, producing overconfidence in the predictions and underestimation of the uncertainty. The two issues of common biases and redundancy are analysed in detail using the AQMEII ensemble of AQ model results for four air pollutants in two European regions. We show that models share large portions of bias and variance, extending well beyond those induced by common inputs. We make use of several techniques to further show that subsets of models can explain the same amount of variance as the full ensemble with the advantage of being poorly correlated. Selecting the members for generating skilful, non-redundant ensembles from such subsets proved, however, non-trivial. We propose and discuss various methods of member selection and rate the ensemble performance they produce. In most cases, the full ensemble is outscored by the reduced ones. We conclude that, although independence of outputs may not always guarantee enhancement of scores (but this depends upon the skill being investigated) we discourage selecting the members of the ensemble simply on the basis of scores, that is, independence and skills need to be considered disjointly.
Ovarian phagocyte subsets and their distinct tissue distribution patterns.
Carlock, Colin; Wu, Jean; Zhou, Cindy; Ross, April; Adams, Henry; Lou, Yahuan
2013-01-01
Ovarian macrophages, which play critical roles in various ovarian events, are probably derived from multiple lineages. Thus, a systemic classification of their subsets is a necessary first step for determination of their functions. Utilizing antibodies to five phagocyte markers, i.e. IA/IE (major histocompatibility complex class II), F4/80, CD11b (Mac-1), CD11c, and CD68, this study investigated subsets of ovarian phagocytes in mice. Three-color immunofluorescence and flow cytometry, together with morphological observation on isolated ovarian cells, demonstrated complicated phenotypes of ovarian phagocytes. Four macrophage and one dendritic cell subset, in addition to many minor phagocyte subsets, were identified. A dendritic cell-like population with a unique phenotype of CD11c(high)IA/IE⁻F4/80⁻ was also frequently observed. A preliminary age-dependent study showed dramatic increases in IA/IE⁺ macrophages and IA/IE⁺ dendritic cells after puberty. Furthermore, immunofluorescences on ovarian sections showed that each subset displayed a distinct tissue distribution pattern. The pattern for each subset may hint to their role in an ovarian function. In addition, partial isolation of ovarian macrophage subset using CD11b antibodies was attempted. Establishment of this isolation method may have provided us a tool for more precise investigation of each subset's functions at the cellular and molecular levels.
Hybrid spread spectrum radio system
Smith, Stephen F.; Dress, William B.
2010-02-02
Systems and methods are described for hybrid spread spectrum radio systems. A method includes modulating a signal by utilizing a subset of bits from a pseudo-random code generator to control an amplification circuit that provides a gain to the signal. Another method includes: modulating a signal by utilizing a subset of bits from a pseudo-random code generator to control a fast hopping frequency synthesizer; and fast frequency hopping the signal with the fast hopping frequency synthesizer, wherein multiple frequency hops occur within a single data-bit time.
Carriage of group B streptococcus in pregnant women from Oxford, UK
Jones, N; Oliver, K; Jones, Y; Haines, A; Crook, D
2006-01-01
Objective To investigate asymptomatic vagino‐rectal carriage of group B streptococcus (GBS) in pregnant women. Methods Women in the final trimester of pregnancy were recruited. A single vagino‐rectal swab was taken, with consent, for culture of GBS. Two microbiological methods for isolation of GBS from vagino‐ractal swabs were compared. The distribution of capsular serotypes of the GBS identified was determined. Epidemiological data for a subset (n = 167) of the pregnant women participating were examined. Results 21.3% were colonised vagino‐rectally with GBS. Risk factors for neonatal GBS disease (maternal fever, prolonged rupture of membranes, and preterm delivery) were present in 34 of 167 women (20.4%), and the presence of these factors correlated poorly with GBS carriage. Capsular serotypes III (26.4%), IA (25.8%), V (18.9%), and IB (15.7%) were prevalent in the GBS isolates. Selective broth culture of vagino‐rectal swabs was superior to selective plate culture, but the combination of both methods was associated with increased detection of GBS (7.5%). An algorithm for the identification of GBS from vagino‐rectal swabs was developed. Conclusions GBS carriage is prevalent in pregnant women in Oxfordshire, UK. The poor correlation between risk factors and GBS carriage requires further investigation in larger groups, given that the identification of these surrogate markers is recommended to guide administration of intrapartum antibiotic prophylaxis by the Royal College of Obstetricians of the UK. A selective broth culture detected more GBS carriers than a selective plate culture. PMID:16473927
Shahraz, Saeid; Lagu, Tara; Ritter, Grant A; Liu, Xiadong; Tompkins, Christopher
2017-03-01
Selection of International Classification of Diseases (ICD)-based coded information for complex conditions such as severe sepsis is a subjective process and the results are sensitive to the codes selected. We use an innovative data exploration method to guide ICD-based case selection for severe sepsis. Using the Nationwide Inpatient Sample, we applied Latent Class Analysis (LCA) to determine if medical coders follow any uniform and sensible coding for observations with severe sepsis. We examined whether ICD-9 codes specific to sepsis (038.xx for septicemia, a subset of 995.9 codes representing Systemic Inflammatory Response syndrome, and 785.52 for septic shock) could all be members of the same latent class. Hospitalizations coded with sepsis-specific codes could be assigned to a latent class of their own. This class constituted 22.8% of all potential sepsis observations. The probability of an observation with any sepsis-specific codes being assigned to the residual class was near 0. The chance of an observation in the residual class having a sepsis-specific code as the principal diagnosis was close to 0. Validity of sepsis class assignment is supported by empirical results, which indicated that in-hospital deaths in the sepsis-specific class were around 4 times as likely as that in the residual class. The conventional methods of defining severe sepsis cases in observational data substantially misclassify sepsis cases. We suggest a methodology that helps reliable selection of ICD codes for conditions that require complex coding.
Design and Application of Drought Indexes in Highly Regulated Mediterranean Water Systems
NASA Astrophysics Data System (ADS)
Castelletti, A.; Zaniolo, M.; Giuliani, M.
2017-12-01
Costs of drought are progressively increasing due to the undergoing alteration of hydro-meteorological regimes induced by climate change. Although drought management is largely studied in the literature, most of the traditional drought indexes fail in detecting critical events in highly regulated systems, which generally rely on ad-hoc formulations and cannot be generalized to different context. In this study, we contribute a novel framework for the design of a basin-customized drought index. This index represents a surrogate of the state of the basin and is computed by combining the available information about the water available in the system to reproduce a representative target variable for the drought condition of the basin (e.g., water deficit). To select the relevant variables and combinatione thereof, we use an advanced feature extraction algorithm called Wrapper for Quasi Equally Informative Subset Selection (W-QEISS). W-QEISS relies on a multi-objective evolutionary algorithm to find Pareto-efficient subsets of variables by maximizing the wrapper accuracy, minimizing the number of selected variables, and optimizing relevance and redundancy of the subset. The accuracy objective is evaluated trough the calibration of an extreme learning machine of the water deficit for each candidate subset of variables, with the index selected from the resulting solutions identifying a suitable compromise between accuracy, cardinality, relevance, and redundancy. The approach is tested on Lake Como, Italy, a regulated lake mainly operated for irrigation supply. In the absence of an institutional drought monitoring system, we constructed the combined index using all the hydrological variables from the existing monitoring system as well as common drought indicators at multiple time aggregations. The soil moisture deficit in the root zone computed by a distributed-parameter water balance model of the agricultural districts is used as target variable. Numerical results show that our combined drought index succesfully reproduces the deficit. The index represents a valuable information for supporting appropriate drought management strategies, including the possibility of directly informing the lake operations about the drought conditions and improve the overall reliability of the irrigation supply system.
Automatic design of basin-specific drought indexes for highly regulated water systems
NASA Astrophysics Data System (ADS)
Zaniolo, Marta; Giuliani, Matteo; Castelletti, Andrea Francesco; Pulido-Velazquez, Manuel
2018-04-01
Socio-economic costs of drought are progressively increasing worldwide due to undergoing alterations of hydro-meteorological regimes induced by climate change. Although drought management is largely studied in the literature, traditional drought indexes often fail at detecting critical events in highly regulated systems, where natural water availability is conditioned by the operation of water infrastructures such as dams, diversions, and pumping wells. Here, ad hoc index formulations are usually adopted based on empirical combinations of several, supposed-to-be significant, hydro-meteorological variables. These customized formulations, however, while effective in the design basin, can hardly be generalized and transferred to different contexts. In this study, we contribute FRIDA (FRamework for Index-based Drought Analysis), a novel framework for the automatic design of basin-customized drought indexes. In contrast to ad hoc empirical approaches, FRIDA is fully automated, generalizable, and portable across different basins. FRIDA builds an index representing a surrogate of the drought conditions of the basin, computed by combining all the relevant available information about the water circulating in the system identified by means of a feature extraction algorithm. We used the Wrapper for Quasi-Equally Informative Subset Selection (W-QEISS), which features a multi-objective evolutionary algorithm to find Pareto-efficient subsets of variables by maximizing the wrapper accuracy, minimizing the number of selected variables, and optimizing relevance and redundancy of the subset. The preferred variable subset is selected among the efficient solutions and used to formulate the final index according to alternative model structures. We apply FRIDA to the case study of the Jucar river basin (Spain), a drought-prone and highly regulated Mediterranean water resource system, where an advanced drought management plan relying on the formulation of an ad hoc state index
is used for triggering drought management measures. The state index was constructed empirically with a trial-and-error process begun in the 1980s and finalized in 2007, guided by the experts from the Confederación Hidrográfica del Júcar (CHJ). Our results show that the automated variable selection outcomes align with CHJ's 25-year-long empirical refinement. In addition, the resultant FRIDA index outperforms the official State Index in terms of accuracy in reproducing the target variable and cardinality of the selected inputs set.
Tan, Maxine; Pu, Jiantao; Zheng, Bin
2014-01-01
Purpose: Improving radiologists’ performance in classification between malignant and benign breast lesions is important to increase cancer detection sensitivity and reduce false-positive recalls. For this purpose, developing computer-aided diagnosis (CAD) schemes has been attracting research interest in recent years. In this study, we investigated a new feature selection method for the task of breast mass classification. Methods: We initially computed 181 image features based on mass shape, spiculation, contrast, presence of fat or calcifications, texture, isodensity, and other morphological features. From this large image feature pool, we used a sequential forward floating selection (SFFS)-based feature selection method to select relevant features, and analyzed their performance using a support vector machine (SVM) model trained for the classification task. On a database of 600 benign and 600 malignant mass regions of interest (ROIs), we performed the study using a ten-fold cross-validation method. Feature selection and optimization of the SVM parameters were conducted on the training subsets only. Results: The area under the receiver operating characteristic curve (AUC) = 0.805±0.012 was obtained for the classification task. The results also showed that the most frequently-selected features by the SFFS-based algorithm in 10-fold iterations were those related to mass shape, isodensity and presence of fat, which are consistent with the image features frequently used by radiologists in the clinical environment for mass classification. The study also indicated that accurately computing mass spiculation features from the projection mammograms was difficult, and failed to perform well for the mass classification task due to tissue overlap within the benign mass regions. Conclusions: In conclusion, this comprehensive feature analysis study provided new and valuable information for optimizing computerized mass classification schemes that may have potential to be useful as a “second reader” in future clinical practice. PMID:24664267
Bootstrap Enhanced Penalized Regression for Variable Selection with Neuroimaging Data.
Abram, Samantha V; Helwig, Nathaniel E; Moodie, Craig A; DeYoung, Colin G; MacDonald, Angus W; Waller, Niels G
2016-01-01
Recent advances in fMRI research highlight the use of multivariate methods for examining whole-brain connectivity. Complementary data-driven methods are needed for determining the subset of predictors related to individual differences. Although commonly used for this purpose, ordinary least squares (OLS) regression may not be ideal due to multi-collinearity and over-fitting issues. Penalized regression is a promising and underutilized alternative to OLS regression. In this paper, we propose a nonparametric bootstrap quantile (QNT) approach for variable selection with neuroimaging data. We use real and simulated data, as well as annotated R code, to demonstrate the benefits of our proposed method. Our results illustrate the practical potential of our proposed bootstrap QNT approach. Our real data example demonstrates how our method can be used to relate individual differences in neural network connectivity with an externalizing personality measure. Also, our simulation results reveal that the QNT method is effective under a variety of data conditions. Penalized regression yields more stable estimates and sparser models than OLS regression in situations with large numbers of highly correlated neural predictors. Our results demonstrate that penalized regression is a promising method for examining associations between neural predictors and clinically relevant traits or behaviors. These findings have important implications for the growing field of functional connectivity research, where multivariate methods produce numerous, highly correlated brain networks.
Bootstrap Enhanced Penalized Regression for Variable Selection with Neuroimaging Data
Abram, Samantha V.; Helwig, Nathaniel E.; Moodie, Craig A.; DeYoung, Colin G.; MacDonald, Angus W.; Waller, Niels G.
2016-01-01
Recent advances in fMRI research highlight the use of multivariate methods for examining whole-brain connectivity. Complementary data-driven methods are needed for determining the subset of predictors related to individual differences. Although commonly used for this purpose, ordinary least squares (OLS) regression may not be ideal due to multi-collinearity and over-fitting issues. Penalized regression is a promising and underutilized alternative to OLS regression. In this paper, we propose a nonparametric bootstrap quantile (QNT) approach for variable selection with neuroimaging data. We use real and simulated data, as well as annotated R code, to demonstrate the benefits of our proposed method. Our results illustrate the practical potential of our proposed bootstrap QNT approach. Our real data example demonstrates how our method can be used to relate individual differences in neural network connectivity with an externalizing personality measure. Also, our simulation results reveal that the QNT method is effective under a variety of data conditions. Penalized regression yields more stable estimates and sparser models than OLS regression in situations with large numbers of highly correlated neural predictors. Our results demonstrate that penalized regression is a promising method for examining associations between neural predictors and clinically relevant traits or behaviors. These findings have important implications for the growing field of functional connectivity research, where multivariate methods produce numerous, highly correlated brain networks. PMID:27516732
An overview of the Columbia Habitat Monitoring Program's (CHaMP) spatial-temporal design framework
We briefly review the concept of a master sample applied to stream networks in which a randomized set of stream sites is selected across a broad region to serve as a list of sites from which a subset of sites is selected to achieve multiple objectives of specific designs. The Col...
ERIC Educational Resources Information Center
Scott, Marcia Strong; Delgado, Christine F.; Tu, Shihfen; Fletcher, Kathryn L.
2005-01-01
In this study, predictive classification accuracy was used to select those tasks from a kindergarten screening battery that best identified children who, three years later, were classified as educable mentally handicapped or as having a specific learning disability. A subset of measures enabled correct classification of 91% of the children in…
Selecting climate change scenarios using impact-relevant sensitivities
Julie A. Vano; John B. Kim; David E. Rupp; Philip W. Mote
2015-01-01
Climate impact studies often require the selection of a small number of climate scenarios. Ideally, a subset would have simulations that both (1) appropriately represent the range of possible futures for the variable/s most important to the impact under investigation and (2) come from global climate models (GCMs) that provide plausible results for future climate in the...
[Varicocele and coincidental abacterial prostato-vesiculitis: negative role about the sperm output].
Vicari, Enzo; La Vignera, Sandro; Tracia, Angelo; Cardì, Francesco; Donati, Angelo
2003-03-01
To evaluate the frequency and the role of a coincidentally expressed abacterial prostato-vesiculitis (PV) on sperm output in patients with left varicocele (Vr). We evaluated 143 selected infertile patients (mean age 27 years, range 21-43), with oligo- and/or astheno- and/or teratozoospermia (OAT) subdivided in two groups. Group A included 76 patients with previous varicocelectomy and persistent OAT. Group B included 67 infertile patients (mean age 26 years, range 21-37) with OAT and not varicocelectomized. Patients with Vr and coincidental didymo-epididymal ultrasound (US) abnormalities were excluded from the study. Following rectal prostato-vesicular ultrasonography, each group was subdivided in two subsets on the basis of the absence (group A: subset Vr-/PV-; and group B: subset Vr+/PV-) or the presence of an abacterial PV (group A: subset Vr-/PV+; group B: subset Vr+/PV+). Particularly, PV was present in 47.4% and 41.8% patients of groups A and B, respectively. This coincidental pathology was ipsilateral with Vr in the 61% of the cases. Semen analysis was performed in all patients. Patients of group A showed a total sperm number significantly higher than those found in group B. In presence of PV, sperm parameters were not significantly different between matched--subsets (Vr-/PV+ vs. Vr+/PV+). In absence of PV, the sperm density, the total sperm number and the percentage of forward motility from subset with previous varicocelectomy (Vr-/PV) exhibited values significantly higher than those found in the matched--subset (Vr+/PV-). Sperm analysis alone performed in patients with left Vr is not a useful prognostic post-varicocelectomy marker. Since following varicocelectomy a lack of sperm response could mask another coincidental pathology, the identification through US scans of a possible PV may be mandatory. On the other hand, an integrated uro-andrological approach, including US scans, allows to enucleate subsets of patients with Vr alone, who will have an expected better sperm response following Vr repair.
Dunham, Richard M; Cervasi, Barbara; Brenchley, Jason M; Albrecht, Helmut; Weintrob, Amy; Sumpter, Beth; Engram, Jessica; Gordon, Shari; Klatt, Nichole R; Frank, Ian; Sodora, Donald L; Douek, Daniel C; Paiardini, Mirko; Silvestri, Guido
2008-04-15
Decreased CD4(+) T cell counts are the best marker of disease progression during HIV infection. However, CD4(+) T cells are heterogeneous in phenotype and function, and it is unknown how preferential depletion of specific CD4(+) T cell subsets influences disease severity. CD4(+) T cells can be classified into three subsets by the expression of receptors for two T cell-tropic cytokines, IL-2 (CD25) and IL-7 (CD127). The CD127(+)CD25(low/-) subset includes IL-2-producing naive and central memory T cells; the CD127(-)CD25(-) subset includes mainly effector T cells expressing perforin and IFN-gamma; and the CD127(low)CD25(high) subset includes FoxP3-expressing regulatory T cells. Herein we investigated how the proportions of these T cell subsets are changed during HIV infection. When compared with healthy controls, HIV-infected patients show a relative increase in CD4(+)CD127(-)CD25(-) T cells that is related to an absolute decline of CD4(+)CD127(+)CD25(low/-) T cells. Interestingly, this expansion of CD4(+)CD127(-) T cells was not observed in naturally SIV-infected sooty mangabeys. The relative expansion of CD4(+)CD127(-)CD25(-) T cells correlated directly with the levels of total CD4(+) T cell depletion and immune activation. CD4(+)CD127(-)CD25(-) T cells were not selectively resistant to HIV infection as levels of cell-associated virus were similar in all non-naive CD4(+) T cell subsets. These data indicate that, during HIV infection, specific changes in the fraction of CD4(+) T cells expressing CD25 and/or CD127 are associated with disease progression. Further studies will determine whether monitoring the three subsets of CD4(+) T cells defined based on the expression of CD25 and CD127 should be used in the clinical management of HIV-infected individuals.
NASA Astrophysics Data System (ADS)
Chen, Hui; Tan, Chao; Lin, Zan; Wu, Tong
2018-01-01
Milk is among the most popular nutrient source worldwide, which is of great interest due to its beneficial medicinal properties. The feasibility of the classification of milk powder samples with respect to their brands and the determination of protein concentration is investigated by NIR spectroscopy along with chemometrics. Two datasets were prepared for experiment. One contains 179 samples of four brands for classification and the other contains 30 samples for quantitative analysis. Principal component analysis (PCA) was used for exploratory analysis. Based on an effective model-independent variable selection method, i.e., minimal-redundancy maximal-relevance (MRMR), only 18 variables were selected to construct a partial least-square discriminant analysis (PLS-DA) model. On the test set, the PLS-DA model based on the selected variable set was compared with the full-spectrum PLS-DA model, both of which achieved 100% accuracy. In quantitative analysis, the partial least-square regression (PLSR) model constructed by the selected subset of 260 variables outperforms significantly the full-spectrum model. It seems that the combination of NIR spectroscopy, MRMR and PLS-DA or PLSR is a powerful tool for classifying different brands of milk and determining the protein content.
Feature Selection Using Information Gain for Improved Structural-Based Alert Correlation
Siraj, Maheyzah Md; Zainal, Anazida; Elshoush, Huwaida Tagelsir; Elhaj, Fatin
2016-01-01
Grouping and clustering alerts for intrusion detection based on the similarity of features is referred to as structurally base alert correlation and can discover a list of attack steps. Previous researchers selected different features and data sources manually based on their knowledge and experience, which lead to the less accurate identification of attack steps and inconsistent performance of clustering accuracy. Furthermore, the existing alert correlation systems deal with a huge amount of data that contains null values, incomplete information, and irrelevant features causing the analysis of the alerts to be tedious, time-consuming and error-prone. Therefore, this paper focuses on selecting accurate and significant features of alerts that are appropriate to represent the attack steps, thus, enhancing the structural-based alert correlation model. A two-tier feature selection method is proposed to obtain the significant features. The first tier aims at ranking the subset of features based on high information gain entropy in decreasing order. The second tier extends additional features with a better discriminative ability than the initially ranked features. Performance analysis results show the significance of the selected features in terms of the clustering accuracy using 2000 DARPA intrusion detection scenario-specific dataset. PMID:27893821
NASA Astrophysics Data System (ADS)
Wang, Baijie; Wang, Xin; Chen, Zhangxin
2013-08-01
Reservoir characterization refers to the process of quantitatively assigning reservoir properties using all available field data. Artificial neural networks (ANN) have recently been introduced to solve reservoir characterization problems dealing with the complex underlying relationships inherent in well log data. Despite the utility of ANNs, the current limitation is that most existing applications simply focus on directly implementing existing ANN models instead of improving/customizing them to fit the specific reservoir characterization tasks at hand. In this paper, we propose a novel intelligent framework that integrates fuzzy ranking (FR) and multilayer perceptron (MLP) neural networks for reservoir characterization. FR can automatically identify a minimum subset of well log data as neural inputs, and the MLP is trained to learn the complex correlations from the selected well log data to a target reservoir property. FR guarantees the selection of the optimal subset of representative data from the overall well log data set for the characterization of a specific reservoir property; and, this implicitly improves the modeling and predication accuracy of the MLP. In addition, a growing number of industrial agencies are implementing geographic information systems (GIS) in field data management; and, we have designed the GFAR solution (GIS-based FR ANN Reservoir characterization solution) system, which integrates the proposed framework into a GIS system that provides an efficient characterization solution. Three separate petroleum wells from southwestern Alberta, Canada, were used in the presented case study of reservoir porosity characterization. Our experiments demonstrate that our method can generate reliable results.
Maeng, Daniel D; Geng, Zhi; Marshall, Wendy M; Hess, Allison L; Tomcavage, Janet F
2017-11-14
Since 2012, a large health care system has offered an employee wellness program providing premium discounts for those who voluntarily undergo biometric screenings and meet goals. This study evaluates the program impact on care utilization and total cost of care, taking into account employee self-selection into the program. A retrospective claims data analysis of 6453 employees between 2011 and 2015 was conducted, categorizing the sample into 3 mutually exclusive subgroups: Subgroup 1 enrolled and met goals in all years, Subgroup 2 enrolled or met goals in some years but not all, and Subgroup 3 never enrolled. Each subgroup was compared to a cohort of employees in other employer groups (N = 24,061). Using a difference-in-difference method, significant reductions in total medical cost (14.2%; P = 0.014) and emergency department (ED) visits (11.2%; P = 0.058) were observed only among Subgroup 2 in 2015. No significant impact was detected among those in Subgroup 1. Those in Subgroup 1 were less likely to have chronic conditions at baseline. The results indicate that the wellness program enrollment was characterized by self-selection of healthier employees, among whom the program appeared to have no significant impact. Yet, cost savings and reductions in ED visits were observed among the subset of employees who enrolled or met goal in some years but not all, suggesting a potential link between the wellness program and positive behavior changes among certain subsets of the employee population.
Kling, Teresia; Johansson, Patrik; Sanchez, José; Marinescu, Voichita D.; Jörnsten, Rebecka; Nelander, Sven
2015-01-01
Statistical network modeling techniques are increasingly important tools to analyze cancer genomics data. However, current tools and resources are not designed to work across multiple diagnoses and technical platforms, thus limiting their applicability to comprehensive pan-cancer datasets such as The Cancer Genome Atlas (TCGA). To address this, we describe a new data driven modeling method, based on generalized Sparse Inverse Covariance Selection (SICS). The method integrates genetic, epigenetic and transcriptional data from multiple cancers, to define links that are present in multiple cancers, a subset of cancers, or a single cancer. It is shown to be statistically robust and effective at detecting direct pathway links in data from TCGA. To facilitate interpretation of the results, we introduce a publicly accessible tool (cancerlandscapes.org), in which the derived networks are explored as interactive web content, linked to several pathway and pharmacological databases. To evaluate the performance of the method, we constructed a model for eight TCGA cancers, using data from 3900 patients. The model rediscovered known mechanisms and contained interesting predictions. Possible applications include prediction of regulatory relationships, comparison of network modules across multiple forms of cancer and identification of drug targets. PMID:25953855
A global optimization approach to multi-polarity sentiment analysis.
Li, Xinmiao; Li, Jing; Wu, Yukeng
2015-01-01
Following the rapid development of social media, sentiment analysis has become an important social media mining technique. The performance of automatic sentiment analysis primarily depends on feature selection and sentiment classification. While information gain (IG) and support vector machines (SVM) are two important techniques, few studies have optimized both approaches in sentiment analysis. The effectiveness of applying a global optimization approach to sentiment analysis remains unclear. We propose a global optimization-based sentiment analysis (PSOGO-Senti) approach to improve sentiment analysis with IG for feature selection and SVM as the learning engine. The PSOGO-Senti approach utilizes a particle swarm optimization algorithm to obtain a global optimal combination of feature dimensions and parameters in the SVM. We evaluate the PSOGO-Senti model on two datasets from different fields. The experimental results showed that the PSOGO-Senti model can improve binary and multi-polarity Chinese sentiment analysis. We compared the optimal feature subset selected by PSOGO-Senti with the features in the sentiment dictionary. The results of this comparison indicated that PSOGO-Senti can effectively remove redundant and noisy features and can select a domain-specific feature subset with a higher-explanatory power for a particular sentiment analysis task. The experimental results showed that the PSOGO-Senti approach is effective and robust for sentiment analysis tasks in different domains. By comparing the improvements of two-polarity, three-polarity and five-polarity sentiment analysis results, we found that the five-polarity sentiment analysis delivered the largest improvement. The improvement of the two-polarity sentiment analysis was the smallest. We conclude that the PSOGO-Senti achieves higher improvement for a more complicated sentiment analysis task. We also compared the results of PSOGO-Senti with those of the genetic algorithm (GA) and grid search method. From the results of this comparison, we found that PSOGO-Senti is more suitable for improving a difficult multi-polarity sentiment analysis problem.
Piao, Yongjun; Piao, Minghao; Ryu, Keun Ho
2017-01-01
Cancer classification has been a crucial topic of research in cancer treatment. In the last decade, messenger RNA (mRNA) expression profiles have been widely used to classify different types of cancers. With the discovery of a new class of small non-coding RNAs; known as microRNAs (miRNAs), various studies have shown that the expression patterns of miRNA can also accurately classify human cancers. Therefore, there is a great demand for the development of machine learning approaches to accurately classify various types of cancers using miRNA expression data. In this article, we propose a feature subset-based ensemble method in which each model is learned from a different projection of the original feature space to classify multiple cancers. In our method, the feature relevance and redundancy are considered to generate multiple feature subsets, the base classifiers are learned from each independent miRNA subset, and the average posterior probability is used to combine the base classifiers. To test the performance of our method, we used bead-based and sequence-based miRNA expression datasets and conducted 10-fold and leave-one-out cross validations. The experimental results show that the proposed method yields good results and has higher prediction accuracy than popular ensemble methods. The Java program and source code of the proposed method and the datasets in the experiments are freely available at https://sourceforge.net/projects/mirna-ensemble/. Copyright © 2016 Elsevier Ltd. All rights reserved.
AOIPS data base management systems support for GARP data sets
NASA Technical Reports Server (NTRS)
Gary, J. P.
1977-01-01
A data base management system is identified, developed to provide flexible access to data sets produced by GARP during its data systems tests. The content and coverage of the data base are defined and a computer-aided, interactive information storage and retrieval system, implemented to facilitate access to user specified data subsets, is described. The computer programs developed to provide the capability were implemented on the highly interactive, minicomputer-based AOIPS and are referred to as the data retrieval system (DRS). Implemented as a user interactive but menu guided system, the DRS permits users to inventory the data tape library and create duplicate or subset data sets based on a user selected window defined by time and latitude/longitude boundaries. The DRS permits users to select, display, or produce formatted hard copy of individual data items contained within the data records.
Yashin, Anatoliy I.; Arbeev, Konstantin G.; Wu, Deqing; Arbeeva, Liubov; Kulminski, Alexander; Kulminskaya, Irina; Akushevich, Igor; Ukraintseva, Svetlana V.
2016-01-01
Background and Objective To clarify mechanisms of genetic regulation of human aging and longevity traits, a number of genome-wide association studies (GWAS) of these traits have been performed. However, the results of these analyses did not meet expectations of the researchers. Most detected genetic associations have not reached a genome-wide level of statistical significance, and suffered from the lack of replication in the studies of independent populations. The reasons for slow progress in this research area include low efficiency of statistical methods used in data analyses, genetic heterogeneity of aging and longevity related traits, possibility of pleiotropic (e.g., age dependent) effects of genetic variants on such traits, underestimation of the effects of (i) mortality selection in genetically heterogeneous cohorts, (ii) external factors and differences in genetic backgrounds of individuals in the populations under study, the weakness of conceptual biological framework that does not fully account for above mentioned factors. One more limitation of conducted studies is that they did not fully realize the potential of longitudinal data that allow for evaluating how genetic influences on life span are mediated by physiological variables and other biomarkers during the life course. The objective of this paper is to address these issues. Data and Methods We performed GWAS of human life span using different subsets of data from the original Framingham Heart Study cohort corresponding to different quality control (QC) procedures and used one subset of selected genetic variants for further analyses. We used simulation study to show that approach to combining data improves the quality of GWAS. We used FHS longitudinal data to compare average age trajectories of physiological variables in carriers and non-carriers of selected genetic variants. We used stochastic process model of human mortality and aging to investigate genetic influence on hidden biomarkers of aging and on dynamic interaction between aging and longevity. We investigated properties of genes related to selected variants and their roles in signaling and metabolic pathways. Results We showed that the use of different QC procedures results in different sets of genetic variants associated with life span. We selected 24 genetic variants negatively associated with life span. We showed that the joint analyses of genetic data at the time of bio-specimen collection and follow up data substantially improved significance of associations of selected 24 SNPs with life span. We also showed that aging related changes in physiological variables and in hidden biomarkers of aging differ for the groups of carriers and non-carriers of selected variants. Conclusions . The results of these analyses demonstrated benefits of using biodemographic models and methods in genetic association studies of these traits. Our findings showed that the absence of a large number of genetic variants with deleterious effects may make substantial contribution to exceptional longevity. These effects are dynamically mediated by a number of physiological variables and hidden biomarkers of aging. The results of these research demonstrated benefits of using integrative statistical models of mortality risks in genetic studies of human aging and longevity. PMID:27773987
Casati, Anna; Varghaei-Nahvi, Azam; Feldman, Steven Alexander; Assenmacher, Mario; Rosenberg, Steven Aaron; Dudley, Mark Edward; Scheffold, Alexander
2013-10-01
The adoptive transfer of lymphocytes genetically engineered to express tumor-specific antigen receptors is a potent strategy to treat cancer patients. T lymphocyte subsets, such as naïve or central memory T cells, selected in vitro prior to genetic engineering have been extensively investigated in preclinical mouse models, where they demonstrated improved therapeutic efficacy. However, so far, this is challenging to realize in the clinical setting, since good manufacturing practices (GMP) procedures for complex cell sorting and genetic manipulation are limited. To be able to directly compare the immunological attributes and therapeutic efficacy of naïve (T(N)) and central memory (T(CM)) CD8(+) T cells, we investigated clinical-scale procedures for their parallel selection and in vitro manipulation. We also evaluated currently available GMP-grade reagents for stimulation of T cell subsets, including a new type of anti-CD3/anti-CD28 nanomatrix. An optimized protocol was established for the isolation of both CD8(+) T(N) cells (CD4(-)CD62L(+)CD45RA(+)) and CD8(+) T(CM) (CD4(-)CD62L(+)CD45RA(-)) from a single patient. The highly enriched T cell subsets can be efficiently transduced and expanded to large cell numbers, sufficient for clinical applications and equivalent to or better than current cell and gene therapy approaches with unselected lymphocyte populations. The GMP protocols for selection of T(N) and T(CM) we reported here will be the basis for clinical trials analyzing safety, in vivo persistence and clinical efficacy in cancer patients and will help to generate a more reliable and efficacious cellular product.
Determining Individual Particle Magnetizations in Assemblages of Micrograins
NASA Astrophysics Data System (ADS)
de Groot, Lennart V.; Fabian, Karl; Béguin, Annemarieke; Reith, Pim; Barnhoorn, Auke; Hilgenkamp, Hans
2018-04-01
Obtaining reliable information from even the most challenging paleomagnetic recorders, such as the oldest igneous rocks and meteorites, is paramount to open new windows into Earth's history. Currently, such information is acquired by simultaneously sensing millions of particles in small samples or single crystals using superconducting quantum interference device magnetometers. The obtained rock-magnetic signal is a statistical ensemble of grains potentially differing in reliability as paleomagnetic recorder due to variations in physical dimensions, chemistry, and magnetic behavior. Here we go beyond bulk magnetic measurements and combine computed tomography and scanning magnetometry to uniquely invert for the magnetic moments of individual grains. This enables us to select and consider contributions of subsets of grains as a function of particle-specific selection criteria and avoid contributions that arise from particles that are altered or contain unreliable magnetic carriers. This new, nondestructive, method unlocks information from complex paleomagnetic recorders that until now goes obscured.
Renner, S; Sahlén, G; Périco, E
2016-06-01
We surveyed 15 bodies of water among remnants of the Atlantic Forest biome in southern Brazil for adult dragonflies and damselflies to test whether an empirical selection method for diversity indicators could be applied in a subtropical ecosystem, where limited ecological knowledge on species level is available. We found a regional species pool of 34 species distributed in a nested subset pattern with a mean of 11.2 species per locality. There was a pronounced difference in species composition between spring, summer, and autumn, but no differences in species numbers between seasons. Two species, Homeoura chelifera (Selys) and Ischnura capreolus (Hagen), were the strongest candidates for regional diversity indicators, being found only at species-rich localities in our surveyed area and likewise in an undisturbed national forest reserve, serving as a reference site for the Atlantic Forest. Using our selection method, we found it possible to obtain a tentative list of diversity indicators without having detailed ecological information of each species, providing a reference site is available for comparison. The method thus allows for indicator species to be selected in blanco from taxonomic groups that are little known. We hence argue that Odonata can already be incorporated in ongoing assessment programs in the Neotropics, which would also increase the ecological knowledge of the group and allow extrapolation to other taxa.
Longitudinal analyses of correlated response efficiencies of fillet traits in Nile tilapia.
Turra, E M; Fernandes, A F A; de Alvarenga, E R; Teixeira, E A; Alves, G F O; Manduca, L G; Murphy, T W; Silva, M A
2018-03-01
Recent studies with Nile tilapia have shown divergent results regarding the possibility of selecting on morphometric measurements to promote indirect genetic gains in fillet yield (FY). The use of indirect selection for fillet traits is important as these traits are only measurable after harvesting. Random regression models are a powerful tool in association studies to identify the best time point to measure and select animals. Random regression models can also be applied in a multiple trait approach to analyze indirect response to selection, which would avoid the need to sacrifice candidate fish. Therefore, the aim of this study was to investigate the genetic relationships between several body measurements, weight and fillet traits throughout the growth period and to evaluate the possibility of indirect selection for fillet traits in Nile tilapia. Data were collected from 2042 fish and was divided into two subsets. The first subset was used to estimate genetic parameters, including the permanent environmental effect for BW and body measurements (8758 records for each body measurement, as each fish was individually weighed and measured a maximum of six times). The second subset (2042 records for each trait) was used to estimate genetic correlations and heritabilities, which enabled the calculation of correlated response efficiencies between body measurements and the fillet traits. Heritability estimates across ages ranged from 0.05 to 0.5 for height, 0.02 to 0.48 for corrected length (CL), 0.05 to 0.68 for width, 0.08 to 0.57 for fillet weight (FW) and 0.12 to 0.42 for FY. All genetic correlation estimates between body measurements and FW were positive and strong (0.64 to 0.98). The estimates of genetic correlation between body measurements and FY were positive (except for CL at some ages), but weak to moderate (-0.08 to 0.68). These estimates resulted in strong and favorable correlated response efficiencies for FW and positive, but moderate for FY. These results indicate the possibility of achieving indirect genetic gains for FW and by selecting for morphometric traits, but low efficiency for FY when compared with direct selection.
Yaacoub, Charles; Mhanna, Georges; Rihana, Sandy
2017-01-01
Electroencephalography is a non-invasive measure of the brain electrical activity generated by millions of neurons. Feature extraction in electroencephalography analysis is a core issue that may lead to accurate brain mental state classification. This paper presents a new feature selection method that improves left/right hand movement identification of a motor imagery brain-computer interface, based on genetic algorithms and artificial neural networks used as classifiers. Raw electroencephalography signals are first preprocessed using appropriate filtering. Feature extraction is carried out afterwards, based on spectral and temporal signal components, and thus a feature vector is constructed. As various features might be inaccurate and mislead the classifier, thus degrading the overall system performance, the proposed approach identifies a subset of features from a large feature space, such that the classifier error rate is reduced. Experimental results show that the proposed method is able to reduce the number of features to as low as 0.5% (i.e., the number of ignored features can reach 99.5%) while improving the accuracy, sensitivity, specificity, and precision of the classifier. PMID:28124985
Yaacoub, Charles; Mhanna, Georges; Rihana, Sandy
2017-01-23
Electroencephalography is a non-invasive measure of the brain electrical activity generated by millions of neurons. Feature extraction in electroencephalography analysis is a core issue that may lead to accurate brain mental state classification. This paper presents a new feature selection method that improves left/right hand movement identification of a motor imagery brain-computer interface, based on genetic algorithms and artificial neural networks used as classifiers. Raw electroencephalography signals are first preprocessed using appropriate filtering. Feature extraction is carried out afterwards, based on spectral and temporal signal components, and thus a feature vector is constructed. As various features might be inaccurate and mislead the classifier, thus degrading the overall system performance, the proposed approach identifies a subset of features from a large feature space, such that the classifier error rate is reduced. Experimental results show that the proposed method is able to reduce the number of features to as low as 0.5% (i.e., the number of ignored features can reach 99.5%) while improving the accuracy, sensitivity, specificity, and precision of the classifier.
Usability-driven pruning of large ontologies: the case of SNOMED CT
Boeker, Martin; Illarramendi, Arantza; Schulz, Stefan
2012-01-01
Objectives To study ontology modularization techniques when applied to SNOMED CT in a scenario in which no previous corpus of information exists and to examine if frequency-based filtering using MEDLINE can reduce subset size without discarding relevant concepts. Materials and Methods Subsets were first extracted using four graph-traversal heuristics and one logic-based technique, and were subsequently filtered with frequency information from MEDLINE. Twenty manually coded discharge summaries from cardiology patients were used as signatures and test sets. The coverage, size, and precision of extracted subsets were measured. Results Graph-traversal heuristics provided high coverage (71–96% of terms in the test sets of discharge summaries) at the expense of subset size (17–51% of the size of SNOMED CT). Pre-computed subsets and logic-based techniques extracted small subsets (1%), but coverage was limited (24–55%). Filtering reduced the size of large subsets to 10% while still providing 80% coverage. Discussion Extracting subsets to annotate discharge summaries is challenging when no previous corpus exists. Ontology modularization provides valuable techniques, but the resulting modules grow as signatures spread across subhierarchies, yielding a very low precision. Conclusion Graph-traversal strategies and frequency data from an authoritative source can prune large biomedical ontologies and produce useful subsets that still exhibit acceptable coverage. However, a clinical corpus closer to the specific use case is preferred when available. PMID:22268217
Optimal Frequency-Domain System Realization with Weighting
NASA Technical Reports Server (NTRS)
Juang, Jer-Nan; Maghami, Peiman G.
1999-01-01
Several approaches are presented to identify an experimental system model directly from frequency response data. The formulation uses a matrix-fraction description as the model structure. Frequency weighting such as exponential weighting is introduced to solve a weighted least-squares problem to obtain the coefficient matrices for the matrix-fraction description. A multi-variable state-space model can then be formed using the coefficient matrices of the matrix-fraction description. Three different approaches are introduced to fine-tune the model using nonlinear programming methods to minimize the desired cost function. The first method uses an eigenvalue assignment technique to reassign a subset of system poles to improve the identified model. The second method deals with the model in the real Schur or modal form, reassigns a subset of system poles, and adjusts the columns (rows) of the input (output) influence matrix using a nonlinear optimizer. The third method also optimizes a subset of poles, but the input and output influence matrices are refined at every optimization step through least-squares procedures.
Gooding, Thomas Michael; McCarthy, Patrick Joseph
2010-03-02
A data collector for a massively parallel computer system obtains call-return stack traceback data for multiple nodes by retrieving partial call-return stack traceback data from each node, grouping the nodes in subsets according to the partial traceback data, and obtaining further call-return stack traceback data from a representative node or nodes of each subset. Preferably, the partial data is a respective instruction address from each node, nodes having identical instruction address being grouped together in the same subset. Preferably, a single node of each subset is chosen and full stack traceback data is retrieved from the call-return stack within the chosen node.
Vremec, David
2016-01-01
Dendritic cells (DCs) form a complex network of cells that initiate and orchestrate immune responses against a vast array of pathogenic challenges. Developmentally and functionally distinct DC subtypes differentially regulate T-cell function. Importantly it is the ability of DC to capture and process antigen, whether from pathogens, vaccines, or self-components, and present it to naive T cells that is the key to their ability to initiate an immune response. Our typical isolation procedure for DC from murine spleen was designed to efficiently extract all DC subtypes, without bias and without alteration to their in vivo phenotype, and involves a short collagenase digestion of the tissue, followed by selection for cells of light density and finally negative selection for DC. The isolation procedure can accommodate DC numbers that have been artificially increased via administration of fms-like tyrosine kinase 3 ligand (Flt3L), either directly through a series of subcutaneous injections or by seeding with an Flt3L secreting murine melanoma. Flt3L may also be added to bone marrow cultures to produce large numbers of in vitro equivalents of the spleen DC subsets. Total DC, or their subsets, may be further purified using immunofluorescent labeling and flow cytometric cell sorting. Cell sorting may be completely bypassed by separating DC subsets using a combination of fluorescent antibody labeling and anti-fluorochrome magnetic beads. Our procedure enables efficient separation of the distinct DC subsets, even in cases where mouse numbers or flow cytometric cell sorting time is limiting.
G-STRATEGY: Optimal Selection of Individuals for Sequencing in Genetic Association Studies
Wang, Miaoyan; Jakobsdottir, Johanna; Smith, Albert V.; McPeek, Mary Sara
2017-01-01
In a large-scale genetic association study, the number of phenotyped individuals available for sequencing may, in some cases, be greater than the study’s sequencing budget will allow. In that case, it can be important to prioritize individuals for sequencing in a way that optimizes power for association with the trait. Suppose a cohort of phenotyped individuals is available, with some subset of them possibly already sequenced, and one wants to choose an additional fixed-size subset of individuals to sequence in such a way that the power to detect association is maximized. When the phenotyped sample includes related individuals, power for association can be gained by including partial information, such as phenotype data of ungenotyped relatives, in the analysis, and this should be taken into account when assessing whom to sequence. We propose G-STRATEGY, which uses simulated annealing to choose a subset of individuals for sequencing that maximizes the expected power for association. In simulations, G-STRATEGY performs extremely well for a range of complex disease models and outperforms other strategies with, in many cases, relative power increases of 20–40% over the next best strategy, while maintaining correct type 1 error. G-STRATEGY is computationally feasible even for large datasets and complex pedigrees. We apply G-STRATEGY to data on HDL and LDL from the AGES-Reykjavik and REFINE-Reykjavik studies, in which G-STRATEGY is able to closely-approximate the power of sequencing the full sample by selecting for sequencing a only small subset of the individuals. PMID:27256766
Yoo, Jin Eun
2018-01-01
A substantial body of research has been conducted on variables relating to students' mathematics achievement with TIMSS. However, most studies have employed conventional statistical methods, and have focused on selected few indicators instead of utilizing hundreds of variables TIMSS provides. This study aimed to find a prediction model for students' mathematics achievement using as many TIMSS student and teacher variables as possible. Elastic net, the selected machine learning technique in this study, takes advantage of both LASSO and ridge in terms of variable selection and multicollinearity, respectively. A logistic regression model was also employed to predict TIMSS 2011 Korean 4th graders' mathematics achievement. Ten-fold cross-validation with mean squared error was employed to determine the elastic net regularization parameter. Among 162 TIMSS variables explored, 12 student and 5 teacher variables were selected in the elastic net model, and the prediction accuracy, sensitivity, and specificity were 76.06, 70.23, and 80.34%, respectively. This study showed that the elastic net method can be successfully applied to educational large-scale data by selecting a subset of variables with reasonable prediction accuracy and finding new variables to predict students' mathematics achievement. Newly found variables via machine learning can shed light on the existing theories from a totally different perspective, which in turn propagates creation of a new theory or complement of existing ones. This study also examined the current scale development convention from a machine learning perspective.
Yoo, Jin Eun
2018-01-01
A substantial body of research has been conducted on variables relating to students' mathematics achievement with TIMSS. However, most studies have employed conventional statistical methods, and have focused on selected few indicators instead of utilizing hundreds of variables TIMSS provides. This study aimed to find a prediction model for students' mathematics achievement using as many TIMSS student and teacher variables as possible. Elastic net, the selected machine learning technique in this study, takes advantage of both LASSO and ridge in terms of variable selection and multicollinearity, respectively. A logistic regression model was also employed to predict TIMSS 2011 Korean 4th graders' mathematics achievement. Ten-fold cross-validation with mean squared error was employed to determine the elastic net regularization parameter. Among 162 TIMSS variables explored, 12 student and 5 teacher variables were selected in the elastic net model, and the prediction accuracy, sensitivity, and specificity were 76.06, 70.23, and 80.34%, respectively. This study showed that the elastic net method can be successfully applied to educational large-scale data by selecting a subset of variables with reasonable prediction accuracy and finding new variables to predict students' mathematics achievement. Newly found variables via machine learning can shed light on the existing theories from a totally different perspective, which in turn propagates creation of a new theory or complement of existing ones. This study also examined the current scale development convention from a machine learning perspective. PMID:29599736
Circulating B cells in type 1 diabetics exhibit fewer maturation-associated phenotypes.
Hanley, Patrick; Sutter, Jennifer A; Goodman, Noah G; Du, Yangzhu; Sekiguchi, Debora R; Meng, Wenzhao; Rickels, Michael R; Naji, Ali; Luning Prak, Eline T
2017-10-01
Although autoantibodies have been used for decades as diagnostic and prognostic markers in type 1 diabetes (T1D), further analysis of developmental abnormalities in B cells could reveal tolerance checkpoint defects that could improve individualized therapy. To evaluate B cell developmental progression in T1D, immunophenotyping was used to classify circulating B cells into transitional, mature naïve, mature activated, and resting memory subsets. Then each subset was analyzed for the expression of additional maturation-associated markers. While the frequencies of B cell subsets did not differ significantly between patients and controls, some T1D subjects exhibited reduced proportions of B cells that expressed transmembrane activator and CAML interactor (TACI) and Fas receptor (FasR). Furthermore, some T1D subjects had B cell subsets with lower frequencies of class switching. These results suggest circulating B cells exhibit variable maturation phenotypes in T1D. These phenotypic variations may correlate with differences in B cell selection in individual T1D patients. Copyright © 2017 The Authors. Published by Elsevier Inc. All rights reserved.
Feature Selection and Pedestrian Detection Based on Sparse Representation.
Yao, Shihong; Wang, Tao; Shen, Weiming; Pan, Shaoming; Chong, Yanwen; Ding, Fei
2015-01-01
Pedestrian detection have been currently devoted to the extraction of effective pedestrian features, which has become one of the obstacles in pedestrian detection application according to the variety of pedestrian features and their large dimension. Based on the theoretical analysis of six frequently-used features, SIFT, SURF, Haar, HOG, LBP and LSS, and their comparison with experimental results, this paper screens out the sparse feature subsets via sparse representation to investigate whether the sparse subsets have the same description abilities and the most stable features. When any two of the six features are fused, the fusion feature is sparsely represented to obtain its important components. Sparse subsets of the fusion features can be rapidly generated by avoiding calculation of the corresponding index of dimension numbers of these feature descriptors; thus, the calculation speed of the feature dimension reduction is improved and the pedestrian detection time is reduced. Experimental results show that sparse feature subsets are capable of keeping the important components of these six feature descriptors. The sparse features of HOG and LSS possess the same description ability and consume less time compared with their full features. The ratios of the sparse feature subsets of HOG and LSS to their full sets are the highest among the six, and thus these two features can be used to best describe the characteristics of the pedestrian and the sparse feature subsets of the combination of HOG-LSS show better distinguishing ability and parsimony.
VizieR Online Data Catalog: RR Lyraes in SDSS stripe 82 (Watkins+, 2009)
NASA Astrophysics Data System (ADS)
Watkins, L. L.; Evans, N. W.; Belokurov, V.; Smith, M. C.; Hewett, P. C.; Bramich, D. M.; Gilmore, G. F.; Irwin, M. J.; Vidrih, S.; Wyrzykowski, L.; Zucker, D. B.
2015-10-01
In this paper, we select first the variable objects in Stripe 82 and then the subset of RR Lyraes, using the Bramich et al. (2008MNRAS.386..887B, Cat. V/141) light-motion curve catalogue (LMCC) and HLC. We make a selection of the variable objects and an identification of RR Lyrae stars. (2 data files).
ERIC Educational Resources Information Center
Reynolds, Gemma; Reed, Phil
2013-01-01
Stimulus over-selectivity refers to the phenomenon whereby behavior is controlled by a subset of elements in the environment at the expense of other equally salient aspects of the environment. The experiments explored whether this cue interference effect was reduced following a surprising downward shift in reinforcer value. Experiment 1 revealed…
Ensemble Sparse Classification of Alzheimer’s Disease
Liu, Manhua; Zhang, Daoqiang; Shen, Dinggang
2012-01-01
The high-dimensional pattern classification methods, e.g., support vector machines (SVM), have been widely investigated for analysis of structural and functional brain images (such as magnetic resonance imaging (MRI)) to assist the diagnosis of Alzheimer’s disease (AD) including its prodromal stage, i.e., mild cognitive impairment (MCI). Most existing classification methods extract features from neuroimaging data and then construct a single classifier to perform classification. However, due to noise and small sample size of neuroimaging data, it is challenging to train only a global classifier that can be robust enough to achieve good classification performance. In this paper, instead of building a single global classifier, we propose a local patch-based subspace ensemble method which builds multiple individual classifiers based on different subsets of local patches and then combines them for more accurate and robust classification. Specifically, to capture the local spatial consistency, each brain image is partitioned into a number of local patches and a subset of patches is randomly selected from the patch pool to build a weak classifier. Here, the sparse representation-based classification (SRC) method, which has shown effective for classification of image data (e.g., face), is used to construct each weak classifier. Then, multiple weak classifiers are combined to make the final decision. We evaluate our method on 652 subjects (including 198 AD patients, 225 MCI and 229 normal controls) from Alzheimer’s Disease Neuroimaging Initiative (ADNI) database using MR images. The experimental results show that our method achieves an accuracy of 90.8% and an area under the ROC curve (AUC) of 94.86% for AD classification and an accuracy of 87.85% and an AUC of 92.90% for MCI classification, respectively, demonstrating a very promising performance of our method compared with the state-of-the-art methods for AD/MCI classification using MR images. PMID:22270352
Li, Yiqing; Wang, Yu; Zi, Yanyang; Zhang, Mingquan
2015-10-21
The various multi-sensor signal features from a diesel engine constitute a complex high-dimensional dataset. The non-linear dimensionality reduction method, t-distributed stochastic neighbor embedding (t-SNE), provides an effective way to implement data visualization for complex high-dimensional data. However, irrelevant features can deteriorate the performance of data visualization, and thus, should be eliminated a priori. This paper proposes a feature subset score based t-SNE (FSS-t-SNE) data visualization method to deal with the high-dimensional data that are collected from multi-sensor signals. In this method, the optimal feature subset is constructed by a feature subset score criterion. Then the high-dimensional data are visualized in 2-dimension space. According to the UCI dataset test, FSS-t-SNE can effectively improve the classification accuracy. An experiment was performed with a large power marine diesel engine to validate the proposed method for diesel engine malfunction classification. Multi-sensor signals were collected by a cylinder vibration sensor and a cylinder pressure sensor. Compared with other conventional data visualization methods, the proposed method shows good visualization performance and high classification accuracy in multi-malfunction classification of a diesel engine.
Li, Yiqing; Wang, Yu; Zi, Yanyang; Zhang, Mingquan
2015-01-01
The various multi-sensor signal features from a diesel engine constitute a complex high-dimensional dataset. The non-linear dimensionality reduction method, t-distributed stochastic neighbor embedding (t-SNE), provides an effective way to implement data visualization for complex high-dimensional data. However, irrelevant features can deteriorate the performance of data visualization, and thus, should be eliminated a priori. This paper proposes a feature subset score based t-SNE (FSS-t-SNE) data visualization method to deal with the high-dimensional data that are collected from multi-sensor signals. In this method, the optimal feature subset is constructed by a feature subset score criterion. Then the high-dimensional data are visualized in 2-dimension space. According to the UCI dataset test, FSS-t-SNE can effectively improve the classification accuracy. An experiment was performed with a large power marine diesel engine to validate the proposed method for diesel engine malfunction classification. Multi-sensor signals were collected by a cylinder vibration sensor and a cylinder pressure sensor. Compared with other conventional data visualization methods, the proposed method shows good visualization performance and high classification accuracy in multi-malfunction classification of a diesel engine. PMID:26506347
Yu, Haiying; Kühne, Ralph; Ebert, Ralf-Uwe; Schüürmann, Gerrit
2010-11-22
For 1143 organic compounds comprising 580 oxygen acids and 563 nitrogen bases that cover more than 17 orders of experimental pK(a) (from -5.00 to 12.23), the pK(a) prediction performances of ACD, SPARC, and two calibrations of a semiempirical quantum chemical (QC) AM1 approach have been analyzed. The overall root-mean-square errors (rms) for the acids are 0.41, 0.58 (0.42 without ortho-substituted phenols with intramolecular H-bonding), and 0.55 and for the bases are 0.65, 0.70, 1.17, and 1.27 for ACD, SPARC, and both QC methods, respectively. Method-specific performances are discussed in detail for six acid subsets (phenols and aromatic and aliphatic carboxylic acids with different substitution patterns) and nine base subsets (anilines, primary, secondary and tertiary amines, meta/para-substituted and ortho-substituted pyridines, pyrimidines, imidazoles, and quinolines). The results demonstrate an overall better performance for acids than for bases but also a substantial variation across subsets. For the overall best-performing ACD, rms ranges from 0.12 to 1.11 and 0.40 to 1.21 pK(a) units for the acid and base subsets, respectively. With regard to the squared correlation coefficient r², the results are 0.86 to 0.96 (acids) and 0.79 to 0.95 (bases) for ACD, 0.77 to 0.95 (acids) and 0.85 to 0.97 (bases) for SPARC, and 0.64 to 0.87 (acids) and 0.43 to 0.83 (bases) for the QC methods, respectively. Attention is paid to structural and method-specific causes for observed pitfalls. The significant subset dependence of the prediction performances suggests a consensus modeling approach.
Sample selection via angular distance in the space of the arguments of an artificial neural network
NASA Astrophysics Data System (ADS)
Fernández Jaramillo, J. M.; Mayerle, R.
2018-05-01
In the construction of an artificial neural network (ANN) a proper data splitting of the available samples plays a major role in the training process. This selection of subsets for training, testing and validation affects the generalization ability of the neural network. Also the number of samples has an impact in the time required for the design of the ANN and the training. This paper introduces an efficient and simple method for reducing the set of samples used for training a neural network. The method reduces the required time to calculate the network coefficients, while keeping the diversity and avoiding overtraining the ANN due the presence of similar samples. The proposed method is based on the calculation of the angle between two vectors, each one representing one input of the neural network. When the angle formed among samples is smaller than a defined threshold only one input is accepted for the training. The accepted inputs are scattered throughout the sample space. Tidal records are used to demonstrate the proposed method. The results of a cross-validation show that with few inputs the quality of the outputs is not accurate and depends on the selection of the first sample, but as the number of inputs increases the accuracy is improved and differences among the scenarios with a different starting sample have and important reduction. A comparison with the K-means clustering algorithm shows that for this application the proposed method with a smaller number of samples is producing a more accurate network.
Longato, Enrico; Garrido, Maria; Saccardo, Desy; Montesinos Guevara, Camila; Mani, Ali R; Bolognesi, Massimo; Amodio, Piero; Facchinetti, Andrea; Sparacino, Giovanni; Montagnese, Sara
2017-01-01
A popular method to estimate proximal/distal temperature (TPROX and TDIST) consists in calculating a weighted average of nine wireless sensors placed on pre-defined skin locations. Specifically, TPROX is derived from five sensors placed on the infra-clavicular and mid-thigh area (left and right) and abdomen, and TDIST from four sensors located on the hands and feet. In clinical practice, the loss/removal of one or more sensors is a common occurrence, but limited information is available on how this affects the accuracy of temperature estimates. The aim of this study was to determine the accuracy of temperature estimates in relation to number/position of sensors removed. Thirteen healthy subjects wore all nine sensors for 24 hours and reference TPROX and TDIST time-courses were calculated using all sensors. Then, all possible combinations of reduced subsets of sensors were simulated and suitable weights for each sensor calculated. The accuracy of TPROX and TDIST estimates resulting from the reduced subsets of sensors, compared to reference values, was assessed by the mean squared error, the mean absolute error (MAE), the cross-validation error and the 25th and 75th percentiles of the reconstruction error. Tables of the accuracy and sensor weights for all possible combinations of sensors are provided. For instance, in relation to TPROX, a subset of three sensors placed in any combination of three non-homologous areas (abdominal, right or left infra-clavicular, right or left mid-thigh) produced an error of 0.13°C MAE, while the loss/removal of the abdominal sensor resulted in an error of 0.25°C MAE, with the greater impact on the quality of the reconstruction. This information may help researchers/clinicians: i) evaluate the expected goodness of their TPROX and TDIST estimates based on the number of available sensors; ii) select the most appropriate subset of sensors, depending on goals and operational constraints.
Longato, Enrico; Garrido, Maria; Saccardo, Desy; Montesinos Guevara, Camila; Mani, Ali R.; Bolognesi, Massimo; Amodio, Piero; Facchinetti, Andrea; Sparacino, Giovanni
2017-01-01
A popular method to estimate proximal/distal temperature (TPROX and TDIST) consists in calculating a weighted average of nine wireless sensors placed on pre-defined skin locations. Specifically, TPROX is derived from five sensors placed on the infra-clavicular and mid-thigh area (left and right) and abdomen, and TDIST from four sensors located on the hands and feet. In clinical practice, the loss/removal of one or more sensors is a common occurrence, but limited information is available on how this affects the accuracy of temperature estimates. The aim of this study was to determine the accuracy of temperature estimates in relation to number/position of sensors removed. Thirteen healthy subjects wore all nine sensors for 24 hours and reference TPROX and TDIST time-courses were calculated using all sensors. Then, all possible combinations of reduced subsets of sensors were simulated and suitable weights for each sensor calculated. The accuracy of TPROX and TDIST estimates resulting from the reduced subsets of sensors, compared to reference values, was assessed by the mean squared error, the mean absolute error (MAE), the cross-validation error and the 25th and 75th percentiles of the reconstruction error. Tables of the accuracy and sensor weights for all possible combinations of sensors are provided. For instance, in relation to TPROX, a subset of three sensors placed in any combination of three non-homologous areas (abdominal, right or left infra-clavicular, right or left mid-thigh) produced an error of 0.13°C MAE, while the loss/removal of the abdominal sensor resulted in an error of 0.25°C MAE, with the greater impact on the quality of the reconstruction. This information may help researchers/clinicians: i) evaluate the expected goodness of their TPROX and TDIST estimates based on the number of available sensors; ii) select the most appropriate subset of sensors, depending on goals and operational constraints. PMID:28666029
NASA Astrophysics Data System (ADS)
Ashok, Praveen C.; Praveen, Bavishna B.; Campbell, Elaine C.; Dholakia, Kishan; Powis, Simon J.
2014-03-01
Leucocytes in the blood of mammals form a powerful protective system against a wide range of dangerous pathogens. There are several types of immune cells that has specific role in the whole immune system. The number and type of immune cells alter in the disease state and identifying the type of immune cell provides information about a person's state of health. There are several immune cell subsets that are essentially morphologically identical and require external labeling to enable discrimination. Here we demonstrate the feasibility of using Wavelength Modulated Raman Spectroscopy (WMRS) with suitable machine learning algorithms as a label-free method to distinguish between different closely lying immune cell subset. Principal Component Analysis (PCA) was performed on WMRS data from single cells, obtained using confocal Raman microscopy for feature reduction, followed by Support Vector Machine (SVM) for binary discrimination of various cell subset, which yielded an accuracy >85%. The method was successful in discriminating between untouched and unfixed purified populations of CD4+CD3+ and CD8+CD3+ T lymphocyte subsets, and CD56+CD3- natural killer cells with a high degree of specificity. It was also proved sensitive enough to identify unique Raman signatures that allow clear discrimination between dendritic cell subsets, comprising CD303+CD45+ plasmacytoid and CD1c+CD141+ myeloid dendritic cells. The results of this study clearly show that WMRS is highly sensitive and can distinguish between cell types that are morphologically identical.
Varvil-Weld, Lindsey; Mallett, Kimberly A; Turrisi, Rob; Abar, Caitlin C
2012-05-01
Previous research identified a high-risk subset of college students experiencing a disproportionate number of alcohol-related consequences at the end of their first year. With the goal of identifying pre-college predictors of membership in this high-risk subset, the present study used a prospective design to identify latent profiles of student-reported maternal and paternal parenting styles and alcohol-specific behaviors and to determine whether these profiles were associated with membership in the high-risk consequences subset. A sample of randomly selected 370 incoming first-year students at a large public university reported on their mothers' and fathers' communication quality, monitoring, approval of alcohol use, and modeling of drinking behaviors and on consequences experienced across the first year of college. Students in the high-risk subset comprised 15.5% of the sample but accounted for almost half (46.6%) of the total consequences reported by the entire sample. Latent profile analyses identified four parental profiles: positive pro-alcohol, positive anti-alcohol, negative mother, and negative father. Logistic regression analyses revealed that students in the negative-father profile were at greatest odds of being in the high-risk consequences subset at a follow-up assessment 1 year later, even after drinking at baseline was controlled for. Students in the positive pro-alcohol profile also were at increased odds of being in the high-risk subset, although this association was attenuated after baseline drinking was controlled for. These findings have important implications for the improvement of existing parent- and individual-based college student drinking interventions designed to reduce alcohol-related consequences.
Adaptation of Decoy Fusion Strategy for Existing Multi-Stage Search Workflows
NASA Astrophysics Data System (ADS)
Ivanov, Mark V.; Levitsky, Lev I.; Gorshkov, Mikhail V.
2016-09-01
A number of proteomic database search engines implement multi-stage strategies aiming at increasing the sensitivity of proteome analysis. These approaches often employ a subset of the original database for the secondary stage of analysis. However, if target-decoy approach (TDA) is used for false discovery rate (FDR) estimation, the multi-stage strategies may violate the underlying assumption of TDA that false matches are distributed uniformly across the target and decoy databases. This violation occurs if the numbers of target and decoy proteins selected for the second search are not equal. Here, we propose a method of decoy database generation based on the previously reported decoy fusion strategy. This method allows unbiased TDA-based FDR estimation in multi-stage searches and can be easily integrated into existing workflows utilizing popular search engines and post-search algorithms.
Facial Affect Recognition Using Regularized Discriminant Analysis-Based Algorithms
NASA Astrophysics Data System (ADS)
Lee, Chien-Cheng; Huang, Shin-Sheng; Shih, Cheng-Yuan
2010-12-01
This paper presents a novel and effective method for facial expression recognition including happiness, disgust, fear, anger, sadness, surprise, and neutral state. The proposed method utilizes a regularized discriminant analysis-based boosting algorithm (RDAB) with effective Gabor features to recognize the facial expressions. Entropy criterion is applied to select the effective Gabor feature which is a subset of informative and nonredundant Gabor features. The proposed RDAB algorithm uses RDA as a learner in the boosting algorithm. The RDA combines strengths of linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA). It solves the small sample size and ill-posed problems suffered from QDA and LDA through a regularization technique. Additionally, this study uses the particle swarm optimization (PSO) algorithm to estimate optimal parameters in RDA. Experiment results demonstrate that our approach can accurately and robustly recognize facial expressions.
Analysis of high-throughput biological data using their rank values.
Dembélé, Doulaye
2018-01-01
High-throughput biological technologies are routinely used to generate gene expression profiling or cytogenetics data. To achieve high performance, methods available in the literature become more specialized and often require high computational resources. Here, we propose a new versatile method based on the data-ordering rank values. We use linear algebra, the Perron-Frobenius theorem and also extend a method presented earlier for searching differentially expressed genes for the detection of recurrent copy number aberration. A result derived from the proposed method is a one-sample Student's t-test based on rank values. The proposed method is to our knowledge the only that applies to gene expression profiling and to cytogenetics data sets. This new method is fast, deterministic, and requires a low computational load. Probabilities are associated with genes to allow a statistically significant subset selection in the data set. Stability scores are also introduced as quality parameters. The performance and comparative analyses were carried out using real data sets. The proposed method can be accessed through an R package available from the CRAN (Comprehensive R Archive Network) website: https://cran.r-project.org/web/packages/fcros .
Graves, Tabitha A.; Royle, J. Andrew; Kendall, Katherine C.; Beier, Paul; Stetz, Jeffrey B.; Macleod, Amy C.
2012-01-01
Using multiple detection methods can increase the number, kind, and distribution of individuals sampled, which may increase accuracy and precision and reduce cost of population abundance estimates. However, when variables influencing abundance are of interest, if individuals detected via different methods are influenced by the landscape differently, separate analysis of multiple detection methods may be more appropriate. We evaluated the effects of combining two detection methods on the identification of variables important to local abundance using detections of grizzly bears with hair traps (systematic) and bear rubs (opportunistic). We used hierarchical abundance models (N-mixture models) with separate model components for each detection method. If both methods sample the same population, the use of either data set alone should (1) lead to the selection of the same variables as important and (2) provide similar estimates of relative local abundance. We hypothesized that the inclusion of 2 detection methods versus either method alone should (3) yield more support for variables identified in single method analyses (i.e. fewer variables and models with greater weight), and (4) improve precision of covariate estimates for variables selected in both separate and combined analyses because sample size is larger. As expected, joint analysis of both methods increased precision as well as certainty in variable and model selection. However, the single-method analyses identified different variables and the resulting predicted abundances had different spatial distributions. We recommend comparing single-method and jointly modeled results to identify the presence of individual heterogeneity between detection methods in N-mixture models, along with consideration of detection probabilities, correlations among variables, and tolerance to risk of failing to identify variables important to a subset of the population. The benefits of increased precision should be weighed against those risks. The analysis framework presented here will be useful for other species exhibiting heterogeneity by detection method.
Zhang, Zhao; Zhao, Mingbo; Chow, Tommy W S
2012-12-01
In this work, sub-manifold projections based semi-supervised dimensionality reduction (DR) problem learning from partial constrained data is discussed. Two semi-supervised DR algorithms termed Marginal Semi-Supervised Sub-Manifold Projections (MS³MP) and orthogonal MS³MP (OMS³MP) are proposed. MS³MP in the singular case is also discussed. We also present the weighted least squares view of MS³MP. Based on specifying the types of neighborhoods with pairwise constraints (PC) and the defined manifold scatters, our methods can preserve the local properties of all points and discriminant structures embedded in the localized PC. The sub-manifolds of different classes can also be separated. In PC guided methods, exploring and selecting the informative constraints is challenging and random constraint subsets significantly affect the performance of algorithms. This paper also introduces an effective technique to select the informative constraints for DR with consistent constraints. The analytic form of the projection axes can be obtained by eigen-decomposition. The connections between this work and other related work are also elaborated. The validity of the proposed constraint selection approach and DR algorithms are evaluated by benchmark problems. Extensive simulations show that our algorithms can deliver promising results over some widely used state-of-the-art semi-supervised DR techniques. Copyright © 2012 Elsevier Ltd. All rights reserved.
Graph Learning in Knowledge Bases
DOE Office of Scientific and Technical Information (OSTI.GOV)
Goldberg, Sean; Wang, Daisy Zhe
The amount of text data has been growing exponentially in recent years, giving rise to automatic information extraction methods that store text annotations in a database. The current state-of-theart structured prediction methods, however, are likely to contain errors and it’s important to be able to manage the overall uncertainty of the database. On the other hand, the advent of crowdsourcing has enabled humans to aid machine algorithms at scale. As part of this project we introduced pi-CASTLE , a system that optimizes and integrates human and machine computing as applied to a complex structured prediction problem involving conditional random fieldsmore » (CRFs). We proposed strategies grounded in information theory to select a token subset, formulate questions for the crowd to label, and integrate these labelings back into the database using a method of constrained inference. On both a text segmentation task over academic citations and a named entity recognition task over tweets we showed an order of magnitude improvement in accuracy gain over baseline methods.« less
Cohen, Alan A; Milot, Emmanuel; Yong, Jian; Seplaki, Christopher L; Fülöp, Tamàs; Bandeen-Roche, Karen; Fried, Linda P
2013-03-01
Previous studies have identified many biomarkers that are associated with aging and related outcomes, but the relevance of these markers for underlying processes and their relationship to hypothesized systemic dysregulation is not clear. We address this gap by presenting a novel method for measuring dysregulation via the joint distribution of multiple biomarkers and assessing associations of dysregulation with age and mortality. Using longitudinal data from the Women's Health and Aging Study, we selected a 14-marker subset from 63 blood measures: those that diverged from the baseline population mean with age. For the 14 markers and all combinatorial sub-subsets we calculated a multivariate distance called the Mahalanobis distance (MHBD) for all observations, indicating how "strange" each individual's biomarker profile was relative to the baseline population mean. In most models, MHBD correlated positively with age, MHBD increased within individuals over time, and higher MHBD predicted higher risk of subsequent mortality. Predictive power increased as more variables were incorporated into the calculation of MHBD. Biomarkers from multiple systems were implicated. These results support hypotheses of simultaneous dysregulation in multiple systems and confirm the need for longitudinal, multivariate approaches to understanding biomarkers in aging. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.
Selection of core animals in the Algorithm for Proven and Young using a simulation model.
Bradford, H L; Pocrnić, I; Fragomeni, B O; Lourenco, D A L; Misztal, I
2017-12-01
The Algorithm for Proven and Young (APY) enables the implementation of single-step genomic BLUP (ssGBLUP) in large, genotyped populations by separating genotyped animals into core and non-core subsets and creating a computationally efficient inverse for the genomic relationship matrix (G). As APY became the choice for large-scale genomic evaluations in BLUP-based methods, a common question is how to choose the animals in the core subset. We compared several core definitions to answer this question. Simulations comprised a moderately heritable trait for 95,010 animals and 50,000 genotypes for animals across five generations. Genotypes consisted of 25,500 SNP distributed across 15 chromosomes. Genotyping errors and missing pedigree were also mimicked. Core animals were defined based on individual generations, equal representation across generations, and at random. For a sufficiently large core size, core definitions had the same accuracies and biases, even if the core animals had imperfect genotypes. When genotyped animals had unknown parents, accuracy and bias were significantly better (p ≤ .05) for random and across generation core definitions. © 2017 The Authors. Journal of Animal Breeding and Genetics Published by Blackwell Verlag GmbH.
An Iris Segmentation Algorithm based on Edge Orientation for Off-angle Iris Recognition
DOE Office of Scientific and Technical Information (OSTI.GOV)
Karakaya, Mahmut; Barstow, Del R; Santos-Villalobos, Hector J
Iris recognition is known as one of the most accurate and reliable biometrics. However, the accuracy of iris recognition systems depends on the quality of data capture and is negatively affected by several factors such as angle, occlusion, and dilation. In this paper, we present a segmentation algorithm for off-angle iris images that uses edge detection, edge elimination, edge classification, and ellipse fitting techniques. In our approach, we first detect all candidate edges in the iris image by using the canny edge detector; this collection contains edges from the iris and pupil boundaries as well as eyelash, eyelids, iris texturemore » etc. Edge orientation is used to eliminate the edges that cannot be part of the iris or pupil. Then, we classify the remaining edge points into two sets as pupil edges and iris edges. Finally, we randomly generate subsets of iris and pupil edge points, fit ellipses for each subset, select ellipses with similar parameters, and average to form the resultant ellipses. Based on the results from real experiments, the proposed method shows effectiveness in segmentation for off-angle iris images.« less
Gutiérrez-López-Franca, Carlos; Hervás, Ramón; Johnson, Esperanza
2018-01-01
This paper aims to improve activity recognition systems based on skeletal tracking through the study of two different strategies (and its combination): (a) specialized body parts analysis and (b) stricter restrictions for the most easily detectable activities. The study was performed using the Extended Body-Angles Algorithm, which is able to analyze activities using only a single key sample. This system allows to select, for each considered activity, which are its relevant joints, which makes it possible to monitor the body of the user selecting only a subset of the same. But this feature of the system has both advantages and disadvantages. As a consequence, in the past we had some difficulties with the recognition of activities that only have a small subset of the joints of the body as relevant. The goal of this work, therefore, is to analyze the effect produced by the application of several strategies on the results of an activity recognition system based on skeletal tracking joint oriented devices. Strategies that we applied with the purpose of improve the recognition rates of the activities with a small subset of relevant joints. Through the results of this work, we aim to give the scientific community some first indications about which considered strategy is better. PMID:29789478
Moon, James J; Dash, Pradyot; Oguin, Thomas H; McClaren, Jennifer L; Chu, H Hamlet; Thomas, Paul G; Jenkins, Marc K
2011-08-30
It is currently thought that T cells with specificity for self-peptide/MHC (pMHC) ligands are deleted during thymic development, thereby preventing autoimmunity. In the case of CD4(+) T cells, what is unclear is the extent to which self-peptide/MHC class II (pMHCII)-specific T cells are deleted or become Foxp3(+) regulatory T cells. We addressed this issue by characterizing a natural polyclonal pMHCII-specific CD4(+) T-cell population in mice that either lacked or expressed the relevant antigen in a ubiquitous pattern. Mice expressing the antigen contained one-third the number of pMHCII-specific T cells as mice lacking the antigen, and the remaining cells exhibited low TCR avidity. In mice lacking the antigen, the pMHCII-specific T-cell population was dominated by phenotypically naive Foxp3(-) cells, but also contained a subset of Foxp3(+) regulatory cells. Both Foxp3(-) and Foxp3(+) pMHCII-specific T-cell numbers were reduced in mice expressing the antigen, but the Foxp3(+) subset was more resistant to changes in number and TCR repertoire. Therefore, thymic selection of self-pMHCII-specific CD4(+) T cells results in incomplete deletion within the normal polyclonal repertoire, especially among regulatory T cells.