ERIC Educational Resources Information Center
Brusco, Michael J.; Singh, Renu; Steinley, Douglas
2009-01-01
The selection of a subset of variables from a pool of candidates is an important problem in several areas of multivariate statistics. Within the context of principal component analysis (PCA), a number of authors have argued that subset selection is crucial for identifying those variables that are required for correct interpretation of the…
Tang, Rongnian; Chen, Xupeng; Li, Chuang
2018-05-01
Near-infrared spectroscopy is an efficient, low-cost technology that has potential as an accurate method in detecting the nitrogen content of natural rubber leaves. Successive projections algorithm (SPA) is a widely used variable selection method for multivariate calibration, which uses projection operations to select a variable subset with minimum multi-collinearity. However, due to the fluctuation of correlation between variables, high collinearity may still exist in non-adjacent variables of subset obtained by basic SPA. Based on analysis to the correlation matrix of the spectra data, this paper proposed a correlation-based SPA (CB-SPA) to apply the successive projections algorithm in regions with consistent correlation. The result shows that CB-SPA can select variable subsets with more valuable variables and less multi-collinearity. Meanwhile, models established by the CB-SPA subset outperform basic SPA subsets in predicting nitrogen content in terms of both cross-validation and external prediction. Moreover, CB-SPA is assured to be more efficient, for the time cost in its selection procedure is one-twelfth that of the basic SPA.
NASA Astrophysics Data System (ADS)
Chen, Jie; Brissette, François P.; Lucas-Picher, Philippe
2016-11-01
Given the ever increasing number of climate change simulations being carried out, it has become impractical to use all of them to cover the uncertainty of climate change impacts. Various methods have been proposed to optimally select subsets of a large ensemble of climate simulations for impact studies. However, the behaviour of optimally-selected subsets of climate simulations for climate change impacts is unknown, since the transfer process from climate projections to the impact study world is usually highly non-linear. Consequently, this study investigates the transferability of optimally-selected subsets of climate simulations in the case of hydrological impacts. Two different methods were used for the optimal selection of subsets of climate scenarios, and both were found to be capable of adequately representing the spread of selected climate model variables contained in the original large ensemble. However, in both cases, the optimal subsets had limited transferability to hydrological impacts. To capture a similar variability in the impact model world, many more simulations have to be used than those that are needed to simply cover variability from the climate model variables' perspective. Overall, both optimal subset selection methods were better than random selection when small subsets were selected from a large ensemble for impact studies. However, as the number of selected simulations increased, random selection often performed better than the two optimal methods. To ensure adequate uncertainty coverage, the results of this study imply that selecting as many climate change simulations as possible is the best avenue. Where this was not possible, the two optimal methods were found to perform adequately.
A non-linear data mining parameter selection algorithm for continuous variables
Razavi, Marianne; Brady, Sean
2017-01-01
In this article, we propose a new data mining algorithm, by which one can both capture the non-linearity in data and also find the best subset model. To produce an enhanced subset of the original variables, a preferred selection method should have the potential of adding a supplementary level of regression analysis that would capture complex relationships in the data via mathematical transformation of the predictors and exploration of synergistic effects of combined variables. The method that we present here has the potential to produce an optimal subset of variables, rendering the overall process of model selection more efficient. This algorithm introduces interpretable parameters by transforming the original inputs and also a faithful fit to the data. The core objective of this paper is to introduce a new estimation technique for the classical least square regression framework. This new automatic variable transformation and model selection method could offer an optimal and stable model that minimizes the mean square error and variability, while combining all possible subset selection methodology with the inclusion variable transformations and interactions. Moreover, this method controls multicollinearity, leading to an optimal set of explanatory variables. PMID:29131829
Vanderhaeghe, F; Smolders, A J P; Roelofs, J G M; Hoffmann, M
2012-03-01
Selecting an appropriate variable subset in linear multivariate methods is an important methodological issue for ecologists. Interest often exists in obtaining general predictive capacity or in finding causal inferences from predictor variables. Because of a lack of solid knowledge on a studied phenomenon, scientists explore predictor variables in order to find the most meaningful (i.e. discriminating) ones. As an example, we modelled the response of the amphibious softwater plant Eleocharis multicaulis using canonical discriminant function analysis. We asked how variables can be selected through comparison of several methods: univariate Pearson chi-square screening, principal components analysis (PCA) and step-wise analysis, as well as combinations of some methods. We expected PCA to perform best. The selected methods were evaluated through fit and stability of the resulting discriminant functions and through correlations between these functions and the predictor variables. The chi-square subset, at P < 0.05, followed by a step-wise sub-selection, gave the best results. In contrast to expectations, PCA performed poorly, as so did step-wise analysis. The different chi-square subset methods all yielded ecologically meaningful variables, while probable noise variables were also selected by PCA and step-wise analysis. We advise against the simple use of PCA or step-wise discriminant analysis to obtain an ecologically meaningful variable subset; the former because it does not take into account the response variable, the latter because noise variables are likely to be selected. We suggest that univariate screening techniques are a worthwhile alternative for variable selection in ecology. © 2011 German Botanical Society and The Royal Botanical Society of the Netherlands.
Ballabio, Davide; Consonni, Viviana; Mauri, Andrea; Todeschini, Roberto
2010-01-11
In multivariate regression and classification issues variable selection is an important procedure used to select an optimal subset of variables with the aim of producing more parsimonious and eventually more predictive models. Variable selection is often necessary when dealing with methodologies that produce thousands of variables, such as Quantitative Structure-Activity Relationships (QSARs) and highly dimensional analytical procedures. In this paper a novel method for variable selection for classification purposes is introduced. This method exploits the recently proposed Canonical Measure of Correlation between two sets of variables (CMC index). The CMC index is in this case calculated for two specific sets of variables, the former being comprised of the independent variables and the latter of the unfolded class matrix. The CMC values, calculated by considering one variable at a time, can be sorted and a ranking of the variables on the basis of their class discrimination capabilities results. Alternatively, CMC index can be calculated for all the possible combinations of variables and the variable subset with the maximal CMC can be selected, but this procedure is computationally more demanding and classification performance of the selected subset is not always the best one. The effectiveness of the CMC index in selecting variables with discriminative ability was compared with that of other well-known strategies for variable selection, such as the Wilks' Lambda, the VIP index based on the Partial Least Squares-Discriminant Analysis, and the selection provided by classification trees. A variable Forward Selection based on the CMC index was finally used in conjunction of Linear Discriminant Analysis. This approach was tested on several chemical data sets. Obtained results were encouraging.
Gunter, Lacey; Zhu, Ji; Murphy, Susan
2012-01-01
For many years, subset analysis has been a popular topic for the biostatistics and clinical trials literature. In more recent years, the discussion has focused on finding subsets of genomes which play a role in the effect of treatment, often referred to as stratified or personalized medicine. Though highly sought after, methods for detecting subsets with altering treatment effects are limited and lacking in power. In this article we discuss variable selection for qualitative interactions with the aim to discover these critical patient subsets. We propose a new technique designed specifically to find these interaction variables among a large set of variables while still controlling for the number of false discoveries. We compare this new method against standard qualitative interaction tests using simulations and give an example of its use on data from a randomized controlled trial for the treatment of depression. PMID:22023676
Variable selection with stepwise and best subset approaches
2016-01-01
While purposeful selection is performed partly by software and partly by hand, the stepwise and best subset approaches are automatically performed by software. Two R functions stepAIC() and bestglm() are well designed for stepwise and best subset regression, respectively. The stepAIC() function begins with a full or null model, and methods for stepwise regression can be specified in the direction argument with character values “forward”, “backward” and “both”. The bestglm() function begins with a data frame containing explanatory variables and response variables. The response variable should be in the last column. Varieties of goodness-of-fit criteria can be specified in the IC argument. The Bayesian information criterion (BIC) usually results in more parsimonious model than the Akaike information criterion. PMID:27162786
NASA Astrophysics Data System (ADS)
Seo, Seung Beom; Kim, Young-Oh; Kim, Youngil; Eum, Hyung-Il
2018-04-01
When selecting a subset of climate change scenarios (GCM models), the priority is to ensure that the subset reflects the comprehensive range of possible model results for all variables concerned. Though many studies have attempted to improve the scenario selection, there is a lack of studies that discuss methods to ensure that the results from a subset of climate models contain the same range of uncertainty in hydrologic variables as when all models are considered. We applied the Katsavounidis-Kuo-Zhang (KKZ) algorithm to select a subset of climate change scenarios and demonstrated its ability to reduce the number of GCM models in an ensemble, while the ranges of multiple climate extremes indices were preserved. First, we analyzed the role of 27 ETCCDI climate extremes indices for scenario selection and selected the representative climate extreme indices. Before the selection of a subset, we excluded a few deficient GCM models that could not represent the observed climate regime. Subsequently, we discovered that a subset of GCM models selected by the KKZ algorithm with the representative climate extreme indices could not capture the full potential range of changes in hydrologic extremes (e.g., 3-day peak flow and 7-day low flow) in some regional case studies. However, the application of the KKZ algorithm with a different set of climate indices, which are correlated to the hydrologic extremes, enabled the overcoming of this limitation. Key climate indices, dependent on the hydrologic extremes to be projected, must therefore be determined prior to the selection of a subset of GCM models.
Variable screening via quantile partial correlation
Ma, Shujie; Tsai, Chih-Ling
2016-01-01
In quantile linear regression with ultra-high dimensional data, we propose an algorithm for screening all candidate variables and subsequently selecting relevant predictors. Specifically, we first employ quantile partial correlation for screening, and then we apply the extended Bayesian information criterion (EBIC) for best subset selection. Our proposed method can successfully select predictors when the variables are highly correlated, and it can also identify variables that make a contribution to the conditional quantiles but are marginally uncorrelated or weakly correlated with the response. Theoretical results show that the proposed algorithm can yield the sure screening set. By controlling the false selection rate, model selection consistency can be achieved theoretically. In practice, we proposed using EBIC for best subset selection so that the resulting model is screening consistent. Simulation studies demonstrate that the proposed algorithm performs well, and an empirical example is presented. PMID:28943683
Variable Selection through Correlation Sifting
NASA Astrophysics Data System (ADS)
Huang, Jim C.; Jojic, Nebojsa
Many applications of computational biology require a variable selection procedure to sift through a large number of input variables and select some smaller number that influence a target variable of interest. For example, in virology, only some small number of viral protein fragments influence the nature of the immune response during viral infection. Due to the large number of variables to be considered, a brute-force search for the subset of variables is in general intractable. To approximate this, methods based on ℓ1-regularized linear regression have been proposed and have been found to be particularly successful. It is well understood however that such methods fail to choose the correct subset of variables if these are highly correlated with other "decoy" variables. We present a method for sifting through sets of highly correlated variables which leads to higher accuracy in selecting the correct variables. The main innovation is a filtering step that reduces correlations among variables to be selected, making the ℓ1-regularization effective for datasets on which many methods for variable selection fail. The filtering step changes both the values of the predictor variables and output values by projections onto components obtained through a computationally-inexpensive principal components analysis. In this paper we demonstrate the usefulness of our method on synthetic datasets and on novel applications in virology. These include HIV viral load analysis based on patients' HIV sequences and immune types, as well as the analysis of seasonal variation in influenza death rates based on the regions of the influenza genome that undergo diversifying selection in the previous season.
Generating a Simulated Fluid Flow over a Surface Using Anisotropic Diffusion
NASA Technical Reports Server (NTRS)
Rodriguez, David L. (Inventor); Sturdza, Peter (Inventor)
2016-01-01
A fluid-flow simulation over a computer-generated surface is generated using a diffusion technique. The surface is comprised of a surface mesh of polygons. A boundary-layer fluid property is obtained for a subset of the polygons of the surface mesh. A gradient vector is determined for a selected polygon, the selected polygon belonging to the surface mesh but not one of the subset of polygons. A maximum and minimum diffusion rate is determined along directions determined using the gradient vector corresponding to the selected polygon. A diffusion-path vector is defined between a point in the selected polygon and a neighboring point in a neighboring polygon. An updated fluid property is determined for the selected polygon using a variable diffusion rate, the variable diffusion rate based on the minimum diffusion rate, maximum diffusion rate, and the gradient vector.
Mujalli, Randa Oqab; de Oña, Juan
2011-10-01
This study describes a method for reducing the number of variables frequently considered in modeling the severity of traffic accidents. The method's efficiency is assessed by constructing Bayesian networks (BN). It is based on a two stage selection process. Several variable selection algorithms, commonly used in data mining, are applied in order to select subsets of variables. BNs are built using the selected subsets and their performance is compared with the original BN (with all the variables) using five indicators. The BNs that improve the indicators' values are further analyzed for identifying the most significant variables (accident type, age, atmospheric factors, gender, lighting, number of injured, and occupant involved). A new BN is built using these variables, where the results of the indicators indicate, in most of the cases, a statistically significant improvement with respect to the original BN. It is possible to reduce the number of variables used to model traffic accidents injury severity through BNs without reducing the performance of the model. The study provides the safety analysts a methodology that could be used to minimize the number of variables used in order to determine efficiently the injury severity of traffic accidents without reducing the performance of the model. Copyright © 2011 Elsevier Ltd. All rights reserved.
Design and Application of Drought Indexes in Highly Regulated Mediterranean Water Systems
NASA Astrophysics Data System (ADS)
Castelletti, A.; Zaniolo, M.; Giuliani, M.
2017-12-01
Costs of drought are progressively increasing due to the undergoing alteration of hydro-meteorological regimes induced by climate change. Although drought management is largely studied in the literature, most of the traditional drought indexes fail in detecting critical events in highly regulated systems, which generally rely on ad-hoc formulations and cannot be generalized to different context. In this study, we contribute a novel framework for the design of a basin-customized drought index. This index represents a surrogate of the state of the basin and is computed by combining the available information about the water available in the system to reproduce a representative target variable for the drought condition of the basin (e.g., water deficit). To select the relevant variables and combinatione thereof, we use an advanced feature extraction algorithm called Wrapper for Quasi Equally Informative Subset Selection (W-QEISS). W-QEISS relies on a multi-objective evolutionary algorithm to find Pareto-efficient subsets of variables by maximizing the wrapper accuracy, minimizing the number of selected variables, and optimizing relevance and redundancy of the subset. The accuracy objective is evaluated trough the calibration of an extreme learning machine of the water deficit for each candidate subset of variables, with the index selected from the resulting solutions identifying a suitable compromise between accuracy, cardinality, relevance, and redundancy. The approach is tested on Lake Como, Italy, a regulated lake mainly operated for irrigation supply. In the absence of an institutional drought monitoring system, we constructed the combined index using all the hydrological variables from the existing monitoring system as well as common drought indicators at multiple time aggregations. The soil moisture deficit in the root zone computed by a distributed-parameter water balance model of the agricultural districts is used as target variable. Numerical results show that our combined drought index succesfully reproduces the deficit. The index represents a valuable information for supporting appropriate drought management strategies, including the possibility of directly informing the lake operations about the drought conditions and improve the overall reliability of the irrigation supply system.
Generating a Simulated Fluid Flow Over an Aircraft Surface Using Anisotropic Diffusion
NASA Technical Reports Server (NTRS)
Rodriguez, David L. (Inventor); Sturdza, Peter (Inventor)
2013-01-01
A fluid-flow simulation over a computer-generated aircraft surface is generated using a diffusion technique. The surface is comprised of a surface mesh of polygons. A boundary-layer fluid property is obtained for a subset of the polygons of the surface mesh. A pressure-gradient vector is determined for a selected polygon, the selected polygon belonging to the surface mesh but not one of the subset of polygons. A maximum and minimum diffusion rate is determined along directions determined using a pressure gradient vector corresponding to the selected polygon. A diffusion-path vector is defined between a point in the selected polygon and a neighboring point in a neighboring polygon. An updated fluid property is determined for the selected polygon using a variable diffusion rate, the variable diffusion rate based on the minimum diffusion rate, maximum diffusion rate, and angular difference between the diffusion-path vector and the pressure-gradient vector.
Selecting climate change scenarios using impact-relevant sensitivities
Julie A. Vano; John B. Kim; David E. Rupp; Philip W. Mote
2015-01-01
Climate impact studies often require the selection of a small number of climate scenarios. Ideally, a subset would have simulations that both (1) appropriately represent the range of possible futures for the variable/s most important to the impact under investigation and (2) come from global climate models (GCMs) that provide plausible results for future climate in the...
Minimization search method for data inversion
NASA Technical Reports Server (NTRS)
Fymat, A. L.
1975-01-01
Technique has been developed for determining values of selected subsets of independent variables in mathematical formulations. Required computation time increases with first power of the number of variables. This is in contrast with classical minimization methods for which computational time increases with third power of the number of variables.
Cheng, Qiang; Zhou, Hongbo; Cheng, Jie
2011-06-01
Selecting features for multiclass classification is a critically important task for pattern recognition and machine learning applications. Especially challenging is selecting an optimal subset of features from high-dimensional data, which typically have many more variables than observations and contain significant noise, missing components, or outliers. Existing methods either cannot handle high-dimensional data efficiently or scalably, or can only obtain local optimum instead of global optimum. Toward the selection of the globally optimal subset of features efficiently, we introduce a new selector--which we call the Fisher-Markov selector--to identify those features that are the most useful in describing essential differences among the possible groups. In particular, in this paper we present a way to represent essential discriminating characteristics together with the sparsity as an optimization objective. With properly identified measures for the sparseness and discriminativeness in possibly high-dimensional settings, we take a systematic approach for optimizing the measures to choose the best feature subset. We use Markov random field optimization techniques to solve the formulated objective functions for simultaneous feature selection. Our results are noncombinatorial, and they can achieve the exact global optimum of the objective function for some special kernels. The method is fast; in particular, it can be linear in the number of features and quadratic in the number of observations. We apply our procedure to a variety of real-world data, including mid--dimensional optical handwritten digit data set and high-dimensional microarray gene expression data sets. The effectiveness of our method is confirmed by experimental results. In pattern recognition and from a model selection viewpoint, our procedure says that it is possible to select the most discriminating subset of variables by solving a very simple unconstrained objective function which in fact can be obtained with an explicit expression.
Automatic design of basin-specific drought indexes for highly regulated water systems
NASA Astrophysics Data System (ADS)
Zaniolo, Marta; Giuliani, Matteo; Castelletti, Andrea Francesco; Pulido-Velazquez, Manuel
2018-04-01
Socio-economic costs of drought are progressively increasing worldwide due to undergoing alterations of hydro-meteorological regimes induced by climate change. Although drought management is largely studied in the literature, traditional drought indexes often fail at detecting critical events in highly regulated systems, where natural water availability is conditioned by the operation of water infrastructures such as dams, diversions, and pumping wells. Here, ad hoc index formulations are usually adopted based on empirical combinations of several, supposed-to-be significant, hydro-meteorological variables. These customized formulations, however, while effective in the design basin, can hardly be generalized and transferred to different contexts. In this study, we contribute FRIDA (FRamework for Index-based Drought Analysis), a novel framework for the automatic design of basin-customized drought indexes. In contrast to ad hoc empirical approaches, FRIDA is fully automated, generalizable, and portable across different basins. FRIDA builds an index representing a surrogate of the drought conditions of the basin, computed by combining all the relevant available information about the water circulating in the system identified by means of a feature extraction algorithm. We used the Wrapper for Quasi-Equally Informative Subset Selection (W-QEISS), which features a multi-objective evolutionary algorithm to find Pareto-efficient subsets of variables by maximizing the wrapper accuracy, minimizing the number of selected variables, and optimizing relevance and redundancy of the subset. The preferred variable subset is selected among the efficient solutions and used to formulate the final index according to alternative model structures. We apply FRIDA to the case study of the Jucar river basin (Spain), a drought-prone and highly regulated Mediterranean water resource system, where an advanced drought management plan relying on the formulation of an ad hoc state index
is used for triggering drought management measures. The state index was constructed empirically with a trial-and-error process begun in the 1980s and finalized in 2007, guided by the experts from the Confederación Hidrográfica del Júcar (CHJ). Our results show that the automated variable selection outcomes align with CHJ's 25-year-long empirical refinement. In addition, the resultant FRIDA index outperforms the official State Index in terms of accuracy in reproducing the target variable and cardinality of the selected inputs set.
VizieR Online Data Catalog: RR Lyraes in SDSS stripe 82 (Watkins+, 2009)
NASA Astrophysics Data System (ADS)
Watkins, L. L.; Evans, N. W.; Belokurov, V.; Smith, M. C.; Hewett, P. C.; Bramich, D. M.; Gilmore, G. F.; Irwin, M. J.; Vidrih, S.; Wyrzykowski, L.; Zucker, D. B.
2015-10-01
In this paper, we select first the variable objects in Stripe 82 and then the subset of RR Lyraes, using the Bramich et al. (2008MNRAS.386..887B, Cat. V/141) light-motion curve catalogue (LMCC) and HLC. We make a selection of the variable objects and an identification of RR Lyrae stars. (2 data files).
Creating a non-linear total sediment load formula using polynomial best subset regression model
NASA Astrophysics Data System (ADS)
Okcu, Davut; Pektas, Ali Osman; Uyumaz, Ali
2016-08-01
The aim of this study is to derive a new total sediment load formula which is more accurate and which has less application constraints than the well-known formulae of the literature. 5 most known stream power concept sediment formulae which are approved by ASCE are used for benchmarking on a wide range of datasets that includes both field and flume (lab) observations. The dimensionless parameters of these widely used formulae are used as inputs in a new regression approach. The new approach is called Polynomial Best subset regression (PBSR) analysis. The aim of the PBRS analysis is fitting and testing all possible combinations of the input variables and selecting the best subset. Whole the input variables with their second and third powers are included in the regression to test the possible relation between the explanatory variables and the dependent variable. While selecting the best subset a multistep approach is used that depends on significance values and also the multicollinearity degrees of inputs. The new formula is compared to others in a holdout dataset and detailed performance investigations are conducted for field and lab datasets within this holdout data. Different goodness of fit statistics are used as they represent different perspectives of the model accuracy. After the detailed comparisons are carried out we figured out the most accurate equation that is also applicable on both flume and river data. Especially, on field dataset the prediction performance of the proposed formula outperformed the benchmark formulations.
Designing basin-customized combined drought indices via feature extraction
NASA Astrophysics Data System (ADS)
Zaniolo, Marta; Giuliani, Matteo; Castelletti, Andrea
2017-04-01
The socio-economic costs of drought are progressively increasing worldwide due to the undergoing alteration of hydro-meteorological regimes induced by climate change. Although drought management is largely studied in the literature, most of the traditional drought indexes fail in detecting critical events in highly regulated systems, which generally rely on ad-hoc formulations and cannot be generalized to different context. In this study, we contribute a novel framework for the design of a basin-customized drought index. This index represents a surrogate of the state of the basin and is computed by combining the available information about the water available in the system to reproduce a representative target variable for the drought condition of the basin (e.g., water deficit). To select the relevant variables and how to combine them, we use an advanced feature extraction algorithm called Wrapper for Quasi Equally Informative Subset Selection (W-QEISS). The W-QEISS algorithm relies on a multi-objective evolutionary algorithm to find Pareto-efficient subsets of variables by maximizing the wrapper accuracy, minimizing the number of selected variables (cardinality) and optimizing relevance and redundancy of the subset. The accuracy objective is evaluated trough the calibration of a pre-defined model (i.e., an extreme learning machine) of the water deficit for each candidate subset of variables, with the index selected from the resulting solutions identifying a suitable compromise between accuracy, cardinality, relevance, and redundancy. The proposed methodology is tested in the case study of Lake Como in northern Italy, a regulated lake mainly operated for irrigation supply to four downstream agricultural districts. In the absence of an institutional drought monitoring system, we constructed the combined index using all the hydrological variables from the existing monitoring system as well as the most common drought indicators at multiple time aggregations. The soil moisture deficit in the root zone computed by a distributed-parameter water balance model of the agricultural districts is used as target variable. Numerical results show that our framework succeeds in constructing a combined drought index that reproduces the soil moisture deficit. Moreover, this index represents a valuable information for supporting appropriate drought management strategies, including the possibility of directly informing the lake operations about the drought conditions and improve the overall reliability of the irrigation supply system.
2012-01-01
Background An important question in the analysis of biochemical data is that of identifying subsets of molecular variables that may jointly influence a biological response. Statistical variable selection methods have been widely used for this purpose. In many settings, it may be important to incorporate ancillary biological information concerning the variables of interest. Pathway and network maps are one example of a source of such information. However, although ancillary information is increasingly available, it is not always clear how it should be used nor how it should be weighted in relation to primary data. Results We put forward an approach in which biological knowledge is incorporated using informative prior distributions over variable subsets, with prior information selected and weighted in an automated, objective manner using an empirical Bayes formulation. We employ continuous, linear models with interaction terms and exploit biochemically-motivated sparsity constraints to permit exact inference. We show an example of priors for pathway- and network-based information and illustrate our proposed method on both synthetic response data and by an application to cancer drug response data. Comparisons are also made to alternative Bayesian and frequentist penalised-likelihood methods for incorporating network-based information. Conclusions The empirical Bayes method proposed here can aid prior elicitation for Bayesian variable selection studies and help to guard against mis-specification of priors. Empirical Bayes, together with the proposed pathway-based priors, results in an approach with a competitive variable selection performance. In addition, the overall procedure is fast, deterministic, and has very few user-set parameters, yet is capable of capturing interplay between molecular players. The approach presented is general and readily applicable in any setting with multiple sources of biological prior knowledge. PMID:22578440
Predictive equations for the estimation of body size in seals and sea lions (Carnivora: Pinnipedia)
Churchill, Morgan; Clementz, Mark T; Kohno, Naoki
2014-01-01
Body size plays an important role in pinniped ecology and life history. However, body size data is often absent for historical, archaeological, and fossil specimens. To estimate the body size of pinnipeds (seals, sea lions, and walruses) for today and the past, we used 14 commonly preserved cranial measurements to develop sets of single variable and multivariate predictive equations for pinniped body mass and total length. Principal components analysis (PCA) was used to test whether separate family specific regressions were more appropriate than single predictive equations for Pinnipedia. The influence of phylogeny was tested with phylogenetic independent contrasts (PIC). The accuracy of these regressions was then assessed using a combination of coefficient of determination, percent prediction error, and standard error of estimation. Three different methods of multivariate analysis were examined: bidirectional stepwise model selection using Akaike information criteria; all-subsets model selection using Bayesian information criteria (BIC); and partial least squares regression. The PCA showed clear discrimination between Otariidae (fur seals and sea lions) and Phocidae (earless seals) for the 14 measurements, indicating the need for family-specific regression equations. The PIC analysis found that phylogeny had a minor influence on relationship between morphological variables and body size. The regressions for total length were more accurate than those for body mass, and equations specific to Otariidae were more accurate than those for Phocidae. Of the three multivariate methods, the all-subsets approach required the fewest number of variables to estimate body size accurately. We then used the single variable predictive equations and the all-subsets approach to estimate the body size of two recently extinct pinniped taxa, the Caribbean monk seal (Monachus tropicalis) and the Japanese sea lion (Zalophus japonicus). Body size estimates using single variable regressions generally under or over-estimated body size; however, the all-subset regression produced body size estimates that were close to historically recorded body length for these two species. This indicates that the all-subset regression equations developed in this study can estimate body size accurately. PMID:24916814
Lee, Kyu Ha; Tadesse, Mahlet G; Baccarelli, Andrea A; Schwartz, Joel; Coull, Brent A
2017-03-01
The analysis of multiple outcomes is becoming increasingly common in modern biomedical studies. It is well-known that joint statistical models for multiple outcomes are more flexible and more powerful than fitting a separate model for each outcome; they yield more powerful tests of exposure or treatment effects by taking into account the dependence among outcomes and pooling evidence across outcomes. It is, however, unlikely that all outcomes are related to the same subset of covariates. Therefore, there is interest in identifying exposures or treatments associated with particular outcomes, which we term outcome-specific variable selection. In this work, we propose a variable selection approach for multivariate normal responses that incorporates not only information on the mean model, but also information on the variance-covariance structure of the outcomes. The approach effectively leverages evidence from all correlated outcomes to estimate the effect of a particular covariate on a given outcome. To implement this strategy, we develop a Bayesian method that builds a multivariate prior for the variable selection indicators based on the variance-covariance of the outcomes. We show via simulation that the proposed variable selection strategy can boost power to detect subtle effects without increasing the probability of false discoveries. We apply the approach to the Normative Aging Study (NAS) epigenetic data and identify a subset of five genes in the asthma pathway for which gene-specific DNA methylations are associated with exposures to either black carbon, a marker of traffic pollution, or sulfate, a marker of particles generated by power plants. © 2016, The International Biometric Society.
NASA Astrophysics Data System (ADS)
Wang, Lijuan; Yan, Yong; Wang, Xue; Wang, Tao
2017-03-01
Input variable selection is an essential step in the development of data-driven models for environmental, biological and industrial applications. Through input variable selection to eliminate the irrelevant or redundant variables, a suitable subset of variables is identified as the input of a model. Meanwhile, through input variable selection the complexity of the model structure is simplified and the computational efficiency is improved. This paper describes the procedures of the input variable selection for the data-driven models for the measurement of liquid mass flowrate and gas volume fraction under two-phase flow conditions using Coriolis flowmeters. Three advanced input variable selection methods, including partial mutual information (PMI), genetic algorithm-artificial neural network (GA-ANN) and tree-based iterative input selection (IIS) are applied in this study. Typical data-driven models incorporating support vector machine (SVM) are established individually based on the input candidates resulting from the selection methods. The validity of the selection outcomes is assessed through an output performance comparison of the SVM based data-driven models and sensitivity analysis. The validation and analysis results suggest that the input variables selected from the PMI algorithm provide more effective information for the models to measure liquid mass flowrate while the IIS algorithm provides a fewer but more effective variables for the models to predict gas volume fraction.
A Selective Overview of Variable Selection in High Dimensional Feature Space
Fan, Jianqing
2010-01-01
High dimensional statistical problems arise from diverse fields of scientific research and technological development. Variable selection plays a pivotal role in contemporary statistical learning and scientific discoveries. The traditional idea of best subset selection methods, which can be regarded as a specific form of penalized likelihood, is computationally too expensive for many modern statistical applications. Other forms of penalized likelihood methods have been successfully developed over the last decade to cope with high dimensionality. They have been widely applied for simultaneously selecting important variables and estimating their effects in high dimensional statistical inference. In this article, we present a brief account of the recent developments of theory, methods, and implementations for high dimensional variable selection. What limits of the dimensionality such methods can handle, what the role of penalty functions is, and what the statistical properties are rapidly drive the advances of the field. The properties of non-concave penalized likelihood and its roles in high dimensional statistical modeling are emphasized. We also review some recent advances in ultra-high dimensional variable selection, with emphasis on independence screening and two-scale methods. PMID:21572976
NASA Astrophysics Data System (ADS)
Treloar, W. J.; Taylor, G. E.; Flenley, J. R.
2004-12-01
This is the first of a series of papers on the theme of automated pollen analysis. The automation of pollen analysis could result in numerous advantages for the reconstruction of past environments, with larger data sets made practical, objectivity and fine resolution sampling. There are also applications in apiculture and medicine. Previous work on the classification of pollen using texture measures has been successful with small numbers of pollen taxa. However, as the number of pollen taxa to be identified increases, more features may be required to achieve a successful classification. This paper describes the use of simple geometric measures to augment the texture measures. The feasibility of this new approach is tested using scanning electron microscope (SEM) images of 12 taxa of fresh pollen taken from reference material collected on Henderson Island, Polynesia. Pollen images were captured directly from a SEM connected to a PC. A threshold grey-level was set and binary images were then generated. Pollen edges were then located and the boundaries were traced using a chain coding system. A number of simple geometric variables were calculated directly from the chain code of the pollen and a variable selection procedure was used to choose the optimal subset to be used for classification. The efficiency of these variables was tested using a leave-one-out classification procedure. The system successfully split the original 12 taxa sample into five sub-samples containing no more than six pollen taxa each. The further subdivision of echinate pollen types was then attempted with a subset of four pollen taxa. A set of difference codes was constructed for a range of displacements along the chain code. From these difference codes probability variables were calculated. A variable selection procedure was again used to choose the optimal subset of probabilities that may be used for classification. The efficiency of these variables was again tested using a leave-one-out classification procedure. The proportion of correctly classified pollen ranged from 81% to 100% depending on the subset of variables used. The best set of variables had an overall classification rate averaging at about 95%. This is comparable with the classification rates from the earlier texture analysis work for other types of pollen. Copyright
Do bioclimate variables improve performance of climate envelope models?
Watling, James I.; Romañach, Stephanie S.; Bucklin, David N.; Speroterra, Carolina; Brandt, Laura A.; Pearlstine, Leonard G.; Mazzotti, Frank J.
2012-01-01
Climate envelope models are widely used to forecast potential effects of climate change on species distributions. A key issue in climate envelope modeling is the selection of predictor variables that most directly influence species. To determine whether model performance and spatial predictions were related to the selection of predictor variables, we compared models using bioclimate variables with models constructed from monthly climate data for twelve terrestrial vertebrate species in the southeastern USA using two different algorithms (random forests or generalized linear models), and two model selection techniques (using uncorrelated predictors or a subset of user-defined biologically relevant predictor variables). There were no differences in performance between models created with bioclimate or monthly variables, but one metric of model performance was significantly greater using the random forest algorithm compared with generalized linear models. Spatial predictions between maps using bioclimate and monthly variables were very consistent using the random forest algorithm with uncorrelated predictors, whereas we observed greater variability in predictions using generalized linear models.
Arruti, Andoni; Cearreta, Idoia; Álvarez, Aitor; Lazkano, Elena; Sierra, Basilio
2014-01-01
Study of emotions in human–computer interaction is a growing research area. This paper shows an attempt to select the most significant features for emotion recognition in spoken Basque and Spanish Languages using different methods for feature selection. RekEmozio database was used as the experimental data set. Several Machine Learning paradigms were used for the emotion classification task. Experiments were executed in three phases, using different sets of features as classification variables in each phase. Moreover, feature subset selection was applied at each phase in order to seek for the most relevant feature subset. The three phases approach was selected to check the validity of the proposed approach. Achieved results show that an instance-based learning algorithm using feature subset selection techniques based on evolutionary algorithms is the best Machine Learning paradigm in automatic emotion recognition, with all different feature sets, obtaining a mean of 80,05% emotion recognition rate in Basque and a 74,82% in Spanish. In order to check the goodness of the proposed process, a greedy searching approach (FSS-Forward) has been applied and a comparison between them is provided. Based on achieved results, a set of most relevant non-speaker dependent features is proposed for both languages and new perspectives are suggested. PMID:25279686
Comparative study of feature selection with ensemble learning using SOM variants
NASA Astrophysics Data System (ADS)
Filali, Ameni; Jlassi, Chiraz; Arous, Najet
2017-03-01
Ensemble learning has succeeded in the growth of stability and clustering accuracy, but their runtime prohibits them from scaling up to real-world applications. This study deals the problem of selecting a subset of the most pertinent features for every cluster from a dataset. The proposed method is another extension of the Random Forests approach using self-organizing maps (SOM) variants to unlabeled data that estimates the out-of-bag feature importance from a set of partitions. Every partition is created using a various bootstrap sample and a random subset of the features. Then, we show that the process internal estimates are used to measure variable pertinence in Random Forests are also applicable to feature selection in unsupervised learning. This approach aims to the dimensionality reduction, visualization and cluster characterization at the same time. Hence, we provide empirical results on nineteen benchmark data sets indicating that RFS can lead to significant improvement in terms of clustering accuracy, over several state-of-the-art unsupervised methods, with a very limited subset of features. The approach proves promise to treat with very broad domains.
Genotyping variability of computationally categorized peach microsatellite markers
USDA-ARS?s Scientific Manuscript database
Numerous expressed sequence tag (EST) simple sequence repeat (SSR) primers can be easily mined out. The obstacle to develop them into usable markers is how to optimally select downsized subsets of the primers for genotyping, which accordingly reduces amplification failure and monomorphism often occu...
NASA Technical Reports Server (NTRS)
Miles, R. F., Jr.
1986-01-01
A research and development (R&D) project often involves a number of decisions that must be made concerning which subset of systems or tasks are to be undertaken to achieve the goal of the R&D project. To help in this decision making, SIMRAND (SIMulation of Research ANd Development Projects) is a methodology for the selection of the optimal subset of systems or tasks to be undertaken on an R&D project. Using alternative networks, the SIMRAND methodology models the alternative subsets of systems or tasks under consideration. Each path through an alternative network represents one way of satisfying the project goals. Equations are developed that relate the system or task variables to the measure of reference. Uncertainty is incorporated by treating the variables of the equations probabilistically as random variables, with cumulative distribution functions assessed by technical experts. Analytical techniques of probability theory are used to reduce the complexity of the alternative networks. Cardinal utility functions over the measure of preference are assessed for the decision makers. A run of the SIMRAND Computer I Program combines, in a Monte Carlo simulation model, the network structure, the equations, the cumulative distribution functions, and the utility functions.
Frank, Laurence E; Heiser, Willem J
2008-05-01
A set of features is the basis for the network representation of proximity data achieved by feature network models (FNMs). Features are binary variables that characterize the objects in an experiment, with some measure of proximity as response variable. Sometimes features are provided by theory and play an important role in the construction of the experimental conditions. In some research settings, the features are not known a priori. This paper shows how to generate features in this situation and how to select an adequate subset of features that takes into account a good compromise between model fit and model complexity, using a new version of least angle regression that restricts coefficients to be non-negative, called the Positive Lasso. It will be shown that features can be generated efficiently with Gray codes that are naturally linked to the FNMs. The model selection strategy makes use of the fact that FNM can be considered as univariate multiple regression model. A simulation study shows that the proposed strategy leads to satisfactory results if the number of objects is less than or equal to 22. If the number of objects is larger than 22, the number of features selected by our method exceeds the true number of features in some conditions.
Improving the use of environmental diversity as a surrogate for species representation.
Albuquerque, Fabio; Beier, Paul
2018-01-01
The continuous p-median approach to environmental diversity (ED) is a reliable way to identify sites that efficiently represent species. A recently developed maximum dispersion (maxdisp) approach to ED is computationally simpler, does not require the user to reduce environmental space to two dimensions, and performed better than continuous p-median for datasets of South African animals. We tested whether maxdisp performs as well as continuous p-median for 12 datasets that included plants and other continents, and whether particular types of environmental variables produced consistently better models of ED. We selected 12 species inventories and atlases to span a broad range of taxa (plants, birds, mammals, reptiles, and amphibians), spatial extents, and resolutions. For each dataset, we used continuous p-median ED and maxdisp ED in combination with five sets of environmental variables (five combinations of temperature, precipitation, insolation, NDVI, and topographic variables) to select environmentally diverse sites. We used the species accumulation index (SAI) to evaluate the efficiency of ED in representing species for each approach and set of environmental variables. Maxdisp ED represented species better than continuous p-median ED in five of 12 biodiversity datasets, and about the same for the other seven biodiversity datasets. Efficiency of ED also varied with type of variables used to define environmental space, but no particular combination of variables consistently performed best. We conclude that maxdisp ED performs at least as well as continuous p-median ED, and has the advantage of faster and simpler computation. Surprisingly, using all 38 environmental variables was not consistently better than using subsets of variables, nor did any subset emerge as consistently best or worst; further work is needed to identify the best variables to define environmental space. Results can help ecologists and conservationists select sites for species representation and assist in conservation planning.
Sarkar, Mohosin; Liu, Yun; Qi, Junpeng; Peng, Haiyong; Morimoto, Jumpei; Rader, Christoph; Chiorazzi, Nicholas; Kodadek, Thomas
2016-04-01
Chronic lymphocytic leukemia (CLL) is a disease in which a single B-cell clone proliferates relentlessly in peripheral lymphoid organs, bone marrow, and blood. DNA sequencing experiments have shown that about 30% of CLL patients have stereotyped antigen-specific B-cell receptors (BCRs) with a high level of sequence homology in the variable domains of the heavy and light chains. These include many of the most aggressive cases that haveIGHV-unmutated BCRs whose sequences have not diverged significantly from the germ line. This suggests a personalized therapy strategy in which a toxin or immune effector function is delivered selectively to the pathogenic B-cells but not to healthy B-cells. To execute this strategy, serum-stable, drug-like compounds able to target the antigen-binding sites of most or all patients in a stereotyped subset are required. We demonstrate here the feasibility of this approach with the discovery of selective, high affinity ligands for CLL BCRs of the aggressive, stereotyped subset 7P that cross-react with the BCRs of several CLL patients in subset 7p, but not with BCRs from patients outside this subset. © 2016 by The American Society for Biochemistry and Molecular Biology, Inc.
Parker, T H; Wilkin, T A; Barr, I R; Sheldon, B C; Rowe, L; Griffith, S C
2011-07-01
Avian plumage colours are some of the most conspicuous sexual ornaments, and yet standardized selection gradients for plumage colour have rarely been quantified. We examined patterns of fecundity selection on plumage colour in blue tits (Cyanistes caeruleus L.). When not accounting for environmental heterogeneity, we detected relatively few cases of selection. We found significant disruptive selection on adult male crown colour and yearling female chest colour and marginally nonsignificant positive linear selection on adult female crown colour. We discovered no new significant selection gradients with canonical rotation of the matrix of nonlinear selection. Next, using a long-term data set, we identified territory-level environmental variables that predicted fecundity to determine whether these variables influenced patterns of plumage selection. The first of these variables, the density of oaks within 50 m of the nest, influenced selection gradients only for yearling males. The second variable, an inverse function of nesting density, interacted with a subset of plumage selection gradients for yearling males and adult females, although the strength and direction of selection did not vary predictably with population density across these analyses. Overall, fecundity selection on plumage colour in blue tits appeared rare and inconsistent among sexes and age classes. © 2011 The Authors. Journal of Evolutionary Biology © 2011 European Society For Evolutionary Biology.
Modulation Depth Estimation and Variable Selection in State-Space Models for Neural Interfaces
Hochberg, Leigh R.; Donoghue, John P.; Brown, Emery N.
2015-01-01
Rapid developments in neural interface technology are making it possible to record increasingly large signal sets of neural activity. Various factors such as asymmetrical information distribution and across-channel redundancy may, however, limit the benefit of high-dimensional signal sets, and the increased computational complexity may not yield corresponding improvement in system performance. High-dimensional system models may also lead to overfitting and lack of generalizability. To address these issues, we present a generalized modulation depth measure using the state-space framework that quantifies the tuning of a neural signal channel to relevant behavioral covariates. For a dynamical system, we develop computationally efficient procedures for estimating modulation depth from multivariate data. We show that this measure can be used to rank neural signals and select an optimal channel subset for inclusion in the neural decoding algorithm. We present a scheme for choosing the optimal subset based on model order selection criteria. We apply this method to neuronal ensemble spike-rate decoding in neural interfaces, using our framework to relate motor cortical activity with intended movement kinematics. With offline analysis of intracortical motor imagery data obtained from individuals with tetraplegia using the BrainGate neural interface, we demonstrate that our variable selection scheme is useful for identifying and ranking the most information-rich neural signals. We demonstrate that our approach offers several orders of magnitude lower complexity but virtually identical decoding performance compared to greedy search and other selection schemes. Our statistical analysis shows that the modulation depth of human motor cortical single-unit signals is well characterized by the generalized Pareto distribution. Our variable selection scheme has wide applicability in problems involving multisensor signal modeling and estimation in biomedical engineering systems. PMID:25265627
NASA Astrophysics Data System (ADS)
Chen, Hui; Tan, Chao; Lin, Zan; Wu, Tong
2018-01-01
Milk is among the most popular nutrient source worldwide, which is of great interest due to its beneficial medicinal properties. The feasibility of the classification of milk powder samples with respect to their brands and the determination of protein concentration is investigated by NIR spectroscopy along with chemometrics. Two datasets were prepared for experiment. One contains 179 samples of four brands for classification and the other contains 30 samples for quantitative analysis. Principal component analysis (PCA) was used for exploratory analysis. Based on an effective model-independent variable selection method, i.e., minimal-redundancy maximal-relevance (MRMR), only 18 variables were selected to construct a partial least-square discriminant analysis (PLS-DA) model. On the test set, the PLS-DA model based on the selected variable set was compared with the full-spectrum PLS-DA model, both of which achieved 100% accuracy. In quantitative analysis, the partial least-square regression (PLSR) model constructed by the selected subset of 260 variables outperforms significantly the full-spectrum model. It seems that the combination of NIR spectroscopy, MRMR and PLS-DA or PLSR is a powerful tool for classifying different brands of milk and determining the protein content.
Use of internal control T-cell populations in the flow cytometric evaluation for T-cell neoplasms.
Hunt, Alicia M; Shallenberger, Wendy; Ten Eyck, Stephen P; Craig, Fiona E
2016-09-01
Flow cytometry is an important tool for identification of neoplastic T-cells, but immunophenotypic abnormalities are often subtle and must be distinguished from nonneoplastic subsets. Use of internal control (IC) T-cells in the evaluation for T-cell neoplasms was explored, both as a quality measure and as a reference for evaluating abnormal antigen expression. All peripheral blood specimens (3-month period), or those containing abnormal T-cells (29-month period), stained with CD45 V500, CD2 V450, CD3 PE-Cy7, CD7 PE, CD4 Per-CP-Cy5.5, CD8 APC-H7, CD56 APC, CD16&57 FITC, were evaluated. IC T-cells were identified (DIVA, BD Biosciences) and median fluorescence intensity (MFI) recorded. Selected files were merged and reference templates generated (Infinicyt, Cytognos). IC T-cells were present in all specimens, including those with abnormal T-cells, but subsets were less well-represented. IC T-cell CD3 MFI differed between instruments (p = 0.0007) and subsets (p < 0.001), but not specimen categories, and served as a longitudinal process control. Merged files highlighted small unusual IC-T subsets: CD2+(dim) (0.25% total), CD2- (0.03% total). An IC reference template highlighted neoplastic T-cells, but was limited by staining variability (IC CD3 MFI reference samples different from test (p = 0.003)). IC T-cells present in the majority of specimens can serve as positive and longitudinal process controls. Use of IC T-cells as an internal reference is limited by variable representation of subsets. Analysis of merged IC T-cells from previously analyzed patient samples can alert the interpreter to less-well-recognized non-neoplastic subsets. However, application of a merged file IC reference template was limited by staining variability. © 2016 Clinical Cytometry Society. © 2016 International Clinical Cytometry Society.
Variability of Root Traits in Spring Wheat Germplasm
Narayanan, Sruthi; Mohan, Amita; Gill, Kulvinder S.; Prasad, P. V. Vara
2014-01-01
Root traits influence the amount of water and nutrient absorption, and are important for maintaining crop yield under drought conditions. The objectives of this research were to characterize variability of root traits among spring wheat genotypes and determine whether root traits are related to shoot traits (plant height, tiller number per plant, shoot dry weight, and coleoptile length), regions of origin, and market classes. Plants were grown in 150-cm columns for 61 days in a greenhouse under optimal growth conditions. Rooting depth, root dry weight, root: shoot ratio, and shoot traits were determined for 297 genotypes of the germplasm, Cultivated Wheat Collection (CWC). The remaining root traits such as total root length and surface area were measured for a subset of 30 genotypes selected based on rooting depth. Significant genetic variability was observed for root traits among spring wheat genotypes in CWC germplasm or its subset. Genotypes Sonora and Currawa were ranked high, and genotype Vandal was ranked low for most root traits. A positive relationship (R2≥0.35) was found between root and shoot dry weights within the CWC germplasm and between total root surface area and tiller number; total root surface area and shoot dry weight; and total root length and coleoptile length within the subset. No correlations were found between plant height and most root traits within the CWC germplasm or its subset. Region of origin had significant impact on rooting depth in the CWC germplasm. Wheat genotypes collected from Australia, Mediterranean, and west Asia had greater rooting depth than those from south Asia, Latin America, Mexico, and Canada. Soft wheat had greater rooting depth than hard wheat in the CWC germplasm. The genetic variability identified in this research for root traits can be exploited to improve drought tolerance and/or resource capture in wheat. PMID:24945438
Intraclonal Cell Expansion and Selection Driven by B Cell Receptor in Chronic Lymphocytic Leukemia
Colombo, Monica; Cutrona, Giovanna; Reverberi, Daniele; Fabris, Sonia; Neri, Antonino; Fabbi, Marina; Quintana, Giovanni; Quarta, Giovanni; Ghiotto, Fabio; Fais, Franco; Ferrarini, Manlio
2011-01-01
The mutational status of the immunoglobulin heavy-chain variable region (IGHV) genes utilized by chronic lymphocytic leukemia (CLL) clones defines two disease subgroups. Patients with unmutated IGHV have a more aggressive disease and a worse outcome than patients with cells having somatic IGHV gene mutations. Moreover, up to 30% of the unmutated CLL clones exhibit very similar or identical B cell receptors (BcR), often encoded by the same IG genes. These “stereotyped” BcRs have been classified into defined subsets. The presence of an IGHV gene somatic mutation and the utilization of a skewed gene repertoire compared with normal B cells together with the expression of stereotyped receptors by unmutated CLL clones may indicate stimulation/selection by antigenic epitopes. This antigenic stimulation may occur prior to or during neoplastic transformation, but it is unknown whether this stimulation/selection continues after leukemogenesis has ceased. In this study, we focused on seven CLL cases with stereotyped BcR Subset #8 found among a cohort of 700 patients; in six, the cells expressed IgG and utilized IGHV4-39 and IGKV1-39/IGKV1D-39 genes, as reported for Subset #8 BcR. One case exhibited special features, including expression of IgM or IgG by different subclones consequent to an isotype switch, allelic inclusion at the IGH locus in the IgM-expressing cells and a particular pattern of cytogenetic lesions. Collectively, the data indicate a process of antigenic stimulation/selection of the fully transformed CLL cells leading to the expansion of the Subset #8 IgG-bearing subclone. PMID:21541442
Zuellig, Robert E.; Bruce, James F.; Evans, Erin E.; Stogner, Sr., Robert W.
2007-01-01
In 2003, the U.S. Geological Survey, in cooperation with Colorado Springs City Engineering, began a study to evaluate the influence of urbanization on stream ecosystems. To accomplish this task, invertebrate, fish, stream discharge, habitat, water-chemistry, and land-use data were collected from 13 sites in the Fountain Creek basin from 2003 to 2005. The Hydrologic Index Tool was used to calculate hydrologic indices known to be related to urbanization. Response of stream hydrology to urbanization was evident among hydrologic variables that described stormflow. These indices included one measurement of high-flow magnitude, two measurements of high-flow frequency, and one measurement of stream flashiness. Habitat and selected nonstormflow water chemistry were characterized at each site. Land-use data were converted to estimates of impervious surface cover and used as the measure of urbanization annually. Correlation analysis (Spearman?s rho) was used to identify a suite of nonredundant streamflow, habitat, and water-chemistry variables that were strongly associated (rho > 0.6) with impervious surface cover but not strongly related to elevation (rho < 0.60). An exploratory multivariate analysis (BIO-ENV, PRIMER ver 6.1, Plymouth, UK) was used to create subsets of eight urban-related environmental variables that described patterns in biological community structure. The strongest and most parsimonious subset of variables describing patterns in invertebrate community structure included high flood pulse count, lower bank capacity, and nutrients. Several other combinations of environmental variables resulted in competing subsets, but these subsets always included the three variables found in the most parsimonious list. This study found that patterns in invertebrate community structure from 2003 to 2005 in the Fountain Creek basin were associated with a variety of environmental characteristics influenced by urbanization. These patterns were explained by a combination of hydrologic, habitat, and water-chemistry variables. Fish community structure showed weaker links between urban-related environmental variables and biological patterns. A conceptual model was developed that showed the influence of urban-related environmental variables and their relation to fish and invertebrate assemblages. This model should prove helpful in guiding future studies on the impacts of urbanization on aquatic systems. Long-term monitoring efforts may be needed in other drainages along the Front Range of Colorado to link urban-related variables to aquatic communities in transition zone streams.
Fast Solution in Sparse LDA for Binary Classification
NASA Technical Reports Server (NTRS)
Moghaddam, Baback
2010-01-01
An algorithm that performs sparse linear discriminant analysis (Sparse-LDA) finds near-optimal solutions in far less time than the prior art when specialized to binary classification (of 2 classes). Sparse-LDA is a type of feature- or variable- selection problem with numerous applications in statistics, machine learning, computer vision, computational finance, operations research, and bio-informatics. Because of its combinatorial nature, feature- or variable-selection problems are NP-hard or computationally intractable in cases involving more than 30 variables or features. Therefore, one typically seeks approximate solutions by means of greedy search algorithms. The prior Sparse-LDA algorithm was a greedy algorithm that considered the best variable or feature to add/ delete to/ from its subsets in order to maximally discriminate between multiple classes of data. The present algorithm is designed for the special but prevalent case of 2-class or binary classification (e.g. 1 vs. 0, functioning vs. malfunctioning, or change versus no change). The present algorithm provides near-optimal solutions on large real-world datasets having hundreds or even thousands of variables or features (e.g. selecting the fewest wavelength bands in a hyperspectral sensor to do terrain classification) and does so in typical computation times of minutes as compared to days or weeks as taken by the prior art. Sparse LDA requires solving generalized eigenvalue problems for a large number of variable subsets (represented by the submatrices of the input within-class and between-class covariance matrices). In the general (fullrank) case, the amount of computation scales at least cubically with the number of variables and thus the size of the problems that can be solved is limited accordingly. However, in binary classification, the principal eigenvalues can be found using a special analytic formula, without resorting to costly iterative techniques. The present algorithm exploits this analytic form along with the inherent sequential nature of greedy search itself. Together this enables the use of highly-efficient partitioned-matrix-inverse techniques that result in large speedups of computation in both the forward-selection and backward-elimination stages of greedy algorithms in general.
Wang, Xiaorong; Kang, Yu; Luo, Chunxiong; Zhao, Tong; Liu, Lin; Jiang, Xiangdan; Fu, Rongrong; An, Shuchang; Chen, Jichao; Jiang, Ning; Ren, Lufeng; Wang, Qi; Baillie, J Kenneth; Gao, Zhancheng; Yu, Jun
2014-02-11
Heteroresistance refers to phenotypic heterogeneity of microbial clonal populations under antibiotic stress, and it has been thought to be an allocation of a subset of "resistant" cells for surviving in higher concentrations of antibiotic. The assumption fits the so-called bet-hedging strategy, where a bacterial population "hedges" its "bet" on different phenotypes to be selected by unpredicted environment stresses. To test this hypothesis, we constructed a heteroresistance model by introducing a blaCTX-M-14 gene (coding for a cephalosporin hydrolase) into a sensitive Escherichia coli strain. We confirmed heteroresistance in this clone and that a subset of the cells expressed more hydrolase and formed more colonies in the presence of ceftriaxone (exhibited stronger "resistance"). However, subsequent single-cell-level investigation by using a microfluidic device showed that a subset of cells with a distinguishable phenotype of slowed growth and intensified hydrolase expression emerged, and they were not positively selected but increased their proportion in the population with ascending antibiotic concentrations. Therefore, heteroresistance--the gradually decreased colony-forming capability in the presence of antibiotic--was a result of a decreased growth rate rather than of selection for resistant cells. Using a mock strain without the resistance gene, we further demonstrated the existence of two nested growth-centric feedback loops that control the expression of the hydrolase and maximize population growth in various antibiotic concentrations. In conclusion, phenotypic heterogeneity is a population-based strategy beneficial for bacterial survival and propagation through task allocation and interphenotypic collaboration, and the growth rate provides a critical control for the expression of stress-related genes and an essential mechanism in responding to environmental stresses. Heteroresistance is essentially phenotypic heterogeneity, where a population-based strategy is thought to be at work, being assumed to be variable cell-to-cell resistance to be selected under antibiotic stress. Exact mechanisms of heteroresistance and its roles in adaptation to antibiotic stress have yet to be fully understood at the molecular and single-cell levels. In our study, we have not been able to detect any apparent subset of "resistant" cells selected by antibiotics; on the contrary, cell populations differentiate into phenotypic subsets with variable growth statuses and hydrolase expression. The growth rate appears to be sensitive to stress intensity and plays a key role in controlling hydrolase expression at both the bulk population and single-cell levels. We have shown here, for the first time, that phenotypic heterogeneity can be beneficial to a growing bacterial population through task allocation and interphenotypic collaboration other than partitioning cells into different categories of selective advantage.
Selection of Representative Models for Decision Analysis Under Uncertainty
NASA Astrophysics Data System (ADS)
Meira, Luis A. A.; Coelho, Guilherme P.; Santos, Antonio Alberto S.; Schiozer, Denis J.
2016-03-01
The decision-making process in oil fields includes a step of risk analysis associated with the uncertainties present in the variables of the problem. Such uncertainties lead to hundreds, even thousands, of possible scenarios that are supposed to be analyzed so an effective production strategy can be selected. Given this high number of scenarios, a technique to reduce this set to a smaller, feasible subset of representative scenarios is imperative. The selected scenarios must be representative of the original set and also free of optimistic and pessimistic bias. This paper is devoted to propose an assisted methodology to identify representative models in oil fields. To do so, first a mathematical function was developed to model the representativeness of a subset of models with respect to the full set that characterizes the problem. Then, an optimization tool was implemented to identify the representative models of any problem, considering not only the cross-plots of the main output variables, but also the risk curves and the probability distribution of the attribute-levels of the problem. The proposed technique was applied to two benchmark cases and the results, evaluated by experts in the field, indicate that the obtained solutions are richer than those identified by previously adopted manual approaches. The program bytecode is available under request.
Water quality parameter measurement using spectral signatures
NASA Technical Reports Server (NTRS)
White, P. E.
1973-01-01
Regression analysis is applied to the problem of measuring water quality parameters from remote sensing spectral signature data. The equations necessary to perform regression analysis are presented and methods of testing the strength and reliability of a regression are described. An efficient algorithm for selecting an optimal subset of the independent variables available for a regression is also presented.
Sex determination of the Acadian Flycatcher using discriminant analysis
Wilson, R.R.
1999-01-01
I used five morphometric variables from 114 individuals captured in Arkansas to develop a discriminant model to predict the sex of Acadian Flycatchers (Empidonax virescens). Stepwise discriminant function analyses selected wing chord and tail length as the most parsimonious subset of variables for discriminating sex. This two-variable model correctly classified 80% of females and 97% of males used to develop the model. Validation of the model using 19 individuals from Louisiana and Virginia resulted in 100% correct classification of males and females. This model provides criteria for sexing monomorphic Acadian Flycatchers during the breeding season and possibly during the winter.
Yoo, Jin Eun
2018-01-01
A substantial body of research has been conducted on variables relating to students' mathematics achievement with TIMSS. However, most studies have employed conventional statistical methods, and have focused on selected few indicators instead of utilizing hundreds of variables TIMSS provides. This study aimed to find a prediction model for students' mathematics achievement using as many TIMSS student and teacher variables as possible. Elastic net, the selected machine learning technique in this study, takes advantage of both LASSO and ridge in terms of variable selection and multicollinearity, respectively. A logistic regression model was also employed to predict TIMSS 2011 Korean 4th graders' mathematics achievement. Ten-fold cross-validation with mean squared error was employed to determine the elastic net regularization parameter. Among 162 TIMSS variables explored, 12 student and 5 teacher variables were selected in the elastic net model, and the prediction accuracy, sensitivity, and specificity were 76.06, 70.23, and 80.34%, respectively. This study showed that the elastic net method can be successfully applied to educational large-scale data by selecting a subset of variables with reasonable prediction accuracy and finding new variables to predict students' mathematics achievement. Newly found variables via machine learning can shed light on the existing theories from a totally different perspective, which in turn propagates creation of a new theory or complement of existing ones. This study also examined the current scale development convention from a machine learning perspective.
Yoo, Jin Eun
2018-01-01
A substantial body of research has been conducted on variables relating to students' mathematics achievement with TIMSS. However, most studies have employed conventional statistical methods, and have focused on selected few indicators instead of utilizing hundreds of variables TIMSS provides. This study aimed to find a prediction model for students' mathematics achievement using as many TIMSS student and teacher variables as possible. Elastic net, the selected machine learning technique in this study, takes advantage of both LASSO and ridge in terms of variable selection and multicollinearity, respectively. A logistic regression model was also employed to predict TIMSS 2011 Korean 4th graders' mathematics achievement. Ten-fold cross-validation with mean squared error was employed to determine the elastic net regularization parameter. Among 162 TIMSS variables explored, 12 student and 5 teacher variables were selected in the elastic net model, and the prediction accuracy, sensitivity, and specificity were 76.06, 70.23, and 80.34%, respectively. This study showed that the elastic net method can be successfully applied to educational large-scale data by selecting a subset of variables with reasonable prediction accuracy and finding new variables to predict students' mathematics achievement. Newly found variables via machine learning can shed light on the existing theories from a totally different perspective, which in turn propagates creation of a new theory or complement of existing ones. This study also examined the current scale development convention from a machine learning perspective. PMID:29599736
Ni, Ai; Cai, Jianwen
2018-07-01
Case-cohort designs are commonly used in large epidemiological studies to reduce the cost associated with covariate measurement. In many such studies the number of covariates is very large. An efficient variable selection method is needed for case-cohort studies where the covariates are only observed in a subset of the sample. Current literature on this topic has been focused on the proportional hazards model. However, in many studies the additive hazards model is preferred over the proportional hazards model either because the proportional hazards assumption is violated or the additive hazards model provides more relevent information to the research question. Motivated by one such study, the Atherosclerosis Risk in Communities (ARIC) study, we investigate the properties of a regularized variable selection procedure in stratified case-cohort design under an additive hazards model with a diverging number of parameters. We establish the consistency and asymptotic normality of the penalized estimator and prove its oracle property. Simulation studies are conducted to assess the finite sample performance of the proposed method with a modified cross-validation tuning parameter selection methods. We apply the variable selection procedure to the ARIC study to demonstrate its practical use.
Biochemical Sensors Using Carbon Nanotube Arrays
NASA Technical Reports Server (NTRS)
Meyyappan, Meyya (Inventor); Cassell, Alan M. (Inventor); Li, Jun (Inventor)
2011-01-01
Method and system for detecting presence of biomolecules in a selected subset, or in each of several selected subsets, in a fluid. Each of an array of two or more carbon nanotubes ("CNTs") is connected at a first CNT end to one or more electronics devices, each of which senses a selected electrochemical signal that is generated when a target biomolecule in the selected subset becomes attached to a functionalized second end of the CNT, which is covalently bonded with a probe molecule. This approach indicates when target biomolecules in the selected subset are present and indicates presence or absence of target biomolecules in two or more selected subsets. Alternatively, presence of absence of an analyte can be detected.
Spectral Band Selection for Urban Material Classification Using Hyperspectral Libraries
NASA Astrophysics Data System (ADS)
Le Bris, A.; Chehata, N.; Briottet, X.; Paparoditis, N.
2016-06-01
In urban areas, information concerning very high resolution land cover and especially material maps are necessary for several city modelling or monitoring applications. That is to say, knowledge concerning the roofing materials or the different kinds of ground areas is required. Airborne remote sensing techniques appear to be convenient for providing such information at a large scale. However, results obtained using most traditional processing methods based on usual red-green-blue-near infrared multispectral images remain limited for such applications. A possible way to improve classification results is to enhance the imagery spectral resolution using superspectral or hyperspectral sensors. In this study, it is intended to design a superspectral sensor dedicated to urban materials classification and this work particularly focused on the selection of the optimal spectral band subsets for such sensor. First, reflectance spectral signatures of urban materials were collected from 7 spectral libraires. Then, spectral optimization was performed using this data set. The band selection workflow included two steps, optimising first the number of spectral bands using an incremental method and then examining several possible optimised band subsets using a stochastic algorithm. The same wrapper relevance criterion relying on a confidence measure of Random Forests classifier was used at both steps. To cope with the limited number of available spectra for several classes, additional synthetic spectra were generated from the collection of reference spectra: intra-class variability was simulated by multiplying reference spectra by a random coefficient. At the end, selected band subsets were evaluated considering the classification quality reached using a rbf svm classifier. It was confirmed that a limited band subset was sufficient to classify common urban materials. The important contribution of bands from the Short Wave Infra-Red (SWIR) spectral domain (1000-2400 nm) to material classification was also shown.
Biagiotti, R; Desii, C; Vanzi, E; Gacci, G
1999-02-01
To compare the performance of artificial neural networks (ANNs) with that of multiple logistic regression (MLR) models for predicting ovarian malignancy in patients with adnexal masses by using transvaginal B-mode and color Doppler flow ultrasonography (US). A total of 226 adnexal masses were examined before surgery: Fifty-one were malignant and 175 were benign. The data were divided into training and testing subsets by using a "leave n out method." The training subsets were used to compute the optimum MLR equations and to train the ANNs. The cross-validation subsets were used to estimate the performance of each of the two models in predicting ovarian malignancy. At testing, three-layer back-propagation networks, based on the same input variables selected by using MLR (i.e., women's ages, papillary projections, random echogenicity, peak systolic velocity, and resistance index), had a significantly higher sensitivity than did MLR (96% vs 84%; McNemar test, p = .04). The Brier scores for ANNs were significantly lower than those calculated for MLR (Student t test for paired samples, P = .004). ANNs might have potential for categorizing adnexal masses as either malignant or benign on the basis of multiple variables related to demographic and US features.
Data-driven confounder selection via Markov and Bayesian networks.
Häggström, Jenny
2018-06-01
To unbiasedly estimate a causal effect on an outcome unconfoundedness is often assumed. If there is sufficient knowledge on the underlying causal structure then existing confounder selection criteria can be used to select subsets of the observed pretreatment covariates, X, sufficient for unconfoundedness, if such subsets exist. Here, estimation of these target subsets is considered when the underlying causal structure is unknown. The proposed method is to model the causal structure by a probabilistic graphical model, for example, a Markov or Bayesian network, estimate this graph from observed data and select the target subsets given the estimated graph. The approach is evaluated by simulation both in a high-dimensional setting where unconfoundedness holds given X and in a setting where unconfoundedness only holds given subsets of X. Several common target subsets are investigated and the selected subsets are compared with respect to accuracy in estimating the average causal effect. The proposed method is implemented with existing software that can easily handle high-dimensional data, in terms of large samples and large number of covariates. The results from the simulation study show that, if unconfoundedness holds given X, this approach is very successful in selecting the target subsets, outperforming alternative approaches based on random forests and LASSO, and that the subset estimating the target subset containing all causes of outcome yields smallest MSE in the average causal effect estimation. © 2017, The International Biometric Society.
Torija, Antonio J; Ruiz, Diego P
2015-02-01
The prediction of environmental noise in urban environments requires the solution of a complex and non-linear problem, since there are complex relationships among the multitude of variables involved in the characterization and modelling of environmental noise and environmental-noise magnitudes. Moreover, the inclusion of the great spatial heterogeneity characteristic of urban environments seems to be essential in order to achieve an accurate environmental-noise prediction in cities. This problem is addressed in this paper, where a procedure based on feature-selection techniques and machine-learning regression methods is proposed and applied to this environmental problem. Three machine-learning regression methods, which are considered very robust in solving non-linear problems, are used to estimate the energy-equivalent sound-pressure level descriptor (LAeq). These three methods are: (i) multilayer perceptron (MLP), (ii) sequential minimal optimisation (SMO), and (iii) Gaussian processes for regression (GPR). In addition, because of the high number of input variables involved in environmental-noise modelling and estimation in urban environments, which make LAeq prediction models quite complex and costly in terms of time and resources for application to real situations, three different techniques are used to approach feature selection or data reduction. The feature-selection techniques used are: (i) correlation-based feature-subset selection (CFS), (ii) wrapper for feature-subset selection (WFS), and the data reduction technique is principal-component analysis (PCA). The subsequent analysis leads to a proposal of different schemes, depending on the needs regarding data collection and accuracy. The use of WFS as the feature-selection technique with the implementation of SMO or GPR as regression algorithm provides the best LAeq estimation (R(2)=0.94 and mean absolute error (MAE)=1.14-1.16 dB(A)). Copyright © 2014 Elsevier B.V. All rights reserved.
Genetic Algorithms Applied to Multi-Objective Aerodynamic Shape Optimization
NASA Technical Reports Server (NTRS)
Holst, Terry L.
2004-01-01
A genetic algorithm approach suitable for solving multi-objective optimization problems is described and evaluated using a series of aerodynamic shape optimization problems. Several new features including two variations of a binning selection algorithm and a gene-space transformation procedure are included. The genetic algorithm is suitable for finding pareto optimal solutions in search spaces that are defined by any number of genes and that contain any number of local extrema. A new masking array capability is included allowing any gene or gene subset to be eliminated as decision variables from the design space. This allows determination of the effect of a single gene or gene subset on the pareto optimal solution. Results indicate that the genetic algorithm optimization approach is flexible in application and reliable. The binning selection algorithms generally provide pareto front quality enhancements and moderate convergence efficiency improvements for most of the problems solved.
Genetic Algorithms Applied to Multi-Objective Aerodynamic Shape Optimization
NASA Technical Reports Server (NTRS)
Holst, Terry L.
2005-01-01
A genetic algorithm approach suitable for solving multi-objective problems is described and evaluated using a series of aerodynamic shape optimization problems. Several new features including two variations of a binning selection algorithm and a gene-space transformation procedure are included. The genetic algorithm is suitable for finding Pareto optimal solutions in search spaces that are defined by any number of genes and that contain any number of local extrema. A new masking array capability is included allowing any gene or gene subset to be eliminated as decision variables from the design space. This allows determination of the effect of a single gene or gene subset on the Pareto optimal solution. Results indicate that the genetic algorithm optimization approach is flexible in application and reliable. The binning selection algorithms generally provide Pareto front quality enhancements and moderate convergence efficiency improvements for most of the problems solved.
Surface Estimation, Variable Selection, and the Nonparametric Oracle Property.
Storlie, Curtis B; Bondell, Howard D; Reich, Brian J; Zhang, Hao Helen
2011-04-01
Variable selection for multivariate nonparametric regression is an important, yet challenging, problem due, in part, to the infinite dimensionality of the function space. An ideal selection procedure should be automatic, stable, easy to use, and have desirable asymptotic properties. In particular, we define a selection procedure to be nonparametric oracle (np-oracle) if it consistently selects the correct subset of predictors and at the same time estimates the smooth surface at the optimal nonparametric rate, as the sample size goes to infinity. In this paper, we propose a model selection procedure for nonparametric models, and explore the conditions under which the new method enjoys the aforementioned properties. Developed in the framework of smoothing spline ANOVA, our estimator is obtained via solving a regularization problem with a novel adaptive penalty on the sum of functional component norms. Theoretical properties of the new estimator are established. Additionally, numerous simulated and real examples further demonstrate that the new approach substantially outperforms other existing methods in the finite sample setting.
Surface Estimation, Variable Selection, and the Nonparametric Oracle Property
Storlie, Curtis B.; Bondell, Howard D.; Reich, Brian J.; Zhang, Hao Helen
2010-01-01
Variable selection for multivariate nonparametric regression is an important, yet challenging, problem due, in part, to the infinite dimensionality of the function space. An ideal selection procedure should be automatic, stable, easy to use, and have desirable asymptotic properties. In particular, we define a selection procedure to be nonparametric oracle (np-oracle) if it consistently selects the correct subset of predictors and at the same time estimates the smooth surface at the optimal nonparametric rate, as the sample size goes to infinity. In this paper, we propose a model selection procedure for nonparametric models, and explore the conditions under which the new method enjoys the aforementioned properties. Developed in the framework of smoothing spline ANOVA, our estimator is obtained via solving a regularization problem with a novel adaptive penalty on the sum of functional component norms. Theoretical properties of the new estimator are established. Additionally, numerous simulated and real examples further demonstrate that the new approach substantially outperforms other existing methods in the finite sample setting. PMID:21603586
Bayesian Ensemble Trees (BET) for Clustering and Prediction in Heterogeneous Data
Duan, Leo L.; Clancy, John P.; Szczesniak, Rhonda D.
2016-01-01
We propose a novel “tree-averaging” model that utilizes the ensemble of classification and regression trees (CART). Each constituent tree is estimated with a subset of similar data. We treat this grouping of subsets as Bayesian Ensemble Trees (BET) and model them as a Dirichlet process. We show that BET determines the optimal number of trees by adapting to the data heterogeneity. Compared with the other ensemble methods, BET requires much fewer trees and shows equivalent prediction accuracy using weighted averaging. Moreover, each tree in BET provides variable selection criterion and interpretation for each subset. We developed an efficient estimating procedure with improved estimation strategies in both CART and mixture models. We demonstrate these advantages of BET with simulations and illustrate the approach with a real-world data example involving regression of lung function measurements obtained from patients with cystic fibrosis. Supplemental materials are available online. PMID:27524872
Boonjing, Veera; Intakosum, Sarun
2016-01-01
This study investigated the use of Artificial Neural Network (ANN) and Genetic Algorithm (GA) for prediction of Thailand's SET50 index trend. ANN is a widely accepted machine learning method that uses past data to predict future trend, while GA is an algorithm that can find better subsets of input variables for importing into ANN, hence enabling more accurate prediction by its efficient feature selection. The imported data were chosen technical indicators highly regarded by stock analysts, each represented by 4 input variables that were based on past time spans of 4 different lengths: 3-, 5-, 10-, and 15-day spans before the day of prediction. This import undertaking generated a big set of diverse input variables with an exponentially higher number of possible subsets that GA culled down to a manageable number of more effective ones. SET50 index data of the past 6 years, from 2009 to 2014, were used to evaluate this hybrid intelligence prediction accuracy, and the hybrid's prediction results were found to be more accurate than those made by a method using only one input variable for one fixed length of past time span. PMID:27974883
Inthachot, Montri; Boonjing, Veera; Intakosum, Sarun
2016-01-01
This study investigated the use of Artificial Neural Network (ANN) and Genetic Algorithm (GA) for prediction of Thailand's SET50 index trend. ANN is a widely accepted machine learning method that uses past data to predict future trend, while GA is an algorithm that can find better subsets of input variables for importing into ANN, hence enabling more accurate prediction by its efficient feature selection. The imported data were chosen technical indicators highly regarded by stock analysts, each represented by 4 input variables that were based on past time spans of 4 different lengths: 3-, 5-, 10-, and 15-day spans before the day of prediction. This import undertaking generated a big set of diverse input variables with an exponentially higher number of possible subsets that GA culled down to a manageable number of more effective ones. SET50 index data of the past 6 years, from 2009 to 2014, were used to evaluate this hybrid intelligence prediction accuracy, and the hybrid's prediction results were found to be more accurate than those made by a method using only one input variable for one fixed length of past time span.
Olivera, André Rodrigues; Roesler, Valter; Iochpe, Cirano; Schmidt, Maria Inês; Vigo, Álvaro; Barreto, Sandhi Maria; Duncan, Bruce Bartholow
2017-01-01
Type 2 diabetes is a chronic disease associated with a wide range of serious health complications that have a major impact on overall health. The aims here were to develop and validate predictive models for detecting undiagnosed diabetes using data from the Longitudinal Study of Adult Health (ELSA-Brasil) and to compare the performance of different machine-learning algorithms in this task. Comparison of machine-learning algorithms to develop predictive models using data from ELSA-Brasil. After selecting a subset of 27 candidate variables from the literature, models were built and validated in four sequential steps: (i) parameter tuning with tenfold cross-validation, repeated three times; (ii) automatic variable selection using forward selection, a wrapper strategy with four different machine-learning algorithms and tenfold cross-validation (repeated three times), to evaluate each subset of variables; (iii) error estimation of model parameters with tenfold cross-validation, repeated ten times; and (iv) generalization testing on an independent dataset. The models were created with the following machine-learning algorithms: logistic regression, artificial neural network, naïve Bayes, K-nearest neighbor and random forest. The best models were created using artificial neural networks and logistic regression. -These achieved mean areas under the curve of, respectively, 75.24% and 74.98% in the error estimation step and 74.17% and 74.41% in the generalization testing step. Most of the predictive models produced similar results, and demonstrated the feasibility of identifying individuals with highest probability of having undiagnosed diabetes, through easily-obtained clinical data.
Morales, Dinora Araceli; Bengoetxea, Endika; Larrañaga, Pedro; García, Miguel; Franco, Yosu; Fresnada, Mónica; Merino, Marisa
2008-05-01
In vitro fertilization (IVF) is a medically assisted reproduction technique that enables infertile couples to achieve successful pregnancy. Given the uncertainty of the treatment, we propose an intelligent decision support system based on supervised classification by Bayesian classifiers to aid to the selection of the most promising embryos that will form the batch to be transferred to the woman's uterus. The aim of the supervised classification system is to improve overall success rate of each IVF treatment in which a batch of embryos is transferred each time, where the success is achieved when implantation (i.e. pregnancy) is obtained. Due to ethical reasons, different legislative restrictions apply in every country on this technique. In Spain, legislation allows a maximum of three embryos to form each transfer batch. As a result, clinicians prefer to select the embryos by non-invasive embryo examination based on simple methods and observation focused on morphology and dynamics of embryo development after fertilization. This paper proposes the application of Bayesian classifiers to this embryo selection problem in order to provide a decision support system that allows a more accurate selection than with the actual procedures which fully rely on the expertise and experience of embryologists. For this, we propose to take into consideration a reduced subset of feature variables related to embryo morphology and clinical data of patients, and from this data to induce Bayesian classification models. Results obtained applying a filter technique to choose the subset of variables, and the performance of Bayesian classifiers using them, are presented.
Circulating B cells in type 1 diabetics exhibit fewer maturation-associated phenotypes.
Hanley, Patrick; Sutter, Jennifer A; Goodman, Noah G; Du, Yangzhu; Sekiguchi, Debora R; Meng, Wenzhao; Rickels, Michael R; Naji, Ali; Luning Prak, Eline T
2017-10-01
Although autoantibodies have been used for decades as diagnostic and prognostic markers in type 1 diabetes (T1D), further analysis of developmental abnormalities in B cells could reveal tolerance checkpoint defects that could improve individualized therapy. To evaluate B cell developmental progression in T1D, immunophenotyping was used to classify circulating B cells into transitional, mature naïve, mature activated, and resting memory subsets. Then each subset was analyzed for the expression of additional maturation-associated markers. While the frequencies of B cell subsets did not differ significantly between patients and controls, some T1D subjects exhibited reduced proportions of B cells that expressed transmembrane activator and CAML interactor (TACI) and Fas receptor (FasR). Furthermore, some T1D subjects had B cell subsets with lower frequencies of class switching. These results suggest circulating B cells exhibit variable maturation phenotypes in T1D. These phenotypic variations may correlate with differences in B cell selection in individual T1D patients. Copyright © 2017 The Authors. Published by Elsevier Inc. All rights reserved.
Gene selection heuristic algorithm for nutrigenomics studies.
Valour, D; Hue, I; Grimard, B; Valour, B
2013-07-15
Large datasets from -omics studies need to be deeply investigated. The aim of this paper is to provide a new method (LEM method) for the search of transcriptome and metabolome connections. The heuristic algorithm here described extends the classical canonical correlation analysis (CCA) to a high number of variables (without regularization) and combines well-conditioning and fast-computing in "R." Reduced CCA models are summarized in PageRank matrices, the product of which gives a stochastic matrix that resumes the self-avoiding walk covered by the algorithm. Then, a homogeneous Markov process applied to this stochastic matrix converges the probabilities of interconnection between genes, providing a selection of disjointed subsets of genes. This is an alternative to regularized generalized CCA for the determination of blocks within the structure matrix. Each gene subset is thus linked to the whole metabolic or clinical dataset that represents the biological phenotype of interest. Moreover, this selection process reaches the aim of biologists who often need small sets of genes for further validation or extended phenotyping. The algorithm is shown to work efficiently on three published datasets, resulting in meaningfully broadened gene networks.
2009-01-01
selection and uncertainty sampling signif- icantly. Index Terms: Transcription, labeling, submodularity, submod- ular selection, active learning , sequence...name of batch active learning , where a subset of data that is most informative and represen- tative of the whole is selected for labeling. Often...representative subset. Note that our Fisher ker- nel is over an unsupervised generative model, which enables us to bootstrap our active learning approach
Decision Variants for the Automatic Determination of Optimal Feature Subset in RF-RFE.
Chen, Qi; Meng, Zhaopeng; Liu, Xinyi; Jin, Qianguo; Su, Ran
2018-06-15
Feature selection, which identifies a set of most informative features from the original feature space, has been widely used to simplify the predictor. Recursive feature elimination (RFE), as one of the most popular feature selection approaches, is effective in data dimension reduction and efficiency increase. A ranking of features, as well as candidate subsets with the corresponding accuracy, is produced through RFE. The subset with highest accuracy (HA) or a preset number of features (PreNum) are often used as the final subset. However, this may lead to a large number of features being selected, or if there is no prior knowledge about this preset number, it is often ambiguous and subjective regarding final subset selection. A proper decision variant is in high demand to automatically determine the optimal subset. In this study, we conduct pioneering work to explore the decision variant after obtaining a list of candidate subsets from RFE. We provide a detailed analysis and comparison of several decision variants to automatically select the optimal feature subset. Random forest (RF)-recursive feature elimination (RF-RFE) algorithm and a voting strategy are introduced. We validated the variants on two totally different molecular biology datasets, one for a toxicogenomic study and the other one for protein sequence analysis. The study provides an automated way to determine the optimal feature subset when using RF-RFE.
Qian, Chongsheng; Wang, Yingying; Cai, Huili; Laroye, Caroline; De Carvalho Bittencourt, Marcelo; Clement, Laurence; Stoltz, Jean-François; Decot, Véronique; Reppel, Loïc; Bensoussan, Danièle
2016-01-01
Adoptive antiviral cellular immunotherapy by infusion of virus-specific T cells (VSTs) is becoming an alternative treatment for viral infection after hematopoietic stem cell transplantation. The T memory stem cell (TSCM) subset was recently described as exhibiting self-renewal and multipotency properties which are required for sustained efficacy in vivo. We wondered if such a crucial subset for immunotherapy was present in VSTs. We identified, by flow cytometry, TSCM in adenovirus (ADV)-specific interferon (IFN)-γ+ T cells before and after IFN-γ-based immunomagnetic selection, and analyzed the distribution of the main T-cell subsets in VSTs: naive T cells (TN), TSCM, T central memory cells (TCM), T effector memory cell (TEM), and effector T cells (TEFF). In this study all of the different T-cell subsets were observed in the blood sample from healthy donor ADV-VSTs, both before and after IFN-γ-based immunomagnetic selection. As the IFN-γ-based immunomagnetic selection system sorts mainly the most differentiated T-cell subsets, we observed that TEM was always the major T-cell subset of ADV-specific T cells after immunomagnetic isolation and especially after expansion in vitro. Comparing T-cell subpopulation profiles before and after in vitro expansion, we observed that in vitro cell culture with interleukin-2 resulted in a significant expansion of TN-like, TCM, TEM, and TEFF subsets in CD4IFN-γ T cells and of TCM and TEM subsets only in CD8IFN-γ T cells. We demonstrated the presence of all T-cell subsets in IFN-γ VSTs including the TSCM subpopulation, although this was weakly selected by the IFN-γ-based immunomagnetic selection system.
Zhou, Fuqun; Zhang, Aining
2016-01-01
Nowadays, various time-series Earth Observation data with multiple bands are freely available, such as Moderate Resolution Imaging Spectroradiometer (MODIS) datasets including 8-day composites from NASA, and 10-day composites from the Canada Centre for Remote Sensing (CCRS). It is challenging to efficiently use these time-series MODIS datasets for long-term environmental monitoring due to their vast volume and information redundancy. This challenge will be greater when Sentinel 2–3 data become available. Another challenge that researchers face is the lack of in-situ data for supervised modelling, especially for time-series data analysis. In this study, we attempt to tackle the two important issues with a case study of land cover mapping using CCRS 10-day MODIS composites with the help of Random Forests’ features: variable importance, outlier identification. The variable importance feature is used to analyze and select optimal subsets of time-series MODIS imagery for efficient land cover mapping, and the outlier identification feature is utilized for transferring sample data available from one year to an adjacent year for supervised classification modelling. The results of the case study of agricultural land cover classification at a regional scale show that using only about a half of the variables we can achieve land cover classification accuracy close to that generated using the full dataset. The proposed simple but effective solution of sample transferring could make supervised modelling possible for applications lacking sample data. PMID:27792152
Zhou, Fuqun; Zhang, Aining
2016-10-25
Nowadays, various time-series Earth Observation data with multiple bands are freely available, such as Moderate Resolution Imaging Spectroradiometer (MODIS) datasets including 8-day composites from NASA, and 10-day composites from the Canada Centre for Remote Sensing (CCRS). It is challenging to efficiently use these time-series MODIS datasets for long-term environmental monitoring due to their vast volume and information redundancy. This challenge will be greater when Sentinel 2-3 data become available. Another challenge that researchers face is the lack of in-situ data for supervised modelling, especially for time-series data analysis. In this study, we attempt to tackle the two important issues with a case study of land cover mapping using CCRS 10-day MODIS composites with the help of Random Forests' features: variable importance, outlier identification. The variable importance feature is used to analyze and select optimal subsets of time-series MODIS imagery for efficient land cover mapping, and the outlier identification feature is utilized for transferring sample data available from one year to an adjacent year for supervised classification modelling. The results of the case study of agricultural land cover classification at a regional scale show that using only about a half of the variables we can achieve land cover classification accuracy close to that generated using the full dataset. The proposed simple but effective solution of sample transferring could make supervised modelling possible for applications lacking sample data.
Armañanzas, Rubén; Bielza, Concha; Chaudhuri, Kallol Ray; Martinez-Martin, Pablo; Larrañaga, Pedro
2013-07-01
Is it possible to predict the severity staging of a Parkinson's disease (PD) patient using scores of non-motor symptoms? This is the kickoff question for a machine learning approach to classify two widely known PD severity indexes using individual tests from a broad set of non-motor PD clinical scales only. The Hoehn & Yahr index and clinical impression of severity index are global measures of PD severity. They constitute the labels to be assigned in two supervised classification problems using only non-motor symptom tests as predictor variables. Such predictors come from a wide range of PD symptoms, such as cognitive impairment, psychiatric complications, autonomic dysfunction or sleep disturbance. The classification was coupled with a feature subset selection task using an advanced evolutionary algorithm, namely an estimation of distribution algorithm. Results show how five different classification paradigms using a wrapper feature selection scheme are capable of predicting each of the class variables with estimated accuracy in the range of 72-92%. In addition, classification into the main three severity categories (mild, moderate and severe) was split into dichotomic problems where binary classifiers perform better and select different subsets of non-motor symptoms. The number of jointly selected symptoms throughout the whole process was low, suggesting a link between the selected non-motor symptoms and the general severity of the disease. Quantitative results are discussed from a medical point of view, reflecting a clear translation to the clinical manifestations of PD. Moreover, results include a brief panel of non-motor symptoms that could help clinical practitioners to identify patients who are at different stages of the disease from a limited set of symptoms, such as hallucinations, fainting, inability to control body sphincters or believing in unlikely facts. Copyright © 2013 Elsevier B.V. All rights reserved.
Nuutinen, Mikko; Leskelä, Riikka-Leena; Suojalehto, Ella; Tirronen, Anniina; Komssi, Vesa
2017-04-13
In previous years a substantial number of studies have identified statistically important predictors of nursing home admission (NHA). However, as far as we know, the analyses have been done at the population-level. No prior research has analysed the prediction accuracy of a NHA model for individuals. This study is an analysis of 3056 longer-term home care customers in the city of Tampere, Finland. Data were collected from the records of social and health service usage and RAI-HC (Resident Assessment Instrument - Home Care) assessment system during January 2011 and September 2015. The aim was to find out the most efficient variable subsets to predict NHA for individuals and validate the accuracy. The variable subsets of predicting NHA were searched by sequential forward selection (SFS) method, a variable ranking metric and the classifiers of logistic regression (LR), support vector machine (SVM) and Gaussian naive Bayes (GNB). The validation of the results was guaranteed using randomly balanced data sets and cross-validation. The primary performance metrics for the classifiers were the prediction accuracy and AUC (average area under the curve). The LR and GNB classifiers achieved 78% accuracy for predicting NHA. The most important variables were RAI MAPLE (Method for Assigning Priority Levels), functional impairment (RAI IADL, Activities of Daily Living), cognitive impairment (RAI CPS, Cognitive Performance Scale), memory disorders (diagnoses G30-G32 and F00-F03) and the use of community-based health-service and prior hospital use (emergency visits and periods of care). The accuracy of the classifier for individuals was high enough to convince the officials of the city of Tampere to integrate the predictive model based on the findings of this study as a part of home care information system. Further work need to be done to evaluate variables that are modifiable and responsive to interventions.
Minimal ensemble based on subset selection using ECG to diagnose categories of CAN.
Abawajy, Jemal; Kelarev, Andrei; Yi, Xun; Jelinek, Herbert F
2018-07-01
Early diagnosis of cardiac autonomic neuropathy (CAN) is critical for reversing or decreasing its progression and prevent complications. Diagnostic accuracy or precision is one of the core requirements of CAN detection. As the standard Ewing battery tests suffer from a number of shortcomings, research in automating and improving the early detection of CAN has recently received serious attention in identifying additional clinical variables and designing advanced ensembles of classifiers to improve the accuracy or precision of CAN diagnostics. Although large ensembles are commonly proposed for the automated diagnosis of CAN, large ensembles are characterized by slow processing speed and computational complexity. This paper applies ECG features and proposes a new ensemble-based approach for diagnosis of CAN progression. We introduce a Minimal Ensemble Based On Subset Selection (MEBOSS) for the diagnosis of all categories of CAN including early, definite and atypical CAN. MEBOSS is based on a novel multi-tier architecture applying classifier subset selection as well as the training subset selection during several steps of its operation. Our experiments determined the diagnostic accuracy or precision obtained in 5 × 2 cross-validation for various options employed in MEBOSS and other classification systems. The experiments demonstrate the operation of the MEBOSS procedure invoking the most effective classifiers available in the open source software environment SageMath. The results of our experiments show that for the large DiabHealth database of CAN related parameters MEBOSS outperformed other classification systems available in SageMath and achieved 94% to 97% precision in 5 × 2 cross-validation correctly distinguishing any two CAN categories to a maximum of five categorizations including control, early, definite, severe and atypical CAN. These results show that MEBOSS architecture is effective and can be recommended for practical implementations in systems for the diagnosis of CAN progression. Copyright © 2018 Elsevier B.V. All rights reserved.
Variable Selection for Regression Models of Percentile Flows
NASA Astrophysics Data System (ADS)
Fouad, G.
2017-12-01
Percentile flows describe the flow magnitude equaled or exceeded for a given percent of time, and are widely used in water resource management. However, these statistics are normally unavailable since most basins are ungauged. Percentile flows of ungauged basins are often predicted using regression models based on readily observable basin characteristics, such as mean elevation. The number of these independent variables is too large to evaluate all possible models. A subset of models is typically evaluated using automatic procedures, like stepwise regression. This ignores a large variety of methods from the field of feature (variable) selection and physical understanding of percentile flows. A study of 918 basins in the United States was conducted to compare an automatic regression procedure to the following variable selection methods: (1) principal component analysis, (2) correlation analysis, (3) random forests, (4) genetic programming, (5) Bayesian networks, and (6) physical understanding. The automatic regression procedure only performed better than principal component analysis. Poor performance of the regression procedure was due to a commonly used filter for multicollinearity, which rejected the strongest models because they had cross-correlated independent variables. Multicollinearity did not decrease model performance in validation because of a representative set of calibration basins. Variable selection methods based strictly on predictive power (numbers 2-5 from above) performed similarly, likely indicating a limit to the predictive power of the variables. Similar performance was also reached using variables selected based on physical understanding, a finding that substantiates recent calls to emphasize physical understanding in modeling for predictions in ungauged basins. The strongest variables highlighted the importance of geology and land cover, whereas widely used topographic variables were the weakest predictors. Variables suffered from a high degree of multicollinearity, possibly illustrating the co-evolution of climatic and physiographic conditions. Given the ineffectiveness of many variables used here, future work should develop new variables that target specific processes associated with percentile flows.
Linear and nonlinear pattern selection in Rayleigh-Benard stability problems
NASA Technical Reports Server (NTRS)
Davis, Sanford S.
1993-01-01
A new algorithm is introduced to compute finite-amplitude states using primitive variables for Rayleigh-Benard convection on relatively coarse meshes. The algorithm is based on a finite-difference matrix-splitting approach that separates all physical and dimensional effects into one-dimensional subsets. The nonlinear pattern selection process for steady convection in an air-filled square cavity with insulated side walls is investigated for Rayleigh numbers up to 20,000. The internalization of disturbances that evolve into coherent patterns is investigated and transient solutions from linear perturbation theory are compared with and contrasted to the full numerical simulations.
Systematic wavelength selection for improved multivariate spectral analysis
Thomas, Edward V.; Robinson, Mark R.; Haaland, David M.
1995-01-01
Methods and apparatus for determining in a biological material one or more unknown values of at least one known characteristic (e.g. the concentration of an analyte such as glucose in blood or the concentration of one or more blood gas parameters) with a model based on a set of samples with known values of the known characteristics and a multivariate algorithm using several wavelength subsets. The method includes selecting multiple wavelength subsets, from the electromagnetic spectral region appropriate for determining the known characteristic, for use by an algorithm wherein the selection of wavelength subsets improves the model's fitness of the determination for the unknown values of the known characteristic. The selection process utilizes multivariate search methods that select both predictive and synergistic wavelengths within the range of wavelengths utilized. The fitness of the wavelength subsets is determined by the fitness function F=.function.(cost, performance). The method includes the steps of: (1) using one or more applications of a genetic algorithm to produce one or more count spectra, with multiple count spectra then combined to produce a combined count spectrum; (2) smoothing the count spectrum; (3) selecting a threshold count from a count spectrum to select these wavelength subsets which optimize the fitness function; and (4) eliminating a portion of the selected wavelength subsets. The determination of the unknown values can be made: (1) noninvasively and in vivo; (2) invasively and in vivo; or (3) in vitro.
Sparse Zero-Sum Games as Stable Functional Feature Selection
Sokolovska, Nataliya; Teytaud, Olivier; Rizkalla, Salwa; Clément, Karine; Zucker, Jean-Daniel
2015-01-01
In large-scale systems biology applications, features are structured in hidden functional categories whose predictive power is identical. Feature selection, therefore, can lead not only to a problem with a reduced dimensionality, but also reveal some knowledge on functional classes of variables. In this contribution, we propose a framework based on a sparse zero-sum game which performs a stable functional feature selection. In particular, the approach is based on feature subsets ranking by a thresholding stochastic bandit. We provide a theoretical analysis of the introduced algorithm. We illustrate by experiments on both synthetic and real complex data that the proposed method is competitive from the predictive and stability viewpoints. PMID:26325268
Bell, Andrew S; Bradley, Joseph; Everett, Jeremy R; Loesel, Jens; McLoughlin, David; Mills, James; Peakman, Marie-Claire; Sharp, Robert E; Williams, Christine; Zhu, Hongyao
2016-11-01
High-throughput screening (HTS) is an effective method for lead and probe discovery that is widely used in industry and academia to identify novel chemical matter and to initiate the drug discovery process. However, HTS can be time consuming and costly and the use of subsets as an efficient alternative to screening entire compound collections has been investigated. Subsets may be selected on the basis of chemical diversity, molecular properties, biological activity diversity or biological target focus. Previously, we described a novel form of subset screening: plate-based diversity subset (PBDS) screening, in which the screening subset is constructed by plate selection (rather than individual compound cherry-picking), using algorithms that select for compound quality and chemical diversity on a plate basis. In this paper, we describe a second-generation approach to the construction of an updated subset: PBDS2, using both plate and individual compound selection, that has an improved coverage of the chemical space of the screening file, whilst only selecting the same number of plates for screening. We describe the validation of PBDS2 and its successful use in hit and lead discovery. PBDS2 screening became the default mode of singleton (one compound per well) HTS for lead discovery in Pfizer.
NASA Technical Reports Server (NTRS)
1979-01-01
A nonlinear, maximum likelihood, parameter identification computer program (NLSCIDNT) is described which evaluates rotorcraft stability and control coefficients from flight test data. The optimal estimates of the parameters (stability and control coefficients) are determined (identified) by minimizing the negative log likelihood cost function. The minimization technique is the Levenberg-Marquardt method, which behaves like the steepest descent method when it is far from the minimum and behaves like the modified Newton-Raphson method when it is nearer the minimum. Twenty-one states and 40 measurement variables are modeled, and any subset may be selected. States which are not integrated may be fixed at an input value, or time history data may be substituted for the state in the equations of motion. Any aerodynamic coefficient may be expressed as a nonlinear polynomial function of selected 'expansion variables'.
NASA Astrophysics Data System (ADS)
Khehra, Baljit Singh; Pharwaha, Amar Partap Singh
2017-04-01
Ductal carcinoma in situ (DCIS) is one type of breast cancer. Clusters of microcalcifications (MCCs) are symptoms of DCIS that are recognized by mammography. Selection of robust features vector is the process of selecting an optimal subset of features from a large number of available features in a given problem domain after the feature extraction and before any classification scheme. Feature selection reduces the feature space that improves the performance of classifier and decreases the computational burden imposed by using many features on classifier. Selection of an optimal subset of features from a large number of available features in a given problem domain is a difficult search problem. For n features, the total numbers of possible subsets of features are 2n. Thus, selection of an optimal subset of features problem belongs to the category of NP-hard problems. In this paper, an attempt is made to find the optimal subset of MCCs features from all possible subsets of features using genetic algorithm (GA), particle swarm optimization (PSO) and biogeography-based optimization (BBO). For simulation, a total of 380 benign and malignant MCCs samples have been selected from mammogram images of DDSM database. A total of 50 features extracted from benign and malignant MCCs samples are used in this study. In these algorithms, fitness function is correct classification rate of classifier. Support vector machine is used as a classifier. From experimental results, it is also observed that the performance of PSO-based and BBO-based algorithms to select an optimal subset of features for classifying MCCs as benign or malignant is better as compared to GA-based algorithm.
AVC: Selecting discriminative features on basis of AUC by maximizing variable complementarity.
Sun, Lei; Wang, Jun; Wei, Jinmao
2017-03-14
The Receiver Operator Characteristic (ROC) curve is well-known in evaluating classification performance in biomedical field. Owing to its superiority in dealing with imbalanced and cost-sensitive data, the ROC curve has been exploited as a popular metric to evaluate and find out disease-related genes (features). The existing ROC-based feature selection approaches are simple and effective in evaluating individual features. However, these approaches may fail to find real target feature subset due to their lack of effective means to reduce the redundancy between features, which is essential in machine learning. In this paper, we propose to assess feature complementarity by a trick of measuring the distances between the misclassified instances and their nearest misses on the dimensions of pairwise features. If a misclassified instance and its nearest miss on one feature dimension are far apart on another feature dimension, the two features are regarded as complementary to each other. Subsequently, we propose a novel filter feature selection approach on the basis of the ROC analysis. The new approach employs an efficient heuristic search strategy to select optimal features with highest complementarities. The experimental results on a broad range of microarray data sets validate that the classifiers built on the feature subset selected by our approach can get the minimal balanced error rate with a small amount of significant features. Compared with other ROC-based feature selection approaches, our new approach can select fewer features and effectively improve the classification performance.
Atlas ranking and selection for automatic segmentation of the esophagus from CT scans
NASA Astrophysics Data System (ADS)
Yang, Jinzhong; Haas, Benjamin; Fang, Raymond; Beadle, Beth M.; Garden, Adam S.; Liao, Zhongxing; Zhang, Lifei; Balter, Peter; Court, Laurence
2017-12-01
In radiation treatment planning, the esophagus is an important organ-at-risk that should be spared in patients with head and neck cancer or thoracic cancer who undergo intensity-modulated radiation therapy. However, automatic segmentation of the esophagus from CT scans is extremely challenging because of the structure’s inconsistent intensity, low contrast against the surrounding tissues, complex and variable shape and location, and random air bubbles. The goal of this study is to develop an online atlas selection approach to choose a subset of optimal atlases for multi-atlas segmentation to the delineate esophagus automatically. We performed atlas selection in two phases. In the first phase, we used the correlation coefficient of the image content in a cubic region between each atlas and the new image to evaluate their similarity and to rank the atlases in an atlas pool. A subset of atlases based on this ranking was selected, and deformable image registration was performed to generate deformed contours and deformed images in the new image space. In the second phase of atlas selection, we used Kullback-Leibler divergence to measure the similarity of local-intensity histograms between the new image and each of the deformed images, and the measurements were used to rank the previously selected atlases. Deformed contours were overlapped sequentially, from the most to the least similar, and the overlap ratio was examined. We further identified a subset of optimal atlases by analyzing the variation of the overlap ratio versus the number of atlases. The deformed contours from these optimal atlases were fused together using a modified simultaneous truth and performance level estimation algorithm to produce the final segmentation. The approach was validated with promising results using both internal data sets (21 head and neck cancer patients and 15 thoracic cancer patients) and external data sets (30 thoracic patients).
On Detecting Influential Data and Selecting Regression Variables
1989-10-01
subset of the data. The empirical influence function for ,, IFA is defined to be IFA = AA -- A (2) For a given positive definite matrix M and a nonzero...interest. Cook and Weisberg (1980) tried to treat their measurement of the influence on the fitted values X. They used the empirical influence function for...Characterizations of an empirical influence function for detecting influential cases in regression. Technometrics 22, 495-508. [3] Gray, J. B. and Ling, R. F
Reducing seed dependent variability of non-uniformly sampled multidimensional NMR data
NASA Astrophysics Data System (ADS)
Mobli, Mehdi
2015-07-01
The application of NMR spectroscopy to study the structure, dynamics and function of macromolecules requires the acquisition of several multidimensional spectra. The one-dimensional NMR time-response from the spectrometer is extended to additional dimensions by introducing incremented delays in the experiment that cause oscillation of the signal along "indirect" dimensions. For a given dimension the delay is incremented at twice the rate of the maximum frequency (Nyquist rate). To achieve high-resolution requires acquisition of long data records sampled at the Nyquist rate. This is typically a prohibitive step due to time constraints, resulting in sub-optimal data records to the detriment of subsequent analyses. The multidimensional NMR spectrum itself is typically sparse, and it has been shown that in such cases it is possible to use non-Fourier methods to reconstruct a high-resolution multidimensional spectrum from a random subset of non-uniformly sampled (NUS) data. For a given acquisition time, NUS has the potential to improve the sensitivity and resolution of a multidimensional spectrum, compared to traditional uniform sampling. The improvements in sensitivity and/or resolution achieved by NUS are heavily dependent on the distribution of points in the random subset acquired. Typically, random points are selected from a probability density function (PDF) weighted according to the NMR signal envelope. In extreme cases as little as 1% of the data is subsampled. The heavy under-sampling can result in poor reproducibility, i.e. when two experiments are carried out where the same number of random samples is selected from the same PDF but using different random seeds. Here, a jittered sampling approach is introduced that is shown to improve random seed dependent reproducibility of multidimensional spectra generated from NUS data, compared to commonly applied NUS methods. It is shown that this is achieved due to the low variability of the inherent sensitivity of the random subset chosen from a given PDF. Finally, it is demonstrated that metrics used to find optimal NUS distributions are heavily dependent on the inherent sensitivity of the random subset, and such optimisation is therefore less critical when using the proposed sampling scheme.
Liu, Hai Lun; Garzoni, Luca; Herry, Christophe; Durosier, Lucien Daniel; Cao, Mingju; Burns, Patrick; Fecteau, Gilles; Desrochers, André; Patey, Natalie; Seely, Andrew J E; Faure, Christophe; Frasch, Martin G
2016-04-01
Necrotizing enterocolitis of the neonate is an acute inflammatory intestinal disease that can cause necrosis and sepsis. Chorioamnionitis is a risk factor of necrotizing enterocolitis. The gut represents the biggest vagus-innervated organ. Vagal activity can be measured via fetal heart rate variability. We hypothesized that fetal heart rate variability can detect fetuses with incipient gut inflammation. Prospective animal study. University research laboratory. Chronically instrumented near-term fetal sheep (n = 21). Animals were surgically instrumented with vascular catheters and electrocardiogram to allow manipulation and recording from nonanesthetized animals. In 14 fetal sheep, inflammation was induced with lipopolysaccharide (IV) to mimic chorioamnionitis. Fetal arterial blood samples were drawn at selected time points over 54 hours post lipopolysaccharide for blood gas and cytokines (interleukin-6 and tumor necrosis factor-α enzymelinked immunosorbent assay). Fetal heart rateV was quantified throughout the experiment. The time-matched fetal heart rate variability measures were correlated to the levels of interleukin-6 and tumor necrosis factor-α. Upon necropsy, ionized calcium binding adaptor molecule 1+ (Iba1+), CD11c+ (M1), CD206+ (M2 macrophages), and occludin (leakiness marker) immunofluorescence in the terminal ileum was quantified along with regional Iba1+ signal in the brain (microglia). Interleukin-6 peaked at 3 hours post lipopolysaccharide accompanied by mild cardiovascular signs of sepsis. At 54 hours, we identified an increase in Iba1+ and, specifically, M1 macrophages in the ileum accompanied by increased leakiness, with no change in Iba1 signal in the brain. Preceding this change on tissue level, at 24 hours, a subset of nine fetal heart rate variability measures correlated exclusively to the Iba+ markers of ileal, but not brain, inflammation. An additional fetal heart rate variability measure, mean of the differences of R-R intervals, correlated uniquely to M1 ileum macrophages increasing due to lipopolysaccharide. We identified a unique subset of fetal heart rate variability measures reflecting 1.5 days ahead of time the levels of macrophage activation and increased leakiness in terminal ileum. We propose that such subset of fetal heart rate variability measures reflects brain-gut communication via the vagus nerve. Detecting such noninvasively obtainable organ-specific fetal heart rate variability signature of inflammation would alarm neonatologists about neonates at risk of developing necrotizing enterocolitis and sepsis. Clinical validation studies are required.
Variable Selection for Road Segmentation in Aerial Images
NASA Astrophysics Data System (ADS)
Warnke, S.; Bulatov, D.
2017-05-01
For extraction of road pixels from combined image and elevation data, Wegner et al. (2015) proposed classification of superpixels into road and non-road, after which a refinement of the classification results using minimum cost paths and non-local optimization methods took place. We believed that the variable set used for classification was to a certain extent suboptimal, because many variables were redundant while several features known as useful in Photogrammetry and Remote Sensing are missed. This motivated us to implement a variable selection approach which builds a model for classification using portions of training data and subsets of features, evaluates this model, updates the feature set, and terminates when a stopping criterion is satisfied. The choice of classifier is flexible; however, we tested the approach with Logistic Regression and Random Forests, and taylored the evaluation module to the chosen classifier. To guarantee a fair comparison, we kept the segment-based approach and most of the variables from the related work, but we extended them by additional, mostly higher-level features. Applying these superior features, removing the redundant ones, as well as using more accurately acquired 3D data allowed to keep stable or even to reduce the misclassification error in a challenging dataset.
Oliveira, Roberta B; Pereira, Aledir S; Tavares, João Manuel R S
2017-10-01
The number of deaths worldwide due to melanoma has risen in recent times, in part because melanoma is the most aggressive type of skin cancer. Computational systems have been developed to assist dermatologists in early diagnosis of skin cancer, or even to monitor skin lesions. However, there still remains a challenge to improve classifiers for the diagnosis of such skin lesions. The main objective of this article is to evaluate different ensemble classification models based on input feature manipulation to diagnose skin lesions. Input feature manipulation processes are based on feature subset selections from shape properties, colour variation and texture analysis to generate diversity for the ensemble models. Three subset selection models are presented here: (1) a subset selection model based on specific feature groups, (2) a correlation-based subset selection model, and (3) a subset selection model based on feature selection algorithms. Each ensemble classification model is generated using an optimum-path forest classifier and integrated with a majority voting strategy. The proposed models were applied on a set of 1104 dermoscopic images using a cross-validation procedure. The best results were obtained by the first ensemble classification model that generates a feature subset ensemble based on specific feature groups. The skin lesion diagnosis computational system achieved 94.3% accuracy, 91.8% sensitivity and 96.7% specificity. The input feature manipulation process based on specific feature subsets generated the greatest diversity for the ensemble classification model with very promising results. Copyright © 2017 Elsevier B.V. All rights reserved.
Automatic identification of variables in epidemiological datasets using logic regression.
Lorenz, Matthias W; Abdi, Negin Ashtiani; Scheckenbach, Frank; Pflug, Anja; Bülbül, Alpaslan; Catapano, Alberico L; Agewall, Stefan; Ezhov, Marat; Bots, Michiel L; Kiechl, Stefan; Orth, Andreas
2017-04-13
For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable. For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated. In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables. We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies.
Diagnosis of Chronic Kidney Disease Based on Support Vector Machine by Feature Selection Methods.
Polat, Huseyin; Danaei Mehr, Homay; Cetin, Aydin
2017-04-01
As Chronic Kidney Disease progresses slowly, early detection and effective treatment are the only cure to reduce the mortality rate. Machine learning techniques are gaining significance in medical diagnosis because of their classification ability with high accuracy rates. The accuracy of classification algorithms depend on the use of correct feature selection algorithms to reduce the dimension of datasets. In this study, Support Vector Machine classification algorithm was used to diagnose Chronic Kidney Disease. To diagnose the Chronic Kidney Disease, two essential types of feature selection methods namely, wrapper and filter approaches were chosen to reduce the dimension of Chronic Kidney Disease dataset. In wrapper approach, classifier subset evaluator with greedy stepwise search engine and wrapper subset evaluator with the Best First search engine were used. In filter approach, correlation feature selection subset evaluator with greedy stepwise search engine and filtered subset evaluator with the Best First search engine were used. The results showed that the Support Vector Machine classifier by using filtered subset evaluator with the Best First search engine feature selection method has higher accuracy rate (98.5%) in the diagnosis of Chronic Kidney Disease compared to other selected methods.
An evaluation of exact methods for the multiple subset maximum cardinality selection problem.
Brusco, Michael J; Köhn, Hans-Friedrich; Steinley, Douglas
2016-05-01
The maximum cardinality subset selection problem requires finding the largest possible subset from a set of objects, such that one or more conditions are satisfied. An important extension of this problem is to extract multiple subsets, where the addition of one more object to a larger subset would always be preferred to increases in the size of one or more smaller subsets. We refer to this as the multiple subset maximum cardinality selection problem (MSMCSP). A recently published branch-and-bound algorithm solves the MSMCSP as a partitioning problem. Unfortunately, the computational requirement associated with the algorithm is often enormous, thus rendering the method infeasible from a practical standpoint. In this paper, we present an alternative approach that successively solves a series of binary integer linear programs to obtain a globally optimal solution to the MSMCSP. Computational comparisons of the methods using published similarity data for 45 food items reveal that the proposed sequential method is computationally far more efficient than the branch-and-bound approach. © 2016 The British Psychological Society.
Machine learning search for variable stars
NASA Astrophysics Data System (ADS)
Pashchenko, Ilya N.; Sokolovsky, Kirill V.; Gavras, Panagiotis
2018-04-01
Photometric variability detection is often considered as a hypothesis testing problem: an object is variable if the null hypothesis that its brightness is constant can be ruled out given the measurements and their uncertainties. The practical applicability of this approach is limited by uncorrected systematic errors. We propose a new variability detection technique sensitive to a wide range of variability types while being robust to outliers and underestimated measurement uncertainties. We consider variability detection as a classification problem that can be approached with machine learning. Logistic Regression (LR), Support Vector Machines (SVM), k Nearest Neighbours (kNN), Neural Nets (NN), Random Forests (RF), and Stochastic Gradient Boosting classifier (SGB) are applied to 18 features (variability indices) quantifying scatter and/or correlation between points in a light curve. We use a subset of Optical Gravitational Lensing Experiment phase two (OGLE-II) Large Magellanic Cloud (LMC) photometry (30 265 light curves) that was searched for variability using traditional methods (168 known variable objects) as the training set and then apply the NN to a new test set of 31 798 OGLE-II LMC light curves. Among 205 candidates selected in the test set, 178 are real variables, while 13 low-amplitude variables are new discoveries. The machine learning classifiers considered are found to be more efficient (select more variables and fewer false candidates) compared to traditional techniques using individual variability indices or their linear combination. The NN, SGB, SVM, and RF show a higher efficiency compared to LR and kNN.
Stochastic subset selection for learning with kernel machines.
Rhinelander, Jason; Liu, Xiaoping P
2012-06-01
Kernel machines have gained much popularity in applications of machine learning. Support vector machines (SVMs) are a subset of kernel machines and generalize well for classification, regression, and anomaly detection tasks. The training procedure for traditional SVMs involves solving a quadratic programming (QP) problem. The QP problem scales super linearly in computational effort with the number of training samples and is often used for the offline batch processing of data. Kernel machines operate by retaining a subset of observed data during training. The data vectors contained within this subset are referred to as support vectors (SVs). The work presented in this paper introduces a subset selection method for the use of kernel machines in online, changing environments. Our algorithm works by using a stochastic indexing technique when selecting a subset of SVs when computing the kernel expansion. The work described here is novel because it separates the selection of kernel basis functions from the training algorithm used. The subset selection algorithm presented here can be used in conjunction with any online training technique. It is important for online kernel machines to be computationally efficient due to the real-time requirements of online environments. Our algorithm is an important contribution because it scales linearly with the number of training samples and is compatible with current training techniques. Our algorithm outperforms standard techniques in terms of computational efficiency and provides increased recognition accuracy in our experiments. We provide results from experiments using both simulated and real-world data sets to verify our algorithm.
Articular cartilage degeneration classification by means of high-frequency ultrasound.
Männicke, N; Schöne, M; Oelze, M; Raum, K
2014-10-01
To date only single ultrasound parameters were regarded in statistical analyses to characterize osteoarthritic changes in articular cartilage and the potential benefit of using parameter combinations for characterization remains unclear. Therefore, the aim of this work was to utilize feature selection and classification of a Mankin subset score (i.e., cartilage surface and cell sub-scores) using ultrasound-based parameter pairs and investigate both classification accuracy and the sensitivity towards different degeneration stages. 40 punch biopsies of human cartilage were previously scanned ex vivo with a 40-MHz transducer. Ultrasound-based surface parameters, as well as backscatter and envelope statistics parameters were available. Logistic regression was performed with each unique US parameter pair as predictor and different degeneration stages as response variables. The best ultrasound-based parameter pair for each Mankin subset score value was assessed by highest classification accuracy and utilized in receiver operating characteristics (ROC) analysis. The classifications discriminating between early degenerations yielded area under the ROC curve (AUC) values of 0.94-0.99 (mean ± SD: 0.97 ± 0.03). In contrast, classifications among higher Mankin subset scores resulted in lower AUC values: 0.75-0.91 (mean ± SD: 0.84 ± 0.08). Variable sensitivities of the different ultrasound features were observed with respect to different degeneration stages. Our results strongly suggest that combinations of high-frequency ultrasound-based parameters exhibit potential to characterize different, particularly very early, degeneration stages of hyaline cartilage. Variable sensitivities towards different degeneration stages suggest that a concurrent estimation of multiple ultrasound-based parameters is diagnostically valuable. In-vivo application of the present findings is conceivable in both minimally invasive arthroscopic ultrasound and high-frequency transcutaneous ultrasound. Copyright © 2014 Osteoarthritis Research Society International. Published by Elsevier Ltd. All rights reserved.
Schrom, Edward C; Graham, Andrea L
2017-12-01
Over recent years, extensive phenotypic variability and plasticity have been revealed among the T-helper cells of the mammalian adaptive immune system, even within clonal lineages of identical antigen specificity. This challenges the conventional view that T-helper cells assort into functionally distinct subsets following differential instruction by the innate immune system. We argue that the adaptive value of coping with uncertainty can reconcile the 'instructed subset' framework with T-helper variability and plasticity. However, we also suggest that T-helper cells might better be understood as agile swarms engaged in collective decision-making to promote host fitness. With rigorous testing, the 'agile swarms' framework may illuminate how variable and plastic individual T-helper cells interact to create coherent immunity. Copyright © 2017 Elsevier Ltd. All rights reserved.
Negotiating Multicollinearity with Spike-and-Slab Priors.
Ročková, Veronika; George, Edward I
2014-08-01
In multiple regression under the normal linear model, the presence of multicollinearity is well known to lead to unreliable and unstable maximum likelihood estimates. This can be particularly troublesome for the problem of variable selection where it becomes more difficult to distinguish between subset models. Here we show how adding a spike-and-slab prior mitigates this difficulty by filtering the likelihood surface into a posterior distribution that allocates the relevant likelihood information to each of the subset model modes. For identification of promising high posterior models in this setting, we consider three EM algorithms, the fast closed form EMVS version of Rockova and George (2014) and two new versions designed for variants of the spike-and-slab formulation. For a multimodal posterior under multicollinearity, we compare the regions of convergence of these three algorithms. Deterministic annealing versions of the EMVS algorithm are seen to substantially mitigate this multimodality. A single simple running example is used for illustration throughout.
Shamshirband, Shahaboddin; Banjanovic-Mehmedovic, Lejla; Bosankic, Ivan; Kasapovic, Suad; Abdul Wahab, Ainuddin Wahid Bin
2016-01-01
Intelligent Transportation Systems rely on understanding, predicting and affecting the interactions between vehicles. The goal of this paper is to choose a small subset from the larger set so that the resulting regression model is simple, yet have good predictive ability for Vehicle agent speed relative to Vehicle intruder. The method of ANFIS (adaptive neuro fuzzy inference system) was applied to the data resulting from these measurements. The ANFIS process for variable selection was implemented in order to detect the predominant variables affecting the prediction of agent speed relative to intruder. This process includes several ways to discover a subset of the total set of recorded parameters, showing good predictive capability. The ANFIS network was used to perform a variable search. Then, it was used to determine how 9 parameters (Intruder Front sensors active (boolean), Intruder Rear sensors active (boolean), Agent Front sensors active (boolean), Agent Rear sensors active (boolean), RSSI signal intensity/strength (integer), Elapsed time (in seconds), Distance between Agent and Intruder (m), Angle of Agent relative to Intruder (angle between vehicles °), Altitude difference between Agent and Intruder (m)) influence prediction of agent speed relative to intruder. The results indicated that distance between Vehicle agent and Vehicle intruder (m) and angle of Vehicle agent relative to Vehicle Intruder (angle between vehicles °) is the most influential parameters to Vehicle agent speed relative to Vehicle intruder.
NASA Technical Reports Server (NTRS)
Ruane, Alex C.; Mcdermid, Sonali P.
2017-01-01
We present the Representative Temperature and Precipitation (T&P) GCM Subsetting Approach developed within the Agricultural Model Intercomparison and Improvement Project (AgMIP) to select a practical subset of global climate models (GCMs) for regional integrated assessment of climate impacts when resource limitations do not permit the full ensemble of GCMs to be evaluated given the need to also focus on impacts sector and economics models. Subsetting inherently leads to a loss of information but can free up resources to explore important uncertainties in the integrated assessment that would otherwise be prohibitive. The Representative T&P GCM Subsetting Approach identifies five individual GCMs that capture a profile of the full ensemble of temperature and precipitation change within the growing season while maintaining information about the probability that basic classes of climate changes (relatively cool/wet, cool/dry, middle, hot/wet, and hot/dry) are projected in the full GCM ensemble. We demonstrate the selection methodology for maize impacts in Ames, Iowa, and discuss limitations and situations when additional information may be required to select representative GCMs. We then classify 29 GCMs over all land areas to identify regions and seasons with characteristic diagonal skewness related to surface moisture as well as extreme skewness connected to snow-albedo feedbacks and GCM uncertainty. Finally, we employ this basic approach to recognize that GCM projections demonstrate coherence across space, time, and greenhouse gas concentration pathway. The Representative T&P GCM Subsetting Approach provides a quantitative basis for the determination of useful GCM subsets, provides a practical and coherent approach where previous assessments selected solely on availability of scenarios, and may be extended for application to a range of scales and sectoral impacts.
NASA Astrophysics Data System (ADS)
Hassanzadeh, S.; Hosseinibalam, F.; Omidvari, M.
2008-04-01
Data of seven meteorological variables (relative humidity, wet temperature, dry temperature, maximum temperature, minimum temperature, ground temperature and sun radiation time) and ozone values have been used for statistical analysis. Meteorological variables and ozone values were analyzed using both multiple linear regression and principal component methods. Data for the period 1999-2004 are analyzed jointly using both methods. For all periods, temperature dependent variables were highly correlated, but were all negatively correlated with relative humidity. Multiple regression analysis was used to fit the meteorological variables using the meteorological variables as predictors. A variable selection method based on high loading of varimax rotated principal components was used to obtain subsets of the predictor variables to be included in the linear regression model of the meteorological variables. In 1999, 2001 and 2002 one of the meteorological variables was weakly influenced predominantly by the ozone concentrations. However, the model did not predict that the meteorological variables for the year 2000 were not influenced predominantly by the ozone concentrations that point to variation in sun radiation. This could be due to other factors that were not explicitly considered in this study.
NASA Technical Reports Server (NTRS)
Rabitz, Herschel
1987-01-01
The use of parametric and functional gradient sensitivity analysis techniques is considered for models described by partial differential equations. By interchanging appropriate dependent and independent variables, questions of inverse sensitivity may be addressed to gain insight into the inversion of observational data for parameter and function identification in mathematical models. It may be argued that the presence of a subset of dominantly strong coupled dependent variables will result in the overall system sensitivity behavior collapsing into a simple set of scaling and self similarity relations amongst elements of the entire matrix of sensitivity coefficients. These general tools are generic in nature, but herein their application to problems arising in selected areas of physics and chemistry is presented.
Using learning automata to determine proper subset size in high-dimensional spaces
NASA Astrophysics Data System (ADS)
Seyyedi, Seyyed Hossein; Minaei-Bidgoli, Behrouz
2017-03-01
In this paper, we offer a new method called FSLA (Finding the best candidate Subset using Learning Automata), which combines the filter and wrapper approaches for feature selection in high-dimensional spaces. Considering the difficulties of dimension reduction in high-dimensional spaces, FSLA's multi-objective functionality is to determine, in an efficient manner, a feature subset that leads to an appropriate tradeoff between the learning algorithm's accuracy and efficiency. First, using an existing weighting function, the feature list is sorted and selected subsets of the list of different sizes are considered. Then, a learning automaton verifies the performance of each subset when it is used as the input space of the learning algorithm and estimates its fitness upon the algorithm's accuracy and the subset size, which determines the algorithm's efficiency. Finally, FSLA introduces the fittest subset as the best choice. We tested FSLA in the framework of text classification. The results confirm its promising performance of attaining the identified goal.
Two-stage atlas subset selection in multi-atlas based image segmentation.
Zhao, Tingting; Ruan, Dan
2015-06-01
Fast growing access to large databases and cloud stored data presents a unique opportunity for multi-atlas based image segmentation and also presents challenges in heterogeneous atlas quality and computation burden. This work aims to develop a novel two-stage method tailored to the special needs in the face of large atlas collection with varied quality, so that high-accuracy segmentation can be achieved with low computational cost. An atlas subset selection scheme is proposed to substitute a significant portion of the computationally expensive full-fledged registration in the conventional scheme with a low-cost alternative. More specifically, the authors introduce a two-stage atlas subset selection method. In the first stage, an augmented subset is obtained based on a low-cost registration configuration and a preliminary relevance metric; in the second stage, the subset is further narrowed down to a fusion set of desired size, based on full-fledged registration and a refined relevance metric. An inference model is developed to characterize the relationship between the preliminary and refined relevance metrics, and a proper augmented subset size is derived to ensure that the desired atlases survive the preliminary selection with high probability. The performance of the proposed scheme has been assessed with cross validation based on two clinical datasets consisting of manually segmented prostate and brain magnetic resonance images, respectively. The proposed scheme demonstrates comparable end-to-end segmentation performance as the conventional single-stage selection method, but with significant computation reduction. Compared with the alternative computation reduction method, their scheme improves the mean and medium Dice similarity coefficient value from (0.74, 0.78) to (0.83, 0.85) and from (0.82, 0.84) to (0.95, 0.95) for prostate and corpus callosum segmentation, respectively, with statistical significance. The authors have developed a novel two-stage atlas subset selection scheme for multi-atlas based segmentation. It achieves good segmentation accuracy with significantly reduced computation cost, making it a suitable configuration in the presence of extensive heterogeneous atlases.
Bayesian block-diagonal variable selection and model averaging
Papaspiliopoulos, O.; Rossell, D.
2018-01-01
Summary We propose a scalable algorithmic framework for exact Bayesian variable selection and model averaging in linear models under the assumption that the Gram matrix is block-diagonal, and as a heuristic for exploring the model space for general designs. In block-diagonal designs our approach returns the most probable model of any given size without resorting to numerical integration. The algorithm also provides a novel and efficient solution to the frequentist best subset selection problem for block-diagonal designs. Posterior probabilities for any number of models are obtained by evaluating a single one-dimensional integral, and other quantities of interest such as variable inclusion probabilities and model-averaged regression estimates are obtained by an adaptive, deterministic one-dimensional numerical integration. The overall computational cost scales linearly with the number of blocks, which can be processed in parallel, and exponentially with the block size, rendering it most adequate in situations where predictors are organized in many moderately-sized blocks. For general designs, we approximate the Gram matrix by a block-diagonal matrix using spectral clustering and propose an iterative algorithm that capitalizes on the block-diagonal algorithms to explore efficiently the model space. All methods proposed in this paper are implemented in the R library mombf. PMID:29861501
Adaptive feature selection using v-shaped binary particle swarm optimization.
Teng, Xuyang; Dong, Hongbin; Zhou, Xiurong
2017-01-01
Feature selection is an important preprocessing method in machine learning and data mining. This process can be used not only to reduce the amount of data to be analyzed but also to build models with stronger interpretability based on fewer features. Traditional feature selection methods evaluate the dependency and redundancy of features separately, which leads to a lack of measurement of their combined effect. Moreover, a greedy search considers only the optimization of the current round and thus cannot be a global search. To evaluate the combined effect of different subsets in the entire feature space, an adaptive feature selection method based on V-shaped binary particle swarm optimization is proposed. In this method, the fitness function is constructed using the correlation information entropy. Feature subsets are regarded as individuals in a population, and the feature space is searched using V-shaped binary particle swarm optimization. The above procedure overcomes the hard constraint on the number of features, enables the combined evaluation of each subset as a whole, and improves the search ability of conventional binary particle swarm optimization. The proposed algorithm is an adaptive method with respect to the number of feature subsets. The experimental results show the advantages of optimizing the feature subsets using the V-shaped transfer function and confirm the effectiveness and efficiency of the feature subsets obtained under different classifiers.
Adaptive feature selection using v-shaped binary particle swarm optimization
Dong, Hongbin; Zhou, Xiurong
2017-01-01
Feature selection is an important preprocessing method in machine learning and data mining. This process can be used not only to reduce the amount of data to be analyzed but also to build models with stronger interpretability based on fewer features. Traditional feature selection methods evaluate the dependency and redundancy of features separately, which leads to a lack of measurement of their combined effect. Moreover, a greedy search considers only the optimization of the current round and thus cannot be a global search. To evaluate the combined effect of different subsets in the entire feature space, an adaptive feature selection method based on V-shaped binary particle swarm optimization is proposed. In this method, the fitness function is constructed using the correlation information entropy. Feature subsets are regarded as individuals in a population, and the feature space is searched using V-shaped binary particle swarm optimization. The above procedure overcomes the hard constraint on the number of features, enables the combined evaluation of each subset as a whole, and improves the search ability of conventional binary particle swarm optimization. The proposed algorithm is an adaptive method with respect to the number of feature subsets. The experimental results show the advantages of optimizing the feature subsets using the V-shaped transfer function and confirm the effectiveness and efficiency of the feature subsets obtained under different classifiers. PMID:28358850
Fizzy: feature subset selection for metagenomics.
Ditzler, Gregory; Morrison, J Calvin; Lan, Yemin; Rosen, Gail L
2015-11-04
Some of the current software tools for comparative metagenomics provide ecologists with the ability to investigate and explore bacterial communities using α- & β-diversity. Feature subset selection--a sub-field of machine learning--can also provide a unique insight into the differences between metagenomic or 16S phenotypes. In particular, feature subset selection methods can obtain the operational taxonomic units (OTUs), or functional features, that have a high-level of influence on the condition being studied. For example, in a previous study we have used information-theoretic feature selection to understand the differences between protein family abundances that best discriminate between age groups in the human gut microbiome. We have developed a new Python command line tool, which is compatible with the widely adopted BIOM format, for microbial ecologists that implements information-theoretic subset selection methods for biological data formats. We demonstrate the software tools capabilities on publicly available datasets. We have made the software implementation of Fizzy available to the public under the GNU GPL license. The standalone implementation can be found at http://github.com/EESI/Fizzy.
Fizzy. Feature subset selection for metagenomics
Ditzler, Gregory; Morrison, J. Calvin; Lan, Yemin; ...
2015-11-04
Background: Some of the current software tools for comparative metagenomics provide ecologists with the ability to investigate and explore bacterial communities using α– & β–diversity. Feature subset selection – a sub-field of machine learning – can also provide a unique insight into the differences between metagenomic or 16S phenotypes. In particular, feature subset selection methods can obtain the operational taxonomic units (OTUs), or functional features, that have a high-level of influence on the condition being studied. For example, in a previous study we have used information-theoretic feature selection to understand the differences between protein family abundances that best discriminate betweenmore » age groups in the human gut microbiome. Results: We have developed a new Python command line tool, which is compatible with the widely adopted BIOM format, for microbial ecologists that implements information-theoretic subset selection methods for biological data formats. We demonstrate the software tools capabilities on publicly available datasets. Conclusions: We have made the software implementation of Fizzy available to the public under the GNU GPL license. The standalone implementation can be found at http://github.com/EESI/Fizzy.« less
Temporal change in biological community structure in the Fountain Creek basin, Colorado, 2001-2008
Zuellig, Robert E.; Bruce, James F.; Stogner, Sr., Robert W.
2010-01-01
In 2001, the U.S. Geological Survey, in cooperation with Colorado Springs City Engineering, began a study to better understand the relations between environmental characteristics and biological communities in the Fountain Creek basin in order to aide water-resource management and guide future monitoring activities. To accomplish this task, environmental (streamflow, habitat, and water chemistry) and biological (fish and macroinvertebrate) data were collected annually at 24 sites over a 6- or 8-year period (fish, 2003 to 2008; macroinvertebrates, 2001 to 2008). For this report, these data were first analyzed to determine the presence of temporal change in macroinvertebrate and fish community structure among years using nonparametric multivariate statistics. Where temporal change in the biological communities was found, these data were further analyzed using additional nonparametric multivariate techniques to determine which subset of selected streamflow, habitat, or water-chemistry variables best described site-specific changes in community structure relative to a gradient of urbanization. This study identified significant directional patterns of temporal change in macroinvertebrate and fish community structure at 15 of 24 sites in the Fountain Creek basin. At four of these sites, changes in environmental variables were significantly correlated with the concurrent temporal change identified in macroinvertebrate and fish community structure (Monument Creek above Woodmen Road at Colorado Springs, Colo.; Monument Creek at Bijou Street at Colorado Springs, Colo.; Bear Creek near Colorado Springs, Colo.; Fountain Creek at Security, Colo.). Combinations of environmental variables describing directional temporal change in the biota appeared to be site specific as no single variable dominated the results; however, substrate composition variables (percent substrate composition composed of sand, gravel, or cobble) collectively were present in 80 percent of the environmental variable subsets that were significantly correlated with temporal change in the macroinvertebrate and fish community structure. Other important environmental variables related to temporal change in the biological community structure included those describing channel form (streambank height) and streamflow (normalized annual mean daily flow, high flood-pulse count). Site-specific results from this study were derived from a relatively small number of observations (6 or 8 years of data); therefore, additional years of data may reveal other sites with temporal change in biological community structure, or could define stronger and more consistent linkages between environmental variables and observed temporal change. Likewise current variable subsets could become weaker. Nonetheless, there were several sites where temporal change was detected in this study that could not be explained by the available environmental variables studied herein. Modification of current data-collection activities may be necessary to better understand site-specific temporal relations between biological communities and environmental variables.
Alvarez, George A; Gill, Jonathan; Cavanagh, Patrick
2012-01-01
Previous studies have shown independent attentional selection of targets in the left and right visual hemifields during attentional tracking (Alvarez & Cavanagh, 2005) but not during a visual search (Luck, Hillyard, Mangun, & Gazzaniga, 1989). Here we tested whether multifocal spatial attention is the critical process that operates independently in the two hemifields. It is explicitly required in tracking (attend to a subset of object locations, suppress the others) but not in the standard visual search task (where all items are potential targets). We used a modified visual search task in which observers searched for a target within a subset of display items, where the subset was selected based on location (Experiments 1 and 3A) or based on a salient feature difference (Experiments 2 and 3B). The results show hemifield independence in this subset visual search task with location-based selection but not with feature-based selection; this effect cannot be explained by general difficulty (Experiment 4). Combined, these findings suggest that hemifield independence is a signature of multifocal spatial attention and highlight the need for cognitive and neural theories of attention to account for anatomical constraints on selection mechanisms. PMID:22637710
Core Hunter 3: flexible core subset selection.
De Beukelaer, Herman; Davenport, Guy F; Fack, Veerle
2018-05-31
Core collections provide genebank curators and plant breeders a way to reduce size of their collections and populations, while minimizing impact on genetic diversity and allele frequency. Many methods have been proposed to generate core collections, often using distance metrics to quantify the similarity of two accessions, based on genetic marker data or phenotypic traits. Core Hunter is a multi-purpose core subset selection tool that uses local search algorithms to generate subsets relying on one or more metrics, including several distance metrics and allelic richness. In version 3 of Core Hunter (CH3) we have incorporated two new, improved methods for summarizing distances to quantify diversity or representativeness of the core collection. A comparison of CH3 and Core Hunter 2 (CH2) showed that these new metrics can be effectively optimized with less complex algorithms, as compared to those used in CH2. CH3 is more effective at maximizing the improved diversity metric than CH2, still ensures a high average and minimum distance, and is faster for large datasets. Using CH3, a simple stochastic hill-climber is able to find highly diverse core collections, and the more advanced parallel tempering algorithm further increases the quality of the core and further reduces variability across independent samples. We also evaluate the ability of CH3 to simultaneously maximize diversity, and either representativeness or allelic richness, and compare the results with those of the GDOpt and SimEli methods. CH3 can sample equally representative cores as GDOpt, which was specifically designed for this purpose, and is able to construct cores that are simultaneously more diverse, and either are more representative or have higher allelic richness, than those obtained by SimEli. In version 3, Core Hunter has been updated to include two new core subset selection metrics that construct cores for representativeness or diversity, with improved performance. It combines and outperforms the strengths of other methods, as it (simultaneously) optimizes a variety of metrics. In addition, CH3 is an improvement over CH2, with the option to use genetic marker data or phenotypic traits, or both, and improved speed. Core Hunter 3 is freely available on http://www.corehunter.org .
NASA Astrophysics Data System (ADS)
Djallel Dilmi, Mohamed; Mallet, Cécile; Barthes, Laurent; Chazottes, Aymeric
2017-04-01
Rain time series records are generally studied using rainfall rate or accumulation parameters, which are estimated for a fixed duration (typically 1 min, 1 h or 1 day). In this study we use the concept of rain events
. The aim of the first part of this paper is to establish a parsimonious characterization of rain events, using a minimal set of variables selected among those normally used for the characterization of these events. A methodology is proposed, based on the combined use of a genetic algorithm (GA) and self-organizing maps (SOMs). It can be advantageous to use an SOM, since it allows a high-dimensional data space to be mapped onto a two-dimensional space while preserving, in an unsupervised manner, most of the information contained in the initial space topology. The 2-D maps obtained in this way allow the relationships between variables to be determined and redundant variables to be removed, thus leading to a minimal subset of variables. We verify that such 2-D maps make it possible to determine the characteristics of all events, on the basis of only five features (the event duration, the peak rain rate, the rain event depth, the standard deviation of the rain rate event and the absolute rain rate variation of the order of 0.5). From this minimal subset of variables, hierarchical cluster analyses were carried out. We show that clustering into two classes allows the conventional convective and stratiform classes to be determined, whereas classification into five classes allows this convective-stratiform classification to be further refined. Finally, our study made it possible to reveal the presence of some specific relationships between these five classes and the microphysics of their associated rain events.
Prognostic scores in oesophageal or gastric variceal bleeding.
Ohmann, C; Stöltzing, H; Wins, L; Busch, E; Thon, K
1990-05-01
Numerous scoring systems have been developed for the prediction of outcome of variceal bleeding; however, only a few have been evaluated adequately. The object of this study was to improve the classical Child-Pugh score (CPS) and to test other scores from the literature. Patients (n = 82) with endoscopically confirmed variceal bleeding and long-term sclerotherapy were included in the study. Linear logistic regression (LR) was applied to different sets of prognostic variables with regard to 30-day mortality. In addition, scores from the literature were evaluated on the data set. Performance was measured by the accuracy and receiver-operating characteristic curves. The application of LR to all five CPS variables (accuracy, 80%) was superior to the classical CPS (70%). LR with selection from the CPS variables or from other sets of variables resulted in no improvement. Compared with CPS only three scores from the literature, mainly based on subsets of the CPS variables, showed an improved accuracy. It is concluded that CPS is still a good scoring system; however, it can be improved by statistical analysis using the same variables.
Kamińska, Joanna A
2018-07-01
Random forests, an advanced data mining method, are used here to model the regression relationships between concentrations of the pollutants NO 2 , NO x and PM 2.5 , and nine variables describing meteorological conditions, temporal conditions and traffic flow. The study was based on hourly values of wind speed, wind direction, temperature, air pressure and relative humidity, temporal variables, and finally traffic flow, in the two years 2015 and 2016. An air quality measurement station was selected on a main road, located a short distance (40 m) from a large intersection equipped with a traffic flow measurement system. Nine different time subsets were defined, based among other things on the climatic conditions in Wrocław. An analysis was made of the fit of models created for those subsets, and of the importance of the predictors. Both the fit and the importance of particular predictors were found to be dependent on season. The best fit was obtained for models created for the six-month warm season (April-September) and for the summer season (June-August). The most important explanatory variable in the models of concentrations of nitrogen oxides was traffic flow, while in the case of PM 2.5 the most important were meteorological conditions, in particular temperature, wind speed and wind direction. Temporal variables (except for month in the case of PM 2.5 ) were found to have no significant effect on the concentrations of the studied pollutants. Copyright © 2018 Elsevier Ltd. All rights reserved.
Unbiased feature selection in learning random forests for high-dimensional data.
Nguyen, Thanh-Tung; Huang, Joshua Zhexue; Nguyen, Thuy Thi
2015-01-01
Random forests (RFs) have been widely used as a powerful classification method. However, with the randomization in both bagging samples and feature selection, the trees in the forest tend to select uninformative features for node splitting. This makes RFs have poor accuracy when working with high-dimensional data. Besides that, RFs have bias in the feature selection process where multivalued features are favored. Aiming at debiasing feature selection in RFs, we propose a new RF algorithm, called xRF, to select good features in learning RFs for high-dimensional data. We first remove the uninformative features using p-value assessment, and the subset of unbiased features is then selected based on some statistical measures. This feature subset is then partitioned into two subsets. A feature weighting sampling technique is used to sample features from these two subsets for building trees. This approach enables one to generate more accurate trees, while allowing one to reduce dimensionality and the amount of data needed for learning RFs. An extensive set of experiments has been conducted on 47 high-dimensional real-world datasets including image datasets. The experimental results have shown that RFs with the proposed approach outperformed the existing random forests in increasing the accuracy and the AUC measures.
AbdelRahman, Samir E; Zhang, Mingyuan; Bray, Bruce E; Kawamoto, Kensaku
2014-05-27
The aim of this study was to propose an analytical approach to develop high-performing predictive models for congestive heart failure (CHF) readmission using an operational dataset with incomplete records and changing data over time. Our analytical approach involves three steps: pre-processing, systematic model development, and risk factor analysis. For pre-processing, variables that were absent in >50% of records were removed. Moreover, the dataset was divided into a validation dataset and derivation datasets which were separated into three temporal subsets based on changes to the data over time. For systematic model development, using the different temporal datasets and the remaining explanatory variables, the models were developed by combining the use of various (i) statistical analyses to explore the relationships between the validation and the derivation datasets; (ii) adjustment methods for handling missing values; (iii) classifiers; (iv) feature selection methods; and (iv) discretization methods. We then selected the best derivation dataset and the models with the highest predictive performance. For risk factor analysis, factors in the highest-performing predictive models were analyzed and ranked using (i) statistical analyses of the best derivation dataset, (ii) feature rankers, and (iii) a newly developed algorithm to categorize risk factors as being strong, regular, or weak. The analysis dataset consisted of 2,787 CHF hospitalizations at University of Utah Health Care from January 2003 to June 2013. In this study, we used the complete-case analysis and mean-based imputation adjustment methods; the wrapper subset feature selection method; and four ranking strategies based on information gain, gain ratio, symmetrical uncertainty, and wrapper subset feature evaluators. The best-performing models resulted from the use of a complete-case analysis derivation dataset combined with the Class-Attribute Contingency Coefficient discretization method and a voting classifier which averaged the results of multi-nominal logistic regression and voting feature intervals classifiers. Of 42 final model risk factors, discharge disposition, discretized age, and indicators of anemia were the most significant. This model achieved a c-statistic of 86.8%. The proposed three-step analytical approach enhanced predictive model performance for CHF readmissions. It could potentially be leveraged to improve predictive model performance in other areas of clinical medicine.
Effects of Sample Selection on Estimates of Economic Impacts of Outdoor Recreation
Donald B.K. English
1997-01-01
Estimates of the economic impacts of recreation often come from spending data provided by a self-selected subset of a random sample of site visitors. The subset is frequently less than half the onsite sample. Biased vectors of per trip spending and impact estimates can result if self-selection is related to spending pattctns, and proper corrective procedures arc not...
Visual analytics in cheminformatics: user-supervised descriptor selection for QSAR methods.
Martínez, María Jimena; Ponzoni, Ignacio; Díaz, Mónica F; Vazquez, Gustavo E; Soto, Axel J
2015-01-01
The design of QSAR/QSPR models is a challenging problem, where the selection of the most relevant descriptors constitutes a key step of the process. Several feature selection methods that address this step are concentrated on statistical associations among descriptors and target properties, whereas the chemical knowledge is left out of the analysis. For this reason, the interpretability and generality of the QSAR/QSPR models obtained by these feature selection methods are drastically affected. Therefore, an approach for integrating domain expert's knowledge in the selection process is needed for increase the confidence in the final set of descriptors. In this paper a software tool, which we named Visual and Interactive DEscriptor ANalysis (VIDEAN), that combines statistical methods with interactive visualizations for choosing a set of descriptors for predicting a target property is proposed. Domain expertise can be added to the feature selection process by means of an interactive visual exploration of data, and aided by statistical tools and metrics based on information theory. Coordinated visual representations are presented for capturing different relationships and interactions among descriptors, target properties and candidate subsets of descriptors. The competencies of the proposed software were assessed through different scenarios. These scenarios reveal how an expert can use this tool to choose one subset of descriptors from a group of candidate subsets or how to modify existing descriptor subsets and even incorporate new descriptors according to his or her own knowledge of the target property. The reported experiences showed the suitability of our software for selecting sets of descriptors with low cardinality, high interpretability, low redundancy and high statistical performance in a visual exploratory way. Therefore, it is possible to conclude that the resulting tool allows the integration of a chemist's expertise in the descriptor selection process with a low cognitive effort in contrast with the alternative of using an ad-hoc manual analysis of the selected descriptors. Graphical abstractVIDEAN allows the visual analysis of candidate subsets of descriptors for QSAR/QSPR. In the two panels on the top, users can interactively explore numerical correlations as well as co-occurrences in the candidate subsets through two interactive graphs.
Clustering and variable selection in the presence of mixed variable types and missing data.
Storlie, C B; Myers, S M; Katusic, S K; Weaver, A L; Voigt, R G; Croarkin, P E; Stoeckel, R E; Port, J D
2018-05-17
We consider the problem of model-based clustering in the presence of many correlated, mixed continuous, and discrete variables, some of which may have missing values. Discrete variables are treated with a latent continuous variable approach, and the Dirichlet process is used to construct a mixture model with an unknown number of components. Variable selection is also performed to identify the variables that are most influential for determining cluster membership. The work is motivated by the need to cluster patients thought to potentially have autism spectrum disorder on the basis of many cognitive and/or behavioral test scores. There are a modest number of patients (486) in the data set along with many (55) test score variables (many of which are discrete valued and/or missing). The goal of the work is to (1) cluster these patients into similar groups to help identify those with similar clinical presentation and (2) identify a sparse subset of tests that inform the clusters in order to eliminate unnecessary testing. The proposed approach compares very favorably with other methods via simulation of problems of this type. The results of the autism spectrum disorder analysis suggested 3 clusters to be most likely, while only 4 test scores had high (>0.5) posterior probability of being informative. This will result in much more efficient and informative testing. The need to cluster observations on the basis of many correlated, continuous/discrete variables with missing values is a common problem in the health sciences as well as in many other disciplines. Copyright © 2018 John Wiley & Sons, Ltd.
Engine with exhaust gas recirculation system and variable geometry turbocharger
Keating, Edward J.
2015-11-03
An engine assembly includes an intake assembly, an internal combustion engine defining a plurality of cylinders and configured to combust a fuel and produce exhaust gas, and an exhaust assembly in fluid communication with a first subset of the plurality of cylinders. Each of the plurality of cylinders are provided in fluid communication with the intake assembly. The exhaust assembly is provided in fluid communication with a first subset of the plurality of cylinders, and a dedicated exhaust gas recirculation system in fluid communication with both a second subset of the plurality of cylinders and with the intake assembly. The dedicated exhaust gas recirculation system is configured to route all of the exhaust gas from the second subset of the plurality of cylinders to the intake assembly. Finally, the engine assembly includes a turbocharger having a variable geometry turbine in fluid communication with the exhaust assembly.
Using abiotic variables to predict importance of sites for species representation.
Albuquerque, Fabio; Beier, Paul
2015-10-01
In systematic conservation planning, species distribution data for all sites in a planning area are used to prioritize each site in terms of the site's importance toward meeting the goal of species representation. But comprehensive species data are not available in most planning areas and would be expensive to acquire. As a shortcut, ecologists use surrogates, such as occurrences of birds or another well-surveyed taxon, or land types defined from remotely sensed data, in the hope that sites that represent the surrogates also represent biodiversity. Unfortunately, surrogates have not performed reliably. We propose a new type of surrogate, predicted importance, that can be developed from species data for a q% subset of sites. With species data from this subset of sites, importance can be modeled as a function of abiotic variables available at no charge for all terrestrial areas on Earth. Predicted importance can then be used as a surrogate to prioritize all sites. We tested this surrogate with 8 sets of species data. For each data set, we used a q% subset of sites to model importance as a function of abiotic variables, used the resulting function to predict importance for all sites, and evaluated the number of species in the sites with highest predicted importance. Sites with the highest predicted importance represented species efficiently for all data sets when q = 25% and for 7 of 8 data sets when q = 20%. Predicted importance requires less survey effort than direct selection for species representation and meets representation goals well compared with other surrogates currently in use. This less expensive surrogate may be useful in those areas of the world that need it most, namely tropical regions with the highest biodiversity, greatest biodiversity loss, most severe lack of inventory data, and poorly developed protected area networks. © 2015 Society for Conservation Biology.
Biowaste home composting: experimental process monitoring and quality control.
Tatàno, Fabio; Pagliaro, Giacomo; Di Giovanni, Paolo; Floriani, Enrico; Mangani, Filippo
2015-04-01
Because home composting is a prevention option in managing biowaste at local levels, the objective of the present study was to contribute to the knowledge of the process evolution and compost quality that can be expected and obtained, respectively, in this decentralized option. In this study, organized as the research portion of a provincial project on home composting in the territory of Pesaro-Urbino (Central Italy), four experimental composters were first initiated and temporally monitored. Second, two small sub-sets of selected provincial composters (directly operated by households involved in the project) underwent quality control on their compost products at two different temporal steps. The monitored experimental composters showed overall decreasing profiles versus composting time for moisture, organic carbon, and C/N, as well as overall increasing profiles for electrical conductivity and total nitrogen, which represented qualitative indications of progress in the process. Comparative evaluations of the monitored experimental composters also suggested some interactions in home composting, i.e., high C/N ratios limiting organic matter decomposition rates and final humification levels; high moisture contents restricting the internal temperature regime; nearly horizontal phosphorus and potassium evolutions contributing to limit the rates of increase in electrical conductivity; and prolonged biowaste additions contributing to limit the rate of decrease in moisture. The measures of parametric data variability in the two sub-sets of controlled provincial composters showed decreased variability in moisture, organic carbon, and C/N from the seventh to fifteenth month of home composting, as well as increased variability in electrical conductivity, total nitrogen, and humification rate, which could be considered compatible with the respective nature of decreasing and increasing parameters during composting. The modeled parametric kinetics in the monitored experimental composters, along with the evaluation of the parametric central tendencies in the sub-sets of controlled provincial composters, all indicate that 12-15 months is a suitable duration for the appropriate development of home composting in final and simultaneous compliance with typical reference limits. Copyright © 2014 Elsevier Ltd. All rights reserved.
Graves, Tabitha A.; Royle, J. Andrew; Kendall, Katherine C.; Beier, Paul; Stetz, Jeffrey B.; Macleod, Amy C.
2012-01-01
Using multiple detection methods can increase the number, kind, and distribution of individuals sampled, which may increase accuracy and precision and reduce cost of population abundance estimates. However, when variables influencing abundance are of interest, if individuals detected via different methods are influenced by the landscape differently, separate analysis of multiple detection methods may be more appropriate. We evaluated the effects of combining two detection methods on the identification of variables important to local abundance using detections of grizzly bears with hair traps (systematic) and bear rubs (opportunistic). We used hierarchical abundance models (N-mixture models) with separate model components for each detection method. If both methods sample the same population, the use of either data set alone should (1) lead to the selection of the same variables as important and (2) provide similar estimates of relative local abundance. We hypothesized that the inclusion of 2 detection methods versus either method alone should (3) yield more support for variables identified in single method analyses (i.e. fewer variables and models with greater weight), and (4) improve precision of covariate estimates for variables selected in both separate and combined analyses because sample size is larger. As expected, joint analysis of both methods increased precision as well as certainty in variable and model selection. However, the single-method analyses identified different variables and the resulting predicted abundances had different spatial distributions. We recommend comparing single-method and jointly modeled results to identify the presence of individual heterogeneity between detection methods in N-mixture models, along with consideration of detection probabilities, correlations among variables, and tolerance to risk of failing to identify variables important to a subset of the population. The benefits of increased precision should be weighed against those risks. The analysis framework presented here will be useful for other species exhibiting heterogeneity by detection method.
Kay, Jeremy N; De la Huerta, Irina; Kim, In-Jung; Zhang, Yifeng; Yamagata, Masahito; Chu, Monica W; Meister, Markus; Sanes, Joshua R
2011-05-25
The retina contains ganglion cells (RGCs) that respond selectively to objects moving in particular directions. Individual members of a group of ON-OFF direction-selective RGCs (ooDSGCs) detect stimuli moving in one of four directions: ventral, dorsal, nasal, or temporal. Despite this physiological diversity, little is known about subtype-specific differences in structure, molecular identity, and projections. To seek such differences, we characterized mouse transgenic lines that selectively mark ooDSGCs preferring ventral or nasal motion as well as a line that marks both ventral- and dorsal-preferring subsets. We then used the lines to identify cell surface molecules, including Cadherin 6, CollagenXXVα1, and Matrix metalloprotease 17, that are selectively expressed by distinct subsets of ooDSGCs. We also identify a neuropeptide, CART (cocaine- and amphetamine-regulated transcript), that distinguishes all ooDSGCs from other RGCs. Together, this panel of endogenous and transgenic markers distinguishes the four ooDSGC subsets. Patterns of molecular diversification occur before eye opening and are therefore experience independent. They may help to explain how the four subsets obtain distinct inputs. We also demonstrate differences among subsets in their dendritic patterns within the retina and their axonal projections to the brain. Differences in projections indicate that information about motion in different directions is sent to different destinations.
Prediction of Malaysian monthly GDP
NASA Astrophysics Data System (ADS)
Hin, Pooi Ah; Ching, Soo Huei; Yeing, Pan Wei
2015-12-01
The paper attempts to use a method based on multivariate power-normal distribution to predict the Malaysian Gross Domestic Product next month. Letting r(t) be the vector consisting of the month-t values on m selected macroeconomic variables, and GDP, we model the month-(t+1) GDP to be dependent on the present and l-1 past values r(t), r(t-1),…,r(t-l+1) via a conditional distribution which is derived from a [(m+1)l+1]-dimensional power-normal distribution. The 100(α/2)% and 100(1-α/2)% points of the conditional distribution may be used to form an out-of sample prediction interval. This interval together with the mean of the conditional distribution may be used to predict the month-(t+1) GDP. The mean absolute percentage error (MAPE), estimated coverage probability and average length of the prediction interval are used as the criterions for selecting the suitable lag value l-1 and the subset from a pool of 17 macroeconomic variables. It is found that the relatively better models would be those of which 2 ≤ l ≤ 3, and involving one or two of the macroeconomic variables given by Market Indicative Yield, Oil Prices, Exchange Rate and Import Trade.
Two-stage atlas subset selection in multi-atlas based image segmentation
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zhao, Tingting, E-mail: tingtingzhao@mednet.ucla.edu; Ruan, Dan, E-mail: druan@mednet.ucla.edu
2015-06-15
Purpose: Fast growing access to large databases and cloud stored data presents a unique opportunity for multi-atlas based image segmentation and also presents challenges in heterogeneous atlas quality and computation burden. This work aims to develop a novel two-stage method tailored to the special needs in the face of large atlas collection with varied quality, so that high-accuracy segmentation can be achieved with low computational cost. Methods: An atlas subset selection scheme is proposed to substitute a significant portion of the computationally expensive full-fledged registration in the conventional scheme with a low-cost alternative. More specifically, the authors introduce a two-stagemore » atlas subset selection method. In the first stage, an augmented subset is obtained based on a low-cost registration configuration and a preliminary relevance metric; in the second stage, the subset is further narrowed down to a fusion set of desired size, based on full-fledged registration and a refined relevance metric. An inference model is developed to characterize the relationship between the preliminary and refined relevance metrics, and a proper augmented subset size is derived to ensure that the desired atlases survive the preliminary selection with high probability. Results: The performance of the proposed scheme has been assessed with cross validation based on two clinical datasets consisting of manually segmented prostate and brain magnetic resonance images, respectively. The proposed scheme demonstrates comparable end-to-end segmentation performance as the conventional single-stage selection method, but with significant computation reduction. Compared with the alternative computation reduction method, their scheme improves the mean and medium Dice similarity coefficient value from (0.74, 0.78) to (0.83, 0.85) and from (0.82, 0.84) to (0.95, 0.95) for prostate and corpus callosum segmentation, respectively, with statistical significance. Conclusions: The authors have developed a novel two-stage atlas subset selection scheme for multi-atlas based segmentation. It achieves good segmentation accuracy with significantly reduced computation cost, making it a suitable configuration in the presence of extensive heterogeneous atlases.« less
Minimally buffered data transfers between nodes in a data communications network
Miller, Douglas R.
2015-06-23
Methods, apparatus, and products for minimally buffered data transfers between nodes in a data communications network are disclosed that include: receiving, by a messaging module on an origin node, a storage identifier, a origin data type, and a target data type, the storage identifier specifying application storage containing data, the origin data type describing a data subset contained in the origin application storage, the target data type describing an arrangement of the data subset in application storage on a target node; creating, by the messaging module, origin metadata describing the origin data type; selecting, by the messaging module from the origin application storage in dependence upon the origin metadata and the storage identifier, the data subset; and transmitting, by the messaging module to the target node, the selected data subset for storing in the target application storage in dependence upon the target data type without temporarily buffering the data subset.
Negotiating Multicollinearity with Spike-and-Slab Priors
Ročková, Veronika
2014-01-01
In multiple regression under the normal linear model, the presence of multicollinearity is well known to lead to unreliable and unstable maximum likelihood estimates. This can be particularly troublesome for the problem of variable selection where it becomes more difficult to distinguish between subset models. Here we show how adding a spike-and-slab prior mitigates this difficulty by filtering the likelihood surface into a posterior distribution that allocates the relevant likelihood information to each of the subset model modes. For identification of promising high posterior models in this setting, we consider three EM algorithms, the fast closed form EMVS version of Rockova and George (2014) and two new versions designed for variants of the spike-and-slab formulation. For a multimodal posterior under multicollinearity, we compare the regions of convergence of these three algorithms. Deterministic annealing versions of the EMVS algorithm are seen to substantially mitigate this multimodality. A single simple running example is used for illustration throughout. PMID:25419004
A Cancer Gene Selection Algorithm Based on the K-S Test and CFS.
Su, Qiang; Wang, Yina; Jiang, Xiaobing; Chen, Fuxue; Lu, Wen-Cong
2017-01-01
To address the challenging problem of selecting distinguished genes from cancer gene expression datasets, this paper presents a gene subset selection algorithm based on the Kolmogorov-Smirnov (K-S) test and correlation-based feature selection (CFS) principles. The algorithm selects distinguished genes first using the K-S test, and then, it uses CFS to select genes from those selected by the K-S test. We adopted support vector machines (SVM) as the classification tool and used the criteria of accuracy to evaluate the performance of the classifiers on the selected gene subsets. This approach compared the proposed gene subset selection algorithm with the K-S test, CFS, minimum-redundancy maximum-relevancy (mRMR), and ReliefF algorithms. The average experimental results of the aforementioned gene selection algorithms for 5 gene expression datasets demonstrate that, based on accuracy, the performance of the new K-S and CFS-based algorithm is better than those of the K-S test, CFS, mRMR, and ReliefF algorithms. The experimental results show that the K-S test-CFS gene selection algorithm is a very effective and promising approach compared to the K-S test, CFS, mRMR, and ReliefF algorithms.
A Study of Quasar Selection in the Supernova Fields of the Dark Energy Survey
DOE Office of Scientific and Technical Information (OSTI.GOV)
Tie, S. S.; Martini, P.; Mudd, D.
In this paper, we present a study of quasar selection using the supernova fields of the Dark Energy Survey (DES). We used a quasar catalog from an overlapping portion of the SDSS Stripe 82 region to quantify the completeness and efficiency of selection methods involving color, probabilistic modeling, variability, and combinations of color/probabilistic modeling with variability. In all cases, we considered only objects that appear as point sources in the DES images. We examine color selection methods based on the Wide-field Infrared Survey Explorer (WISE) mid-IR W1-W2 color, a mixture of WISE and DES colors (g - i and i-W1),more » and a mixture of Vista Hemisphere Survey and DES colors (g - i and i - K). For probabilistic quasar selection, we used XDQSO, an algorithm that employs an empirical multi-wavelength flux model of quasars to assign quasar probabilities. Our variability selection uses the multi-band χ 2-probability that sources are constant in the DES Year 1 griz-band light curves. The completeness and efficiency are calculated relative to an underlying sample of point sources that are detected in the required selection bands and pass our data quality and photometric error cuts. We conduct our analyses at two magnitude limits, i < 19.8 mag and i < 22 mag. For the subset of sources with W1 and W2 detections, the W1-W2 color or XDQSOz method combined with variability gives the highest completenesses of >85% for both i-band magnitude limits and efficiencies of >80% to the bright limit and >60% to the faint limit; however, the giW1 and giW1+variability methods give the highest quasar surface densities. The XDQSOz method and combinations of W1W2/giW1/XDQSOz with variability are among the better selection methods when both high completeness and high efficiency are desired. We also present the OzDES Quasar Catalog of 1263 spectroscopically confirmed quasars from three years of OzDES observation in the 30 deg 2 of the DES supernova fields. Finally, the catalog includes quasars with redshifts up to z ~ 4 and brighter than i = 22 mag, although the catalog is not complete up to this magnitude limit.« less
A Study of Quasar Selection in the Supernova Fields of the Dark Energy Survey
Tie, S. S.; Martini, P.; Mudd, D.; ...
2017-02-15
In this paper, we present a study of quasar selection using the supernova fields of the Dark Energy Survey (DES). We used a quasar catalog from an overlapping portion of the SDSS Stripe 82 region to quantify the completeness and efficiency of selection methods involving color, probabilistic modeling, variability, and combinations of color/probabilistic modeling with variability. In all cases, we considered only objects that appear as point sources in the DES images. We examine color selection methods based on the Wide-field Infrared Survey Explorer (WISE) mid-IR W1-W2 color, a mixture of WISE and DES colors (g - i and i-W1),more » and a mixture of Vista Hemisphere Survey and DES colors (g - i and i - K). For probabilistic quasar selection, we used XDQSO, an algorithm that employs an empirical multi-wavelength flux model of quasars to assign quasar probabilities. Our variability selection uses the multi-band χ 2-probability that sources are constant in the DES Year 1 griz-band light curves. The completeness and efficiency are calculated relative to an underlying sample of point sources that are detected in the required selection bands and pass our data quality and photometric error cuts. We conduct our analyses at two magnitude limits, i < 19.8 mag and i < 22 mag. For the subset of sources with W1 and W2 detections, the W1-W2 color or XDQSOz method combined with variability gives the highest completenesses of >85% for both i-band magnitude limits and efficiencies of >80% to the bright limit and >60% to the faint limit; however, the giW1 and giW1+variability methods give the highest quasar surface densities. The XDQSOz method and combinations of W1W2/giW1/XDQSOz with variability are among the better selection methods when both high completeness and high efficiency are desired. We also present the OzDES Quasar Catalog of 1263 spectroscopically confirmed quasars from three years of OzDES observation in the 30 deg 2 of the DES supernova fields. Finally, the catalog includes quasars with redshifts up to z ~ 4 and brighter than i = 22 mag, although the catalog is not complete up to this magnitude limit.« less
The Cross-Entropy Based Multi-Filter Ensemble Method for Gene Selection.
Sun, Yingqiang; Lu, Chengbo; Li, Xiaobo
2018-05-17
The gene expression profile has the characteristics of a high dimension, low sample, and continuous type, and it is a great challenge to use gene expression profile data for the classification of tumor samples. This paper proposes a cross-entropy based multi-filter ensemble (CEMFE) method for microarray data classification. Firstly, multiple filters are used to select the microarray data in order to obtain a plurality of the pre-selected feature subsets with a different classification ability. The top N genes with the highest rank of each subset are integrated so as to form a new data set. Secondly, the cross-entropy algorithm is used to remove the redundant data in the data set. Finally, the wrapper method, which is based on forward feature selection, is used to select the best feature subset. The experimental results show that the proposed method is more efficient than other gene selection methods and that it can achieve a higher classification accuracy under fewer characteristic genes.
SCI model structure determination program (OSR) user's guide. [optimal subset regression
NASA Technical Reports Server (NTRS)
1979-01-01
The computer program, OSR (Optimal Subset Regression) which estimates models for rotorcraft body and rotor force and moment coefficients is described. The technique used is based on the subset regression algorithm. Given time histories of aerodynamic coefficients, aerodynamic variables, and control inputs, the program computes correlation between various time histories. The model structure determination is based on these correlations. Inputs and outputs of the program are given.
On the reliable and flexible solution of practical subset regression problems
NASA Technical Reports Server (NTRS)
Verhaegen, M. H.
1987-01-01
A new algorithm for solving subset regression problems is described. The algorithm performs a QR decomposition with a new column-pivoting strategy, which permits subset selection directly from the originally defined regression parameters. This, in combination with a number of extensions of the new technique, makes the method a very flexible tool for analyzing subset regression problems in which the parameters have a physical meaning.
Analysis of Information Content in High-Spectral Resolution Sounders using Subset Selection Analysis
NASA Technical Reports Server (NTRS)
Velez-Reyes, Miguel; Joiner, Joanna
1998-01-01
In this paper, we summarize the results of the sensitivity analysis and data reduction carried out to determine the information content of AIRS and IASI channels. The analysis and data reduction was based on the use of subset selection techniques developed in the linear algebra and statistical community to study linear dependencies in high dimensional data sets. We applied the subset selection method to study dependency among channels by studying the dependency among their weighting functions. Also, we applied the technique to study the information provided by the different levels in which the atmosphere is discretized for retrievals and analysis. Results from the method correlate well with intuition in many respects and point out to possible modifications for band selection in sensor design and number and location of levels in the analysis process.
Genetic Algorithms and Classification Trees in Feature Discovery: Diabetes and the NHANES database
DOE Office of Scientific and Technical Information (OSTI.GOV)
Heredia-Langner, Alejandro; Jarman, Kristin H.; Amidan, Brett G.
2013-09-01
This paper presents a feature selection methodology that can be applied to datasets containing a mixture of continuous and categorical variables. Using a Genetic Algorithm (GA), this method explores a dataset and selects a small set of features relevant for the prediction of a binary (1/0) response. Binary classification trees and an objective function based on conditional probabilities are used to measure the fitness of a given subset of features. The method is applied to health data in order to find factors useful for the prediction of diabetes. Results show that our algorithm is capable of narrowing down the setmore » of predictors to around 8 factors that can be validated using reputable medical and public health resources.« less
Prediction of solvation enthalpy of gaseous organic compounds in propanol
NASA Astrophysics Data System (ADS)
Golmohammadi, Hassan; Dashtbozorgi, Zahra
2016-09-01
The purpose of this paper is to present a novel way for developing quantitative structure-property relationship (QSPR) models to predict the gas-to-propanol solvation enthalpy (Δ H solv) of 95 organic compounds. Different kinds of descriptors were calculated for each compound using the Dragon software package. The variable selection technique of replacement method (RM) was employed to select the optimal subset of solute descriptors. Our investigation reveals that the dependence of physical chemistry properties of solution on solvation enthalpy is nonlinear and that the RM method is unable to model the solvation enthalpy accurately. The results established that the calculated Δ H solv values by SVM were in good agreement with the experimental ones, and the performances of the SVM models were superior to those obtained by RM model.
Fink, Herbert; Panne, Ulrich; Niessner, Reinhard
2002-09-01
An experimental setup for direct elemental analysis of recycled thermoplasts from consumer electronics by laser-induced plasma spectroscopy (LIPS, or laser-induced breakdown spectroscopy, LIBS) was realized. The combination of a echelle spectrograph, featuring a high resolution with a broad spectral coverage, with multivariate methods, such as PLS, PCR, and variable subset selection via a genetic algorithm, resulted in considerable improvements in selectivity and sensitivity for this complex matrix. With a normalization to carbon as internal standard, the limits of detection were in the ppm range. A preliminary pattern recognition study points to the possibility of polymer recognition via the line-rich echelle spectra. Several experiments at an extruder within a recycling plant demonstrated successfully the capability of LIPS for different kinds of routine on-line process analysis.
van der Aa, Lieke M; Levraud, Jean-Pierre; Yahmi, Malika; Lauret, Emilie; Briolat, Valérie; Herbomel, Philippe; Benmansour, Abdenour; Boudinot, Pierre
2009-01-01
Background In mammals, the members of the tripartite motif (TRIM) protein family are involved in various cellular processes including innate immunity against viral infection. Viruses exert strong selective pressures on the defense system. Accordingly, antiviral TRIMs have diversified highly through gene expansion, positive selection and alternative splicing. Characterizing immune TRIMs in other vertebrates may enlighten their complex evolution. Results We describe here a large new subfamily of TRIMs in teleosts, called finTRIMs, identified in rainbow trout as virus-induced transcripts. FinTRIMs are formed of nearly identical RING/B-box regions and C-termini of variable length; the long variants include a B30.2 domain. The zebrafish genome harbors a striking diversity of finTRIMs, with 84 genes distributed in clusters on different chromosomes. A phylogenetic analysis revealed different subsets suggesting lineage-specific diversification events. Accordingly, the number of fintrim genes varies greatly among fish species. Conserved syntenies were observed only for the oldest fintrims. The closest mammalian relatives are trim16 and trim25, but they are not true orthologs. The B30.2 domain of zebrafish finTRIMs evolved under strong positive selection. The positions under positive selection are remarkably congruent in finTRIMs and in mammalian antiviral TRIM5α, concentrated within a viral recognition motif in mammals. The B30.2 domains most closely related to finTRIM are found among NOD-like receptors (NLR), indicating that the evolution of TRIMs and NLRs was intertwined by exon shuffling. Conclusion The diversity, evolution, and features of finTRIMs suggest an important role in fish innate immunity; this would make them the first TRIMs involved in immunity identified outside mammals. PMID:19196451
Müller, Christian; Schillert, Arne; Röthemeier, Caroline; Trégouët, David-Alexandre; Proust, Carole; Binder, Harald; Pfeiffer, Norbert; Beutel, Manfred; Lackner, Karl J.; Schnabel, Renate B.; Tiret, Laurence; Wild, Philipp S.; Blankenberg, Stefan
2016-01-01
Technical variation plays an important role in microarray-based gene expression studies, and batch effects explain a large proportion of this noise. It is therefore mandatory to eliminate technical variation while maintaining biological variability. Several strategies have been proposed for the removal of batch effects, although they have not been evaluated in large-scale longitudinal gene expression data. In this study, we aimed at identifying a suitable method for batch effect removal in a large study of microarray-based longitudinal gene expression. Monocytic gene expression was measured in 1092 participants of the Gutenberg Health Study at baseline and 5-year follow up. Replicates of selected samples were measured at both time points to identify technical variability. Deming regression, Passing-Bablok regression, linear mixed models, non-linear models as well as ReplicateRUV and ComBat were applied to eliminate batch effects between replicates. In a second step, quantile normalization prior to batch effect correction was performed for each method. Technical variation between batches was evaluated by principal component analysis. Associations between body mass index and transcriptomes were calculated before and after batch removal. Results from association analyses were compared to evaluate maintenance of biological variability. Quantile normalization, separately performed in each batch, combined with ComBat successfully reduced batch effects and maintained biological variability. ReplicateRUV performed perfectly in the replicate data subset of the study, but failed when applied to all samples. All other methods did not substantially reduce batch effects in the replicate data subset. Quantile normalization plus ComBat appears to be a valuable approach for batch correction in longitudinal gene expression data. PMID:27272489
Star Products with Separation of Variables Admitting a Smooth Extension
NASA Astrophysics Data System (ADS)
Karabegov, Alexander
2012-08-01
Given a complex manifold M with an open dense subset Ω endowed with a pseudo-Kähler form ω which cannot be smoothly extended to a larger open subset, we consider various examples where the corresponding Kähler-Poisson structure and a star product with separation of variables on (Ω, ω) admit smooth extensions to M. We give a simple criterion of the existence of a smooth extension of a star product and apply it to these examples.
Applications For Real Time NOMADS At NCEP To Disseminate NOAA's Operational Model Data Base
NASA Astrophysics Data System (ADS)
Alpert, J. C.; Wang, J.; Rutledge, G.
2007-05-01
A wide range of environmental information, in digital form, with metadata descriptions and supporting infrastructure is contained in the NOAA Operational Modeling Archive Distribution System (NOMADS) and its Real Time (RT) project prototype at the National Centers for Environmental Prediction (NCEP). NOMADS is now delivering on its goal of a seamless framework, from archival to real time data dissemination for NOAA's operational model data holdings. A process is under way to make NOMADS part of NCEP's operational production of products. A goal is to foster collaborations among the research and education communities, value added retailers, and public access for science and development. In the National Research Council's "Completing the Forecast", Recommendation 3.4 states: "NOMADS should be maintained and extended to include (a) long-term archives of the global and regional ensemble forecasting systems at their native resolution, and (b) re-forecast datasets to facilitate post-processing." As one of many participants of NOMADS, NCEP serves the operational model data base using data application protocol (Open-DAP) and other services for participants to serve their data sets and users to obtain them. Using the NCEP global ensemble data as an example, we show an Open-DAP (also known as DODS) client application that provides a request-and-fulfill mechanism for access to the complex ensemble matrix of holdings. As an example of the DAP service, we show a client application which accesses the Global or Regional Ensemble data set to produce user selected weather element event probabilities. The event probabilities are easily extended over model forecast time to show probability histograms defining the future trend of user selected events. This approach insures an efficient use of computer resources because users transmit only the data necessary for their tasks. Data sets are served by OPeN-DAP allowing commercial clients such as MATLAB or IDL as well as freeware clients such as GrADS to access the NCEP real time database. We will demonstrate how users can use NOMADS services to repackage area subsets and select levels and variables that are sent to a users selected ftp site. NOMADS can also display plots on demand for area subsets, selected levels, time series and selected variables.
Algorithm For Solution Of Subset-Regression Problems
NASA Technical Reports Server (NTRS)
Verhaegen, Michel
1991-01-01
Reliable and flexible algorithm for solution of subset-regression problem performs QR decomposition with new column-pivoting strategy, enables selection of subset directly from originally defined regression parameters. This feature, in combination with number of extensions, makes algorithm very flexible for use in analysis of subset-regression problems in which parameters have physical meanings. Also extended to enable joint processing of columns contaminated by noise with those free of noise, without using scaling techniques.
Mori, K
1986-02-19
To examine differential carbohydrate expression among different subsets of primary afferent fibers, several fluorescein-isothiocyanate conjugated lectins were used in a histochemical study of the dorsal root ganglion (DRG) and spinal cord of the rabbit. The lectin Ulex europaeus agglutinin I specifically labeled a subset of DRG cells and primary afferent fibers which projected to the superficial laminae of the dorsal horn. These results suggest that specific carbohydrates containing L-fucosyl residue is expressed selectively in small diameter primary afferent fibers which subserve nociception or thermoception.
Žuvela, Petar; Liu, J Jay; Macur, Katarzyna; Bączek, Tomasz
2015-10-06
In this work, performance of five nature-inspired optimization algorithms, genetic algorithm (GA), particle swarm optimization (PSO), artificial bee colony (ABC), firefly algorithm (FA), and flower pollination algorithm (FPA), was compared in molecular descriptor selection for development of quantitative structure-retention relationship (QSRR) models for 83 peptides that originate from eight model proteins. The matrix with 423 descriptors was used as input, and QSRR models based on selected descriptors were built using partial least squares (PLS), whereas root mean square error of prediction (RMSEP) was used as a fitness function for their selection. Three performance criteria, prediction accuracy, computational cost, and the number of selected descriptors, were used to evaluate the developed QSRR models. The results show that all five variable selection methods outperform interval PLS (iPLS), sparse PLS (sPLS), and the full PLS model, whereas GA is superior because of its lowest computational cost and higher accuracy (RMSEP of 5.534%) with a smaller number of variables (nine descriptors). The GA-QSRR model was validated initially through Y-randomization. In addition, it was successfully validated with an external testing set out of 102 peptides originating from Bacillus subtilis proteomes (RMSEP of 22.030%). Its applicability domain was defined, from which it was evident that the developed GA-QSRR exhibited strong robustness. All the sources of the model's error were identified, thus allowing for further application of the developed methodology in proteomics.
NASA Astrophysics Data System (ADS)
Mohd. Rijal, Omar; Mohd. Noor, Norliza; Teng, Shee Lee
A statistical method of comparing two digital chest radiographs for Pulmonary Tuberculosis (PTB) patients has been proposed. After applying appropriate image registration procedures, a selected subset of each image is converted to an image histogram (or box plot). Comparing two chest X-ray images is equivalent to the direct comparison of the two corresponding histograms. From each histogram, eleven percentiles (of image intensity) are calculated. The number of percentiles that shift to the left (NLSP) when second image is compared to the first has been shown to be an indicator of patients` progress. In this study, the values of NLSP is to be compared with the actual diagnosis (Y) of several medical practitioners. A logistic regression model is used to study the relationship between NLSP and Y. This study showed that NLSP may be used as an alternative or second opinion for Y. The proposed regression model also show that important explanatory variables such as outcomes of sputum test (Z) and degree of image registration (W) may be omitted when estimating Y-values.
A catalog of galaxy morphology and photometric redshift
NASA Astrophysics Data System (ADS)
Paul, Nicholas; Shamir, Lior
2018-01-01
Morphology carries important information about the physical characteristics of a galaxy. Here we used machine learning to produce a catalog of ~3,000,000 SDSS galaxies classified by their broad morphology into spiral and elliptical galaxies. Comparison of the catalog to Galaxy Zooshows that the catalog contains a subset of 1.7*10^6 galaxies classified with the same level of consistency as the debiased “superclean” sub-sample. In addition to the morphology, we also computed the photometric redshifts of the galaxies. Several pattern recognition algorithms and variable selection strategies were tested, and the best accuracy of mean absolute error of ~0.0062 was achieved by using random forest with a combination of manually and automatically selected variables. The catalog shows that for redshift lower than 0.085 galaxies that visually look spiral become more prevalent as the redshift gets higher. For redshift greater than 0.085 galaxies thatvisually look elliptical become more prevalent. The catalog as well as the source code used to produce it is publicly available athttps://figshare.com/articles/Morphology_and_photometric_redshift_catalog/4833593 .
Scott, J.C.
1990-01-01
Computer software was written to randomly select sites for a ground-water-quality sampling network. The software uses digital cartographic techniques and subroutines from a proprietary geographic information system. The report presents the approaches, computer software, and sample applications. It is often desirable to collect ground-water-quality samples from various areas in a study region that have different values of a spatial characteristic, such as land-use or hydrogeologic setting. A stratified network can be used for testing hypotheses about relations between spatial characteristics and water quality, or for calculating statistical descriptions of water-quality data that account for variations that correspond to the spatial characteristic. In the software described, a study region is subdivided into areal subsets that have a common spatial characteristic to stratify the population into several categories from which sampling sites are selected. Different numbers of sites may be selected from each category of areal subsets. A population of potential sampling sites may be defined by either specifying a fixed population of existing sites, or by preparing an equally spaced population of potential sites. In either case, each site is identified with a single category, depending on the value of the spatial characteristic of the areal subset in which the site is located. Sites are selected from one category at a time. One of two approaches may be used to select sites. Sites may be selected randomly, or the areal subsets in the category can be grouped into cells and sites selected randomly from each cell.
Feature selection for the classification of traced neurons.
López-Cabrera, José D; Lorenzo-Ginori, Juan V
2018-06-01
The great availability of computational tools to calculate the properties of traced neurons leads to the existence of many descriptors which allow the automated classification of neurons from these reconstructions. This situation determines the necessity to eliminate irrelevant features as well as making a selection of the most appropriate among them, in order to improve the quality of the classification obtained. The dataset used contains a total of 318 traced neurons, classified by human experts in 192 GABAergic interneurons and 126 pyramidal cells. The features were extracted by means of the L-measure software, which is one of the most used computational tools in neuroinformatics to quantify traced neurons. We review some current feature selection techniques as filter, wrapper, embedded and ensemble methods. The stability of the feature selection methods was measured. For the ensemble methods, several aggregation methods based on different metrics were applied to combine the subsets obtained during the feature selection process. The subsets obtained applying feature selection methods were evaluated using supervised classifiers, among which Random Forest, C4.5, SVM, Naïve Bayes, Knn, Decision Table and the Logistic classifier were used as classification algorithms. Feature selection methods of types filter, embedded, wrappers and ensembles were compared and the subsets returned were tested in classification tasks for different classification algorithms. L-measure features EucDistanceSD, PathDistanceSD, Branch_pathlengthAve, Branch_pathlengthSD and EucDistanceAve were present in more than 60% of the selected subsets which provides evidence about their importance in the classification of this neurons. Copyright © 2018 Elsevier B.V. All rights reserved.
Kindergarten predictors of second versus eighth grade reading comprehension impairments.
Adlof, Suzanne M; Catts, Hugh W; Lee, Jaehoon
2010-01-01
Multiple studies have shown that kindergarten measures of phonological awareness and alphabet knowledge are good predictors of reading achievement in the primary grades. However, less attention has been given to the early predictors of later reading achievement. This study used a modified best-subsets variable-selection technique to examine kindergarten predictors of early versus later reading comprehension impairments. Participants included 433 children involved in a longitudinal study of language and reading development. The kindergarten test battery assessed various language skills in addition to phonological awareness, alphabet knowledge, naming speed, and nonverbal cognitive ability. Reading comprehension was assessed in second and eighth grades. Results indicated that different combinations of variables were required to optimally predict second versus eighth grade reading impairments. Although some variables effectively predicted reading impairments in both grades, their relative contributions shifted over time. These results are discussed in light of the changing nature of reading comprehension over time. Further research will help to improve the early identification of later reading disabilities.
NASA Astrophysics Data System (ADS)
Milovančević, Miloš; Nikolić, Vlastimir; Anđelković, Boban
2017-01-01
Vibration-based structural health monitoring is widely recognized as an attractive strategy for early damage detection in civil structures. Vibration monitoring and prediction is important for any system since it can save many unpredictable behaviors of the system. If the vibration monitoring is properly managed, that can ensure economic and safe operations. Potentials for further improvement of vibration monitoring lie in the improvement of current control strategies. One of the options is the introduction of model predictive control. Multistep ahead predictive models of vibration are a starting point for creating a successful model predictive strategy. For the purpose of this article, predictive models of are created for vibration monitoring of planetary power transmissions in pellet mills. The models were developed using the novel method based on ANFIS (adaptive neuro fuzzy inference system). The aim of this study is to investigate the potential of ANFIS for selecting the most relevant variables for predictive models of vibration monitoring of pellet mills power transmission. The vibration data are collected by PIC (Programmable Interface Controller) microcontrollers. The goal of the predictive vibration monitoring of planetary power transmissions in pellet mills is to indicate deterioration in the vibration of the power transmissions before the actual failure occurs. The ANFIS process for variable selection was implemented in order to detect the predominant variables affecting the prediction of vibration monitoring. It was also used to select the minimal input subset of variables from the initial set of input variables - current and lagged variables (up to 11 steps) of vibration. The obtained results could be used for simplification of predictive methods so as to avoid multiple input variables. It was preferable to used models with less inputs because of overfitting between training and testing data. While the obtained results are promising, further work is required in order to get results that could be directly applied in practice.
Yashin, Anatoliy I.; Arbeev, Konstantin G.; Wu, Deqing; Arbeeva, Liubov; Kulminski, Alexander; Kulminskaya, Irina; Akushevich, Igor; Ukraintseva, Svetlana V.
2016-01-01
Background and Objective To clarify mechanisms of genetic regulation of human aging and longevity traits, a number of genome-wide association studies (GWAS) of these traits have been performed. However, the results of these analyses did not meet expectations of the researchers. Most detected genetic associations have not reached a genome-wide level of statistical significance, and suffered from the lack of replication in the studies of independent populations. The reasons for slow progress in this research area include low efficiency of statistical methods used in data analyses, genetic heterogeneity of aging and longevity related traits, possibility of pleiotropic (e.g., age dependent) effects of genetic variants on such traits, underestimation of the effects of (i) mortality selection in genetically heterogeneous cohorts, (ii) external factors and differences in genetic backgrounds of individuals in the populations under study, the weakness of conceptual biological framework that does not fully account for above mentioned factors. One more limitation of conducted studies is that they did not fully realize the potential of longitudinal data that allow for evaluating how genetic influences on life span are mediated by physiological variables and other biomarkers during the life course. The objective of this paper is to address these issues. Data and Methods We performed GWAS of human life span using different subsets of data from the original Framingham Heart Study cohort corresponding to different quality control (QC) procedures and used one subset of selected genetic variants for further analyses. We used simulation study to show that approach to combining data improves the quality of GWAS. We used FHS longitudinal data to compare average age trajectories of physiological variables in carriers and non-carriers of selected genetic variants. We used stochastic process model of human mortality and aging to investigate genetic influence on hidden biomarkers of aging and on dynamic interaction between aging and longevity. We investigated properties of genes related to selected variants and their roles in signaling and metabolic pathways. Results We showed that the use of different QC procedures results in different sets of genetic variants associated with life span. We selected 24 genetic variants negatively associated with life span. We showed that the joint analyses of genetic data at the time of bio-specimen collection and follow up data substantially improved significance of associations of selected 24 SNPs with life span. We also showed that aging related changes in physiological variables and in hidden biomarkers of aging differ for the groups of carriers and non-carriers of selected variants. Conclusions . The results of these analyses demonstrated benefits of using biodemographic models and methods in genetic association studies of these traits. Our findings showed that the absence of a large number of genetic variants with deleterious effects may make substantial contribution to exceptional longevity. These effects are dynamically mediated by a number of physiological variables and hidden biomarkers of aging. The results of these research demonstrated benefits of using integrative statistical models of mortality risks in genetic studies of human aging and longevity. PMID:27773987
Reiter, Harold I; Lockyer, Jocelyn; Ziola, Barry; Courneya, Carol-Ann; Eva, Kevin
2012-04-01
Traditional medical school admissions assessment tools may be limiting diversity. This study investigates whether the Multiple Mini-Interview (MMI) is diversity-neutral and, if so, whether applying it with greater weight would dilute the anticipated negative impact of diversity-limiting admissions measures. Interviewed applicants to six medical schools in 2008 and 2009 underwent MMI. Predictor variables of MMI scores, grade point average (GPA), and Medical College Admission Test (MCAT) scores were correlated with diversity measures of age, gender, size of community of origin, income level, and self-declared aboriginal status. A subset of the data was then combined with variable weight assigned to predictor variables to determine whether weighting during the applicant selection process would affect diversity among chosen applicants. MMI scores were unrelated to gender, size of community of origin, and income level. They correlated positively with age and negatively with aboriginal status. GPA and MCAT correlated negatively with age and aboriginal status, GPA correlated positively with income level, and MCAT correlated positively with size of community of origin. Even extreme combinations of MMI and GPA weightings failed to increase diversity among applicants who would be selected on the basis of weighted criteria. MMI could not neutralize the diversity-limiting properties of academic scores as selection criteria to interview. Using academic scores in this way causes range restriction, counteracting attempts to enhance diversity using downstream admissions selection measures such as MMI. Diversity efforts should instead be focused upstream. These results lend further support for the development of pipeline programs.
Approximate error conjugation gradient minimization methods
Kallman, Jeffrey S
2013-05-21
In one embodiment, a method includes selecting a subset of rays from a set of all rays to use in an error calculation for a constrained conjugate gradient minimization problem, calculating an approximate error using the subset of rays, and calculating a minimum in a conjugate gradient direction based on the approximate error. In another embodiment, a system includes a processor for executing logic, logic for selecting a subset of rays from a set of all rays to use in an error calculation for a constrained conjugate gradient minimization problem, logic for calculating an approximate error using the subset of rays, and logic for calculating a minimum in a conjugate gradient direction based on the approximate error. In other embodiments, computer program products, methods, and systems are described capable of using approximate error in constrained conjugate gradient minimization problems.
Complexities in Subsetting Satellite Level 2 Data
NASA Astrophysics Data System (ADS)
Huwe, P.; Wei, J.; Albayrak, A.; Silberstein, D. S.; Alfred, J.; Savtchenko, A. K.; Johnson, J. E.; Hearty, T.; Meyer, D. J.
2017-12-01
Satellite Level 2 data presents unique challenges for tools and services. From nonlinear spatial geometry to inhomogeneous file data structure to inconsistent temporal variables to complex data variable dimensionality to multiple file formats, there are many difficulties in creating general tools for Level 2 data support. At NASA Goddard Earth Sciences Data and Information Services Center (GES DISC), we are implementing a general Level 2 Subsetting service for Level 2 data. In this presentation, we will unravel some of the challenges faced in creating this service and the strategies we used to surmount them.
Choice: 36 band feature selection software with applications to multispectral pattern recognition
NASA Technical Reports Server (NTRS)
Jones, W. C.
1973-01-01
Feature selection software was developed at the Earth Resources Laboratory that is capable of inputting up to 36 channels and selecting channel subsets according to several criteria based on divergence. One of the criterion used is compatible with the table look-up classifier requirements. The software indicates which channel subset best separates (based on average divergence) each class from all other classes. The software employs an exhaustive search technique, and computer time is not prohibitive. A typical task to select the best 4 of 22 channels for 12 classes takes 9 minutes on a Univac 1108 computer.
NASA Astrophysics Data System (ADS)
Hou, Yanqing; Verhagen, Sandra; Wu, Jie
2016-12-01
Ambiguity Resolution (AR) is a key technique in GNSS precise positioning. In case of weak models (i.e., low precision of data), however, the success rate of AR may be low, which may consequently introduce large errors to the baseline solution in cases of wrong fixing. Partial Ambiguity Resolution (PAR) is therefore proposed such that the baseline precision can be improved by fixing only a subset of ambiguities with high success rate. This contribution proposes a new PAR strategy, allowing to select the subset such that the expected precision gain is maximized among a set of pre-selected subsets, while at the same time the failure rate is controlled. These pre-selected subsets are supposed to obtain the highest success rate among those with the same subset size. The strategy is called Two-step Success Rate Criterion (TSRC) as it will first try to fix a relatively large subset with the fixed failure rate ratio test (FFRT) to decide on acceptance or rejection. In case of rejection, a smaller subset will be fixed and validated by the ratio test so as to fulfill the overall failure rate criterion. It is shown how the method can be practically used, without introducing a large additional computation effort. And more importantly, how it can improve (or at least not deteriorate) the availability in terms of baseline precision comparing to classical Success Rate Criterion (SRC) PAR strategy, based on a simulation validation. In the simulation validation, significant improvements are obtained for single-GNSS on short baselines with dual-frequency observations. For dual-constellation GNSS, the improvement for single-frequency observations on short baselines is very significant, on average 68%. For the medium- to long baselines, with dual-constellation GNSS the average improvement is around 20-30%.
A Parameter Subset Selection Algorithm for Mixed-Effects Models
Schmidt, Kathleen L.; Smith, Ralph C.
2016-01-01
Mixed-effects models are commonly used to statistically model phenomena that include attributes associated with a population or general underlying mechanism as well as effects specific to individuals or components of the general mechanism. This can include individual effects associated with data from multiple experiments. However, the parameterizations used to incorporate the population and individual effects are often unidentifiable in the sense that parameters are not uniquely specified by the data. As a result, the current literature focuses on model selection, by which insensitive parameters are fixed or removed from the model. Model selection methods that employ information criteria are applicablemore » to both linear and nonlinear mixed-effects models, but such techniques are limited in that they are computationally prohibitive for large problems due to the number of possible models that must be tested. To limit the scope of possible models for model selection via information criteria, we introduce a parameter subset selection (PSS) algorithm for mixed-effects models, which orders the parameters by their significance. In conclusion, we provide examples to verify the effectiveness of the PSS algorithm and to test the performance of mixed-effects model selection that makes use of parameter subset selection.« less
Capela, Nicole A; Lemaire, Edward D; Baddour, Natalie
2015-01-01
Human activity recognition (HAR), using wearable sensors, is a growing area with the potential to provide valuable information on patient mobility to rehabilitation specialists. Smartphones with accelerometer and gyroscope sensors are a convenient, minimally invasive, and low cost approach for mobility monitoring. HAR systems typically pre-process raw signals, segment the signals, and then extract features to be used in a classifier. Feature selection is a crucial step in the process to reduce potentially large data dimensionality and provide viable parameters to enable activity classification. Most HAR systems are customized to an individual research group, including a unique data set, classes, algorithms, and signal features. These data sets are obtained predominantly from able-bodied participants. In this paper, smartphone accelerometer and gyroscope sensor data were collected from populations that can benefit from human activity recognition: able-bodied, elderly, and stroke patients. Data from a consecutive sequence of 41 mobility tasks (18 different tasks) were collected for a total of 44 participants. Seventy-six signal features were calculated and subsets of these features were selected using three filter-based, classifier-independent, feature selection methods (Relief-F, Correlation-based Feature Selection, Fast Correlation Based Filter). The feature subsets were then evaluated using three generic classifiers (Naïve Bayes, Support Vector Machine, j48 Decision Tree). Common features were identified for all three populations, although the stroke population subset had some differences from both able-bodied and elderly sets. Evaluation with the three classifiers showed that the feature subsets produced similar or better accuracies than classification with the entire feature set. Therefore, since these feature subsets are classifier-independent, they should be useful for developing and improving HAR systems across and within populations.
2015-01-01
Human activity recognition (HAR), using wearable sensors, is a growing area with the potential to provide valuable information on patient mobility to rehabilitation specialists. Smartphones with accelerometer and gyroscope sensors are a convenient, minimally invasive, and low cost approach for mobility monitoring. HAR systems typically pre-process raw signals, segment the signals, and then extract features to be used in a classifier. Feature selection is a crucial step in the process to reduce potentially large data dimensionality and provide viable parameters to enable activity classification. Most HAR systems are customized to an individual research group, including a unique data set, classes, algorithms, and signal features. These data sets are obtained predominantly from able-bodied participants. In this paper, smartphone accelerometer and gyroscope sensor data were collected from populations that can benefit from human activity recognition: able-bodied, elderly, and stroke patients. Data from a consecutive sequence of 41 mobility tasks (18 different tasks) were collected for a total of 44 participants. Seventy-six signal features were calculated and subsets of these features were selected using three filter-based, classifier-independent, feature selection methods (Relief-F, Correlation-based Feature Selection, Fast Correlation Based Filter). The feature subsets were then evaluated using three generic classifiers (Naïve Bayes, Support Vector Machine, j48 Decision Tree). Common features were identified for all three populations, although the stroke population subset had some differences from both able-bodied and elderly sets. Evaluation with the three classifiers showed that the feature subsets produced similar or better accuracies than classification with the entire feature set. Therefore, since these feature subsets are classifier-independent, they should be useful for developing and improving HAR systems across and within populations. PMID:25885272
Defining an essence of structure determining residue contacts in proteins.
Sathyapriya, R; Duarte, Jose M; Stehr, Henning; Filippis, Ioannis; Lappe, Michael
2009-12-01
The network of native non-covalent residue contacts determines the three-dimensional structure of a protein. However, not all contacts are of equal structural significance, and little knowledge exists about a minimal, yet sufficient, subset required to define the global features of a protein. Characterisation of this "structural essence" has remained elusive so far: no algorithmic strategy has been devised to-date that could outperform a random selection in terms of 3D reconstruction accuracy (measured as the Ca RMSD). It is not only of theoretical interest (i.e., for design of advanced statistical potentials) to identify the number and nature of essential native contacts-such a subset of spatial constraints is very useful in a number of novel experimental methods (like EPR) which rely heavily on constraint-based protein modelling. To derive accurate three-dimensional models from distance constraints, we implemented a reconstruction pipeline using distance geometry. We selected a test-set of 12 protein structures from the four major SCOP fold classes and performed our reconstruction analysis. As a reference set, series of random subsets (ranging from 10% to 90% of native contacts) are generated for each protein, and the reconstruction accuracy is computed for each subset. We have developed a rational strategy, termed "cone-peeling" that combines sequence features and network descriptors to select minimal subsets that outperform the reference sets. We present, for the first time, a rational strategy to derive a structural essence of residue contacts and provide an estimate of the size of this minimal subset. Our algorithm computes sparse subsets capable of determining the tertiary structure at approximately 4.8 A Ca RMSD with as little as 8% of the native contacts (Ca-Ca and Cb-Cb). At the same time, a randomly chosen subset of native contacts needs about twice as many contacts to reach the same level of accuracy. This "structural essence" opens new avenues in the fields of structure prediction, empirical potentials and docking.
Defining an Essence of Structure Determining Residue Contacts in Proteins
Sathyapriya, R.; Duarte, Jose M.; Stehr, Henning; Filippis, Ioannis; Lappe, Michael
2009-01-01
The network of native non-covalent residue contacts determines the three-dimensional structure of a protein. However, not all contacts are of equal structural significance, and little knowledge exists about a minimal, yet sufficient, subset required to define the global features of a protein. Characterisation of this “structural essence” has remained elusive so far: no algorithmic strategy has been devised to-date that could outperform a random selection in terms of 3D reconstruction accuracy (measured as the Ca RMSD). It is not only of theoretical interest (i.e., for design of advanced statistical potentials) to identify the number and nature of essential native contacts—such a subset of spatial constraints is very useful in a number of novel experimental methods (like EPR) which rely heavily on constraint-based protein modelling. To derive accurate three-dimensional models from distance constraints, we implemented a reconstruction pipeline using distance geometry. We selected a test-set of 12 protein structures from the four major SCOP fold classes and performed our reconstruction analysis. As a reference set, series of random subsets (ranging from 10% to 90% of native contacts) are generated for each protein, and the reconstruction accuracy is computed for each subset. We have developed a rational strategy, termed “cone-peeling” that combines sequence features and network descriptors to select minimal subsets that outperform the reference sets. We present, for the first time, a rational strategy to derive a structural essence of residue contacts and provide an estimate of the size of this minimal subset. Our algorithm computes sparse subsets capable of determining the tertiary structure at approximately 4.8 Å Ca RMSD with as little as 8% of the native contacts (Ca-Ca and Cb-Cb). At the same time, a randomly chosen subset of native contacts needs about twice as many contacts to reach the same level of accuracy. This “structural essence” opens new avenues in the fields of structure prediction, empirical potentials and docking. PMID:19997489
Longobardi, F; Ventrella, A; Bianco, A; Catucci, L; Cafagna, I; Gallo, V; Mastrorilli, P; Agostiano, A
2013-12-01
In this study, non-targeted (1)H NMR fingerprinting was used in combination with multivariate statistical techniques for the classification of Italian sweet cherries based on their different geographical origins (Emilia Romagna and Puglia). As classification techniques, Soft Independent Modelling of Class Analogy (SIMCA), Partial Least Squares Discriminant Analysis (PLS-DA), and Linear Discriminant Analysis (LDA) were carried out and the results were compared. For LDA, before performing a refined selection of the number/combination of variables, two different strategies for a preliminary reduction of the variable number were tested. The best average recognition and CV prediction abilities (both 100.0%) were obtained for all the LDA models, although PLS-DA also showed remarkable performances (94.6%). All the statistical models were validated by observing the prediction abilities with respect to an external set of cherry samples. The best result (94.9%) was obtained with LDA by performing a best subset selection procedure on a set of 30 principal components previously selected by a stepwise decorrelation. The metabolites that mostly contributed to the classification performances of such LDA model, were found to be malate, glucose, fructose, glutamine and succinate. Copyright © 2013 Elsevier Ltd. All rights reserved.
Lin, Kuan-Cheng; Hsieh, Yi-Hsiu
2015-10-01
The classification and analysis of data is an important issue in today's research. Selecting a suitable set of features makes it possible to classify an enormous quantity of data quickly and efficiently. Feature selection is generally viewed as a problem of feature subset selection, such as combination optimization problems. Evolutionary algorithms using random search methods have proven highly effective in obtaining solutions to problems of optimization in a diversity of applications. In this study, we developed a hybrid evolutionary algorithm based on endocrine-based particle swarm optimization (EPSO) and artificial bee colony (ABC) algorithms in conjunction with a support vector machine (SVM) for the selection of optimal feature subsets for the classification of datasets. The results of experiments using specific UCI medical datasets demonstrate that the accuracy of the proposed hybrid evolutionary algorithm is superior to that of basic PSO, EPSO and ABC algorithms, with regard to classification accuracy using subsets with a reduced number of features.
A hybrid feature selection method using multiclass SVM for diagnosis of erythemato-squamous disease
NASA Astrophysics Data System (ADS)
Maryam, Setiawan, Noor Akhmad; Wahyunggoro, Oyas
2017-08-01
The diagnosis of erythemato-squamous disease is a complex problem and difficult to detect in dermatology. Besides that, it is a major cause of skin cancer. Data mining implementation in the medical field helps expert to diagnose precisely, accurately, and inexpensively. In this research, we use data mining technique to developed a diagnosis model based on multiclass SVM with a novel hybrid feature selection method to diagnose erythemato-squamous disease. Our hybrid feature selection method, named ChiGA (Chi Square and Genetic Algorithm), uses the advantages from filter and wrapper methods to select the optimal feature subset from original feature. Chi square used as filter method to remove redundant features and GA as wrapper method to select the ideal feature subset with SVM used as classifier. Experiment performed with 10 fold cross validation on erythemato-squamous diseases dataset taken from University of California Irvine (UCI) machine learning database. The experimental result shows that the proposed model based multiclass SVM with Chi Square and GA can give an optimum feature subset. There are 18 optimum features with 99.18% accuracy.
NASA Astrophysics Data System (ADS)
Beguet, Benoit; Guyon, Dominique; Boukir, Samia; Chehata, Nesrine
2014-10-01
The main goal of this study is to design a method to describe the structure of forest stands from Very High Resolution satellite imagery, relying on some typical variables such as crown diameter, tree height, trunk diameter, tree density and tree spacing. The emphasis is placed on the automatization of the process of identification of the most relevant image features for the forest structure retrieval task, exploiting both spectral and spatial information. Our approach is based on linear regressions between the forest structure variables to be estimated and various spectral and Haralick's texture features. The main drawback of this well-known texture representation is the underlying parameters which are extremely difficult to set due to the spatial complexity of the forest structure. To tackle this major issue, an automated feature selection process is proposed which is based on statistical modeling, exploring a wide range of parameter values. It provides texture measures of diverse spatial parameters hence implicitly inducing a multi-scale texture analysis. A new feature selection technique, we called Random PRiF, is proposed. It relies on random sampling in feature space, carefully addresses the multicollinearity issue in multiple-linear regression while ensuring accurate prediction of forest variables. Our automated forest variable estimation scheme was tested on Quickbird and Pléiades panchromatic and multispectral images, acquired at different periods on the maritime pine stands of two sites in South-Western France. It outperforms two well-established variable subset selection techniques. It has been successfully applied to identify the best texture features in modeling the five considered forest structure variables. The RMSE of all predicted forest variables is improved by combining multispectral and panchromatic texture features, with various parameterizations, highlighting the potential of a multi-resolution approach for retrieving forest structure variables from VHR satellite images. Thus an average prediction error of ˜ 1.1 m is expected on crown diameter, ˜ 0.9 m on tree spacing, ˜ 3 m on height and ˜ 0.06 m on diameter at breast height.
Deng, Changjian; Lv, Kun; Shi, Debo; Yang, Bo; Yu, Song; He, Zhiyi; Yan, Jia
2018-06-12
In this paper, a novel feature selection and fusion framework is proposed to enhance the discrimination ability of gas sensor arrays for odor identification. Firstly, we put forward an efficient feature selection method based on the separability and the dissimilarity to determine the feature selection order for each type of feature when increasing the dimension of selected feature subsets. Secondly, the K-nearest neighbor (KNN) classifier is applied to determine the dimensions of the optimal feature subsets for different types of features. Finally, in the process of establishing features fusion, we come up with a classification dominance feature fusion strategy which conducts an effective basic feature. Experimental results on two datasets show that the recognition rates of Database I and Database II achieve 97.5% and 80.11%, respectively, when k = 1 for KNN classifier and the distance metric is correlation distance (COR), which demonstrates the superiority of the proposed feature selection and fusion framework in representing signal features. The novel feature selection method proposed in this paper can effectively select feature subsets that are conducive to the classification, while the feature fusion framework can fuse various features which describe the different characteristics of sensor signals, for enhancing the discrimination ability of gas sensors and, to a certain extent, suppressing drift effect.
A coastal information system to propel emerging science and ...
The Estuary Data Mapper (EDM) is a free, interactive virtual gateway to coastal data aimed to promote research and aid in environmental management. The graphical user interface allows users to custom select and subset data based on their spatial and temporal interests giving them easy access to visualize, retrieve, and save data for further analysis. Data are accessible across estuarine systems of the Atlantic, Gulf of Mexico and Pacific regions of the United States and includes: (1) time series data including tidal, hydrologic, and weather, (2) water and sediment quality, (3) atmospheric deposition, (4) habitat, (5) coastal exposure indices, (6) historic and projected land-use and population, (7) historic and projected nitrogen and phosphorous sources and load summaries. EDM issues Web Coverage Service Interface Standard queries (WCS; simple, standard one-line text strings) to a public web service to quickly obtain data subsets by variable, for a date-time range and area selected by user. EDM is continuously being enhanced with updated data and new options. Recent additions include a comprehensive suite of nitrogen source and loading data, and inputs for supporting a modeling approach of seagrass habitat. Additions planned for the near future include 1) support for Integrated Water Resources Management cost-benefit analysis, specifically the Watershed Management Optimization Support Tool and 2) visualization of the combined effects of climate change, land-use a
Estuary Data Mapper: A coastal information system to propel ...
The Estuary Data Mapper (EDM) is a free, interactive virtual gateway to coastal data aimed to promote research and aid in environmental management. The graphical user interface allows users to custom select and subset data based on their spatial and temporal interests giving them easy access to visualize, retrieve, and save data for further analysis. Data are accessible across estuarine systems of the Atlantic, Gulf of Mexico and Pacific regions of the United States and includes: (1) time series data including tidal, hydrologic, and weather, (2) water and sediment quality, (3) atmospheric deposition, (4) habitat, (5) coastal exposure indices, (6) historic and projected land-use and population, (7) historic and projected nitrogen and phosphorous sources and load summaries. EDM issues Web Coverage Service Interface Standard queries (WCS; simple, standard one-line text strings) to a public web service to quickly obtain data subsets by variable, for a date-time range and area selected by user. EDM is continuously being enhanced with updated data and new options. Recent additions include a comprehensive suite of nitrogen source and loading data, and inputs for supporting a modeling approach of seagrass habitat. Additions planned for the near future include 1) support for Integrated Water Resources Management cost-benefit analysis, specifically the Watershed Management Optimization Support Tool and 2) visualization of the combined effects of climate change, land-use a
Hall, S. A.; Burke, I.C.; Box, D. O.; Kaufmann, M. R.; Stoker, Jason M.
2005-01-01
The ponderosa pine forests of the Colorado Front Range, USA, have historically been subjected to wildfires. Recent large burns have increased public interest in fire behavior and effects, and scientific interest in the carbon consequences of wildfires. Remote sensing techniques can provide spatially explicit estimates of stand structural characteristics. Some of these characteristics can be used as inputs to fire behavior models, increasing our understanding of the effect of fuels on fire behavior. Others provide estimates of carbon stocks, allowing us to quantify the carbon consequences of fire. Our objective was to use discrete-return lidar to estimate such variables, including stand height, total aboveground biomass, foliage biomass, basal area, tree density, canopy base height and canopy bulk density. We developed 39 metrics from the lidar data, and used them in limited combinations in regression models, which we fit to field estimates of the stand structural variables. We used an information–theoretic approach to select the best model for each variable, and to select the subset of lidar metrics with most predictive potential. Observed versus predicted values of stand structure variables were highly correlated, with r2 ranging from 57% to 87%. The most parsimonious linear models for the biomass structure variables, based on a restricted dataset, explained between 35% and 58% of the observed variability. Our results provide us with useful estimates of stand height, total aboveground biomass, foliage biomass and basal area. There is promise for using this sensor to estimate tree density, canopy base height and canopy bulk density, though more research is needed to generate robust relationships. We selected 14 lidar metrics that showed the most potential as predictors of stand structure. We suggest that the focus of future lidar studies should broaden to include low density forests, particularly systems where the vertical structure of the canopy is important, such as fire prone forests.
Criteria to Extract High-Quality Protein Data Bank Subsets for Structure Users.
Carugo, Oliviero; Djinović-Carugo, Kristina
2016-01-01
It is often necessary to build subsets of the Protein Data Bank to extract structural trends and average values. For this purpose it is mandatory that the subsets are non-redundant and of high quality. The first problem can be solved relatively easily at the sequence level or at the structural level. The second, on the contrary, needs special attention. It is not sufficient, in fact, to consider the crystallographic resolution and other feature must be taken into account: the absence of strings of residues from the electron density maps and from the files deposited in the Protein Data Bank; the B-factor values; the appropriate validation of the structural models; the quality of the electron density maps, which is not uniform; and the temperature of the diffraction experiments. More stringent criteria produce smaller subsets, which can be enlarged with more tolerant selection criteria. The incessant growth of the Protein Data Bank and especially of the number of high-resolution structures is allowing the use of more stringent selection criteria, with a consequent improvement of the quality of the subsets of the Protein Data Bank.
Robust model selection and the statistical classification of languages
NASA Astrophysics Data System (ADS)
García, J. E.; González-López, V. A.; Viola, M. L. L.
2012-10-01
In this paper we address the problem of model selection for the set of finite memory stochastic processes with finite alphabet, when the data is contaminated. We consider m independent samples, with more than half of them being realizations of the same stochastic process with law Q, which is the one we want to retrieve. We devise a model selection procedure such that for a sample size large enough, the selected process is the one with law Q. Our model selection strategy is based on estimating relative entropies to select a subset of samples that are realizations of the same law. Although the procedure is valid for any family of finite order Markov models, we will focus on the family of variable length Markov chain models, which include the fixed order Markov chain model family. We define the asymptotic breakdown point (ABDP) for a model selection procedure, and we show the ABDP for our procedure. This means that if the proportion of contaminated samples is smaller than the ABDP, then, as the sample size grows our procedure selects a model for the process with law Q. We also use our procedure in a setting where we have one sample conformed by the concatenation of sub-samples of two or more stochastic processes, with most of the subsamples having law Q. We conducted a simulation study. In the application section we address the question of the statistical classification of languages according to their rhythmic features using speech samples. This is an important open problem in phonology. A persistent difficulty on this problem is that the speech samples correspond to several sentences produced by diverse speakers, corresponding to a mixture of distributions. The usual procedure to deal with this problem has been to choose a subset of the original sample which seems to best represent each language. The selection is made by listening to the samples. In our application we use the full dataset without any preselection of samples. We apply our robust methodology estimating a model which represent the main law for each language. Our findings agree with the linguistic conjecture, related to the rhythm of the languages included on our dataset.
Compact cancer biomarkers discovery using a swarm intelligence feature selection algorithm.
Martinez, Emmanuel; Alvarez, Mario Moises; Trevino, Victor
2010-08-01
Biomarker discovery is a typical application from functional genomics. Due to the large number of genes studied simultaneously in microarray data, feature selection is a key step. Swarm intelligence has emerged as a solution for the feature selection problem. However, swarm intelligence settings for feature selection fail to select small features subsets. We have proposed a swarm intelligence feature selection algorithm based on the initialization and update of only a subset of particles in the swarm. In this study, we tested our algorithm in 11 microarray datasets for brain, leukemia, lung, prostate, and others. We show that the proposed swarm intelligence algorithm successfully increase the classification accuracy and decrease the number of selected features compared to other swarm intelligence methods. Copyright © 2010 Elsevier Ltd. All rights reserved.
Herrgård, Markus J.
2014-01-01
High-cell-density fermentation for industrial production of chemicals can impose numerous stresses on cells due to high substrate, product, and by-product concentrations; high osmolarity; reactive oxygen species; and elevated temperatures. There is a need to develop platform strains of industrial microorganisms that are more tolerant toward these typical processing conditions. In this study, the growth of six industrially relevant strains of Escherichia coli was characterized under eight stress conditions representative of fed-batch fermentation, and strains W and BL21(DE3) were selected as platforms for transposon (Tn) mutagenesis due to favorable resistance characteristics. Selection experiments, followed by either targeted or genome-wide next-generation-sequencing-based Tn insertion site determination, were performed to identify mutants with improved growth properties under a subset of three stress conditions and two combinations of individual stresses. A subset of the identified loss-of-function mutants were selected for a combinatorial approach, where strains with combinations of two and three gene deletions were systematically constructed and tested for single and multistress resistance. These approaches allowed identification of (i) strain-background-specific stress resistance phenotypes, (ii) novel gene deletion mutants in E. coli that confer single and multistress resistance in a strain-background-dependent manner, and (iii) synergistic effects of multiple gene deletions that confer improved resistance over single deletions. The results of this study underscore the suboptimality and strain-specific variability of the genetic network regulating growth under stressful conditions and suggest that further exploration of the combinatorial gene deletion space in multiple strain backgrounds is needed for optimizing strains for microbial bioprocessing applications. PMID:25085490
He, S C; Qiao, N; Sheng, W
2003-01-01
The purpose of our study is to determine the alteration of neurobehavioral parameters, autonomic nervous function and lymphocyte subsets in aluminum electrolytic workers of long-term aluminum exposure. 33 men who were 35.16 +/- 2.95 (mean +/- S.D) years old occupationally exposed to aluminum for 14.91 +/- 6.31 (mean +/- S.D) years. Air Al level and urinary aluminum concentration was measured by means of graphite furnace atomic absorption spectrophotometer. Normal reference group were selected from a flour plant. Neurobehavioral core test battery (NCTB) recommended by WHO was utilized. Autonomic nervous function test battery recommended by Ewing DJ was conducted on subjects. FAC SCAN was used to measure the lymphocyte subsets of peripheral blood. The mean air aluminum level in the workshop was 6.36 mg/m3, ranged from 2.90 to 11.38 mg/m3. Urinary aluminum of the Al electrolytic workers (40.08 +/- 9.36 microgram/mg.cre) was obviously higher than that of control group (26.84 +/- 8.93 m/mg.cre). Neurobehavioral results showed that the scores of DSY, PAC and PA in Al electrolytic workers were significantly lower than those of control group, The score of POMSC, POMSF and SRT among Al exposed workers were significantly augmented in relation to those of control group. Autonomic nervous function test results showed that R-R interval variability of maximum ratio of immediately standing up in Al electrolytic workers were decreased compare with the control group, while the BP-IS, HR-V, HR-DB, R30:15 had no significant change. Peripheral blood lymphocyte subsets test showed that CD4-CD8+ T lymphocyte in Al electrolytic workers increased. This study suggests that Al exposure exerts adverse effects on neurobehavioral performance, especially movement coordination and negative mood, and parasympathetic nervous function; moreover it increase CD4-CD8+ T lymphocyte subsets.
Decoys Selection in Benchmarking Datasets: Overview and Perspectives
Réau, Manon; Langenfeld, Florent; Zagury, Jean-François; Lagarde, Nathalie; Montes, Matthieu
2018-01-01
Virtual Screening (VS) is designed to prospectively help identifying potential hits, i.e., compounds capable of interacting with a given target and potentially modulate its activity, out of large compound collections. Among the variety of methodologies, it is crucial to select the protocol that is the most adapted to the query/target system under study and that yields the most reliable output. To this aim, the performance of VS methods is commonly evaluated and compared by computing their ability to retrieve active compounds in benchmarking datasets. The benchmarking datasets contain a subset of known active compounds together with a subset of decoys, i.e., assumed non-active molecules. The composition of both the active and the decoy compounds subsets is critical to limit the biases in the evaluation of the VS methods. In this review, we focus on the selection of decoy compounds that has considerably changed over the years, from randomly selected compounds to highly customized or experimentally validated negative compounds. We first outline the evolution of decoys selection in benchmarking databases as well as current benchmarking databases that tend to minimize the introduction of biases, and secondly, we propose recommendations for the selection and the design of benchmarking datasets. PMID:29416509
An improved wrapper-based feature selection method for machinery fault diagnosis
2017-01-01
A major issue of machinery fault diagnosis using vibration signals is that it is over-reliant on personnel knowledge and experience in interpreting the signal. Thus, machine learning has been adapted for machinery fault diagnosis. The quantity and quality of the input features, however, influence the fault classification performance. Feature selection plays a vital role in selecting the most representative feature subset for the machine learning algorithm. In contrast, the trade-off relationship between capability when selecting the best feature subset and computational effort is inevitable in the wrapper-based feature selection (WFS) method. This paper proposes an improved WFS technique before integration with a support vector machine (SVM) model classifier as a complete fault diagnosis system for a rolling element bearing case study. The bearing vibration dataset made available by the Case Western Reserve University Bearing Data Centre was executed using the proposed WFS and its performance has been analysed and discussed. The results reveal that the proposed WFS secures the best feature subset with a lower computational effort by eliminating the redundancy of re-evaluation. The proposed WFS has therefore been found to be capable and efficient to carry out feature selection tasks. PMID:29261689
Zawbaa, Hossam M; Szlȩk, Jakub; Grosan, Crina; Jachowicz, Renata; Mendyk, Aleksander
2016-01-01
Poly-lactide-co-glycolide (PLGA) is a copolymer of lactic and glycolic acid. Drug release from PLGA microspheres depends not only on polymer properties but also on drug type, particle size, morphology of microspheres, release conditions, etc. Selecting a subset of relevant properties for PLGA is a challenging machine learning task as there are over three hundred features to consider. In this work, we formulate the selection of critical attributes for PLGA as a multiobjective optimization problem with the aim of minimizing the error of predicting the dissolution profile while reducing the number of attributes selected. Four bio-inspired optimization algorithms: antlion optimization, binary version of antlion optimization, grey wolf optimization, and social spider optimization are used to select the optimal feature set for predicting the dissolution profile of PLGA. Besides these, LASSO algorithm is also used for comparisons. Selection of crucial variables is performed under the assumption that both predictability and model simplicity are of equal importance to the final result. During the feature selection process, a set of input variables is employed to find minimum generalization error across different predictive models and their settings/architectures. The methodology is evaluated using predictive modeling for which various tools are chosen, such as Cubist, random forests, artificial neural networks (monotonic MLP, deep learning MLP), multivariate adaptive regression splines, classification and regression tree, and hybrid systems of fuzzy logic and evolutionary computations (fugeR). The experimental results are compared with the results reported by Szlȩk. We obtain a normalized root mean square error (NRMSE) of 15.97% versus 15.4%, and the number of selected input features is smaller, nine versus eleven.
Zawbaa, Hossam M.; Szlȩk, Jakub; Grosan, Crina; Jachowicz, Renata; Mendyk, Aleksander
2016-01-01
Poly-lactide-co-glycolide (PLGA) is a copolymer of lactic and glycolic acid. Drug release from PLGA microspheres depends not only on polymer properties but also on drug type, particle size, morphology of microspheres, release conditions, etc. Selecting a subset of relevant properties for PLGA is a challenging machine learning task as there are over three hundred features to consider. In this work, we formulate the selection of critical attributes for PLGA as a multiobjective optimization problem with the aim of minimizing the error of predicting the dissolution profile while reducing the number of attributes selected. Four bio-inspired optimization algorithms: antlion optimization, binary version of antlion optimization, grey wolf optimization, and social spider optimization are used to select the optimal feature set for predicting the dissolution profile of PLGA. Besides these, LASSO algorithm is also used for comparisons. Selection of crucial variables is performed under the assumption that both predictability and model simplicity are of equal importance to the final result. During the feature selection process, a set of input variables is employed to find minimum generalization error across different predictive models and their settings/architectures. The methodology is evaluated using predictive modeling for which various tools are chosen, such as Cubist, random forests, artificial neural networks (monotonic MLP, deep learning MLP), multivariate adaptive regression splines, classification and regression tree, and hybrid systems of fuzzy logic and evolutionary computations (fugeR). The experimental results are compared with the results reported by Szlȩk. We obtain a normalized root mean square error (NRMSE) of 15.97% versus 15.4%, and the number of selected input features is smaller, nine versus eleven. PMID:27315205
Which products are available for subsetting?
Atmospheric Science Data Center
2014-12-08
... users to create smaller files (subsets) of the original data by selecting desired parameters, parameter criterion, or latitude and ... fluxes, where the net flux is constrained to the global heat storage in netCDF format. Single Scanner Footprint TOA/Surface Fluxes ...
New TES Search and Subset Application
Atmospheric Science Data Center
2017-08-23
... Wednesday, September 19, 2012 The Atmospheric Science Data Center (ASDC) at NASA Langley Research Center in collaboration ... pleased to announce the release of the TES Search and Subset Web Application for select TES Level 2 products. Features of the Search and ...
Complexities in Subsetting Level 2 Data
NASA Technical Reports Server (NTRS)
Huwe, Paul; Wei, Jennifer; Meyer, David; Silberstein, David S.; Alfred, Jerome; Savtchenko, Andrey K.; Johnson, James E.; Albayrak, Arif; Hearty, Thomas
2017-01-01
Satellite Level 2 data presents unique challenges for tools and services. From nonlinear spatial geometry to inhomogeneous file data structure to inconsistent temporal variables to complex data variable dimensionality to multiple file formats, there are many difficulties in creating general tools for Level 2 data support. At NASA Goddard Earth Sciences Data and Information Services Center (GES DISC), we are implementing a general Level 2 Subsetting service for Level 2 data to a user-specified spatio-temporal region of interest (ROI). In this presentation, we will unravel some of the challenges faced in creating this service and the strategies we used to surmount them.
NASA Astrophysics Data System (ADS)
Kwon, Ki-Won; Cho, Yongsoo
This letter presents a simple joint estimation method for residual frequency offset (RFO) and sampling frequency offset (STO) in OFDM-based digital video broadcasting (DVB) systems. The proposed method selects a continual pilot (CP) subset from an unsymmetrically and non-uniformly distributed CP set to obtain an unbiased estimator. Simulation results show that the proposed method using a properly selected CP subset is unbiased and performs robustly.
Cai, Jun; Wang, Hua; Zhou, Sheng; Wu, Bin; Song, Hua-Rong; Xuan, Zheng-Rong
2008-01-01
To observe the effect of perioperative application of Sijunzi Decoction and enteral nutrition on T-cell subsets and nutritional status in patients with gastric cancer after operation. In this prospective, single-blinded, controlled clinical trial, fifty-nine patients with gastric cancer were randomly divided into three groups: control group (n=20) and two study groups (group A, n=21; group B, n=18). Sjunzi Decoction (100 ml) was administered via nasogastric tube to the patients in the study group B from the second postoperation day to the 9th postoperation day. Patients in the two study groups were given an isocaloric and isonitrogonous enteral diet, which was started on the second day after operation, and continued for eight days. Patients in the control group were given an isocaloric and isonitrogonous parenteral diet for 9 days. All variables of nutritional status such as serum albumin (ALB), prealbumin (PA), transferrin (TRF) and T-cell subsets were measured one day before operation, and one day and 10 days after operation. All the nutritional variables and the levels of CD3(+), CD4(+), CD4(+)/CD8(+) were decreased significantly after operation. Ten days after operation, T-cell subsets and nutritional variables in the two study groups were increased as compare with the control group. The levels of ALB, TRF and T-cell subsets in the study group B were increased significantly as compared with the study group A (P<0.05). Enteral nutrition assisted with Sijunzi Decoction can positively improve and optimize cellular immune function and nutritional status in the patients with gastric cancer after operation.
SU-E-J-128: Two-Stage Atlas Selection in Multi-Atlas-Based Image Segmentation
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zhao, T; Ruan, D
2015-06-15
Purpose: In the new era of big data, multi-atlas-based image segmentation is challenged by heterogeneous atlas quality and high computation burden from extensive atlas collection, demanding efficient identification of the most relevant atlases. This study aims to develop a two-stage atlas selection scheme to achieve computational economy with performance guarantee. Methods: We develop a low-cost fusion set selection scheme by introducing a preliminary selection to trim full atlas collection into an augmented subset, alleviating the need for extensive full-fledged registrations. More specifically, fusion set selection is performed in two successive steps: preliminary selection and refinement. An augmented subset is firstmore » roughly selected from the whole atlas collection with a simple registration scheme and the corresponding preliminary relevance metric; the augmented subset is further refined into the desired fusion set size, using full-fledged registration and the associated relevance metric. The main novelty of this work is the introduction of an inference model to relate the preliminary and refined relevance metrics, based on which the augmented subset size is rigorously derived to ensure the desired atlases survive the preliminary selection with high probability. Results: The performance and complexity of the proposed two-stage atlas selection method were assessed using a collection of 30 prostate MR images. It achieved comparable segmentation accuracy as the conventional one-stage method with full-fledged registration, but significantly reduced computation time to 1/3 (from 30.82 to 11.04 min per segmentation). Compared with alternative one-stage cost-saving approach, the proposed scheme yielded superior performance with mean and medium DSC of (0.83, 0.85) compared to (0.74, 0.78). Conclusion: This work has developed a model-guided two-stage atlas selection scheme to achieve significant cost reduction while guaranteeing high segmentation accuracy. The benefit in both complexity and performance is expected to be most pronounced with large-scale heterogeneous data.« less
Feature Selection for Nonstationary Data: Application to Human Recognition Using Medical Biometrics.
Komeili, Majid; Louis, Wael; Armanfard, Narges; Hatzinakos, Dimitrios
2018-05-01
Electrocardiogram (ECG) and transient evoked otoacoustic emission (TEOAE) are among the physiological signals that have attracted significant interest in biometric community due to their inherent robustness to replay and falsification attacks. However, they are time-dependent signals and this makes them hard to deal with in across-session human recognition scenario where only one session is available for enrollment. This paper presents a novel feature selection method to address this issue. It is based on an auxiliary dataset with multiple sessions where it selects a subset of features that are more persistent across different sessions. It uses local information in terms of sample margins while enforcing an across-session measure. This makes it a perfect fit for aforementioned biometric recognition problem. Comprehensive experiments on ECG and TEOAE variability due to time lapse and body posture are done. Performance of the proposed method is compared against seven state-of-the-art feature selection algorithms as well as another six approaches in the area of ECG and TEOAE biometric recognition. Experimental results demonstrate that the proposed method performs noticeably better than other algorithms.
GenoCore: A simple and fast algorithm for core subset selection from large genotype datasets.
Jeong, Seongmun; Kim, Jae-Yoon; Jeong, Soon-Chun; Kang, Sung-Taeg; Moon, Jung-Kyung; Kim, Namshin
2017-01-01
Selecting core subsets from plant genotype datasets is important for enhancing cost-effectiveness and to shorten the time required for analyses of genome-wide association studies (GWAS), and genomics-assisted breeding of crop species, etc. Recently, a large number of genetic markers (>100,000 single nucleotide polymorphisms) have been identified from high-density single nucleotide polymorphism (SNP) arrays and next-generation sequencing (NGS) data. However, there is no software available for picking out the efficient and consistent core subset from such a huge dataset. It is necessary to develop software that can extract genetically important samples in a population with coherence. We here present a new program, GenoCore, which can find quickly and efficiently the core subset representing the entire population. We introduce simple measures of coverage and diversity scores, which reflect genotype errors and genetic variations, and can help to select a sample rapidly and accurately for crop genotype dataset. Comparison of our method to other core collection software using example datasets are performed to validate the performance according to genetic distance, diversity, coverage, required system resources, and the number of selected samples. GenoCore selects the smallest, most consistent, and most representative core collection from all samples, using less memory with more efficient scores, and shows greater genetic coverage compared to the other software tested. GenoCore was written in R language, and can be accessed online with an example dataset and test results at https://github.com/lovemun/Genocore.
Performance Analysis of Relay Subset Selection for Amplify-and-Forward Cognitive Relay Networks
Qureshi, Ijaz Mansoor; Malik, Aqdas Naveed; Zubair, Muhammad
2014-01-01
Cooperative communication is regarded as a key technology in wireless networks, including cognitive radio networks (CRNs), which increases the diversity order of the signal to combat the unfavorable effects of the fading channels, by allowing distributed terminals to collaborate through sophisticated signal processing. Underlay CRNs have strict interference constraints towards the secondary users (SUs) active in the frequency band of the primary users (PUs), which limits their transmit power and their coverage area. Relay selection offers a potential solution to the challenges faced by underlay networks, by selecting either single best relay or a subset of potential relay set under different design requirements and assumptions. The best relay selection schemes proposed in the literature for amplify-and-forward (AF) based underlay cognitive relay networks have been very well studied in terms of outage probability (OP) and bit error rate (BER), which is deficient in multiple relay selection schemes. The novelty of this work is to study the outage behavior of multiple relay selection in the underlay CRN and derive the closed-form expressions for the OP and BER through cumulative distribution function (CDF) of the SNR received at the destination. The effectiveness of relay subset selection is shown through simulation results. PMID:24737980
NASA Technical Reports Server (NTRS)
Prater, T.; Tilson, W.; Jones, Z.
2015-01-01
The absence of an economy of scale in spaceflight hardware makes additive manufacturing an immensely attractive option for propulsion components. As additive manufacturing techniques are increasingly adopted by government and industry to produce propulsion hardware in human-rated systems, significant development efforts are needed to establish these methods as reliable alternatives to conventional subtractive manufacturing. One of the critical challenges facing powder bed fusion techniques in this application is variability between machines used to perform builds. Even with implementation of robust process controls, it is possible for two machines operating at identical parameters with equivalent base materials to produce specimens with slightly different material properties. The machine variability study presented here evaluates 60 specimens of identical geometry built using the same parameters. 30 samples were produced on machine 1 (M1) and the other 30 samples were built on machine 2 (M2). Each of the 30-sample sets were further subdivided into three subsets (with 10 specimens in each subset) to assess the effect of progressive heat treatment on machine variability. The three categories for post-processing were: stress relief, stress relief followed by hot isostatic press (HIP), and stress relief followed by HIP followed by heat treatment per AMS 5664. Each specimen (a round, smooth tensile) was mechanically tested per ASTM E8. Two formal statistical techniques, hypothesis testing for equivalency of means and one-way analysis of variance (ANOVA), were applied to characterize the impact of machine variability and heat treatment on six material properties: tensile stress, yield stress, modulus of elasticity, fracture elongation, and reduction of area. This work represents the type of development effort that is critical as NASA, academia, and the industrial base work collaboratively to establish a path to certification for additively manufactured parts. For future flight programs, NASA and its commercial partners will procure parts from vendors who will use a diverse range of machines to produce parts and, as such, it is essential that the AM community develop a sound understanding of the degree to which machine variability impacts material properties.
Low-resolution Australasian palaeoclimate records of the last 2000 years
NASA Astrophysics Data System (ADS)
Dixon, Bronwyn C.; Tyler, Jonathan J.; Lorrey, Andrew M.; Goodwin, Ian D.; Gergis, Joëlle; Drysdale, Russell N.
2017-10-01
Non-annually resolved palaeoclimate records in the Australasian region were compiled to facilitate investigations of decadal to centennial climate variability over the past 2000 years. A total of 675 lake and wetland, geomorphic, marine, and speleothem records were identified. The majority of records are located near population centres in southeast Australia, in New Zealand, and across the maritime continent, and there are few records from the arid regions of central and western Australia. Each record was assessed against a set of a priori criteria based on temporal resolution, record length, dating methods, and confidence in the proxy-climate relationship over the Common Era. A subset of 22 records met the criteria and were endorsed for subsequent analyses. Chronological uncertainty was the primary reason why records did not meet the selection criteria. New chronologies based on Bayesian techniques were constructed for the high-quality subset to ensure a consistent approach to age modelling and quantification of age uncertainties. The primary reasons for differences between published and reconstructed age-depth models were the consideration of the non-singular distribution of ages in calibrated 14C dates and the use of estimated autocorrelation between sampled depths as a constraint for changes in accumulation rate. Existing proxies and reconstruction techniques that successfully capture climate variability in the region show potential to address spatial gaps and expand the range of climate variables covering the last 2000 years in the Australasian region. Future palaeoclimate research and records in Australasia could be greatly improved through three main actions: (i) greater data availability through the public archiving of published records; (ii) thorough characterisation of proxy-climate relationships through site monitoring and climate sensitivity tests; and (iii) improvement of chronologies through core-top dating, inclusion of tephra layers where possible, and increased date density during the Common Era.
Baldini, Luca; Goldaniga, Maria; Guffanti, Andrea; Broglia, Chiara; Cortelazzo, Sergio; Rossi, Andrea; Morra, Enrica; Colombi, Mariangela; Callea, Vincenzo; Pogliani, Enrico; Ilariucci, Fiorella; Luminari, Stefano; Morel, Pierre; Merlini, Giampaolo; Gobbi, Paolo
2005-07-20
To evaluate the clinicohematologic variables at diagnosis that are prognostically related to neoplastic progression in patients with immunoglobulin M (IgM) monoclonal gammopathies of undetermined significance (MGUS), and indolent Waldenström's macroglobulinemia (IWM), and propose a scoring system to identify subsets of patients at different risk. We evaluated 217 patients with IgM MGUS and 201 with IWM (male-female ratio, 131:86 and 117:84; mean age, 63.7 and 63.6 years, respectively) diagnosed on the basis of serum monoclonal component (MC) levels and bone marrow lymphoplasmacytic infiltration degree. The variables selected by univariate analyses were multivariately investigated; on the basis of their individual relative hazards, a scoring system was devised to identify subsets of patients at different risk of evolution. After a median follow-up of 56.1 and 60.2 months, 15 of 217 MGUS and 45 of 201 IWM patients, respectively, required chemotherapy for symptomatic WM (13 and 36), non-Hodgkin's lymphoma (2 and 6) and amyloidosis (0 and 3). The median time to evolution (TTE) was not reached for MGUS and was 141.5 months for IWM. The variables adversely related to evolution were qualitatively the same in both groups: MC levels, Hb concentrations and sex. A scoring system based on these parameters identified three risk groups with highly significant differences in TTE in both groups (P < .0001). MGUS and IWM identify disease entities with different propensities for symptomatic neoplastic evolution. As both have the same prognostic determinants of progression, we propose a practical scoring system that, identifying different risks of malignant evolution, may allow an individualized clinical approach.
In Silico Syndrome Prediction for Coronary Artery Disease in Traditional Chinese Medicine
Lu, Peng; Chen, Jianxin; Zhao, Huihui; Gao, Yibo; Luo, Liangtao; Zuo, Xiaohan; Shi, Qi; Yang, Yiping; Yi, Jianqiang; Wang, Wei
2012-01-01
Coronary artery disease (CAD) is the leading causes of deaths in the world. The differentiation of syndrome (ZHENG) is the criterion of diagnosis and therapeutic in TCM. Therefore, syndrome prediction in silico can be improving the performance of treatment. In this paper, we present a Bayesian network framework to construct a high-confidence syndrome predictor based on the optimum subset, that is, collected by Support Vector Machine (SVM) feature selection. Syndrome of CAD can be divided into asthenia and sthenia syndromes. According to the hierarchical characteristics of syndrome, we firstly label every case three types of syndrome (asthenia, sthenia, or both) to solve several syndromes with some patients. On basis of the three syndromes' classes, we design SVM feature selection to achieve the optimum symptom subset and compare this subset with Markov blanket feature select using ROC. Using this subset, the six predictors of CAD's syndrome are constructed by the Bayesian network technique. We also design Naïve Bayes, C4.5 Logistic, Radial basis function (RBF) network compared with Bayesian network. In a conclusion, the Bayesian network method based on the optimum symptoms shows a practical method to predict six syndromes of CAD in TCM. PMID:22567030
Transferable Output ASCII Data (TOAD) editor version 1.0 user's guide
NASA Technical Reports Server (NTRS)
Bingel, Bradford D.; Shea, Anne L.; Hofler, Alicia S.
1991-01-01
The Transferable Output ASCII Data (TOAD) editor is an interactive software tool for manipulating the contents of TOAD files. The TOAD editor is specifically designed to work with tabular data. Selected subsets of data may be displayed to the user's screen, sorted, exchanged, duplicated, removed, replaced, inserted, or transferred to and from external files. It also offers a number of useful features including on-line help, macros, a command history, an 'undo' option, variables, and a full compliment of mathematical functions and conversion factors. Written in ANSI FORTRAN 77 and completely self-contained, the TOAD editor is very portable and has already been installed on SUN, SGI/IRIS, and CONVEX hosts.
Forina, M; Oliveri, P; Bagnasco, L; Simonetti, R; Casolino, M C; Nizzi Grifi, F; Casale, M
2015-11-01
An authentication study of the Italian PDO (Protected Designation of Origin) olive oil Chianti Classico, based on artificial nose, near-infrared and UV-visible spectroscopy, with a set of samples representative of the whole Chianti Classico production area and a considerable number of samples from other Italian PDO regions was performed. The signals provided by the three analytical techniques were used both individually and jointly, after fusion of the respective variables, in order to build a model for the Chianti Classico PDO olive oil. Different signal pre-treatments were performed in order to investigate their importance and their effects in enhancing and extracting information from experimental data, correcting backgrounds or removing baseline variations. Stepwise-Linear Discriminant Analysis (STEP-LDA) was used as a feature selection technique and, afterward, Linear Discriminant Analysis (LDA) and the class-modelling technique Quadratic Discriminant Analysis-UNEQual dispersed classes (QDA-UNEQ) were applied to sub-sets of selected variables, in order to obtain efficient models capable of characterising the extra virgin olive oils produced in the Chianti Classico PDO area. Copyright © 2015 Elsevier B.V. All rights reserved.
Genetic component of flammability variation in a Mediterranean shrub.
Moreira, B; Castellanos, M C; Pausas, J G
2014-03-01
Recurrent fires impose a strong selection pressure in many ecosystems worldwide. In such ecosystems, plant flammability is of paramount importance because it enhances population persistence, particularly in non-resprouting species. Indeed, there is evidence of phenotypic divergence of flammability under different fire regimes. Our general hypothesis is that flammability-enhancing traits are adaptive; here, we test whether they have a genetic component. To test this hypothesis, we used the postfire obligate seeder Ulex parviflorus from sites historically exposed to different fire recurrence. We associated molecular variation in potentially adaptive loci detected with a genomic scan (using AFLP markers) with individual phenotypic variability in flammability across fire regimes. We found that at least 42% of the phenotypic variation in flammability was explained by the genetic divergence in a subset of AFLP loci. In spite of generalized gene flow, the genetic variability was structured by differences in fire recurrence. Our results provide the first field evidence supporting that traits enhancing plant flammability have a genetic component and thus can be responding to natural selection driven by fire. These results highlight the importance of flammability as an adaptive trait in fire-prone ecosystems. © 2014 John Wiley & Sons Ltd.
Libbrecht, Maxwell W; Bilmes, Jeffrey A; Noble, William Stafford
2018-04-01
Selecting a non-redundant representative subset of sequences is a common step in many bioinformatics workflows, such as the creation of non-redundant training sets for sequence and structural models or selection of "operational taxonomic units" from metagenomics data. Previous methods for this task, such as CD-HIT, PISCES, and UCLUST, apply a heuristic threshold-based algorithm that has no theoretical guarantees. We propose a new approach based on submodular optimization. Submodular optimization, a discrete analogue to continuous convex optimization, has been used with great success for other representative set selection problems. We demonstrate that the submodular optimization approach results in representative protein sequence subsets with greater structural diversity than sets chosen by existing methods, using as a gold standard the SCOPe library of protein domain structures. In this setting, submodular optimization consistently yields protein sequence subsets that include more SCOPe domain families than sets of the same size selected by competing approaches. We also show how the optimization framework allows us to design a mixture objective function that performs well for both large and small representative sets. The framework we describe is the best possible in polynomial time (under some assumptions), and it is flexible and intuitive because it applies a suite of generic methods to optimize one of a variety of objective functions. © 2018 Wiley Periodicals, Inc.
An Active RBSE Framework to Generate Optimal Stimulus Sequences in a BCI for Spelling
NASA Astrophysics Data System (ADS)
Moghadamfalahi, Mohammad; Akcakaya, Murat; Nezamfar, Hooman; Sourati, Jamshid; Erdogmus, Deniz
2017-10-01
A class of brain computer interfaces (BCIs) employs noninvasive recordings of electroencephalography (EEG) signals to enable users with severe speech and motor impairments to interact with their environment and social network. For example, EEG based BCIs for typing popularly utilize event related potentials (ERPs) for inference. Presentation paradigm design in current ERP-based letter by letter typing BCIs typically query the user with an arbitrary subset characters. However, the typing accuracy and also typing speed can potentially be enhanced with more informed subset selection and flash assignment. In this manuscript, we introduce the active recursive Bayesian state estimation (active-RBSE) framework for inference and sequence optimization. Prior to presentation in each iteration, rather than showing a subset of randomly selected characters, the developed framework optimally selects a subset based on a query function. Selected queries are made adaptively specialized for users during each intent detection. Through a simulation-based study, we assess the effect of active-RBSE on the performance of a language-model assisted typing BCI in terms of typing speed and accuracy. To provide a baseline for comparison, we also utilize standard presentation paradigms namely, row and column matrix presentation paradigm and also random rapid serial visual presentation paradigms. The results show that utilization of active-RBSE can enhance the online performance of the system, both in terms of typing accuracy and speed.
Deploying a quantum annealing processor to detect tree cover in aerial imagery of California
Basu, Saikat; Ganguly, Sangram; Michaelis, Andrew; Mukhopadhyay, Supratik; Nemani, Ramakrishna R.
2017-01-01
Quantum annealing is an experimental and potentially breakthrough computational technology for handling hard optimization problems, including problems of computer vision. We present a case study in training a production-scale classifier of tree cover in remote sensing imagery, using early-generation quantum annealing hardware built by D-wave Systems, Inc. Beginning within a known boosting framework, we train decision stumps on texture features and vegetation indices extracted from four-band, one-meter-resolution aerial imagery from the state of California. We then impose a regulated quadratic training objective to select an optimal voting subset from among these stumps. The votes of the subset define the classifier. For optimization, the logical variables in the objective function map to quantum bits in the hardware device, while quadratic couplings encode as the strength of physical interactions between the quantum bits. Hardware design limits the number of couplings between these basic physical entities to five or six. To account for this limitation in mapping large problems to the hardware architecture, we propose a truncation and rescaling of the training objective through a trainable metaparameter. The boosting process on our basic 108- and 508-variable problems, thus constituted, returns classifiers that incorporate a diverse range of color- and texture-based metrics and discriminate tree cover with accuracies as high as 92% in validation and 90% on a test scene encompassing the open space preserves and dense suburban build of Mill Valley, CA. PMID:28241028
NASA Astrophysics Data System (ADS)
Daliakopoulos, Ioannis; Tsanis, Ioannis
2017-04-01
Mitigating the vulnerability of Mediterranean rangelands against degradation is limited by our ability to understand and accurately characterize those impacts in space and time. The Normalized Difference Vegetation Index (NDVI) is a radiometric measure of the photosynthetically active radiation absorbed by green vegetation canopy chlorophyll and is therefore a good surrogate measure of vegetation dynamics. On the other hand, meteorological indices such as the drought assessing Standardised Precipitation Index (SPI) are can be easily estimated from historical and projected datasets at the global scale. This work investigates the potential of driving Random Forest (RF) models with meteorological indices to approximate NDVI-based vegetation dynamics. A sufficiently large number of RF models are trained using random subsets of the dataset as predictors, in a bootstrapping approach to account for the uncertainty introduced by the subset selection. The updated E-OBS-v13.1 dataset of the ENSEMBLES EU FP6 program provides observed monthly meteorological input to estimate SPI over the Mediterranean rangelands. RF models are trained to depict vegetation dynamics using the latest version (3g.v1) of the third generation GIMMS NDVI generated from NOAA's Advanced Very High Resolution Radiometer (AVHRR) sensors. Analysis is conducted for the period 1981-2015 at a gridded spatial resolution of 25 km. Preliminary results demonstrate the potential of machine learning algorithms to effectively mimic the underlying physical relationship of drought and Earth Observation vegetation indices to provide estimates based on precipitation variability.
Lerma, Claudia; Wessel, Niels; Schirdewan, Alexander; Kurths, Jürgen; Glass, Leon
2008-07-01
The objective was to determine the characteristics of heart rate variability and ventricular arrhythmias prior to the onset of ventricular tachycardia (VT) in patients with an implantable cardioverter defibrillator (ICD). Sixty-eight beat-to-beat time series from 13 patients with an ICD were analyzed to quantify heart rate variability and ventricular arrhythmias. The episodes of VT were classified in one of two groups depending on whether the sinus rate in the 1 min preceding the VT was greater or less than 90 beats per minute. In a subset of patients, increased heart rate and reduced heart rate variability was often observed up to 20 min prior to the VT. There was a non-significant trend to higher incidence of premature ventricular complexes (PVCs) before VT compared to control recordings. The patterns of the ventricular arrhythmias were highly heterogeneous among different patients and even within the same patient. Analysis of the changes of heart rate and heart rate variability may have predictive value about the onset of VT in selected patients. The patterns of ventricular arrhythmia could not be used to predict onset of VT in this group of patients.
2013-01-01
Background Gene expression data could likely be a momentous help in the progress of proficient cancer diagnoses and classification platforms. Lately, many researchers analyze gene expression data using diverse computational intelligence methods, for selecting a small subset of informative genes from the data for cancer classification. Many computational methods face difficulties in selecting small subsets due to the small number of samples compared to the huge number of genes (high-dimension), irrelevant genes, and noisy genes. Methods We propose an enhanced binary particle swarm optimization to perform the selection of small subsets of informative genes which is significant for cancer classification. Particle speed, rule, and modified sigmoid function are introduced in this proposed method to increase the probability of the bits in a particle’s position to be zero. The method was empirically applied to a suite of ten well-known benchmark gene expression data sets. Results The performance of the proposed method proved to be superior to other previous related works, including the conventional version of binary particle swarm optimization (BPSO) in terms of classification accuracy and the number of selected genes. The proposed method also requires lower computational time compared to BPSO. PMID:23617960
Aggregating job exit statuses of a plurality of compute nodes executing a parallel application
DOE Office of Scientific and Technical Information (OSTI.GOV)
Aho, Michael E.; Attinella, John E.; Gooding, Thomas M.
Aggregating job exit statuses of a plurality of compute nodes executing a parallel application, including: identifying a subset of compute nodes in the parallel computer to execute the parallel application; selecting one compute node in the subset of compute nodes in the parallel computer as a job leader compute node; initiating execution of the parallel application on the subset of compute nodes; receiving an exit status from each compute node in the subset of compute nodes, where the exit status for each compute node includes information describing execution of some portion of the parallel application by the compute node; aggregatingmore » each exit status from each compute node in the subset of compute nodes; and sending an aggregated exit status for the subset of compute nodes in the parallel computer.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ditzler, Gregory; Morrison, J. Calvin; Lan, Yemin
Background: Some of the current software tools for comparative metagenomics provide ecologists with the ability to investigate and explore bacterial communities using α– & β–diversity. Feature subset selection – a sub-field of machine learning – can also provide a unique insight into the differences between metagenomic or 16S phenotypes. In particular, feature subset selection methods can obtain the operational taxonomic units (OTUs), or functional features, that have a high-level of influence on the condition being studied. For example, in a previous study we have used information-theoretic feature selection to understand the differences between protein family abundances that best discriminate betweenmore » age groups in the human gut microbiome. Results: We have developed a new Python command line tool, which is compatible with the widely adopted BIOM format, for microbial ecologists that implements information-theoretic subset selection methods for biological data formats. We demonstrate the software tools capabilities on publicly available datasets. Conclusions: We have made the software implementation of Fizzy available to the public under the GNU GPL license. The standalone implementation can be found at http://github.com/EESI/Fizzy.« less
Subset selective search on the basis of color and preview.
Donk, Mieke
2017-01-01
In the preview paradigm observers are presented with one set of elements (the irrelevant set) followed by the addition of a second set among which the target is presented (the relevant set). Search efficiency in such a preview condition has been demonstrated to be higher than that in a full-baseline condition in which both sets are simultaneously presented, suggesting that a preview of the irrelevant set reduces its influence on the search process. However, numbers of irrelevant and relevant elements are typically not independently manipulated. Moreover, subset selective search also occurs when both sets are presented simultaneously but differ in color. The aim of the present study was to investigate how numbers of irrelevant and relevant elements contribute to preview search in the absence and presence of a color difference between subsets. In two experiments it was demonstrated that a preview reduced the influence of the number of irrelevant elements in the absence but not in the presence of a color difference between subsets. In the presence of a color difference, a preview lowered the effect of the number of relevant elements but only when the target was defined by a unique feature within the relevant set (Experiment 1); when the target was defined by a conjunction of features (Experiment 2), search efficiency as a function of the number of relevant elements was not modulated by a preview. Together the results are in line with the idea that subset selective search is based on different simultaneously operating mechanisms.
NASA Astrophysics Data System (ADS)
Vidal, Jean-Philippe; Hingray, Benoît
2014-05-01
In order to better understand the uncertainties in the climate of the next decades, an increasingly large number of increasingly diverse climate projections is being produced by the climate research community through coordinated initiatives (e.g., CMIP5, CORDEX), but also through more specific experiments at both the global scale (perturbed parameter ensembles) and the regional-to-local scale (empirical statistical downscaling ensembles). When significant efforts are put into making such projections available online, very few works focus on how to make such an enormous amount of information actually usable by the impact and adaptation community. Climate services should therefore include guidelines and recommendations for identifying subsets of climate projections that would have (1) a size manageable by downstream modelling approaches and (2) the relevant properties for informing adaptation strategies. This works proposes a generic framework for identifying tailored subsets of climate projections that would meet both the objectives and the constraints of a specific impact / adaptation study in a typical top-down approach. This decision framework builds on two main preliminary tasks that lead to critical choices in the selection strategy: (1) understanding the requirements of the specific impact / adaptation study, and (2) characterizing the (downscaled) climate projections dataset available. An impact / adaptation study has two types of requirements. First, the study may aim at various outcomes for a given climate-related feature: the best estimate of the future, the range of possible futures, a set of representative futures, or a statistically interpretable ensemble of futures. Second, impact models may come with specific constraints on climate input variables, like spatio-temporal and between-variables coherence. Additionally, when concurrent impact models are used, the most restrictive constraints have to be considered in order to be able to assess the uncertainty associated from this modelling step. Besides, the climate projection dataset available for a given study has several characteristics that will heavily condition the type of conclusions that can be reached. Indeed, the dataset at hand may or not sample different types of uncertainty (socio-economic, structural, parametric, along with internal variability). Moreover, these types are present at different steps in the well-known cascade of uncertainty, from the emission / concentration scenarios and the global climate to the regional-to-local climate. Critical choices for the selection are therefore conditioned on all features above. The type of selection (picking out, culling, or statistical sampling) is closely related to the study objectives and the uncertainty types present in the dataset. Moreover, grounds for picking out or culling projections may stem from global, regional or feature-specific present-day performance, representativeness, or covered range. An example use of this framework is a hierarchical selection for 3 classes of impact models among 3000 transient climate projections from different runs of 4 GCMs, statistically downscaled by 3 probabilistic methods, and made available for an integrated water resource adaptation study in the Durance catchment (southern French Alps). This work is part of the GICC R2D2-20501 project (Risk, water Resources and sustainable Development of the Durance catchment in 2050) and the EU FP7 COMPLEX2 project (Knowledge Based Climate Mitigation Systems for a Low Carbon Economy).
2012-01-01
Background Previous studies on tumor classification based on gene expression profiles suggest that gene selection plays a key role in improving the classification performance. Moreover, finding important tumor-related genes with the highest accuracy is a very important task because these genes might serve as tumor biomarkers, which is of great benefit to not only tumor molecular diagnosis but also drug development. Results This paper proposes a novel gene selection method with rich biomedical meaning based on Heuristic Breadth-first Search Algorithm (HBSA) to find as many optimal gene subsets as possible. Due to the curse of dimensionality, this type of method could suffer from over-fitting and selection bias problems. To address these potential problems, a HBSA-based ensemble classifier is constructed using majority voting strategy from individual classifiers constructed by the selected gene subsets, and a novel HBSA-based gene ranking method is designed to find important tumor-related genes by measuring the significance of genes using their occurrence frequencies in the selected gene subsets. The experimental results on nine tumor datasets including three pairs of cross-platform datasets indicate that the proposed method can not only obtain better generalization performance but also find many important tumor-related genes. Conclusions It is found that the frequencies of the selected genes follow a power-law distribution, indicating that only a few top-ranked genes can be used as potential diagnosis biomarkers. Moreover, the top-ranked genes leading to very high prediction accuracy are closely related to specific tumor subtype and even hub genes. Compared with other related methods, the proposed method can achieve higher prediction accuracy with fewer genes. Moreover, they are further justified by analyzing the top-ranked genes in the context of individual gene function, biological pathway, and protein-protein interaction network. PMID:22830977
The application of cat swarm optimisation algorithm in classifying small loan performance
NASA Astrophysics Data System (ADS)
Kencana, Eka N.; Kiswanti, Nyoman; Sari, Kartika
2017-10-01
It is common for banking system to analyse the feasibility of credit application before its approval. Although this process has been carefully done, there is no warranty that all credits will be repaid smoothly. This study aimed to know the accuracy of Cat Swarm Optimisation (CSO) algorithm in classifying small loans’ performance that is approved by Bank Rakyat Indonesia (BRI), one of several public banks in Indonesia. Data collected from 200 lenders were used in this work. The data matrix consists of 9 independent variables that represent profile of the credit, and one categorical dependent variable reflects credit’s performance. Prior to the analyses, data was divided into two data subset with equal size. Ordinal logistic regression (OLR) procedure is applied for the first subset and gave 3 out of 9 independent variables i.e. the amount of credit, credit’s period, and income per month of lender proved significantly affect credit performance. By using significantly estimated parameters from OLR procedure as the initial values for observations at the second subset, CSO procedure started. This procedure gave 76 percent of classification accuracy of credit performance, slightly better compared to 64 percent resulted from OLR procedure.
Fisher, Charles K; Mehta, Pankaj
2015-06-01
Feature selection, identifying a subset of variables that are relevant for predicting a response, is an important and challenging component of many methods in statistics and machine learning. Feature selection is especially difficult and computationally intensive when the number of variables approaches or exceeds the number of samples, as is often the case for many genomic datasets. Here, we introduce a new approach--the Bayesian Ising Approximation (BIA)-to rapidly calculate posterior probabilities for feature relevance in L2 penalized linear regression. In the regime where the regression problem is strongly regularized by the prior, we show that computing the marginal posterior probabilities for features is equivalent to computing the magnetizations of an Ising model with weak couplings. Using a mean field approximation, we show it is possible to rapidly compute the feature selection path described by the posterior probabilities as a function of the L2 penalty. We present simulations and analytical results illustrating the accuracy of the BIA on some simple regression problems. Finally, we demonstrate the applicability of the BIA to high-dimensional regression by analyzing a gene expression dataset with nearly 30 000 features. These results also highlight the impact of correlations between features on Bayesian feature selection. An implementation of the BIA in C++, along with data for reproducing our gene expression analyses, are freely available at http://physics.bu.edu/∼pankajm/BIACode. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
Identification of features in indexed data and equipment therefore
Jarman, Kristin H [Richland, WA; Daly, Don Simone [Richland, WA; Anderson, Kevin K [Richland, WA; Wahl, Karen L [Richland, WA
2002-04-02
Embodiments of the present invention provide methods of identifying a feature in an indexed dataset. Such embodiments encompass selecting an initial subset of indices, the initial subset of indices being encompassed by an initial window-of-interest and comprising at least one beginning index and at least one ending index; computing an intensity weighted measure of dispersion for the subset of indices using a subset of responses corresponding to the subset of indices; and comparing the intensity weighted measure of dispersion to a dispersion critical value determined from an expected value of the intensity weighted measure of dispersion under a null hypothesis of no transient feature present. Embodiments of the present invention also encompass equipment configured to perform the methods of the present invention.
Efficient feature subset selection with probabilistic distance criteria. [pattern recognition
NASA Technical Reports Server (NTRS)
Chittineni, C. B.
1979-01-01
Recursive expressions are derived for efficiently computing the commonly used probabilistic distance measures as a change in the criteria both when a feature is added to and when a feature is deleted from the current feature subset. A combinatorial algorithm for generating all possible r feature combinations from a given set of s features in (s/r) steps with a change of a single feature at each step is presented. These expressions can also be used for both forward and backward sequential feature selection.
NASA Astrophysics Data System (ADS)
Li, Yun; Zhang, Ji; Li, Tao; Liu, Honggao; Li, Jieqing; Wang, Yuanzhong
2017-04-01
In this work, the data fusion strategy of Fourier transform mid infrared (FT-MIR) spectroscopy and inductively coupled plasma-atomic emission spectrometry (ICP-AES) was used in combination with Support Vector Machine (SVM) to determine the geographic origin of Boletus edulis collected from nine regions of Yunnan Province in China. Firstly, competitive adaptive reweighted sampling (CARS) was used for selecting an optimal combination of key wavenumbers of second derivative FT-MIR spectra, and thirteen elements were sorted with variable importance in projection (VIP) scores. Secondly, thirteen subsets of multi-elements with the best VIP score were generated and each subset was used to fuse with FT-MIR. Finally, the classification models were established by SVM, and the combination of parameter C and γ (gamma) of SVM models was calculated by the approaches of grid search (GS) and genetic algorithm (GA). The results showed that both GS-SVM and GA-SVM models achieved good performances based on the #9 subset and the prediction accuracy in calibration and validation sets of the two models were 81.40% and 90.91%, correspondingly. In conclusion, it indicated that the data fusion strategy of FT-MIR and ICP-AES coupled with the algorithm of SVM can be used as a reliable tool for accurate identification of B. edulis, and it can provide a useful way of thinking for the quality control of edible mushrooms.
Melzer, Nina; Wittenburg, Dörte; Repsilber, Dirk
2013-01-01
In this study the benefit of metabolome level analysis for the prediction of genetic value of three traditional milk traits was investigated. Our proposed approach consists of three steps: First, milk metabolite profiles are used to predict three traditional milk traits of 1,305 Holstein cows. Two regression methods, both enabling variable selection, are applied to identify important milk metabolites in this step. Second, the prediction of these important milk metabolite from single nucleotide polymorphisms (SNPs) enables the detection of SNPs with significant genetic effects. Finally, these SNPs are used to predict milk traits. The observed precision of predicted genetic values was compared to the results observed for the classical genotype-phenotype prediction using all SNPs or a reduced SNP subset (reduced classical approach). To enable a comparison between SNP subsets, a special invariable evaluation design was implemented. SNPs close to or within known quantitative trait loci (QTL) were determined. This enabled us to determine if detected important SNP subsets were enriched in these regions. The results show that our approach can lead to genetic value prediction, but requires less than 1% of the total amount of (40,317) SNPs., significantly more important SNPs in known QTL regions were detected using our approach compared to the reduced classical approach. Concluding, our approach allows a deeper insight into the associations between the different levels of the genotype-phenotype map (genotype-metabolome, metabolome-phenotype, genotype-phenotype). PMID:23990900
Li, Yun; Zhang, Ji; Li, Tao; Liu, Honggao; Li, Jieqing; Wang, Yuanzhong
2017-04-15
In this work, the data fusion strategy of Fourier transform mid infrared (FT-MIR) spectroscopy and inductively coupled plasma-atomic emission spectrometry (ICP-AES) was used in combination with Support Vector Machine (SVM) to determine the geographic origin of Boletus edulis collected from nine regions of Yunnan Province in China. Firstly, competitive adaptive reweighted sampling (CARS) was used for selecting an optimal combination of key wavenumbers of second derivative FT-MIR spectra, and thirteen elements were sorted with variable importance in projection (VIP) scores. Secondly, thirteen subsets of multi-elements with the best VIP score were generated and each subset was used to fuse with FT-MIR. Finally, the classification models were established by SVM, and the combination of parameter C and γ (gamma) of SVM models was calculated by the approaches of grid search (GS) and genetic algorithm (GA). The results showed that both GS-SVM and GA-SVM models achieved good performances based on the #9 subset and the prediction accuracy in calibration and validation sets of the two models were 81.40% and 90.91%, correspondingly. In conclusion, it indicated that the data fusion strategy of FT-MIR and ICP-AES coupled with the algorithm of SVM can be used as a reliable tool for accurate identification of B. edulis, and it can provide a useful way of thinking for the quality control of edible mushrooms. Copyright © 2017. Published by Elsevier B.V.
2015-01-01
Complex RNA structures are constructed from helical segments connected by flexible loops that move spontaneously and in response to binding of small molecule ligands and proteins. Understanding the conformational variability of RNA requires the characterization of the coupled time evolution of interconnected flexible domains. To elucidate the collective molecular motions and explore the conformational landscape of the HIV-1 TAR RNA, we describe a new methodology that utilizes energy-minimized structures generated by the program “Fragment Assembly of RNA with Full-Atom Refinement (FARFAR)”. We apply structural filters in the form of experimental residual dipolar couplings (RDCs) to select a subset of discrete energy-minimized conformers and carry out principal component analyses (PCA) to corroborate the choice of the filtered subset. We use this subset of structures to calculate solution T1 and T1ρ relaxation times for 13C spins in multiple residues in different domains of the molecule using two simulation protocols that we previously published. We match the experimental T1 times to within 2% and the T1ρ times to within less than 10% for helical residues. These results introduce a protocol to construct viable dynamic trajectories for RNA molecules that accord well with experimental NMR data and support the notion that the motions of the helical portions of this small RNA can be described by a relatively small number of discrete conformations exchanging over time scales longer than 1 μs. PMID:24479561
Reducing I/O variability using dynamic I/O path characterization in petascale storage systems
Son, Seung Woo; Sehrish, Saba; Liao, Wei-keng; ...
2016-11-01
In petascale systems with a million CPU cores, scalable and consistent I/O performance is becoming increasingly difficult to sustain mainly because of I/O variability. Furthermore, the I/O variability is caused by concurrently running processes/jobs competing for I/O or a RAID rebuild when a disk drive fails. We present a mechanism that stripes across a selected subset of I/O nodes with the lightest workload at runtime to achieve the highest I/O bandwidth available in the system. In this paper, we propose a probing mechanism to enable application-level dynamic file striping to mitigate I/O variability. We also implement the proposed mechanism inmore » the high-level I/O library that enables memory-to-file data layout transformation and allows transparent file partitioning using subfiling. Subfiling is a technique that partitions data into a set of files of smaller size and manages file access to them, making data to be treated as a single, normal file to users. Here, we demonstrate that our bandwidth probing mechanism can successfully identify temporally slower I/O nodes without noticeable runtime overhead. Experimental results on NERSC’s systems also show that our approach isolates I/O variability effectively on shared systems and improves overall collective I/O performance with less variation.« less
Generalized Lotka—Volterra systems connected with simple Lie algebras
NASA Astrophysics Data System (ADS)
Charalambides, Stelios A.; Damianou, Pantelis A.; Evripidou, Charalambos A.
2015-06-01
We devise a new method for producing Hamiltonian systems by constructing the corresponding Lax pairs. This is achieved by considering a larger subset of the positive roots than the simple roots of the root system of a simple Lie algebra. We classify all subsets of the positive roots of the root system of type An for which the corresponding Hamiltonian systems are transformed, via a simple change of variables, to Lotka-Volterra systems. For some special cases of subsets of the positive roots of the root system of type An, we produce new integrable Hamiltonian systems.
Mas, Sergi; Gassó, Patricia; Morer, Astrid; Calvo, Anna; Bargalló, Nuria; Lafuente, Amalia; Lázaro, Luisa
2016-01-01
We propose an integrative approach that combines structural magnetic resonance imaging data (MRI), diffusion tensor imaging data (DTI), neuropsychological data, and genetic data to predict early-onset obsessive compulsive disorder (OCD) severity. From a cohort of 87 patients, 56 with complete information were used in the present analysis. First, we performed a multivariate genetic association analysis of OCD severity with 266 genetic polymorphisms. This association analysis was used to select and prioritize the SNPs that would be included in the model. Second, we split the sample into a training set (N = 38) and a validation set (N = 18). Third, entropy-based measures of information gain were used for feature selection with the training subset. Fourth, the selected features were fed into two supervised methods of class prediction based on machine learning, using the leave-one-out procedure with the training set. Finally, the resulting model was validated with the validation set. Nine variables were used for the creation of the OCD severity predictor, including six genetic polymorphisms and three variables from the neuropsychological data. The developed model classified child and adolescent patients with OCD by disease severity with an accuracy of 0.90 in the testing set and 0.70 in the validation sample. Above its clinical applicability, the combination of particular neuropsychological, neuroimaging, and genetic characteristics could enhance our understanding of the neurobiological basis of the disorder. PMID:27093171
Predicting degree of benefit from adjuvant trastuzumab in NSABP trial B-31.
Pogue-Geile, Katherine L; Kim, Chungyeul; Jeong, Jong-Hyeon; Tanaka, Noriko; Bandos, Hanna; Gavin, Patrick G; Fumagalli, Debora; Goldstein, Lynn C; Sneige, Nour; Burandt, Eike; Taniyama, Yusuke; Bohn, Olga L; Lee, Ahwon; Kim, Seung-Il; Reilly, Megan L; Remillard, Matthew Y; Blackmon, Nicole L; Kim, Seong-Rim; Horne, Zachary D; Rastogi, Priya; Fehrenbacher, Louis; Romond, Edward H; Swain, Sandra M; Mamounas, Eleftherios P; Wickerham, D Lawrence; Geyer, Charles E; Costantino, Joseph P; Wolmark, Norman; Paik, Soonmyung
2013-12-04
National Surgical Adjuvant Breast and Bowel Project (NSABP) trial B-31 suggested the efficacy of adjuvant trastuzumab, even in HER2-negative breast cancer. This finding prompted us to develop a predictive model for degree of benefit from trastuzumab using archived tumor blocks from B-31. Case subjects with tumor blocks were randomly divided into discovery (n = 588) and confirmation cohorts (n = 991). A predictive model was built from the discovery cohort through gene expression profiling of 462 genes with nCounter assay. A predefined cut point for the predictive model was tested in the confirmation cohort. Gene-by-treatment interaction was tested with Cox models, and correlations between variables were assessed with Spearman correlation. Principal component analysis was performed on the final set of selected genes. All statistical tests were two-sided. Eight predictive genes associated with HER2 (ERBB2, c17orf37, GRB7) or ER (ESR1, NAT1, GATA3, CA12, IGF1R) were selected for model building. Three-dimensional subset treatment effect pattern plot using two principal components of these genes was used to identify a subset with no benefit from trastuzumab, characterized by intermediate-level ERBB2 and high-level ESR1 mRNA expression. In the confirmation set, the predefined cut points for this model classified patients into three subsets with differential benefit from trastuzumab with hazard ratios of 1.58 (95% confidence interval [CI] = 0.67 to 3.69; P = .29; n = 100), 0.60 (95% CI = 0.41 to 0.89; P = .01; n = 449), and 0.28 (95% CI = 0.20 to 0.41; P < .001; n = 442; P(interaction) between the model and trastuzumab < .001). We developed a gene expression-based predictive model for degree of benefit from trastuzumab and demonstrated that HER2-negative tumors belong to the moderate benefit group, thus providing justification for testing trastuzumab in HER2-negative patients (NSABP B-47).
Predicting Degree of Benefit From Adjuvant Trastuzumab in NSABP Trial B-31
Pogue-Geile, Katherine L.; Kim, Chungyeul; Jeong, Jong-Hyeon; Tanaka, Noriko; Bandos, Hanna; Gavin, Patrick G.; Fumagalli, Debora; Goldstein, Lynn C.; Sneige, Nour; Burandt, Eike; Taniyama, Yusuke; Bohn, Olga L.; Lee, Ahwon; Kim, Seung-Il; Reilly, Megan L.; Remillard, Matthew Y.; Blackmon, Nicole L.; Kim, Seong-Rim; Horne, Zachary D.; Rastogi, Priya; Fehrenbacher, Louis; Romond, Edward H.; Swain, Sandra M.; Mamounas, Eleftherios P.; Wickerham, D. Lawrence; Geyer, Charles E.; Costantino, Joseph P.; Wolmark, Norman
2013-01-01
Background National Surgical Adjuvant Breast and Bowel Project (NSABP) trial B-31 suggested the efficacy of adjuvant trastuzumab, even in HER2-negative breast cancer. This finding prompted us to develop a predictive model for degree of benefit from trastuzumab using archived tumor blocks from B-31. Methods Case subjects with tumor blocks were randomly divided into discovery (n = 588) and confirmation cohorts (n = 991). A predictive model was built from the discovery cohort through gene expression profiling of 462 genes with nCounter assay. A predefined cut point for the predictive model was tested in the confirmation cohort. Gene-by-treatment interaction was tested with Cox models, and correlations between variables were assessed with Spearman correlation. Principal component analysis was performed on the final set of selected genes. All statistical tests were two-sided. Results Eight predictive genes associated with HER2 (ERBB2, c17orf37, GRB7) or ER (ESR1, NAT1, GATA3, CA12, IGF1R) were selected for model building. Three-dimensional subset treatment effect pattern plot using two principal components of these genes was used to identify a subset with no benefit from trastuzumab, characterized by intermediate-level ERBB2 and high-level ESR1 mRNA expression. In the confirmation set, the predefined cut points for this model classified patients into three subsets with differential benefit from trastuzumab with hazard ratios of 1.58 (95% confidence interval [CI] = 0.67 to 3.69; P = .29; n = 100), 0.60 (95% CI = 0.41 to 0.89; P = .01; n = 449), and 0.28 (95% CI = 0.20 to 0.41; P < .001; n = 442; P interaction between the model and trastuzumab < .001). Conclusions We developed a gene expression–based predictive model for degree of benefit from trastuzumab and demonstrated that HER2-negative tumors belong to the moderate benefit group, thus providing justification for testing trastuzumab in HER2-negative patients (NSABP B-47). PMID:24262440
Identification of selection signatures in cattle breeds selected for dairy production.
Stella, Alessandra; Ajmone-Marsan, Paolo; Lazzari, Barbara; Boettcher, Paul
2010-08-01
The genomics revolution has spurred the undertaking of HapMap studies of numerous species, allowing for population genomics to increase the understanding of how selection has created genetic differences between subspecies populations. The objectives of this study were to (1) develop an approach to detect signatures of selection in subsets of phenotypically similar breeds of livestock by comparing single nucleotide polymorphism (SNP) diversity between the subset and a larger population, (2) verify this method in breeds selected for simply inherited traits, and (3) apply this method to the dairy breeds in the International Bovine HapMap (IBHM) study. The data consisted of genotypes for 32,689 SNPs of 497 animals from 19 breeds. For a given subset of breeds, the test statistic was the parametric composite log likelihood (CLL) of the differences in allelic frequencies between the subset and the IBHM for a sliding window of SNPs. The null distribution was obtained by calculating CLL for 50,000 random subsets (per chromosome) of individuals. The validity of this approach was confirmed by obtaining extremely large CLLs at the sites of causative variation for polled (BTA1) and black-coat-color (BTA18) phenotypes. Across the 30 bovine chromosomes, 699 putative selection signatures were detected. The largest CLL was on BTA6 and corresponded to KIT, which is responsible for the piebald phenotype present in four of the five dairy breeds. Potassium channel-related genes were at the site of the largest CLL on three chromosomes (BTA14, -16, and -25) whereas integrins (BTA18 and -19) and serine/arginine rich splicing factors (BTA20 and -23) each had the largest CLL on two chromosomes. On the basis of the results of this study, the application of population genomics to farm animals seems quite promising. Comparisons between breed groups have the potential to identify genomic regions influencing complex traits with no need for complex equipment and the collection of extensive phenotypic records and can contribute to the identification of candidate genes and to the understanding of the biological mechanisms controlling complex traits.
Fernández-Varela, R; Andrade, J M; Muniategui, S; Prada, D; Ramírez-Villalobos, F
2010-04-01
Identifying petroleum-related products released into the environment is a complex and difficult task. To achieve this, polycyclic aromatic hydrocarbons (PAHs) are of outstanding importance nowadays. Despite traditional quantitative fingerprinting uses straightforward univariate statistical analyses to differentiate among oils and to assess their sources, a multivariate strategy based on Procrustes rotation (PR) was applied in this paper. The aim of PR is to select a reduced subset of PAHs still capable of performing a satisfactory identification of petroleum-related hydrocarbons. PR selected two subsets of three (C(2)-naphthalene, C(2)-dibenzothiophene and C(2)-phenanthrene) and five (C(1)-decahidronaphthalene, naphthalene, C(2)-phenanthrene, C(3)-phenanthrene and C(2)-fluoranthene) PAHs for each of the two datasets studied here. The classification abilities of each subset of PAHs were tested using principal components analysis, hierarchical cluster analysis and Kohonen neural networks and it was demonstrated that they unraveled the same patterns as the overall set of PAHs. (c) 2009 Elsevier Ltd. All rights reserved.
Accuracy of direct genomic values in Holstein bulls and cows using subsets of SNP markers
2010-01-01
Background At the current price, the use of high-density single nucleotide polymorphisms (SNP) genotyping assays in genomic selection of dairy cattle is limited to applications involving elite sires and dams. The objective of this study was to evaluate the use of low-density assays to predict direct genomic value (DGV) on five milk production traits, an overall conformation trait, a survival index, and two profit index traits (APR, ASI). Methods Dense SNP genotypes were available for 42,576 SNP for 2,114 Holstein bulls and 510 cows. A subset of 1,847 bulls born between 1955 and 2004 was used as a training set to fit models with various sets of pre-selected SNP. A group of 297 bulls born between 2001 and 2004 and all cows born between 1992 and 2004 were used to evaluate the accuracy of DGV prediction. Ridge regression (RR) and partial least squares regression (PLSR) were used to derive prediction equations and to rank SNP based on the absolute value of the regression coefficients. Four alternative strategies were applied to select subset of SNP, namely: subsets of the highest ranked SNP for each individual trait, or a single subset of evenly spaced SNP, where SNP were selected based on their rank for ASI, APR or minor allele frequency within intervals of approximately equal length. Results RR and PLSR performed very similarly to predict DGV, with PLSR performing better for low-density assays and RR for higher-density SNP sets. When using all SNP, DGV predictions for production traits, which have a higher heritability, were more accurate (0.52-0.64) than for survival (0.19-0.20), which has a low heritability. The gain in accuracy using subsets that included the highest ranked SNP for each trait was marginal (5-6%) over a common set of evenly spaced SNP when at least 3,000 SNP were used. Subsets containing 3,000 SNP provided more than 90% of the accuracy that could be achieved with a high-density assay for cows, and 80% of the high-density assay for young bulls. Conclusions Accurate genomic evaluation of the broader bull and cow population can be achieved with a single genotyping assays containing ~ 3,000 to 5,000 evenly spaced SNP. PMID:20950478
Weigel, K A; de los Campos, G; González-Recio, O; Naya, H; Wu, X L; Long, N; Rosa, G J M; Gianola, D
2009-10-01
The objective of the present study was to assess the predictive ability of subsets of single nucleotide polymorphism (SNP) markers for development of low-cost, low-density genotyping assays in dairy cattle. Dense SNP genotypes of 4,703 Holstein bulls were provided by the USDA Agricultural Research Service. A subset of 3,305 bulls born from 1952 to 1998 was used to fit various models (training set), and a subset of 1,398 bulls born from 1999 to 2002 was used to evaluate their predictive ability (testing set). After editing, data included genotypes for 32,518 SNP and August 2003 and April 2008 predicted transmitting abilities (PTA) for lifetime net merit (LNM$), the latter resulting from progeny testing. The Bayesian least absolute shrinkage and selection operator method was used to regress August 2003 PTA on marker covariates in the training set to arrive at estimates of marker effects and direct genomic PTA. The coefficient of determination (R(2)) from regressing the April 2008 progeny test PTA of bulls in the testing set on their August 2003 direct genomic PTA was 0.375. Subsets of 300, 500, 750, 1,000, 1,250, 1,500, and 2,000 SNP were created by choosing equally spaced and highly ranked SNP, with the latter based on the absolute value of their estimated effects obtained from the training set. The SNP effects were re-estimated from the training set for each subset of SNP, and the 2008 progeny test PTA of bulls in the testing set were regressed on corresponding direct genomic PTA. The R(2) values for subsets of 300, 500, 750, 1,000, 1,250, 1,500, and 2,000 SNP with largest effects (evenly spaced SNP) were 0.184 (0.064), 0.236 (0.111), 0.269 (0.190), 0.289 (0.179), 0.307 (0.228), 0.313 (0.268), and 0.322 (0.291), respectively. These results indicate that a low-density assay comprising selected SNP could be a cost-effective alternative for selection decisions and that significant gains in predictive ability may be achieved by increasing the number of SNP allocated to such an assay from 300 or fewer to 1,000 or more.
Overlapping meta-analyses on the same topic: survey of published studies.
Siontis, Konstantinos C; Hernandez-Boussard, Tina; Ioannidis, John P A
2013-07-19
To assess how common it is to have multiple overlapping meta-analyses of randomized trials published on the same topic. Survey of published meta-analyses. PubMed. Meta-analyses published in 2010 were identified, and 5% of them were randomly selected. We further selected those that included randomized trials and examined effectiveness of any medical intervention. For eligible meta-analyses, we searched for other meta-analyses on the same topic (covering the same comparisons, indications/settings, and outcomes or overlapping subsets of them) published until February 2013. Of 73 eligible meta-analyses published in 2010, 49 (67%) had at least one other overlapping meta-analysis (median two meta-analyses per topic, interquartile range 1-4, maximum 13). In 17 topics at least one author was involved in at least two of the overlapping meta-analyses. No characteristics of the index meta-analyses were associated with the potential for overlapping meta-analyses. Among pairs of overlapping meta-analyses in 20 randomly selected topics, 13 of the more recent meta-analyses did not include any additional outcomes. In three of the four topics with eight or more published meta-analyses, many meta-analyses examined only a subset of the eligible interventions or indications/settings covered by the index meta-analysis. Conversely, for statins in the prevention of atrial fibrillation after cardiac surgery, 11 meta-analyses were published with similar eligibility criteria for interventions and setting: there was still variability on which studies were included, but the results were always similar or even identical across meta-analyses. While some independent replication of meta-analyses by different teams is possibly useful, the overall picture suggests that there is a waste of efforts with many topics covered by multiple overlapping meta-analyses.
NASA Astrophysics Data System (ADS)
Pande, Saket; Sharma, Ashish
2014-05-01
This study is motivated by the need to robustly specify, identify, and forecast runoff generation processes for hydroelectricity production. It atleast requires the identification of significant predictors of runoff generation and the influence of each such significant predictor on runoff response. To this end, we compare two non-parametric algorithms of predictor subset selection. One is based on information theory that assesses predictor significance (and hence selection) based on Partial Information (PI) rationale of Sharma and Mehrotra (2014). The other algorithm is based on a frequentist approach that uses bounds on probability of error concept of Pande (2005), assesses all possible predictor subsets on-the-go and converges to a predictor subset in an computationally efficient manner. Both the algorithms approximate the underlying system by locally constant functions and select predictor subsets corresponding to these functions. The performance of the two algorithms is compared on a set of synthetic case studies as well as a real world case study of inflow forecasting. References: Sharma, A., and R. Mehrotra (2014), An information theoretic alternative to model a natural system using observational information alone, Water Resources Research, 49, doi:10.1002/2013WR013845. Pande, S. (2005), Generalized local learning in water resource management, PhD dissertation, Utah State University, UT-USA, 148p.
Centanni, T M; Pantazis, D; Truong, D T; Gruen, J R; Gabrieli, J D E; Hogan, T P
2018-05-26
Individuals with dyslexia exhibit increased brainstem variability in response to sound. It is unknown as to whether increased variability extends to neocortical regions associated with audition and reading, extends to visual stimuli, and whether increased variability characterizes all children with dyslexia or, instead, a specific subset of children. We evaluated the consistency of stimulus-evoked neural responses in children with (N = 20) or without dyslexia (N = 12) as measured by magnetoencephalography (MEG). Approximately half of the children with dyslexia had significantly higher levels of variability in cortical responses to both auditory and visual stimuli in multiple nodes of the reading network. There was a significant and positive relationship between the number of risk alleles at rs6935076 in the dyslexia-susceptibility gene KIAA0319 and the degree of neural variability in primary auditory cortex across all participants. This gene has been linked with neural variability in rodents and in typical readers. These findings indicate that unstable representations of auditory and visual stimuli in auditory and other reading-related neocortical regions are present in a subset of children with dyslexia and support the link between the gene KIAA0319 and the auditory neural variability across children with or without dyslexia. Copyright © 2018 The Authors. Published by Elsevier Ltd.. All rights reserved.
Ordering Elements and Subsets: Examples for Student Understanding
ERIC Educational Resources Information Center
Mellinger, Keith E.
2004-01-01
Teaching the art of counting can be quite difficult. Many undergraduate students have difficulty separating the ideas of permutation, combination, repetition, etc. This article develops some examples to help explain some of the underlying theory while looking carefully at the selection of various subsets of objects from a larger collection. The…
McParland, D; Phillips, C M; Brennan, L; Roche, H M; Gormley, I C
2017-12-10
The LIPGENE-SU.VI.MAX study, like many others, recorded high-dimensional continuous phenotypic data and categorical genotypic data. LIPGENE-SU.VI.MAX focuses on the need to account for both phenotypic and genetic factors when studying the metabolic syndrome (MetS), a complex disorder that can lead to higher risk of type 2 diabetes and cardiovascular disease. Interest lies in clustering the LIPGENE-SU.VI.MAX participants into homogeneous groups or sub-phenotypes, by jointly considering their phenotypic and genotypic data, and in determining which variables are discriminatory. A novel latent variable model that elegantly accommodates high dimensional, mixed data is developed to cluster LIPGENE-SU.VI.MAX participants using a Bayesian finite mixture model. A computationally efficient variable selection algorithm is incorporated, estimation is via a Gibbs sampling algorithm and an approximate BIC-MCMC criterion is developed to select the optimal model. Two clusters or sub-phenotypes ('healthy' and 'at risk') are uncovered. A small subset of variables is deemed discriminatory, which notably includes phenotypic and genotypic variables, highlighting the need to jointly consider both factors. Further, 7 years after the LIPGENE-SU.VI.MAX data were collected, participants underwent further analysis to diagnose presence or absence of the MetS. The two uncovered sub-phenotypes strongly correspond to the 7-year follow-up disease classification, highlighting the role of phenotypic and genotypic factors in the MetS and emphasising the potential utility of the clustering approach in early screening. Additionally, the ability of the proposed approach to define the uncertainty in sub-phenotype membership at the participant level is synonymous with the concepts of precision medicine and nutrition. Copyright © 2017 John Wiley & Sons, Ltd. Copyright © 2017 John Wiley & Sons, Ltd.
Monocyte Subset Dynamics in Human Atherosclerosis Can Be Profiled with Magnetic Nano-Sensors
Wildgruber, Moritz; Lee, Hakho; Chudnovskiy, Aleksey; Yoon, Tae-Jong; Etzrodt, Martin; Pittet, Mikael J.; Nahrendorf, Matthias; Croce, Kevin; Libby, Peter; Weissleder, Ralph; Swirski, Filip K.
2009-01-01
Monocytes are circulating macrophage and dendritic cell precursors that populate healthy and diseased tissue. In humans, monocytes consist of at least two subsets whose proportions in the blood fluctuate in response to coronary artery disease, sepsis, and viral infection. Animal studies have shown that specific shifts in the monocyte subset repertoire either exacerbate or attenuate disease, suggesting a role for monocyte subsets as biomarkers and therapeutic targets. Assays are therefore needed that can selectively and rapidly enumerate monocytes and their subsets. This study shows that two major human monocyte subsets express similar levels of the receptor for macrophage colony stimulating factor (MCSFR) but differ in their phagocytic capacity. We exploit these properties and custom-engineer magnetic nanoparticles for ex vivo sensing of monocytes and their subsets. We present a two-dimensional enumerative mathematical model that simultaneously reports number and proportion of monocyte subsets in a small volume of human blood. Using a recently described diagnostic magnetic resonance (DMR) chip with 1 µl sample size and high throughput capabilities, we then show that application of the model accurately quantifies subset fluctuations that occur in patients with atherosclerosis. PMID:19461894
Estimation of the probability of success in petroleum exploration
Davis, J.C.
1977-01-01
A probabilistic model for oil exploration can be developed by assessing the conditional relationship between perceived geologic variables and the subsequent discovery of petroleum. Such a model includes two probabilistic components, the first reflecting the association between a geologic condition (structural closure, for example) and the occurrence of oil, and the second reflecting the uncertainty associated with the estimation of geologic variables in areas of limited control. Estimates of the conditional relationship between geologic variables and subsequent production can be found by analyzing the exploration history of a "training area" judged to be geologically similar to the exploration area. The geologic variables are assessed over the training area using an historical subset of the available data, whose density corresponds to the present control density in the exploration area. The success or failure of wells drilled in the training area subsequent to the time corresponding to the historical subset provides empirical estimates of the probability of success conditional upon geology. Uncertainty in perception of geological conditions may be estimated from the distribution of errors made in geologic assessment using the historical subset of control wells. These errors may be expressed as a linear function of distance from available control. Alternatively, the uncertainty may be found by calculating the semivariogram of the geologic variables used in the analysis: the two procedures will yield approximately equivalent results. The empirical probability functions may then be transferred to the exploration area and used to estimate the likelihood of success of specific exploration plays. These estimates will reflect both the conditional relationship between the geological variables used to guide exploration and the uncertainty resulting from lack of control. The technique is illustrated with case histories from the mid-Continent area of the U.S.A. ?? 1977 Plenum Publishing Corp.
Cohen, Alan A; Milot, Emmanuel; Yong, Jian; Seplaki, Christopher L; Fülöp, Tamàs; Bandeen-Roche, Karen; Fried, Linda P
2013-03-01
Previous studies have identified many biomarkers that are associated with aging and related outcomes, but the relevance of these markers for underlying processes and their relationship to hypothesized systemic dysregulation is not clear. We address this gap by presenting a novel method for measuring dysregulation via the joint distribution of multiple biomarkers and assessing associations of dysregulation with age and mortality. Using longitudinal data from the Women's Health and Aging Study, we selected a 14-marker subset from 63 blood measures: those that diverged from the baseline population mean with age. For the 14 markers and all combinatorial sub-subsets we calculated a multivariate distance called the Mahalanobis distance (MHBD) for all observations, indicating how "strange" each individual's biomarker profile was relative to the baseline population mean. In most models, MHBD correlated positively with age, MHBD increased within individuals over time, and higher MHBD predicted higher risk of subsequent mortality. Predictive power increased as more variables were incorporated into the calculation of MHBD. Biomarkers from multiple systems were implicated. These results support hypotheses of simultaneous dysregulation in multiple systems and confirm the need for longitudinal, multivariate approaches to understanding biomarkers in aging. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.
Oregon ground-water quality and its relation to hydrogeological factors; a statistical approach
Miller, T.L.; Gonthier, J.B.
1984-01-01
An appraisal of Oregon ground-water quality was made using existing data accessible through the U.S. Geological Survey computer system. The data available for about 1,000 sites were separated by aquifer units and hydrologic units. Selected statistical moments were described for 19 constituents including major ions. About 96 percent of all sites in the data base were sampled only once. The sample data were classified by aquifer unit and hydrologic unit and analysis of variance was run to determine if significant differences exist between the units within each of these two classifications for the same 19 constituents on which statistical moments were determined. Results of the analysis of variance indicated both classification variables performed about the same, but aquifer unit did provide more separation for some constituents. Samples from the Rogue River basin were classified by location within the flow system and type of flow system. The samples were then analyzed using analysis of variance on 14 constituents to determine if there were significant differences between subsets classified by flow path. Results of this analysis were not definitive, but classification as to the type of flow system did indicate potential for segregating water-quality data into distinct subsets. (USGS)
Mono-, Di-, or Trimorphism in Black Sea Ammonia sp.
NASA Astrophysics Data System (ADS)
Altenbach, Alexander V.; Bassler, Barbara
2014-05-01
For the genus Ammonia, the size of proloculi was considered one of the valuable taxonomic landmarks, although it may split in first alternating generations. We analysed 140 living (stained) tests of Ammonia sp. from the outer shelf of the Black Sea, collected from 5 stations on a depth gradient (138 to 206 m water depth). Samples were treated by standard technologies, such as live staining, wet sieving, volume detection, counts, and measures by light microscopy. The size of the proloculi was detected, extended by biometric characterisations of 11 measures, 5 qualitative characters, and 4 numerical ratios. Surprisingly, the multitude of test parameters allows the definition of either one highly variable taxon, or several species, or either di- or trimorphism, exclusively resulting from the definition of 'decisive' or 'neglectable' parameters, or parameter subsets. We followed the general taxonomic definition for the species of the genera, and applied, discussed and rejected published criteria considered as taxonomically important. Surprisingly, in result none of the species described hitherto fully correlates with the morphological roundup observed. It is a new species. This conclusion mainly results from the balance of all morphologies, and not from the selection of an ultimate subset.
Factors affecting plant species composition of hedgerows: relative importance and hierarchy
NASA Astrophysics Data System (ADS)
Deckers, Bart; Hermy, Martin; Muys, Bart
2004-07-01
Although there has been a clear quantitative and qualitative decline in traditional hedgerow network landscapes during last century, hedgerows are crucial for the conservation of rural biodiversity, functioning as an important habitat, refuge and corridor for numerous species. To safeguard this conservation function, insight in the basic organizing principles of hedgerow plant communities is needed. The vegetation composition of 511 individual hedgerows situated within an ancient hedgerow network landscape in Flanders, Belgium was recorded, in combination with a wide range of explanatory variables, including a selection of spatial variables. Non-parametric statistics in combination with multivariate data analysis techniques were used to study the effect of individual explanatory variables. Next, variables were grouped in five distinct subsets and the relative importance of these variable groups was assessed by two related variation partitioning techniques, partial regression and partial canonical correspondence analysis, taking into account explicitly the existence of intercorrelations between variables of different factor groups. Most explanatory variables affected significantly hedgerow species richness and composition. Multivariate analysis showed that, besides adjacent land use, hedgerow management, soil conditions, hedgerow type and origin, the role of other factors such as hedge dimensions, intactness, etc., could certainly not be neglected. Furthermore, both methods revealed the same overall ranking of the five distinct factor groups. Besides a predominant impact of abiotic environmental conditions, it was found that management variables and structural aspects have a relatively larger influence on the distribution of plant species in hedgerows than their historical background or spatial configuration.
Moghimi, Saba; Schudlo, Larissa; Chau, Tom; Guerguerian, Anne-Marie
2015-01-01
Music-induced brain activity modulations in areas involved in emotion regulation may be useful in achieving therapeutic outcomes. Clinical applications of music may involve prolonged or repeated exposures to music. However, the variability of the observed brain activity patterns in repeated exposures to music is not well understood. We hypothesized that multiple exposures to the same music would elicit more consistent activity patterns than exposure to different music. In this study, the temporal and spatial variability of cerebral prefrontal hemodynamic response was investigated across multiple exposures to self-selected musical excerpts in 10 healthy adults. The hemodynamic changes were measured using prefrontal cortex near infrared spectroscopy and represented by instantaneous phase values. Based on spatial and temporal characteristics of these observed hemodynamic changes, we defined a consistency index to represent variability across these domains. The consistency index across repeated exposures to the same piece of music was compared to the consistency index corresponding to prefrontal activity from randomly matched non-identical musical excerpts. Consistency indexes were significantly different for identical versus non-identical musical excerpts when comparing a subset of repetitions. When all four exposures were compared, no significant difference was observed between the consistency indexes of randomly matched non-identical musical excerpts and the consistency index corresponding to repetitions of the same musical excerpts. This observation suggests the existence of only partial consistency between repeated exposures to the same musical excerpt, which may stem from the role of the prefrontal cortex in regulating other cognitive and emotional processes.
Moghimi, Saba; Schudlo, Larissa; Chau, Tom; Guerguerian, Anne-Marie
2015-01-01
Music-induced brain activity modulations in areas involved in emotion regulation may be useful in achieving therapeutic outcomes. Clinical applications of music may involve prolonged or repeated exposures to music. However, the variability of the observed brain activity patterns in repeated exposures to music is not well understood. We hypothesized that multiple exposures to the same music would elicit more consistent activity patterns than exposure to different music. In this study, the temporal and spatial variability of cerebral prefrontal hemodynamic response was investigated across multiple exposures to self-selected musical excerpts in 10 healthy adults. The hemodynamic changes were measured using prefrontal cortex near infrared spectroscopy and represented by instantaneous phase values. Based on spatial and temporal characteristics of these observed hemodynamic changes, we defined a consistency index to represent variability across these domains. The consistency index across repeated exposures to the same piece of music was compared to the consistency index corresponding to prefrontal activity from randomly matched non-identical musical excerpts. Consistency indexes were significantly different for identical versus non-identical musical excerpts when comparing a subset of repetitions. When all four exposures were compared, no significant difference was observed between the consistency indexes of randomly matched non-identical musical excerpts and the consistency index corresponding to repetitions of the same musical excerpts. This observation suggests the existence of only partial consistency between repeated exposures to the same musical excerpt, which may stem from the role of the prefrontal cortex in regulating other cognitive and emotional processes. PMID:25837268
Selecting informative subsets of sparse supermatrices increases the chance to find correct trees.
Misof, Bernhard; Meyer, Benjamin; von Reumont, Björn Marcus; Kück, Patrick; Misof, Katharina; Meusemann, Karen
2013-12-03
Character matrices with extensive missing data are frequently used in phylogenomics with potentially detrimental effects on the accuracy and robustness of tree inference. Therefore, many investigators select taxa and genes with high data coverage. Drawbacks of these selections are their exclusive reliance on data coverage without consideration of actual signal in the data which might, thus, not deliver optimal data matrices in terms of potential phylogenetic signal. In order to circumvent this problem, we have developed a heuristics implemented in a software called mare which (1) assesses information content of genes in supermatrices using a measure of potential signal combined with data coverage and (2) reduces supermatrices with a simple hill climbing procedure to submatrices with high total information content. We conducted simulation studies using matrices of 50 taxa × 50 genes with heterogeneous phylogenetic signal among genes and data coverage between 10-30%. With matrices of 50 taxa × 50 genes with heterogeneous phylogenetic signal among genes and data coverage between 10-30% Maximum Likelihood (ML) tree reconstructions failed to recover correct trees. A selection of a data subset with the herein proposed approach increased the chance to recover correct partial trees more than 10-fold. The selection of data subsets with the herein proposed simple hill climbing procedure performed well either considering the information content or just a simple presence/absence information of genes. We also applied our approach on an empirical data set, addressing questions of vertebrate systematics. With this empirical dataset selecting a data subset with high information content and supporting a tree with high average boostrap support was most successful if information content of genes was considered. Our analyses of simulated and empirical data demonstrate that sparse supermatrices can be reduced on a formal basis outperforming the usually used simple selections of taxa and genes with high data coverage.
Saavedra, Laura M; Romanelli, Gustavo P; Rozo, Ciro E; Duchowicz, Pablo R
2018-01-01
The insecticidal activity of a series of 62 plant derived molecules against the chikungunya, dengue and zika vector, the Aedes aegypti (Diptera:Culicidae) mosquito, is subjected to a Quantitative Structure-Activity Relationships (QSAR) analysis. The Replacement Method (RM) variable subset selection technique based on Multivariable Linear Regression (MLR) proves to be successful for exploring 4885 molecular descriptors calculated with Dragon 6. The predictive capability of the obtained models is confirmed through an external test set of compounds, Leave-One-Out (LOO) cross-validation and Y-Randomization. The present study constitutes a first necessary computational step for designing less toxic insecticides. Copyright © 2017 Elsevier B.V. All rights reserved.
Marcus, Bernd; Wagner, Uwe
2007-04-01
In the present research, we investigated the joint impact of selected antecedents of counterproductive work behavior (CWB). A sample of German apprentices reported on their CWB and completed measures of situational evaluations (vocational preference, level and constructiveness of job satisfaction) believed to trigger CWB and of dispositional motivators (measured by integrity test subscales) and controls (self-control and another subset of integrity scales) of CWB. All predictors investigated showed the expected bivariate relationships with CWB. Multivariate analyses revealed that the triggering effect of an unfavorable vocational choice on CWB was fully mediated by job satisfaction. When predictors were aggregated, a composite of dispositional control variables had the largest effect on CWB and moderated the effects of motivational dispositions and situational evaluations. These results extend the knowledge on antecedents of CWB by investigating previously overlooked variables and samples and partially replicate recent findings on the joint impact of dispositions and work-related evaluations on CWB. Copyright (c) 2007 APA, all rights reserved.
1976-07-01
PURDUE UNIVERSITY DEPARTMENT OF STATISTICS DIVISION OF MATHEMATICAL SCIENCES ON SUBSET SELECTION PROCEDURES FOR POISSON PROCESSES AND SOME...Mathematical Sciences Mimeograph Series #457, July 1976 This research was supported by the Office of Naval Research under Contract NOOO14-75-C-0455 at Purdue...11 CON PC-111 riFIC-F ,A.F ANO ADDPFS Office of INaval ResearchJu#07 Washington, DC07 36AE 14~~~ rjCr; NF A ’ , A FAA D F 6 - I S it 9 i 1, - ,1 I
Efficient least angle regression for identification of linear-in-the-parameters models
Beach, Thomas H.; Rezgui, Yacine
2017-01-01
Least angle regression, as a promising model selection method, differentiates itself from conventional stepwise and stagewise methods, in that it is neither too greedy nor too slow. It is closely related to L1 norm optimization, which has the advantage of low prediction variance through sacrificing part of model bias property in order to enhance model generalization capability. In this paper, we propose an efficient least angle regression algorithm for model selection for a large class of linear-in-the-parameters models with the purpose of accelerating the model selection process. The entire algorithm works completely in a recursive manner, where the correlations between model terms and residuals, the evolving directions and other pertinent variables are derived explicitly and updated successively at every subset selection step. The model coefficients are only computed when the algorithm finishes. The direct involvement of matrix inversions is thereby relieved. A detailed computational complexity analysis indicates that the proposed algorithm possesses significant computational efficiency, compared with the original approach where the well-known efficient Cholesky decomposition is involved in solving least angle regression. Three artificial and real-world examples are employed to demonstrate the effectiveness, efficiency and numerical stability of the proposed algorithm. PMID:28293140
Frisch-Daiello, Jessica L; Williams, Mary R; Waddell, Erin E; Sigman, Michael E
2014-03-01
The unsupervised artificial neural networks method of self-organizing feature maps (SOFMs) is applied to spectral data of ignitable liquids to visualize the grouping of similar ignitable liquids with respect to their American Society for Testing and Materials (ASTM) class designations and to determine the ions associated with each group. The spectral data consists of extracted ion spectra (EIS), defined as the time-averaged mass spectrum across the chromatographic profile for select ions, where the selected ions are a subset of ions from Table 2 of the ASTM standard E1618-11. Utilization of the EIS allows for inter-laboratory comparisons without the concern of retention time shifts. The trained SOFM demonstrates clustering of the ignitable liquid samples according to designated ASTM classes. The EIS of select samples designated as miscellaneous or oxygenated as well as ignitable liquid residues from fire debris samples are projected onto the SOFM. The results indicate the similarities and differences between the variables of the newly projected data compared to those of the data used to train the SOFM. Copyright © 2014 Elsevier Ireland Ltd. All rights reserved.
GES DAAC HDF Data Processing and Visualization Tools
NASA Astrophysics Data System (ADS)
Ouzounov, D.; Cho, S.; Johnson, J.; Li, J.; Liu, Z.; Lu, L.; Pollack, N.; Qin, J.; Savtchenko, A.; Teng, B.
2002-12-01
The Goddard Earth Sciences (GES) Distributed Active Archive Center (DAAC) plays a major role in enabling basic scientific research and providing access to scientific data to the general user community. Several GES DAAC Data Support Teams provide expert assistance to users in accessing data, including information on visualization tools and documentation for data products. To provide easy access to the science data, the data support teams have additionally developed many online and desktop tools for data processing and visualization. This presentation is an overview of major HDF tools implemented at the GES DAAC and aimed at optimizing access to EOS data for the Earth Sciences community. GES DAAC ONLINE TOOLS: MODIS and AIRS on-demand Channel/Variable Subsetter are web-based, on-the-fly/on-demand subsetters that perform channel/variable subsetting and restructuring for Level1B and Level 2 data products. Users can specify criteria to subset data files with desired channels and variables and then download the subsetted file. AIRS QuickLook is a CGI/IDL combo package that allows users to view AIRS/HSB/AMSU Level-1B data online by specifying a channel prior to obtaining data. A global map is also provided along with the image to show geographic coverage of the granule and flight direction of the spacecraft. OASIS (Online data AnalySIS) is an IDL-based HTML/CGI interface for search, selection, and simple analysis of earth science data. It supports binary and GRIB formatted data, such as TOVS, Data Assimilation products, and some NCEP operational products. TRMM Online Analysis System is designed for quick exploration, analyses, and visualization of TRMM Level-3 and other precipitation products. The products consist of the daily (3B42), monthly(3B43), near-real-time (3B42RT), and Willmott's climate data. The system is also designed to be simple and easy to use - users can plot the average or accumulated rainfall over their region of interest for a given time period, or plot the time series of regional rainfall average. WebGIS is an online web software that implements the Open GIS Consortium (OGC) standards for mapping requests and rendering. It allows users access to TRMM, MODIS, SeaWiFS, and AVHRR data from several DAAC map servers, as well as externally served data such as political boundaries, population centers, lakes, rivers, and elevation. GES DAAC DESKTOP TOOLS: HDFLook-MODIS is a new, multifunctional, data processing and visualization tool for Radiometric and Geolocation, Atmosphere, Ocean, and Land MODIS HDF-EOS data. Features include (1) accessing and visualization of all swath (Levels l and 2) MODIS and AIRS products, and gridded (Levels 3 and 4) MODIS products; (2) re-mapping of swath data to world map; (3) geo-projection conversion; (4) interactive and batch mode capabilities; (5) subsetting and multi-granule processing; and (6) data conversion. SIMAP is an IDL-based script that is designed to read and map MODIS Level 1B (L1B) and Level 2 (L2) Ocean and Atmosphere products. It is a non-interactive, command line executed tool. The resulting maps are scaled to physical units (e.g., radiances, concentrations, brightness temperatures) and saved in binary files. TRMM HDF (in C and Fortran), reads in TRMM HDF data files and writes out user-selected SDS arrays and Vdata tables as separate flat binary files.
Profiling dendritic cell subsets in head and neck squamous cell tonsillar cancer and benign tonsils.
Abolhalaj, Milad; Askmyr, David; Sakellariou, Christina Alexandra; Lundberg, Kristina; Greiff, Lennart; Lindstedt, Malin
2018-05-23
Dendritic cells (DCs) have a key role in orchestrating immune responses and are considered important targets for immunotherapy against cancer. In order to develop effective cancer vaccines, detailed knowledge of the micromilieu in cancer lesions is warranted. In this study, flow cytometry and human transcriptome arrays were used to characterize subsets of DCs in head and neck squamous cell tonsillar cancer and compare them to their counterparts in benign tonsils to evaluate subset-selective biomarkers associated with tonsillar cancer. We describe, for the first time, four subsets of DCs in tonsillar cancer: CD123 + plasmacytoid DCs (pDC), CD1c + , CD141 + , and CD1c - CD141 - myeloid DCs (mDC). An increased frequency of DCs and an elevated mDC/pDC ratio were shown in malignant compared to benign tonsillar tissue. The microarray data demonstrates characteristics specific for tonsil cancer DC subsets, including expression of immunosuppressive molecules and lower expression levels of genes involved in development of effector immune responses in DCs in malignant tonsillar tissue, compared to their counterparts in benign tonsillar tissue. Finally, we present target candidates selectively expressed by different DC subsets in malignant tonsils and confirm expression of CD206/MRC1 and CD207/Langerin on CD1c + DCs at protein level. This study descibes DC characteristics in the context of head and neck cancer and add valuable steps towards future DC-based therapies against tonsillar cancer.
Metabolomics analysis was performed on the supernatant of human embryonic stem (hES) cell cultures exposed to a blinded subset of 11 chemicals selected from the chemical library of EPA's ToxCast™ chemical screening and prioritization research project. Metabolites from hES cultur...
Wang, Ching-Yun; Song, Xiao
2017-01-01
SUMMARY Biomedical researchers are often interested in estimating the effect of an environmental exposure in relation to a chronic disease endpoint. However, the exposure variable of interest may be measured with errors. In a subset of the whole cohort, a surrogate variable is available for the true unobserved exposure variable. The surrogate variable satisfies an additive measurement error model, but it may not have repeated measurements. The subset in which the surrogate variables are available is called a calibration sample. In addition to the surrogate variables that are available among the subjects in the calibration sample, we consider the situation when there is an instrumental variable available for all study subjects. An instrumental variable is correlated with the unobserved true exposure variable, and hence can be useful in the estimation of the regression coefficients. In this paper, we propose a nonparametric method for Cox regression using the observed data from the whole cohort. The nonparametric estimator is the best linear combination of a nonparametric correction estimator from the calibration sample and the difference of the naive estimators from the calibration sample and the whole cohort. The asymptotic distribution is derived, and the finite sample performance of the proposed estimator is examined via intensive simulation studies. The methods are applied to the Nutritional Biomarkers Study of the Women’s Health Initiative. PMID:27546625
Mapping tropical rainforest canopies using multi-temporal spaceborne imaging spectroscopy
NASA Astrophysics Data System (ADS)
Somers, Ben; Asner, Gregory P.
2013-10-01
The use of imaging spectroscopy for florisic mapping of forests is complicated by the spectral similarity among coexisting species. Here we evaluated an alternative spectral unmixing strategy combining a time series of EO-1 Hyperion images and an automated feature selection strategy in MESMA. Instead of using the same spectral subset to unmix each image pixel, our modified approach allowed the spectral subsets to vary on a per pixel basis such that each pixel is evaluated using a spectral subset tuned towards maximal separability of its specific endmember class combination or species mixture. The potential of the new approach for floristic mapping of tree species in Hawaiian rainforests was quantitatively demonstrated using both simulated and actual hyperspectral image time-series. With a Cohen's Kappa coefficient of 0.65, our approach provided a more accurate tree species map compared to MESMA (Kappa = 0.54). In addition, by the selection of spectral subsets our approach was about 90% faster than MESMA. The flexible or adaptive use of band sets in spectral unmixing as such provides an interesting avenue to address spectral similarities in complex vegetation canopies.
Application of machine learning on brain cancer multiclass classification
NASA Astrophysics Data System (ADS)
Panca, V.; Rustam, Z.
2017-07-01
Classification of brain cancer is a problem of multiclass classification. One approach to solve this problem is by first transforming it into several binary problems. The microarray gene expression dataset has the two main characteristics of medical data: extremely many features (genes) and only a few number of samples. The application of machine learning on microarray gene expression dataset mainly consists of two steps: feature selection and classification. In this paper, the features are selected using a method based on support vector machine recursive feature elimination (SVM-RFE) principle which is improved to solve multiclass classification, called multiple multiclass SVM-RFE. Instead of using only the selected features on a single classifier, this method combines the result of multiple classifiers. The features are divided into subsets and SVM-RFE is used on each subset. Then, the selected features on each subset are put on separate classifiers. This method enhances the feature selection ability of each single SVM-RFE. Twin support vector machine (TWSVM) is used as the method of the classifier to reduce computational complexity. While ordinary SVM finds single optimum hyperplane, the main objective Twin SVM is to find two non-parallel optimum hyperplanes. The experiment on the brain cancer microarray gene expression dataset shows this method could classify 71,4% of the overall test data correctly, using 100 and 1000 genes selected from multiple multiclass SVM-RFE feature selection method. Furthermore, the per class results show that this method could classify data of normal and MD class with 100% accuracy.
Local sensitivity analyses and identifiable parameter subsets were used to describe numerical constraints of a hypoxia model for bottom waters of the northern Gulf of Mexico. The sensitivity of state variables differed considerably with parameter changes, although most variables ...
Lorber, M.; Johnson, Kevin; Kross, B.; Pinsky, P.; Burmeister, L.; Thurman, M.; Wilkins, A.; Hallberg, G.
1997-01-01
In 1988, the Iowa Department of Natural Resources, along with the University of Iowa conducted the Statewide Rural Well Water Survey, commonly known as SWRL. A total of 686 private rural drinking water wells was selected by use of a probability sample and tested for pesticides and nitrates. Sixty-eight of these wells, the '10% repeat' wells, were additionally sampled in October, 1990 and June, 1991. Starting in November, 1991, the University of Iowa, with sponsorship from the United States Environmental Protection Agency, revisited these wells to begin a study of the temporal variability of atrazine and nitrates in wells. Other wells, which had originally tested positive for atrazine in SWRL but were not in the 10% repeat population, were added to the study population. Temporal sampling for a year-long period began in February of 1992 and concluded in January of 1993. All wells were sampled monthly, one subset was sampled weekly, and a second subset was sampled for 14-day consecutive periods. Two unique aspects of this study were the use of an immunoassay technique to screen for triazines before gas chromatography/mass spectrometry (GC/MS) analysis and quantification of atrazine, and the use of well owners to sample the wells. A total of 1771 samples from 83 wells are in the final data base for this study. This paper reviews the study design, the analytical methodologies, and development of the data base. A companion paper discusses the analysis of the data from this survey.
A study of the temporal variability of atrazine in private well water. part ii: analysis of data
Pinsky, Paul; Lorber, Matthew; Johnson, Kent; Kross, Burton; Burmeister, Leon; Wilkins, Amina; Hallberg, George
1997-01-01
In 1988, the Iowa Department of Natural Resources, along withthe University of Iowa, conducted the Statewide Rural WellWater Survey, commonly known as SWRL. A total of 686private rural drinking water wells was selected by use of aprobability sample and tested for pesticides and nitrate. A subsetof these wells, the 10% repeat wells, were additionally sampledin October, 1990 and June, 1991. Starting in November, 1991,the University of Iowa, with sponsorship from the United StatesEnvironmental Protection Agency, revisited the 10% repeat wellsto begin a study of the temporal variability of atrazine and nitratein wells. Other wells, which had originally tested positive foratrazine in SWRL but were not in the 10% population, wereadded to the study population. Temporal sampling for a year-long period began in February of 1992 and concluded in Januaryof 1993. All wells were sampled monthly, a subset was sampledweekly, and a second subset was sampled for 14 day consecutiveperiods. Of the 67 wells in the 10% population tested monthly,7 (10.4%) tested positive for atrazine at least once during theyear, and 3 (4%) were positive each of the 12 months. Theaverage concentration in the 7 wells was 0.10 µg/L. Fornitrate, 15 (22%) wells in the 10% repeat population monthlysampling were above the Maximum Contaminant Level of 10 mg/L at least once. This paper, the second of two papers on thisstudy, describes the analysis of data from the survey. The firstpaper (Lorber et al., 1997) reviews the study design, theanalytical methodologies, and development of the data base.
Belcher, C.N.; Jennings, Cecil A.
2010-01-01
We examined the affects of selected water quality variables on the presence of subadult sharks in six of nine Georgia estuaries. During 231 longline sets, we captured 415 individuals representing nine species. Atlantic sharpnose shark (Rhizoprionodon terranovae), bonnethead (Sphyrna tiburo), blacktip shark (Carcharhinus limbatus) and sandbar shark (C. plumbeus) comprised 96.1% of the catch. Canonical correlation analysis (CCA) was used to assess environmental influences on the assemblage of the four common species. Results of the CCA indicated Bonnethead Shark and Sandbar Shark were correlated with each other and with a subset of environmental variables. When the species occurred singly, depth was the defining environmental variable; whereas, when the two co-occurred, dissolved oxygen and salinity were the defining variables. Discriminant analyses (DA) were used to assess environmental influences on individual species. Results of the discriminant analyses supported the general CCA findings that the presence of bonnethead and sandbar shark were the only two species that correlated with environmental variables. In addition to depth and dissolved oxygen, turbidity influenced the presence of sandbar shark. The presence of bonnethead shark was influenced primarily by salinity and turbidity. Significant relationships existed for both the CCA and DA analyses; however, environmental variables accounted for <16% of the total variation in each. Compared to the environmental variables we measured, macrohabitat features (e.g., substrate type), prey availability, and susceptibility to predation may have stronger influences on the presence and distribution of subadult shark species among sites.
Optimizing data collection for public health decisions: a data mining approach
2014-01-01
Background Collecting data can be cumbersome and expensive. Lack of relevant, accurate and timely data for research to inform policy may negatively impact public health. The aim of this study was to test if the careful removal of items from two community nutrition surveys guided by a data mining technique called feature selection, can (a) identify a reduced dataset, while (b) not damaging the signal inside that data. Methods The Nutrition Environment Measures Surveys for stores (NEMS-S) and restaurants (NEMS-R) were completed on 885 retail food outlets in two counties in West Virginia between May and November of 2011. A reduced dataset was identified for each outlet type using feature selection. Coefficients from linear regression modeling were used to weight items in the reduced datasets. Weighted item values were summed with the error term to compute reduced item survey scores. Scores produced by the full survey were compared to the reduced item scores using a Wilcoxon rank-sum test. Results Feature selection identified 9 store and 16 restaurant survey items as significant predictors of the score produced from the full survey. The linear regression models built from the reduced feature sets had R2 values of 92% and 94% for restaurant and grocery store data, respectively. Conclusions While there are many potentially important variables in any domain, the most useful set may only be a small subset. The use of feature selection in the initial phase of data collection to identify the most influential variables may be a useful tool to greatly reduce the amount of data needed thereby reducing cost. PMID:24919484
Optimizing data collection for public health decisions: a data mining approach.
Partington, Susan N; Papakroni, Vasil; Menzies, Tim
2014-06-12
Collecting data can be cumbersome and expensive. Lack of relevant, accurate and timely data for research to inform policy may negatively impact public health. The aim of this study was to test if the careful removal of items from two community nutrition surveys guided by a data mining technique called feature selection, can (a) identify a reduced dataset, while (b) not damaging the signal inside that data. The Nutrition Environment Measures Surveys for stores (NEMS-S) and restaurants (NEMS-R) were completed on 885 retail food outlets in two counties in West Virginia between May and November of 2011. A reduced dataset was identified for each outlet type using feature selection. Coefficients from linear regression modeling were used to weight items in the reduced datasets. Weighted item values were summed with the error term to compute reduced item survey scores. Scores produced by the full survey were compared to the reduced item scores using a Wilcoxon rank-sum test. Feature selection identified 9 store and 16 restaurant survey items as significant predictors of the score produced from the full survey. The linear regression models built from the reduced feature sets had R2 values of 92% and 94% for restaurant and grocery store data, respectively. While there are many potentially important variables in any domain, the most useful set may only be a small subset. The use of feature selection in the initial phase of data collection to identify the most influential variables may be a useful tool to greatly reduce the amount of data needed thereby reducing cost.
A direct-gradient multivariate index of biotic condition
Miranda, Leandro E.; Aycock, J.N.; Killgore, K. J.
2012-01-01
Multimetric indexes constructed by summing metric scores have been criticized despite many of their merits. A leading criticism is the potential for investigator bias involved in metric selection and scoring. Often there is a large number of competing metrics equally well correlated with environmental stressors, requiring a judgment call by the investigator to select the most suitable metrics to include in the index and how to score them. Data-driven procedures for multimetric index formulation published during the last decade have reduced this limitation, yet apprehension remains. Multivariate approaches that select metrics with statistical algorithms may reduce the level of investigator bias and alleviate a weakness of multimetric indexes. We investigated the suitability of a direct-gradient multivariate procedure to derive an index of biotic condition for fish assemblages in oxbow lakes in the Lower Mississippi Alluvial Valley. Although this multivariate procedure also requires that the investigator identify a set of suitable metrics potentially associated with a set of environmental stressors, it is different from multimetric procedures because it limits investigator judgment in selecting a subset of biotic metrics to include in the index and because it produces metric weights suitable for computation of index scores. The procedure, applied to a sample of 35 competing biotic metrics measured at 50 oxbow lakes distributed over a wide geographical region in the Lower Mississippi Alluvial Valley, selected 11 metrics that adequately indexed the biotic condition of five test lakes. Because the multivariate index includes only metrics that explain the maximum variability in the stressor variables rather than a balanced set of metrics chosen to reflect various fish assemblage attributes, it is fundamentally different from multimetric indexes of biotic integrity with advantages and disadvantages. As such, it provides an alternative to multimetric procedures.
Zhou, Shang-Ming; Lyons, Ronan A.; Brophy, Sinead; Gravenor, Mike B.
2012-01-01
The Takagi-Sugeno (TS) fuzzy rule system is a widely used data mining technique, and is of particular use in the identification of non-linear interactions between variables. However the number of rules increases dramatically when applied to high dimensional data sets (the curse of dimensionality). Few robust methods are available to identify important rules while removing redundant ones, and this results in limited applicability in fields such as epidemiology or bioinformatics where the interaction of many variables must be considered. Here, we develop a new parsimonious TS rule system. We propose three statistics: R, L, and ω-values, to rank the importance of each TS rule, and a forward selection procedure to construct a final model. We use our method to predict how key components of childhood deprivation combine to influence educational achievement outcome. We show that a parsimonious TS model can be constructed, based on a small subset of rules, that provides an accurate description of the relationship between deprivation indices and educational outcomes. The selected rules shed light on the synergistic relationships between the variables, and reveal that the effect of targeting specific domains of deprivation is crucially dependent on the state of the other domains. Policy decisions need to incorporate these interactions, and deprivation indices should not be considered in isolation. The TS rule system provides a basis for such decision making, and has wide applicability for the identification of non-linear interactions in complex biomedical data. PMID:23272108
Variable importance in nonlinear kernels (VINK): classification of digitized histopathology.
Ginsburg, Shoshana; Ali, Sahirzeeshan; Lee, George; Basavanhally, Ajay; Madabhushi, Anant
2013-01-01
Quantitative histomorphometry is the process of modeling appearance of disease morphology on digitized histopathology images via image-based features (e.g., texture, graphs). Due to the curse of dimensionality, building classifiers with large numbers of features requires feature selection (which may require a large training set) or dimensionality reduction (DR). DR methods map the original high-dimensional features in terms of eigenvectors and eigenvalues, which limits the potential for feature transparency or interpretability. Although methods exist for variable selection and ranking on embeddings obtained via linear DR schemes (e.g., principal components analysis (PCA)), similar methods do not yet exist for nonlinear DR (NLDR) methods. In this work we present a simple yet elegant method for approximating the mapping between the data in the original feature space and the transformed data in the kernel PCA (KPCA) embedding space; this mapping provides the basis for quantification of variable importance in nonlinear kernels (VINK). We show how VINK can be implemented in conjunction with the popular Isomap and Laplacian eigenmap algorithms. VINK is evaluated in the contexts of three different problems in digital pathology: (1) predicting five year PSA failure following radical prostatectomy, (2) predicting Oncotype DX recurrence risk scores for ER+ breast cancers, and (3) distinguishing good and poor outcome p16+ oropharyngeal tumors. We demonstrate that subsets of features identified by VINK provide similar or better classification or regression performance compared to the original high dimensional feature sets.
Gorodeski, Eiran Z.; Ishwaran, Hemant; Kogalur, Udaya B.; Blackstone, Eugene H.; Hsich, Eileen; Zhang, Zhu-ming; Vitolins, Mara Z.; Manson, JoAnn E.; Curb, J. David; Martin, Lisa W.; Prineas, Ronald J.; Lauer, Michael S.
2013-01-01
Background Simultaneous contribution of hundreds of electrocardiographic biomarkers to prediction of long-term mortality in post-menopausal women with clinically normal resting electrocardiograms (ECGs) is unknown. Methods and Results We analyzed ECGs and all-cause mortality in 33,144 women enrolled in Women’s Health Initiative trials, who were without baseline cardiovascular disease or cancer, and had normal ECGs by Minnesota and Novacode criteria. Four hundred and seventy seven ECG biomarkers, encompassing global and individual ECG findings, were measured using computer algorithms. During a median follow-up of 8.1 years (range for survivors 0.5–11.2 years), 1,229 women died. For analyses cohort was randomly split into derivation (n=22,096, deaths=819) and validation (n=11,048, deaths=410) subsets. ECG biomarkers, demographic, and clinical characteristics were simultaneously analyzed using both traditional Cox regression and Random Survival Forest (RSF), a novel algorithmic machine-learning approach. Regression modeling failed to converge. RSF variable selection yielded 20 variables that were independently predictive of long-term mortality, 14 of which were ECG biomarkers related to autonomic tone, atrial conduction, and ventricular depolarization and repolarization. Conclusions We identified 14 ECG biomarkers from amongst hundreds that were associated with long-term prognosis using a novel random forest variable selection methodology. These were related to autonomic tone, atrial conduction, ventricular depolarization, and ventricular repolarization. Quantitative ECG biomarkers have prognostic importance, and may be markers of subclinical disease in apparently healthy post-menopausal women. PMID:21862719
Stekolnikov, Alexandr A; Klimov, Pavel B
2010-09-01
We revise chiggers belonging to the minuta-species group (genus Neotrombicula Hirst, 1925) from the Palaearctic using size-free multivariate morphometrics. This approach allowed us to resolve several diagnostic problems. We show that the widely distributed Neotrombicula scrupulosa Kudryashova, 1993 forms three spatially and ecologically isolated groups different from each other in size or shape (morphometric property) only: specimens from the Caucasus are distinct from those from Asia in shape, whereas the Asian specimens from plains and mountains are different from each other in size. We developed a multivariate classification model to separate three closely related species: N. scrupulosa, N. lubrica Kudryashova, 1993 and N. minuta Schluger, 1966. This model is based on five shape variables selected from an initial 17 variables by a best subset analysis using a custom size-correction subroutine. The variable selection procedure slightly improved the predictive power of the model, suggesting that it not only removed redundancy but also reduced 'noise' in the dataset. The overall classification accuracy of this model is 96.2, 96.2 and 95.5%, as estimated by internal validation, external validation and jackknife statistics, respectively. Our analyses resulted in one new synonymy: N. dimidiata Stekolnikov, 1995 is considered to be a synonym of N. lubrica. Both N. scrupulosa and N. lubrica are recorded from new localities. A key to species of the minuta-group incorporating results from our multivariate analyses is presented.
Zhou, Shang-Ming; Lyons, Ronan A; Brophy, Sinead; Gravenor, Mike B
2012-01-01
The Takagi-Sugeno (TS) fuzzy rule system is a widely used data mining technique, and is of particular use in the identification of non-linear interactions between variables. However the number of rules increases dramatically when applied to high dimensional data sets (the curse of dimensionality). Few robust methods are available to identify important rules while removing redundant ones, and this results in limited applicability in fields such as epidemiology or bioinformatics where the interaction of many variables must be considered. Here, we develop a new parsimonious TS rule system. We propose three statistics: R, L, and ω-values, to rank the importance of each TS rule, and a forward selection procedure to construct a final model. We use our method to predict how key components of childhood deprivation combine to influence educational achievement outcome. We show that a parsimonious TS model can be constructed, based on a small subset of rules, that provides an accurate description of the relationship between deprivation indices and educational outcomes. The selected rules shed light on the synergistic relationships between the variables, and reveal that the effect of targeting specific domains of deprivation is crucially dependent on the state of the other domains. Policy decisions need to incorporate these interactions, and deprivation indices should not be considered in isolation. The TS rule system provides a basis for such decision making, and has wide applicability for the identification of non-linear interactions in complex biomedical data.
Julian, Mark C.; Li, Lijuan; Garde, Shekhar; Wilen, Rebecca; Tessier, Peter M.
2017-01-01
The ability of antibodies to accumulate affinity-enhancing mutations in their complementarity-determining regions (CDRs) without compromising thermodynamic stability is critical to their natural function. However, it is unclear if affinity mutations in the hypervariable CDRs generally impact antibody stability and to what extent additional compensatory mutations are required to maintain stability during affinity maturation. Here we have experimentally and computationally evaluated the functional contributions of mutations acquired by a human variable (VH) domain that was evolved using strong selections for enhanced stability and affinity for the Alzheimer’s Aβ42 peptide. Interestingly, half of the key affinity mutations in the CDRs were destabilizing. Moreover, the destabilizing effects of these mutations were compensated for by a subset of the affinity mutations that were also stabilizing. Our findings demonstrate that the accumulation of both affinity and stability mutations is necessary to maintain thermodynamic stability during extensive mutagenesis and affinity maturation in vitro, which is similar to findings for natural antibodies that are subjected to somatic hypermutation in vivo. These findings for diverse antibodies and antibody fragments specific for unrelated antigens suggest that the formation of the antigen-binding site is generally a destabilizing process and that co-enrichment for compensatory mutations is critical for maintaining thermodynamic stability. PMID:28349921
The Performance of Short-Term Heart Rate Variability in the Detection of Congestive Heart Failure
Barros, Allan Kardec; Ohnishi, Noboru
2016-01-01
Congestive heart failure (CHF) is a cardiac disease associated with the decreasing capacity of the cardiac output. It has been shown that the CHF is the main cause of the cardiac death around the world. Some works proposed to discriminate CHF subjects from healthy subjects using either electrocardiogram (ECG) or heart rate variability (HRV) from long-term recordings. In this work, we propose an alternative framework to discriminate CHF from healthy subjects by using HRV short-term intervals based on 256 RR continuous samples. Our framework uses a matching pursuit algorithm based on Gabor functions. From the selected Gabor functions, we derived a set of features that are inputted into a hybrid framework which uses a genetic algorithm and k-nearest neighbour classifier to select a subset of features that has the best classification performance. The performance of the framework is analyzed using both Fantasia and CHF database from Physionet archives which are, respectively, composed of 40 healthy volunteers and 29 subjects. From a set of nonstandard 16 features, the proposed framework reaches an overall accuracy of 100% with five features. Our results suggest that the application of hybrid frameworks whose classifier algorithms are based on genetic algorithms has outperformed well-known classifier methods. PMID:27891509
Domino: Extracting, Comparing, and Manipulating Subsets across Multiple Tabular Datasets
Gratzl, Samuel; Gehlenborg, Nils; Lex, Alexander; Pfister, Hanspeter; Streit, Marc
2016-01-01
Answering questions about complex issues often requires analysts to take into account information contained in multiple interconnected datasets. A common strategy in analyzing and visualizing large and heterogeneous data is dividing it into meaningful subsets. Interesting subsets can then be selected and the associated data and the relationships between the subsets visualized. However, neither the extraction and manipulation nor the comparison of subsets is well supported by state-of-the-art techniques. In this paper we present Domino, a novel multiform visualization technique for effectively representing subsets and the relationships between them. By providing comprehensive tools to arrange, combine, and extract subsets, Domino allows users to create both common visualization techniques and advanced visualizations tailored to specific use cases. In addition to the novel technique, we present an implementation that enables analysts to manage the wide range of options that our approach offers. Innovative interactive features such as placeholders and live previews support rapid creation of complex analysis setups. We introduce the technique and the implementation using a simple example and demonstrate scalability and effectiveness in a use case from the field of cancer genomics. PMID:26356916
Arambula, Diego; Wong, Wenge; Medhekar, Bob A; Guo, Huatao; Gingery, Mari; Czornyj, Elizabeth; Liu, Minghsun; Dey, Sanghamitra; Ghosh, Partho; Miller, Jeff F
2013-05-14
Diversity-generating retroelements (DGRs) are a unique family of retroelements that confer selective advantages to their hosts by facilitating localized DNA sequence evolution through a specialized error-prone reverse transcription process. We characterized a DGR in Legionella pneumophila, an opportunistic human pathogen that causes Legionnaires disease. The L. pneumophila DGR is found within a horizontally acquired genomic island, and it can theoretically generate 10(26) unique nucleotide sequences in its target gene, legionella determinent target A (ldtA), creating a repertoire of 10(19) distinct proteins. Expression of the L. pneumophila DGR resulted in transfer of DNA sequence information from a template repeat to a variable repeat (VR) accompanied by adenine-specific mutagenesis of progeny VRs at the 3'end of ldtA. ldtA encodes a twin-arginine translocated lipoprotein that is anchored in the outer leaflet of the outer membrane, with its C-terminal variable region surface exposed. Related DGRs were identified in L. pneumophila clinical isolates that encode unique target proteins with homologous VRs, demonstrating the adaptability of DGR components. This work characterizes a DGR that diversifies a bacterial protein and confirms the hypothesis that DGR-mediated mutagenic homing occurs through a conserved mechanism. Comparative bioinformatics predicts that surface display of massively variable proteins is a defining feature of a subset of bacterial DGRs.
Baliakas, Panagiotis; Hadzidimitriou, Anastasia; Sutton, Lesley-Ann; Minga, Eva; Agathangelidis, Andreas; Nichelatti, Michele; Tsanousa, Athina; Scarfò, Lydia; Davis, Zadie; Yan, Xiao-Jie; Shanafelt, Tait; Plevova, Karla; Sandberg, Yorick; Vojdeman, Fie Juhl; Boudjogra, Myriam; Tzenou, Tatiana; Chatzouli, Maria; Chu, Charles C; Veronese, Silvio; Gardiner, Anne; Mansouri, Larry; Smedby, Karin E; Pedersen, Lone Bredo; van Lom, Kirsten; Giudicelli, Véronique; Francova, Hana Skuhrova; Nguyen-Khac, Florence; Panagiotidis, Panagiotis; Juliusson, Gunnar; Angelis, Lefteris; Anagnostopoulos, Achilles; Lefranc, Marie-Paule; Facco, Monica; Trentin, Livio; Catherwood, Mark; Montillo, Marco; Geisler, Christian H; Langerak, Anton W; Pospisilova, Sarka; Chiorazzi, Nicholas; Oscier, David; Jelinek, Diane F; Darzentas, Nikos; Belessi, Chrysoula; Davi, Frederic; Rosenquist, Richard; Ghia, Paolo; Stamatopoulos, Kostas
2014-11-01
About 30% of cases of chronic lymphocytic leukaemia (CLL) carry quasi-identical B-cell receptor immunoglobulins and can be assigned to distinct stereotyped subsets. Although preliminary evidence suggests that B-cell receptor immunoglobulin stereotypy is relevant from a clinical viewpoint, this aspect has never been explored in a systematic manner or in a cohort of adequate size that would enable clinical conclusions to be drawn. For this retrospective, multicentre study, we analysed 8593 patients with CLL for whom immunogenetic data were available. These patients were followed up in 15 academic institutions throughout Europe (in Czech Republic, Denmark, France, Greece, Italy, Netherlands, Sweden, and the UK) and the USA, and data were collected between June 1, 2012, and June 7, 2013. We retrospectively assessed the clinical implications of CLL B-cell receptor immunoglobulin stereotypy, with a particular focus on 14 major stereotyped subsets comprising cases expressing unmutated (U-CLL) or mutated (M-CLL) immunoglobulin heavy chain variable genes. The primary outcome of our analysis was time to first treatment, defined as the time between diagnosis and date of first treatment. 2878 patients were assigned to a stereotyped subset, of which 1122 patients belonged to one of 14 major subsets. Stereotyped subsets showed significant differences in terms of age, sex, disease burden at diagnosis, CD38 expression, and cytogenetic aberrations of prognostic significance. Patients within a specific subset generally followed the same clinical course, whereas patients in different stereotyped subsets-despite having the same immunoglobulin heavy variable gene and displaying similar immunoglobulin mutational status-showed substantially different times to first treatment. By integrating B-cell receptor immunoglobulin stereotypy (for subsets 1, 2, and 4) into the well established Döhner cytogenetic prognostic model, we showed these, which collectively account for around 7% of all cases of CLL and represent both U-CLL and M-CLL, constituted separate clinical entities, ranging from very indolent (subset 4) to aggressive disease (subsets 1 and 2). The molecular classification of chronic lymphocytic leukaemia based on B-cell receptor immunoglobulin stereotypy improves the Döhner hierarchical model and refines prognostication beyond immunoglobulin mutational status, with potential implications for clinical decision making, especially within prospective clinical trials. European Union; General Secretariat for Research and Technology of Greece; AIRC; Italian Ministry of Health; AIRC Regional Project with Fondazione CARIPARO and CARIVERONA; Regione Veneto on Chronic Lymphocytic Leukemia; Nordic Cancer Union; Swedish Cancer Society; Swedish Research Council; and National Cancer Institute (NIH). Copyright © 2014 Elsevier Ltd. All rights reserved.
Chelcy Miniat
2013-01-01
The EcoTrends Editorial Committee sorted through vase amounts of historical and ongoing data from 50 ecological sites in the continental United States including Alaska, several islands, and Antarctica to present in a logical format the variables commonly collected. This report presents a subset of data and variables from these sites and illustrates through detailed...
Neuroepileptic Correlates of Autistic Symptomatology in Tuberous Sclerosis
ERIC Educational Resources Information Center
Bolton, Patrick F.
2004-01-01
Tuberous sclerosis is a genetic condition that is strongly associated with the development of an autism spectrum disorder. However, there is marked variability in expression, and only a subset of children with tuberous sclerosis develop autism spectrum disorder. Clarification of the mechanisms that underlie the association and variability in…
System and method for progressive band selection for hyperspectral images
NASA Technical Reports Server (NTRS)
Fisher, Kevin (Inventor)
2013-01-01
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for progressive band selection for hyperspectral images. A system having module configured to control a processor to practice the method calculates a virtual dimensionality of a hyperspectral image having multiple bands to determine a quantity Q of how many bands are needed for a threshold level of information, ranks each band based on a statistical measure, selects Q bands from the multiple bands to generate a subset of bands based on the virtual dimensionality, and generates a reduced image based on the subset of bands. This approach can create reduced datasets of full hyperspectral images tailored for individual applications. The system uses a metric specific to a target application to rank the image bands, and then selects the most useful bands. The number of bands selected can be specified manually or calculated from the hyperspectral image's virtual dimensionality.
Efficient Simulation Budget Allocation for Selecting an Optimal Subset
NASA Technical Reports Server (NTRS)
Chen, Chun-Hung; He, Donghai; Fu, Michael; Lee, Loo Hay
2008-01-01
We consider a class of the subset selection problem in ranking and selection. The objective is to identify the top m out of k designs based on simulated output. Traditional procedures are conservative and inefficient. Using the optimal computing budget allocation framework, we formulate the problem as that of maximizing the probability of correc tly selecting all of the top-m designs subject to a constraint on the total number of samples available. For an approximation of this corre ct selection probability, we derive an asymptotically optimal allocat ion and propose an easy-to-implement heuristic sequential allocation procedure. Numerical experiments indicate that the resulting allocatio ns are superior to other methods in the literature that we tested, and the relative efficiency increases for larger problems. In addition, preliminary numerical results indicate that the proposed new procedur e has the potential to enhance computational efficiency for simulation optimization.
Enhancing the Performance of LibSVM Classifier by Kernel F-Score Feature Selection
NASA Astrophysics Data System (ADS)
Sarojini, Balakrishnan; Ramaraj, Narayanasamy; Nickolas, Savarimuthu
Medical Data mining is the search for relationships and patterns within the medical datasets that could provide useful knowledge for effective clinical decisions. The inclusion of irrelevant, redundant and noisy features in the process model results in poor predictive accuracy. Much research work in data mining has gone into improving the predictive accuracy of the classifiers by applying the techniques of feature selection. Feature selection in medical data mining is appreciable as the diagnosis of the disease could be done in this patient-care activity with minimum number of significant features. The objective of this work is to show that selecting the more significant features would improve the performance of the classifier. We empirically evaluate the classification effectiveness of LibSVM classifier on the reduced feature subset of diabetes dataset. The evaluations suggest that the feature subset selected improves the predictive accuracy of the classifier and reduce false negatives and false positives.
NASA Astrophysics Data System (ADS)
Peerbhay, Kabir Yunus; Mutanga, Onisimo; Ismail, Riyad
2013-05-01
Discriminating commercial tree species using hyperspectral remote sensing techniques is critical in monitoring the spatial distributions and compositions of commercial forests. However, issues related to data dimensionality and multicollinearity limit the successful application of the technology. The aim of this study was to examine the utility of the partial least squares discriminant analysis (PLS-DA) technique in accurately classifying six exotic commercial forest species (Eucalyptus grandis, Eucalyptus nitens, Eucalyptus smithii, Pinus patula, Pinus elliotii and Acacia mearnsii) using airborne AISA Eagle hyperspectral imagery (393-900 nm). Additionally, the variable importance in the projection (VIP) method was used to identify subsets of bands that could successfully discriminate the forest species. Results indicated that the PLS-DA model that used all the AISA Eagle bands (n = 230) produced an overall accuracy of 80.61% and a kappa value of 0.77, with user's and producer's accuracies ranging from 50% to 100%. In comparison, incorporating the optimal subset of VIP selected wavebands (n = 78) in the PLS-DA model resulted in an improved overall accuracy of 88.78% and a kappa value of 0.87, with user's and producer's accuracies ranging from 70% to 100%. Bands located predominantly within the visible region of the electromagnetic spectrum (393-723 nm) showed the most capability in terms of discriminating between the six commercial forest species. Overall, the research has demonstrated the potential of using PLS-DA for reducing the dimensionality of hyperspectral datasets as well as determining the optimal subset of bands to produce the highest classification accuracies.
Sherman, Stephen E; Kuljanin, Miljan; Cooper, Tyler T; Putman, David M; Lajoie, Gilles A; Hess, David A
2017-06-01
During culture expansion, multipotent mesenchymal stromal cells (MSCs) differentially express aldehyde dehydrogenase (ALDH), an intracellular detoxification enzyme that protects long-lived cells against oxidative stress. Thus, MSC selection based on ALDH-activity may be used to reduce heterogeneity and distinguish MSC subsets with improved regenerative potency. After expansion of human bone marrow-derived MSCs, cell progeny was purified based on low versus high ALDH-activity (ALDH hi ) by fluorescence-activated cell sorting, and each subset was compared for multipotent stromal and provascular regenerative functions. Both ALDH l ° and ALDH hi MSC subsets demonstrated similar expression of stromal cell (>95% CD73 + , CD90 + , CD105 + ) and pericyte (>95% CD146 + ) surface markers and showed multipotent differentiation into bone, cartilage, and adipose cells in vitro. Conditioned media (CDM) generated by ALDH hi MSCs demonstrated a potent proliferative and prosurvival effect on human microvascular endothelial cells (HMVECs) under serum-free conditions and augmented HMVEC tube-forming capacity in growth factor-reduced matrices. After subcutaneous transplantation within directed in vivo angiogenesis assay implants into immunodeficient mice, ALDH hi MSC or CDM produced by ALDH hi MSC significantly augmented murine vascular cell recruitment and perfused vessel infiltration compared with ALDH l ° MSC. Although both subsets demonstrated strikingly similar mRNA expression patterns, quantitative proteomic analyses performed on subset-specific CDM revealed the ALDH hi MSC subset uniquely secreted multiple proangiogenic cytokines (vascular endothelial growth factor beta, platelet derived growth factor alpha, and angiogenin) and actively produced multiple factors with chemoattractant (transforming growth factor-β, C-X-C motif chemokine ligand 1, 2, and 3 (GRO), C-C motif chemokine ligand 5 (RANTES), monocyte chemotactic protein 1 (MCP-1), interleukin [IL]-6, IL-8) and matrix-modifying functions (tissue inhibitor of metalloprotinase 1 & 2 (TIMP1/2)). Collectively, MSCs selected for ALDH hi demonstrated enhanced proangiogenic secretory functions and represent a purified MSC subset amenable for vascular regenerative applications. Stem Cells 2017;35:1542-1553. © 2017 AlphaMed Press.
Massof, Robert W
2014-10-01
A simple theoretical framework explains patient responses to items in rating scale questionnaires. Fixed latent variables position each patient and each item on the same linear scale. Item responses are governed by a set of fixed category thresholds, one for each ordinal response category. A patient's item responses are magnitude estimates of the difference between the patient variable and the patient's estimate of the item variable, relative to his/her personally defined response category thresholds. Differences between patients in their personal estimates of the item variable and in their personal choices of category thresholds are represented by random variables added to the corresponding fixed variables. Effects of intervention correspond to changes in the patient variable, the patient's response bias, and/or latent item variables for a subset of items. Intervention effects on patients' item responses were simulated by assuming the random variables are normally distributed with a constant scalar covariance matrix. Rasch analysis was used to estimate latent variables from the simulated responses. The simulations demonstrate that changes in the patient variable and changes in response bias produce indistinguishable effects on item responses and manifest as changes only in the estimated patient variable. Changes in a subset of item variables manifest as intervention-specific differential item functioning and as changes in the estimated person variable that equals the average of changes in the item variables. Simulations demonstrate that intervention-specific differential item functioning produces inefficiencies and inaccuracies in computer adaptive testing. © The Author(s) 2013 Reprints and permissions: sagepub.co.uk/journalsPermissions.nav.
Kim, Hyun O; Oh, Hyun Jin; Lee, Jae Wook; Jang, Pil-Sang; Chung, Nack-Gyun; Cho, Bin; Kim, Hack-Ki
2013-01-01
Lymphocyte subset recovery is an important factor that determines the success of hematopoietic stem cell transplantation (HSCT). Temporal differences in the recovery of lymphocyte subsets and the factors influencing this recovery are important variables that affect a patient's post-transplant immune reconstitution, and therefore require investigation. The time taken to achieve lymphocyte subset recovery and the factors influencing this recovery were investigated in 59 children who had undergone HSCT at the Department of Pediatrics, The Catholic University of Korea Seoul St. Mary's Hospital, and who had an uneventful follow-up period of at least 1 year. Analyses were carried out at 3 and 12 months post-transplant. An additional study was performed 1 month post-transplant to evaluate natural killer (NK) cell recovery. The impact of pre- and post-transplant variables, including diagnosis of Epstein-Barr virus (EBV) DNAemia posttransplant, on lymphocyte recovery was evaluated. THE LYMPHOCYTE SUBSETS RECOVERED IN THE FOLLOWING ORDER: NK cells, cytotoxic T cells, B cells, and helper T cells. At 1 month post-transplant, acute graft-versus-host disease was found to contribute significantly to the delay of CD16(+)/56(+) cell recovery. Younger patients showed delayed recovery of both CD3(+)/CD8(+) and CD19(+) cells. EBV DNAemia had a deleterious impact on the recovery of both CD3(+) and CD3(+)/CD4(+) lymphocytes at 1 year post-transplant. In our pediatric allogeneic HSCT cohort, helper T cells were the last subset to recover. Younger age and EBV DNAemia had a negative impact on the post-transplant recovery of T cells and B cells.
NASA Astrophysics Data System (ADS)
Zhang, Aizhu; Sun, Genyun; Wang, Zhenjie
2015-12-01
The serious information redundancy in hyperspectral images (HIs) cannot contribute to the data analysis accuracy, instead it require expensive computational resources. Consequently, to identify the most useful and valuable information from the HIs, thereby improve the accuracy of data analysis, this paper proposed a novel hyperspectral band selection method using the hybrid genetic algorithm and gravitational search algorithm (GA-GSA). In the proposed method, the GA-GSA is mapped to the binary space at first. Then, the accuracy of the support vector machine (SVM) classifier and the number of selected spectral bands are utilized to measure the discriminative capability of the band subset. Finally, the band subset with the smallest number of spectral bands as well as covers the most useful and valuable information is obtained. To verify the effectiveness of the proposed method, studies conducted on an AVIRIS image against two recently proposed state-of-the-art GSA variants are presented. The experimental results revealed the superiority of the proposed method and indicated that the method can indeed considerably reduce data storage costs and efficiently identify the band subset with stable and high classification precision.
Darmann, Andreas; Nicosia, Gaia; Pferschy, Ulrich; Schauer, Joachim
2014-03-16
In this work we address a game theoretic variant of the Subset Sum problem, in which two decision makers (agents/players) compete for the usage of a common resource represented by a knapsack capacity. Each agent owns a set of integer weighted items and wants to maximize the total weight of its own items included in the knapsack. The solution is built as follows: Each agent, in turn, selects one of its items (not previously selected) and includes it in the knapsack if there is enough capacity. The process ends when the remaining capacity is too small for including any item left. We look at the problem from a single agent point of view and show that finding an optimal sequence of items to select is an [Formula: see text]-hard problem. Therefore we propose two natural heuristic strategies and analyze their worst-case performance when (1) the opponent is able to play optimally and (2) the opponent adopts a greedy strategy. From a centralized perspective we observe that some known results on the approximation of the classical Subset Sum can be effectively adapted to the multi-agent version of the problem.
Darmann, Andreas; Nicosia, Gaia; Pferschy, Ulrich; Schauer, Joachim
2014-01-01
In this work we address a game theoretic variant of the Subset Sum problem, in which two decision makers (agents/players) compete for the usage of a common resource represented by a knapsack capacity. Each agent owns a set of integer weighted items and wants to maximize the total weight of its own items included in the knapsack. The solution is built as follows: Each agent, in turn, selects one of its items (not previously selected) and includes it in the knapsack if there is enough capacity. The process ends when the remaining capacity is too small for including any item left. We look at the problem from a single agent point of view and show that finding an optimal sequence of items to select is an NP-hard problem. Therefore we propose two natural heuristic strategies and analyze their worst-case performance when (1) the opponent is able to play optimally and (2) the opponent adopts a greedy strategy. From a centralized perspective we observe that some known results on the approximation of the classical Subset Sum can be effectively adapted to the multi-agent version of the problem. PMID:25844012
McFarland, Kent P.; Rimmer, Christopher C.; Goetz, James E.; Aubry, Yves; Wunderle, Joseph M.; Sutton, Anne; Townsend, Jason M.; Sosa, Alejandro Llanes; Kirkconnell, Arturo
2013-01-01
Conservation planning and implementation require identifying pertinent habitats and locations where protection and management may improve viability of targeted species. The winter range of Bicknell’s Thrush (Catharus bicknelli), a threatened Nearctic-Neotropical migratory songbird, is restricted to the Greater Antilles. We analyzed winter records from the mid-1970s to 2009 to quantitatively evaluate winter distribution and habitat selection. Additionally, we conducted targeted surveys in Jamaica (n = 433), Cuba (n = 363), Dominican Republic (n = 1,000), Haiti (n = 131) and Puerto Rico (n = 242) yielding 179 sites with thrush presence. We modeled Bicknell’s Thrush winter habitat selection and distribution in the Greater Antilles in Maxent version 3.3.1. using environmental predictors represented in 30 arc second study area rasters. These included nine landform, land cover and climatic variables that were thought a priori to have potentially high predictive power. We used the average training gain from ten model runs to select the best subset of predictors. Total winter precipitation, aspect and land cover, particularly broadleaf forests, emerged as important variables. A five-variable model that contained land cover, winter precipitation, aspect, slope, and elevation was the most parsimonious and not significantly different than the models with more variables. We used the best fitting model to depict potential winter habitat. Using the 10 percentile threshold (>0.25), we estimated winter habitat to cover 33,170 km2, nearly 10% of the study area. The Dominican Republic contained half of all potential habitat (51%), followed by Cuba (15.1%), Jamaica (13.5%), Haiti (10.6%), and Puerto Rico (9.9%). Nearly one-third of the range was found to be in protected areas. By providing the first detailed predictive map of Bicknell’s Thrush winter distribution, our study provides a useful tool to prioritize and direct conservation planning for this and other wet, broadleaf forest specialists in the Greater Antilles. PMID:23326554
Genomic selection for slaughter age in pigs using the Cox frailty model.
Santos, V S; Martins Filho, S; Resende, M D V; Azevedo, C F; Lopes, P S; Guimarães, S E F; Glória, L S; Silva, F F
2015-10-19
The aim of this study was to compare genomic selection methodologies using a linear mixed model and the Cox survival model. We used data from an F2 population of pigs, in which the response variable was the time in days from birth to the culling of the animal and the covariates were 238 markers [237 single nucleotide polymorphism (SNP) plus the halothane gene]. The data were corrected for fixed effects, and the accuracy of the method was determined based on the correlation of the ranks of predicted genomic breeding values (GBVs) in both models with the corrected phenotypic values. The analysis was repeated with a subset of SNP markers with largest absolute effects. The results were in agreement with the GBV prediction and the estimation of marker effects for both models for uncensored data and for normality. However, when considering censored data, the Cox model with a normal random effect (S1) was more appropriate. Since there was no agreement between the linear mixed model and the imputed data (L2) for the prediction of genomic values and the estimation of marker effects, the model S1 was considered superior as it took into account the latent variable and the censored data. Marker selection increased correlations between the ranks of predicted GBVs by the linear and Cox frailty models and the corrected phenotypic values, and 120 markers were required to increase the predictive ability for the characteristic analyzed.
Hemmateenejad, Bahram; Yazdani, Mahdieh
2009-02-16
Steroids are widely distributed in nature and are found in plants, animals, and fungi in abundance. A data set consists of a diverse set of steroids have been used to develop quantitative structure-electrochemistry relationship (QSER) models for their half-wave reduction potential. Modeling was established by means of multiple linear regression (MLR) and principle component regression (PCR) analyses. In MLR analysis, the QSPR models were constructed by first grouping descriptors and then stepwise selection of variables from each group (MLR1) and stepwise selection of predictor variables from the pool of all calculated descriptors (MLR2). Similar procedure was used in PCR analysis so that the principal components (or features) were extracted from different group of descriptors (PCR1) and from entire set of descriptors (PCR2). The resulted models were evaluated using cross-validation, chance correlation, application to prediction reduction potential of some test samples and accessing applicability domain. Both MLR approaches represented accurate results however the QSPR model found by MLR1 was statistically more significant. PCR1 approach produced a model as accurate as MLR approaches whereas less accurate results were obtained by PCR2 approach. In overall, the correlation coefficients of cross-validation and prediction of the QSPR models resulted from MLR1, MLR2 and PCR1 approaches were higher than 90%, which show the high ability of the models to predict reduction potential of the studied steroids.
Wang, Nanyi; Wang, Lirong; Xie, Xiang-Qun
2017-11-27
Molecular docking is widely applied to computer-aided drug design and has become relatively mature in the recent decades. Application of docking in modeling varies from single lead compound optimization to large-scale virtual screening. The performance of molecular docking is highly dependent on the protein structures selected. It is especially challenging for large-scale target prediction research when multiple structures are available for a single target. Therefore, we have established ProSelection, a docking preferred-protein selection algorithm, in order to generate the proper structure subset(s). By the ProSelection algorithm, protein structures of "weak selectors" are filtered out whereas structures of "strong selectors" are kept. Specifically, the structure which has a good statistical performance of distinguishing active ligands from inactive ligands is defined as a strong selector. In this study, 249 protein structures of 14 autophagy-related targets are investigated. Surflex-dock was used as the docking engine to distinguish active and inactive compounds against these protein structures. Both t test and Mann-Whitney U test were used to distinguish the strong from the weak selectors based on the normality of the docking score distribution. The suggested docking score threshold for active ligands (SDA) was generated for each strong selector structure according to the receiver operating characteristic (ROC) curve. The performance of ProSelection was further validated by predicting the potential off-targets of 43 U.S. Federal Drug Administration approved small molecule antineoplastic drugs. Overall, ProSelection will accelerate the computational work in protein structure selection and could be a useful tool for molecular docking, target prediction, and protein-chemical database establishment research.
Delineation of soil temperature regimes from HCMM data
NASA Technical Reports Server (NTRS)
Day, R. L.; Petersen, G. W. (Principal Investigator)
1981-01-01
Supplementary data including photographs as well as topographic, geologic, and soil maps were obtained and evaluated for ground truth purposes and control point selection. A study area (approximately 450 by 450 pixels) was subset from LANDSAT scene No. 2477-17142. Geometric corrections and scaling were performed. Initial enhancement techniques were initiated to aid control point selection and soils interpretation. The SUBSET program was modified to read HCMM tapes and HCMM data were reformated so that they are compatible with the ORSER system. Initial NMAP products of geometrically corrected and scaled raw data tapes (unregistered) of the study were produced.
Decision Aids for Airborne Intercept Operations in Advanced Aircrafts
NASA Technical Reports Server (NTRS)
Madni, A.; Freedy, A.
1981-01-01
A tactical decision aid (TDA) for the F-14 aircrew, i.e., the naval flight officer and pilot, in conducting a multitarget attack during the performance of a Combat Air Patrol (CAP) role is presented. The TDA employs hierarchical multiattribute utility models for characterizing mission objectives in operationally measurable terms, rule based AI-models for tactical posture selection, and fast time simulation for maneuver consequence prediction. The TDA makes aspect maneuver recommendations, selects and displays the optimum mission posture, evaluates attackable and potentially attackable subsets, and recommends the 'best' attackable subset along with the required course perturbation.
Selection of optimal complexity for ENSO-EMR model by minimum description length principle
NASA Astrophysics Data System (ADS)
Loskutov, E. M.; Mukhin, D.; Mukhina, A.; Gavrilov, A.; Kondrashov, D. A.; Feigin, A. M.
2012-12-01
One of the main problems arising in modeling of data taken from natural system is finding a phase space suitable for construction of the evolution operator model. Since we usually deal with strongly high-dimensional behavior, we are forced to construct a model working in some projection of system phase space corresponding to time scales of interest. Selection of optimal projection is non-trivial problem since there are many ways to reconstruct phase variables from given time series, especially in the case of a spatio-temporal data field. Actually, finding optimal projection is significant part of model selection, because, on the one hand, the transformation of data to some phase variables vector can be considered as a required component of the model. On the other hand, such an optimization of a phase space makes sense only in relation to the parametrization of the model we use, i.e. representation of evolution operator, so we should find an optimal structure of the model together with phase variables vector. In this paper we propose to use principle of minimal description length (Molkov et al., 2009) for selection models of optimal complexity. The proposed method is applied to optimization of Empirical Model Reduction (EMR) of ENSO phenomenon (Kravtsov et al. 2005, Kondrashov et. al., 2005). This model operates within a subset of leading EOFs constructed from spatio-temporal field of SST in Equatorial Pacific, and has a form of multi-level stochastic differential equations (SDE) with polynomial parameterization of the right-hand side. Optimal values for both the number of EOF, the order of polynomial and number of levels are estimated from the Equatorial Pacific SST dataset. References: Ya. Molkov, D. Mukhin, E. Loskutov, G. Fidelin and A. Feigin, Using the minimum description length principle for global reconstruction of dynamic systems from noisy time series, Phys. Rev. E, Vol. 80, P 046207, 2009 Kravtsov S, Kondrashov D, Ghil M, 2005: Multilevel regression modeling of nonlinear processes: Derivation and applications to climatic variability. J. Climate, 18 (21): 4404-4424. D. Kondrashov, S. Kravtsov, A. W. Robertson and M. Ghil, 2005. A hierarchy of data-based ENSO models. J. Climate, 18, 4425-4444.
NASA Astrophysics Data System (ADS)
Chang, Kai-Wei; L'Ecuyer, Tristan S.; Kahn, Brian H.; Natraj, Vijay
2017-05-01
Hyperspectral instruments such as Atmospheric Infrared Sounder (AIRS) have spectrally dense observations effective for ice cloud retrievals. However, due to the large number of channels, only a small subset is typically used. It is crucial that this subset of channels be chosen to contain the maximum possible information about the retrieved variables. This study describes an information content analysis designed to select optimal channels for ice cloud retrievals. To account for variations in ice cloud properties, we perform channel selection over an ensemble of cloud regimes, extracted with a clustering algorithm, from a multiyear database at a tropical Atmospheric Radiation Measurement site. Multiple satellite viewing angles over land and ocean surfaces are considered to simulate the variations in observation scenarios. The results suggest that AIRS channels near wavelengths of 14, 10.4, 4.2, and 3.8 μm contain the most information. With an eye toward developing a joint AIRS-MODIS (Moderate Resolution Imaging Spectroradiometer) retrieval, the analysis is also applied to combined measurements from both instruments. While application of this method to MODIS yields results consistent with previous channel sensitivity studies, the analysis shows that this combination may yield substantial improvement in cloud retrievals. MODIS provides most information on optical thickness and particle size, aided by a better constraint on cloud vertical placement from AIRS. An alternate scenario where cloud top boundaries are supplied by the active sensors in the A-train is also explored. The more robust cloud placement afforded by active sensors shifts the optimal channels toward the window region and shortwave infrared, further constraining optical thickness and particle size.
Organizational commitment as a predictor variable in nursing turnover research: literature review.
Wagner, Cheryl M
2007-11-01
This paper is a report of a literature review to (1) demonstrate the predictability of organizational commitment as a variable, (2) compare organizational commitment and job satisfaction as predictor variables and (3) determine the usefulness of organizational commitment in nursing turnover research. Organizational commitment is not routinely selected as a predictor variable in nursing studies, although the evidence suggests that it is a reliable predictor. Findings from turnover studies can help determine the previous performance of organizational commitment, and be compared to those of studies using the more conventional variable of job satisfaction. Published research studies in English were accessed for the period 1960-2006 using the CINAHL, EBSCOHealthsource Nursing, ERIC, PROQUEST, Journals@OVID, PubMed, PsychINFO, Health and Psychosocial Instruments (HAPI) and COCHRANE library databases and Business Source Premier. The search terms included nursing turnover, organizational commitment or job satisfaction. Only studies reporting mean comparisons, R(2) or beta values related to organizational commitment and turnover or turnover antecedents were included in the review. There were 25 studies in the final data set, with a subset of 23 studies generated to compare the variables of organizational commitment and job satisfaction. Results indicated robust indirect predictability of organizational commitment overall, with greater predictability by organizational commitment vs job satisfaction. Organizational commitment is a useful predictor of turnover in nursing research, and effective as a variable with the most direct impact on antecedents of turnover such as intent to stay. The organizational commitment variable should be routinely employed in nursing turnover research studies.
Wang, Ching-Yun; Song, Xiao
2016-11-01
Biomedical researchers are often interested in estimating the effect of an environmental exposure in relation to a chronic disease endpoint. However, the exposure variable of interest may be measured with errors. In a subset of the whole cohort, a surrogate variable is available for the true unobserved exposure variable. The surrogate variable satisfies an additive measurement error model, but it may not have repeated measurements. The subset in which the surrogate variables are available is called a calibration sample. In addition to the surrogate variables that are available among the subjects in the calibration sample, we consider the situation when there is an instrumental variable available for all study subjects. An instrumental variable is correlated with the unobserved true exposure variable, and hence can be useful in the estimation of the regression coefficients. In this paper, we propose a nonparametric method for Cox regression using the observed data from the whole cohort. The nonparametric estimator is the best linear combination of a nonparametric correction estimator from the calibration sample and the difference of the naive estimators from the calibration sample and the whole cohort. The asymptotic distribution is derived, and the finite sample performance of the proposed estimator is examined via intensive simulation studies. The methods are applied to the Nutritional Biomarkers Study of the Women's Health Initiative. © 2016 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
USDA-ARS?s Scientific Manuscript database
he USDA rice (Oryza sativa L.) core subset (RCS) was assembled to represent the genetic diversity of the entire USDA-ARS National Small Grains Collection and consists of 1,794 accessions from 114 countries. The USDA rice mini-core (MC) is a subset of 217 accessions from the RCS and was selected to ...
Effects of sample size and sampling frequency on studies of brown bear home ranges and habitat use
Arthur, Steve M.; Schwartz, Charles C.
1999-01-01
We equipped 9 brown bears (Ursus arctos) on the Kenai Peninsula, Alaska, with collars containing both conventional very-high-frequency (VHF) transmitters and global positioning system (GPS) receivers programmed to determine an animal's position at 5.75-hr intervals. We calculated minimum convex polygon (MCP) and fixed and adaptive kernel home ranges for randomly-selected subsets of the GPS data to examine the effects of sample size on accuracy and precision of home range estimates. We also compared results obtained by weekly aerial radiotracking versus more frequent GPS locations to test for biases in conventional radiotracking data. Home ranges based on the MCP were 20-606 km2 (x = 201) for aerial radiotracking data (n = 12-16 locations/bear) and 116-1,505 km2 (x = 522) for the complete GPS data sets (n = 245-466 locations/bear). Fixed kernel home ranges were 34-955 km2 (x = 224) for radiotracking data and 16-130 km2 (x = 60) for the GPS data. Differences between means for radiotracking and GPS data were due primarily to the larger samples provided by the GPS data. Means did not differ between radiotracking data and equivalent-sized subsets of GPS data (P > 0.10). For the MCP, home range area increased and variability decreased asymptotically with number of locations. For the kernel models, both area and variability decreased with increasing sample size. Simulations suggested that the MCP and kernel models required >60 and >80 locations, respectively, for estimates to be both accurate (change in area <1%/additional location) and precise (CV < 50%). Although the radiotracking data appeared unbiased, except for the relationship between area and sample size, these data failed to indicate some areas that likely were important to bears. Our results suggest that the usefulness of conventional radiotracking data may be limited by potential biases and variability due to small samples. Investigators that use home range estimates in statistical tests should consider the effects of variability of those estimates. Use of GPS-equipped collars can facilitate obtaining larger samples of unbiased data and improve accuracy and precision of home range estimates.
Two-Way Regularized Fuzzy Clustering of Multiple Correspondence Analysis.
Kim, Sunmee; Choi, Ji Yeh; Hwang, Heungsun
2017-01-01
Multiple correspondence analysis (MCA) is a useful tool for investigating the interrelationships among dummy-coded categorical variables. MCA has been combined with clustering methods to examine whether there exist heterogeneous subclusters of a population, which exhibit cluster-level heterogeneity. These combined approaches aim to classify either observations only (one-way clustering of MCA) or both observations and variable categories (two-way clustering of MCA). The latter approach is favored because its solutions are easier to interpret by providing explicitly which subgroup of observations is associated with which subset of variable categories. Nonetheless, the two-way approach has been built on hard classification that assumes observations and/or variable categories to belong to only one cluster. To relax this assumption, we propose two-way fuzzy clustering of MCA. Specifically, we combine MCA with fuzzy k-means simultaneously to classify a subgroup of observations and a subset of variable categories into a common cluster, while allowing both observations and variable categories to belong partially to multiple clusters. Importantly, we adopt regularized fuzzy k-means, thereby enabling us to decide the degree of fuzziness in cluster memberships automatically. We evaluate the performance of the proposed approach through the analysis of simulated and real data, in comparison with existing two-way clustering approaches.
Hernandez, Maria Eugenia; Martinez-Fong, Daniel; Perez-Tapia, Mayra; Estrada-Garcia, Iris; Estrada-Parra, Sergio; Pavón, Lenin
2010-02-01
To date, only the effect of a short-term antidepressant treatment (<12 weeks) on neuroendocrinoimmune alterations in patients with a major depressive disorder has been evaluated. Our objective was to determine the effect of a 52-week long treatment with selective serotonin-reuptake inhibitors on lymphocyte subsets. The participants were thirty-one patients and twenty-two healthy volunteers. The final number of patients (10) resulted from selection and course, as detailed in the enrollment scheme. Methods used to psychiatrically analyze the participants included the Mini-International Neuropsychiatric Interview, Hamilton Depression Scale and Beck Depression Inventory. The peripheral lymphocyte subsets were measured in peripheral blood using flow cytometry. Before treatment, increased counts of natural killer (NK) cells in patients were statistically significant when compared with those of healthy volunteers (312+/-29 versus 158+/-30; cells/mL), but no differences in the populations of T and B cells were found. The patients showed remission of depressive episodes after 20 weeks of treatment along with an increase in NK cell and B cell populations, which remained increased until the end of the study. At the 52nd week of treatment, patients showed an increase in the counts of NK cells (396+/-101 cells/mL) and B cells (268+/-64 cells/mL) compared to healthy volunteers (NK, 159+/-30 cells/mL; B cells, 179+/-37 cells/mL). We conclude that long-term treatment with selective serotonin-reuptake inhibitors not only causes remission of depressive symptoms, but also affects lymphocyte subset populations. The physiopathological consequence of these changes remains to be determined.
Updating estimates of low streamflow statistics to account for possible trends
NASA Astrophysics Data System (ADS)
Blum, A. G.; Archfield, S. A.; Hirsch, R. M.; Vogel, R. M.; Kiang, J. E.; Dudley, R. W.
2017-12-01
Given evidence of both increasing and decreasing trends in low flows in many streams, methods are needed to update estimators of low flow statistics used in water resources management. One such metric is the 10-year annual low-flow statistic (7Q10) calculated as the annual minimum seven-day streamflow which is exceeded in nine out of ten years on average. Historical streamflow records may not be representative of current conditions at a site if environmental conditions are changing. We present a new approach to frequency estimation under nonstationary conditions that applies a stationary nonparametric quantile estimator to a subset of the annual minimum flow record. Monte Carlo simulation experiments were used to evaluate this approach across a range of trend and no trend scenarios. Relative to the standard practice of using the entire available streamflow record, use of a nonparametric quantile estimator combined with selection of the most recent 30 or 50 years for 7Q10 estimation were found to improve accuracy and reduce bias. Benefits of data subset selection approaches were greater for higher magnitude trends annual minimum flow records with lower coefficients of variation. A nonparametric trend test approach for subset selection did not significantly improve upon always selecting the last 30 years of record. At 174 stream gages in the Chesapeake Bay region, 7Q10 estimators based on the most recent 30 years of flow record were compared to estimators based on the entire period of record. Given the availability of long records of low streamflow, using only a subset of the flow record ( 30 years) can be used to update 7Q10 estimators to better reflect current streamflow conditions.
Collectively loading an application in a parallel computer
DOE Office of Scientific and Technical Information (OSTI.GOV)
Aho, Michael E.; Attinella, John E.; Gooding, Thomas M.
Collectively loading an application in a parallel computer, the parallel computer comprising a plurality of compute nodes, including: identifying, by a parallel computer control system, a subset of compute nodes in the parallel computer to execute a job; selecting, by the parallel computer control system, one of the subset of compute nodes in the parallel computer as a job leader compute node; retrieving, by the job leader compute node from computer memory, an application for executing the job; and broadcasting, by the job leader to the subset of compute nodes in the parallel computer, the application for executing the job.
Nishi, Kanae; Kewley-Port, Diane
2008-01-01
Purpose Nishi and Kewley-Port (2007) trained Japanese listeners to perceive nine American English monophthongs and showed that a protocol using all nine vowels (fullset) produced better results than the one using only the three more difficult vowels (subset). The present study extended the target population to Koreans and examined whether protocols combining the two stimulus sets would provide more effective training. Method Three groups of five Korean listeners were trained on American English vowels for nine days using one of the three protocols: fullset only, first three days on subset then six days on fullset, or first six days on fullset then three days on subset. Participants' performance was assessed by pre- and post-training tests, as well as by a mid-training test. Results 1) Fullset training was also effective for Koreans; 2) no advantage was found for the two combined protocols over the fullset only protocol, and 3) sustained “non-improvement” was observed for training using one of the combined protocols. Conclusions In using subsets for training American English vowels, care should be taken not only in the selection of subset vowels, but also for the training orders of subsets. PMID:18664694
Catera, Rosa; Hatzi, Katerina; Yan, Xiao-Jie; Zhang, Lu; Wang, Xiao Bo; Fales, Henry M.; Allen, Steven L.; Kolitz, Jonathan E.; Rai, Kanti R.; Chiorazzi, Nicholas
2008-01-01
Leukemic B lymphocytes of a large group of unrelated chronic lymphocytic leukemia (CLL) patients express an unmutated heavy chain immunoglobulin variable (V) region encoded by IGHV1-69, IGHD3-16, and IGHJ3 with nearly identical heavy and light chain complementarity-determining region 3 sequences. The likelihood that these patients developed CLL clones with identical antibody V regions randomly is highly improbable and suggests selection by a common antigen. Monoclonal antibodies (mAbs) from this stereotypic subset strongly bind cytoplasmic structures in HEp-2 cells. Therefore, HEp-2 cell extracts were immunoprecipitated with recombinant stereotypic subset-specific CLL mAbs, revealing a major protein band at approximately 225 kDa that was identified by mass spectrometry as nonmuscle myosin heavy chain IIA (MYHIIA). Reactivity of the stereotypic mAbs with MYHIIA was confirmed by Western blot and immunofluorescence colocalization with anti-MYHIIA antibody. Treatments that alter MYHIIA amounts and cytoplasmic localization resulted in a corresponding change in binding to these mAbs. The appearance of MYHIIA on the surface of cells undergoing stress or apoptosis suggests that CLL mAb may generally bind molecules exposed as a consequence of these events. Binding of CLL mAb to MYHIIA could promote the development, survival, and expansion of these leukemic cells. PMID:18812466
NASA Astrophysics Data System (ADS)
Tan, Chao; Chen, Hui; Wang, Chao; Zhu, Wanping; Wu, Tong; Diao, Yuanbo
2013-03-01
Near and mid-infrared (NIR/MIR) spectroscopy techniques have gained great acceptance in the industry due to their multiple applications and versatility. However, a success of application often depends heavily on the construction of accurate and stable calibration models. For this purpose, a simple multi-model fusion strategy is proposed. It is actually the combination of Kohonen self-organizing map (KSOM), mutual information (MI) and partial least squares (PLSs) and therefore named as KMICPLS. It works as follows: First, the original training set is fed into a KSOM for unsupervised clustering of samples, on which a series of training subsets are constructed. Thereafter, on each of the training subsets, a MI spectrum is calculated and only the variables with higher MI values than the mean value are retained, based on which a candidate PLS model is constructed. Finally, a fixed number of PLS models are selected to produce a consensus model. Two NIR/MIR spectral datasets from brewing industry are used for experiments. The results confirms its superior performance to two reference algorithms, i.e., the conventional PLS and genetic algorithm-PLS (GAPLS). It can build more accurate and stable calibration models without increasing the complexity, and can be generalized to other NIR/MIR applications.
[Regression analysis to select native-like structures from decoys of antigen-antibody docking].
Chen, Zhengshan; Chi, Xiangyang; Fan, Pengfei; Zhang, Guanying; Wang, Meirong; Yu, Changming; Chen, Wei
2018-06-25
Given the increasing exploitation of antibodies in different contexts such as molecular diagnostics and therapeutics, it would be beneficial to unravel properties of antigen-antibody interaction with modeling of computational protein-protein docking, especially, in the absence of a cocrystal structure. However, obtaining a native-like antigen-antibody structure remains challenging due in part to failing to reliably discriminate accurate from inaccurate structures among tens of thousands of decoys after computational docking with existing scoring function. We hypothesized that some important physicochemical and energetic features could be used to describe antigen-antibody interfaces and identify native-like antigen-antibody structure. We prepared a dataset, a subset of Protein-Protein Docking Benchmark Version 4.0, comprising 37 nonredundant 3D structures of antigen-antibody complexes, and used it to train and test multivariate logistic regression equation which took several important physicochemical and energetic features of decoys as dependent variables. Our results indicate that the ability to identify native-like structures of our method is superior to ZRANK and ZDOCK score for the subset of antigen-antibody complexes. And then, we use our method in workflow of predicting epitope of anti-Ebola glycoprotein monoclonal antibody-4G7 and identify three accurate residues in its epitope.
Rough sets and Laplacian score based cost-sensitive feature selection
Yu, Shenglong
2018-01-01
Cost-sensitive feature selection learning is an important preprocessing step in machine learning and data mining. Recently, most existing cost-sensitive feature selection algorithms are heuristic algorithms, which evaluate the importance of each feature individually and select features one by one. Obviously, these algorithms do not consider the relationship among features. In this paper, we propose a new algorithm for minimal cost feature selection called the rough sets and Laplacian score based cost-sensitive feature selection. The importance of each feature is evaluated by both rough sets and Laplacian score. Compared with heuristic algorithms, the proposed algorithm takes into consideration the relationship among features with locality preservation of Laplacian score. We select a feature subset with maximal feature importance and minimal cost when cost is undertaken in parallel, where the cost is given by three different distributions to simulate different applications. Different from existing cost-sensitive feature selection algorithms, our algorithm simultaneously selects out a predetermined number of “good” features. Extensive experimental results show that the approach is efficient and able to effectively obtain the minimum cost subset. In addition, the results of our method are more promising than the results of other cost-sensitive feature selection algorithms. PMID:29912884
Rough sets and Laplacian score based cost-sensitive feature selection.
Yu, Shenglong; Zhao, Hong
2018-01-01
Cost-sensitive feature selection learning is an important preprocessing step in machine learning and data mining. Recently, most existing cost-sensitive feature selection algorithms are heuristic algorithms, which evaluate the importance of each feature individually and select features one by one. Obviously, these algorithms do not consider the relationship among features. In this paper, we propose a new algorithm for minimal cost feature selection called the rough sets and Laplacian score based cost-sensitive feature selection. The importance of each feature is evaluated by both rough sets and Laplacian score. Compared with heuristic algorithms, the proposed algorithm takes into consideration the relationship among features with locality preservation of Laplacian score. We select a feature subset with maximal feature importance and minimal cost when cost is undertaken in parallel, where the cost is given by three different distributions to simulate different applications. Different from existing cost-sensitive feature selection algorithms, our algorithm simultaneously selects out a predetermined number of "good" features. Extensive experimental results show that the approach is efficient and able to effectively obtain the minimum cost subset. In addition, the results of our method are more promising than the results of other cost-sensitive feature selection algorithms.
An Improved Search Approach for Solving Non-Convex Mixed-Integer Non Linear Programming Problems
NASA Astrophysics Data System (ADS)
Sitopu, Joni Wilson; Mawengkang, Herman; Syafitri Lubis, Riri
2018-01-01
The nonlinear mathematical programming problem addressed in this paper has a structure characterized by a subset of variables restricted to assume discrete values, which are linear and separable from the continuous variables. The strategy of releasing nonbasic variables from their bounds, combined with the “active constraint” method, has been developed. This strategy is used to force the appropriate non-integer basic variables to move to their neighbourhood integer points. Successful implementation of these algorithms was achieved on various test problems.
Comparison of Different EHG Feature Selection Methods for the Detection of Preterm Labor
Alamedine, D.; Khalil, M.; Marque, C.
2013-01-01
Numerous types of linear and nonlinear features have been extracted from the electrohysterogram (EHG) in order to classify labor and pregnancy contractions. As a result, the number of available features is now very large. The goal of this study is to reduce the number of features by selecting only the relevant ones which are useful for solving the classification problem. This paper presents three methods for feature subset selection that can be applied to choose the best subsets for classifying labor and pregnancy contractions: an algorithm using the Jeffrey divergence (JD) distance, a sequential forward selection (SFS) algorithm, and a binary particle swarm optimization (BPSO) algorithm. The two last methods are based on a classifier and were tested with three types of classifiers. These methods have allowed us to identify common features which are relevant for contraction classification. PMID:24454536
HIV-1 protease cleavage site prediction based on two-stage feature selection method.
Niu, Bing; Yuan, Xiao-Cheng; Roeper, Preston; Su, Qiang; Peng, Chun-Rong; Yin, Jing-Yuan; Ding, Juan; Li, HaiPeng; Lu, Wen-Cong
2013-03-01
Knowledge of the mechanism of HIV protease cleavage specificity is critical to the design of specific and effective HIV inhibitors. Searching for an accurate, robust, and rapid method to correctly predict the cleavage sites in proteins is crucial when searching for possible HIV inhibitors. In this article, HIV-1 protease specificity was studied using the correlation-based feature subset (CfsSubset) selection method combined with Genetic Algorithms method. Thirty important biochemical features were found based on a jackknife test from the original data set containing 4,248 features. By using the AdaBoost method with the thirty selected features the prediction model yields an accuracy of 96.7% for the jackknife test and 92.1% for an independent set test, with increased accuracy over the original dataset by 6.7% and 77.4%, respectively. Our feature selection scheme could be a useful technique for finding effective competitive inhibitors of HIV protease.
USDA-ARS?s Scientific Manuscript database
Selective principal component regression analysis (SPCR) uses a subset of the original image bands for principal component transformation and regression. For optimal band selection before the transformation, this paper used genetic algorithms (GA). In this case, the GA process used the regression co...
Hammerstrom, Donald J.
2013-10-15
A method for managing the charging and discharging of batteries wherein at least one battery is connected to a battery charger, the battery charger is connected to a power supply. A plurality of controllers in communication with one and another are provided, each of the controllers monitoring a subset of input variables. A set of charging constraints may then generated for each controller as a function of the subset of input variables. A set of objectives for each controller may also be generated. A preferred charge rate for each controller is generated as a function of either the set of objectives, the charging constraints, or both, using an algorithm that accounts for each of the preferred charge rates for each of the controllers and/or that does not violate any of the charging constraints. A current flow between the battery and the battery charger is then provided at the actual charge rate.
What we talk about when we talk about access deficits
Mirman, Daniel; Britt, Allison E.
2014-01-01
Semantic impairments have been divided into storage deficits, in which the semantic representations themselves are damaged, and access deficits, in which the representations are intact but access to them is impaired. The behavioural phenomena that have been associated with access deficits include sensitivity to cueing, sensitivity to presentation rate, performance inconsistency, negative serial position effects, sensitivity to number and strength of competitors, semantic blocking effects, disordered selection between strong and weak competitors, correlation between semantic deficits and executive function deficits and reduced word frequency effects. Four general accounts have been proposed for different subsets of these phenomena: abnormal refractoriness, too much activation, impaired competitive selection and deficits of semantic control. A combination of abnormal refractoriness and impaired competitive selection can account for most of the behavioural phenomena, but there remain several open questions. In particular, it remains unclear whether access deficits represent a single syndrome, a syndrome with multiple subtypes or a variable collection of phenomena, whether the underlying deficit is domain-general or domain-specific, whether it is owing to disorders of inhibition, activation or selection, and the nature of the connection (if any) between access phenomena in aphasia and in neurologically intact controls. Computational models offer a promising approach to answering these questions. PMID:24324232
Visuomotor Transformations Underlying Hunting Behavior in Zebrafish
Bianco, Isaac H.; Engert, Florian
2015-01-01
Summary Visuomotor circuits filter visual information and determine whether or not to engage downstream motor modules to produce behavioral outputs. However, the circuit mechanisms that mediate and link perception of salient stimuli to execution of an adaptive response are poorly understood. We combined a virtual hunting assay for tethered larval zebrafish with two-photon functional calcium imaging to simultaneously monitor neuronal activity in the optic tectum during naturalistic behavior. Hunting responses showed mixed selectivity for combinations of visual features, specifically stimulus size, speed, and contrast polarity. We identified a subset of tectal neurons with similar highly selective tuning, which show non-linear mixed selectivity for visual features and are likely to mediate the perceptual recognition of prey. By comparing neural dynamics in the optic tectum during response versus non-response trials, we discovered premotor population activity that specifically preceded initiation of hunting behavior and exhibited anatomical localization that correlated with motor variables. In summary, the optic tectum contains non-linear mixed selectivity neurons that are likely to mediate reliable detection of ethologically relevant sensory stimuli. Recruitment of small tectal assemblies appears to link perception to action by providing the premotor commands that release hunting responses. These findings allow us to propose a model circuit for the visuomotor transformations underlying a natural behavior. PMID:25754638
Visuomotor transformations underlying hunting behavior in zebrafish.
Bianco, Isaac H; Engert, Florian
2015-03-30
Visuomotor circuits filter visual information and determine whether or not to engage downstream motor modules to produce behavioral outputs. However, the circuit mechanisms that mediate and link perception of salient stimuli to execution of an adaptive response are poorly understood. We combined a virtual hunting assay for tethered larval zebrafish with two-photon functional calcium imaging to simultaneously monitor neuronal activity in the optic tectum during naturalistic behavior. Hunting responses showed mixed selectivity for combinations of visual features, specifically stimulus size, speed, and contrast polarity. We identified a subset of tectal neurons with similar highly selective tuning, which show non-linear mixed selectivity for visual features and are likely to mediate the perceptual recognition of prey. By comparing neural dynamics in the optic tectum during response versus non-response trials, we discovered premotor population activity that specifically preceded initiation of hunting behavior and exhibited anatomical localization that correlated with motor variables. In summary, the optic tectum contains non-linear mixed selectivity neurons that are likely to mediate reliable detection of ethologically relevant sensory stimuli. Recruitment of small tectal assemblies appears to link perception to action by providing the premotor commands that release hunting responses. These findings allow us to propose a model circuit for the visuomotor transformations underlying a natural behavior. Copyright © 2015 The Authors. Published by Elsevier Ltd.. All rights reserved.
Estimation of fat-free mass in Asian neonates using bioelectrical impedance analysis
Tint, Mya-Thway; Ward, Leigh C; Soh, Shu E; Aris, Izzuddin M; Chinnadurai, Amutha; Saw, Seang Mei; Gluckman, Peter D; Godfrey, Keith M; Chong, Yap-Seng; Kramer, Michael S; Yap, Fabian; Lingwood, Barbara; Lee, Yung Seng
2016-01-01
The aims of this study were to develop and validate a prediction equation of fat-free mass (FFM) based on bioelectrical impedance analysis (BIA) and anthropometry using air displacement plethysmography (ADP) as a reference in Asian neonates and to test the applicability of the prediction equations in independent Western cohort. A total of 173 neonates at birth and 140 at week-2 of age were included. Multiple linear regression analysis was performed to develop the prediction equations in a two-third randomly selected subset and validated on the remaining one-third subset at each time point and in an independent Queensland cohort. FFM measured by ADP was the dependent variable and anthropometric measures, sex and impedance quotient (L2/R50) were independent variables in the model. Accuracy of prediction equations were assessed using intra-class correlation and Bland-Altman analyses. L2/R50 was the significant predictor of FFM at week-2 but not at birth. Compared to the model using weight, sex and length, including L2/R50 slightly improved the prediction with a bias of 0.01kg with 2SD limits of agreement (LOA) (0.18, −0.20). Prediction explained 88.9% of variation but not beyond that of anthropometry. Applying these equations to Queensland cohort provided similar performance at the appropriate age. However, when the Queensland equations were applied to our cohort, the bias increased slightly but with similar LOA. BIA appears to have limited use in predicting FFM in the first few weeks of life compared to simple anthropometry in Asian populations. There is a need for population and age appropriate FFM prediction equations. PMID:26856420
NASA Astrophysics Data System (ADS)
Pal, I.; Lall, U.; Robertson, A. W.; Cane, M. A.; Bansal, R.
2013-06-01
Snowmelt-dominated streamflow of the Western Himalayan rivers is an important water resource during the dry pre-monsoon spring months to meet the irrigation and hydropower needs in northern India. Here we study the seasonal prediction of melt-dominated total inflow into the Bhakra Dam in northern India based on statistical relationships with meteorological variables during the preceding winter. Total inflow into the Bhakra Dam includes the Satluj River flow together with a flow diversion from its tributary, the Beas River. Both are tributaries of the Indus River that originate from the Western Himalayas, which is an under-studied region. Average measured winter snow volume at the upper-elevation stations and corresponding lower-elevation rainfall and temperature of the Satluj River basin were considered as empirical predictors. Akaike information criteria (AIC) and Bayesian information criteria (BIC) were used to select the best subset of inputs from all the possible combinations of predictors for a multiple linear regression framework. To test for potential issues arising due to multicollinearity of the predictor variables, cross-validated prediction skills of the best subset were also compared with the prediction skills of principal component regression (PCR) and partial least squares regression (PLSR) techniques, which yielded broadly similar results. As a whole, the forecasts of the melt season at the end of winter and as the melt season commences were shown to have potential skill for guiding the development of stochastic optimization models to manage the trade-off between irrigation and hydropower releases versus flood control during the annual fill cycle of the Bhakra Reservoir, a major energy and irrigation source in the region.
Estimation of fat-free mass in Asian neonates using bioelectrical impedance analysis.
Tint, Mya-Thway; Ward, Leigh C; Soh, Shu E; Aris, Izzuddin M; Chinnadurai, Amutha; Saw, Seang Mei; Gluckman, Peter D; Godfrey, Keith M; Chong, Yap-Seng; Kramer, Michael S; Yap, Fabian; Lingwood, Barbara; Lee, Yung Seng
2016-03-28
The aims of this study were to develop and validate a prediction equation of fat-free mass (FFM) based on bioelectrical impedance analysis (BIA) and anthropometry using air-displacement plethysmography (ADP) as a reference in Asian neonates and to test the applicability of the prediction equations in an independent Western cohort. A total of 173 neonates at birth and 140 at two weeks of age were included. Multiple linear regression analysis was performed to develop the prediction equations in a two-third randomly selected subset and validated on the remaining one-third subset at each time point and in an independent Queensland cohort. FFM measured by ADP was the dependent variable, and anthropometric measures, sex and impedance quotient (L2/R50) were independent variables in the model. Accuracy of prediction equations was assessed using intra-class correlation and Bland-Altman analyses. L2/R50 was the significant predictor of FFM at week two but not at birth. Compared with the model using weight, sex and length, including L2/R50 slightly improved the prediction with a bias of 0·01 kg with 2 sd limits of agreement (LOA) (0·18, -0·20). Prediction explained 88·9 % of variation but not beyond that of anthropometry. Applying these equations to the Queensland cohort provided similar performance at the appropriate age. However, when the Queensland equations were applied to our cohort, the bias increased slightly but with similar LOA. BIA appears to have limited use in predicting FFM in the first few weeks of life compared with simple anthropometry in Asian populations. There is a need for population- and age-appropriate FFM prediction equations.
2011-01-01
Background Thousands of children experience cardiac arrest events every year in pediatric intensive care units. Most of these children die. Cardiac arrest prediction tools are used as part of medical emergency team evaluations to identify patients in standard hospital beds that are at high risk for cardiac arrest. There are no models to predict cardiac arrest in pediatric intensive care units though, where the risk of an arrest is 10 times higher than for standard hospital beds. Current tools are based on a multivariable approach that does not characterize deterioration, which often precedes cardiac arrests. Characterizing deterioration requires a time series approach. The purpose of this study is to propose a method that will allow for time series data to be used in clinical prediction models. Successful implementation of these methods has the potential to bring arrest prediction to the pediatric intensive care environment, possibly allowing for interventions that can save lives and prevent disabilities. Methods We reviewed prediction models from nonclinical domains that employ time series data, and identified the steps that are necessary for building predictive models using time series clinical data. We illustrate the method by applying it to the specific case of building a predictive model for cardiac arrest in a pediatric intensive care unit. Results Time course analysis studies from genomic analysis provided a modeling template that was compatible with the steps required to develop a model from clinical time series data. The steps include: 1) selecting candidate variables; 2) specifying measurement parameters; 3) defining data format; 4) defining time window duration and resolution; 5) calculating latent variables for candidate variables not directly measured; 6) calculating time series features as latent variables; 7) creating data subsets to measure model performance effects attributable to various classes of candidate variables; 8) reducing the number of candidate features; 9) training models for various data subsets; and 10) measuring model performance characteristics in unseen data to estimate their external validity. Conclusions We have proposed a ten step process that results in data sets that contain time series features and are suitable for predictive modeling by a number of methods. We illustrated the process through an example of cardiac arrest prediction in a pediatric intensive care setting. PMID:22023778
Kennedy, Curtis E; Turley, James P
2011-10-24
Thousands of children experience cardiac arrest events every year in pediatric intensive care units. Most of these children die. Cardiac arrest prediction tools are used as part of medical emergency team evaluations to identify patients in standard hospital beds that are at high risk for cardiac arrest. There are no models to predict cardiac arrest in pediatric intensive care units though, where the risk of an arrest is 10 times higher than for standard hospital beds. Current tools are based on a multivariable approach that does not characterize deterioration, which often precedes cardiac arrests. Characterizing deterioration requires a time series approach. The purpose of this study is to propose a method that will allow for time series data to be used in clinical prediction models. Successful implementation of these methods has the potential to bring arrest prediction to the pediatric intensive care environment, possibly allowing for interventions that can save lives and prevent disabilities. We reviewed prediction models from nonclinical domains that employ time series data, and identified the steps that are necessary for building predictive models using time series clinical data. We illustrate the method by applying it to the specific case of building a predictive model for cardiac arrest in a pediatric intensive care unit. Time course analysis studies from genomic analysis provided a modeling template that was compatible with the steps required to develop a model from clinical time series data. The steps include: 1) selecting candidate variables; 2) specifying measurement parameters; 3) defining data format; 4) defining time window duration and resolution; 5) calculating latent variables for candidate variables not directly measured; 6) calculating time series features as latent variables; 7) creating data subsets to measure model performance effects attributable to various classes of candidate variables; 8) reducing the number of candidate features; 9) training models for various data subsets; and 10) measuring model performance characteristics in unseen data to estimate their external validity. We have proposed a ten step process that results in data sets that contain time series features and are suitable for predictive modeling by a number of methods. We illustrated the process through an example of cardiac arrest prediction in a pediatric intensive care setting.
Sample size determination for bibliographic retrieval studies
Yao, Xiaomei; Wilczynski, Nancy L; Walter, Stephen D; Haynes, R Brian
2008-01-01
Background Research for developing search strategies to retrieve high-quality clinical journal articles from MEDLINE is expensive and time-consuming. The objective of this study was to determine the minimal number of high-quality articles in a journal subset that would need to be hand-searched to update or create new MEDLINE search strategies for treatment, diagnosis, and prognosis studies. Methods The desired width of the 95% confidence intervals (W) for the lowest sensitivity among existing search strategies was used to calculate the number of high-quality articles needed to reliably update search strategies. New search strategies were derived in journal subsets formed by 2 approaches: random sampling of journals and top journals (having the most high-quality articles). The new strategies were tested in both the original large journal database and in a low-yielding journal (having few high-quality articles) subset. Results For treatment studies, if W was 10% or less for the lowest sensitivity among our existing search strategies, a subset of 15 randomly selected journals or 2 top journals were adequate for updating search strategies, based on each approach having at least 99 high-quality articles. The new strategies derived in 15 randomly selected journals or 2 top journals performed well in the original large journal database. Nevertheless, the new search strategies developed using the random sampling approach performed better than those developed using the top journal approach in a low-yielding journal subset. For studies of diagnosis and prognosis, no journal subset had enough high-quality articles to achieve the expected W (10%). Conclusion The approach of randomly sampling a small subset of journals that includes sufficient high-quality articles is an efficient way to update or create search strategies for high-quality articles on therapy in MEDLINE. The concentrations of diagnosis and prognosis articles are too low for this approach. PMID:18823538
Improving permafrost distribution modelling using feature selection algorithms
NASA Astrophysics Data System (ADS)
Deluigi, Nicola; Lambiel, Christophe; Kanevski, Mikhail
2016-04-01
The availability of an increasing number of spatial data on the occurrence of mountain permafrost allows the employment of machine learning (ML) classification algorithms for modelling the distribution of the phenomenon. One of the major problems when dealing with high-dimensional dataset is the number of input features (variables) involved. Application of ML classification algorithms to this large number of variables leads to the risk of overfitting, with the consequence of a poor generalization/prediction. For this reason, applying feature selection (FS) techniques helps simplifying the amount of factors required and improves the knowledge on adopted features and their relation with the studied phenomenon. Moreover, taking away irrelevant or redundant variables from the dataset effectively improves the quality of the ML prediction. This research deals with a comparative analysis of permafrost distribution models supported by FS variable importance assessment. The input dataset (dimension = 20-25, 10 m spatial resolution) was constructed using landcover maps, climate data and DEM derived variables (altitude, aspect, slope, terrain curvature, solar radiation, etc.). It was completed with permafrost evidences (geophysical and thermal data and rock glacier inventories) that serve as training permafrost data. Used FS algorithms informed about variables that appeared less statistically important for permafrost presence/absence. Three different algorithms were compared: Information Gain (IG), Correlation-based Feature Selection (CFS) and Random Forest (RF). IG is a filter technique that evaluates the worth of a predictor by measuring the information gain with respect to the permafrost presence/absence. Conversely, CFS is a wrapper technique that evaluates the worth of a subset of predictors by considering the individual predictive ability of each variable along with the degree of redundancy between them. Finally, RF is a ML algorithm that performs FS as part of its overall operation. It operates by constructing a large collection of decorrelated classification trees, and then predicts the permafrost occurrence through a majority vote. With the so-called out-of-bag (OOB) error estimate, the classification of permafrost data can be validated as well as the contribution of each predictor can be assessed. The performances of compared permafrost distribution models (computed on independent testing sets) increased with the application of FS algorithms on the original dataset and irrelevant or redundant variables were removed. As a consequence, the process provided faster and more cost-effective predictors and a better understanding of the underlying structures residing in permafrost data. Our work demonstrates the usefulness of a feature selection step prior to applying a machine learning algorithm. In fact, permafrost predictors could be ranked not only based on their heuristic and subjective importance (expert knowledge), but also based on their statistical relevance in relation of the permafrost distribution.
Christopher M. Taylor; Melvin L. Warren
2001-01-01
Stream landscapes are highly variable in space and time and, like terrestrial landscapes, the resources they contain are patchily distributed. Organisms may disperse among patches to fulfill life-history requirements, but biotic and abiotic factors may limit patch or locality occupancy. Thus, the dynamics of immigration and extinction determine, in part, the local...
Five Guidelines for Selecting Hydrological Signatures
NASA Astrophysics Data System (ADS)
McMillan, H. K.; Westerberg, I.; Branger, F.
2017-12-01
Hydrological signatures are index values derived from observed or modeled series of hydrological data such as rainfall, flow or soil moisture. They are designed to extract relevant information about hydrological behavior, such as to identify dominant processes, and to determine the strength, speed and spatiotemporal variability of the rainfall-runoff response. Hydrological signatures play an important role in model evaluation. They allow us to test whether particular model structures or parameter sets accurately reproduce the runoff generation processes within the watershed of interest. Most modeling studies use a selection of different signatures to capture different aspects of the catchment response, for example evaluating overall flow distribution as well as high and low flow extremes and flow timing. Such studies often choose their own set of signatures, or may borrow subsets of signatures used in multiple other works. The link between signature values and hydrological processes is not always straightforward, leading to uncertainty and variability in hydrologists' signature choices. In this presentation, we aim to encourage a more rigorous approach to hydrological signature selection, which considers the ability of signatures to represent hydrological behavior and underlying processes for the catchment and application in question. To this end, we propose a set of guidelines for selecting hydrological signatures. We describe five criteria that any hydrological signature should conform to: Identifiability, Robustness, Consistency, Representativeness, and Discriminatory Power. We describe an example of the design process for a signature, assessing possible signature designs against the guidelines above. Due to their ubiquity, we chose a signature related to the Flow Duration Curve, selecting the FDC mid-section slope as a proposed signature to quantify catchment overall behavior and flashiness. We demonstrate how assessment against each guideline could be used to compare or choose between alternative signature definitions. We believe that reaching a consensus on selection criteria for hydrological signatures will assist modelers to choose between competing signatures, facilitate comparison between hydrological studies, and help hydrologists to fully evaluate their models.
Regionalization of land-use impacts on streamflow using a network of paired catchments
NASA Astrophysics Data System (ADS)
Ochoa-Tocachi, Boris F.; Buytaert, Wouter; De Bièvre, Bert
2016-09-01
Quantifying the impact of land use and cover (LUC) change on catchment hydrological response is essential for land-use planning and management. Yet hydrologists are often not able to present consistent and reliable evidence to support such decision-making. The issue tends to be twofold: a scarcity of relevant observations, and the difficulty of regionalizing any existing observations. This study explores the potential of a paired catchment monitoring network to provide statistically robust, regionalized predictions of LUC change impact in an environment of high hydrological variability. We test the importance of LUC variables to explain hydrological responses and to improve regionalized predictions using 24 catchments distributed along the Tropical Andes. For this, we calculate first 50 physical catchment properties, and then select a subset based on correlation analysis. The reduced set is subsequently used to regionalize a selection of hydrological indices using multiple linear regression. Contrary to earlier studies, we find that incorporating LUC variables in the regional model structures increases significantly regression performance and predictive capacity for 66% of the indices. For the runoff ratio, baseflow index, and slope of the flow duration curve, the mean absolute error reduces by 53% and the variance of the residuals by 79%, on average. We attribute the explanatory capacity of LUC in the regional model to the pairwise monitoring setup, which increases the contrast of the land-use signal in the data set. As such, it may be a useful strategy to optimize data collection to support watershed management practices and improve decision-making in data-scarce regions.
Using Impact-Relevant Sensitivities to Efficiently Evaluate and Select Climate Change Scenarios
NASA Astrophysics Data System (ADS)
Vano, J. A.; Kim, J. B.; Rupp, D. E.; Mote, P.
2014-12-01
We outline an efficient approach to help researchers and natural resource managers more effectively use global climate model information in their long-term planning. The approach provides an estimate of the magnitude of change of a particular impact (e.g., summertime streamflow) from a large ensemble of climate change projections prior to detailed analysis. These estimates provide both qualitative information as an end unto itself (e.g., the distribution of future changes between emissions scenarios for the specific impact) and a judicious, defensible evaluation structure that can be used to qualitatively select a sub-set of climate models for further analysis. More specifically, the evaluation identifies global climate model scenarios that both (1) span the range of possible futures for the variable/s most important to the impact under investigation, and (2) come from global climate models that adequately simulate historical climate, providing plausible results for the future climate in the region of interest. To identify how an ecosystem process responds to projected future changes, we methodically sample, using a simple sensitivity analysis, how an impact variable (e.g., streamflow magnitude, vegetation carbon) responds locally to projected regional temperature and precipitation changes. We demonstrate our technique over the Pacific Northwest, focusing on two types of impacts each in three distinct geographic settings: (a) changes in streamflow magnitudes in critical seasons for water management in the Willamette, Yakima, and Upper Columbia River basins; and (b) changes in annual vegetation carbon in the Oregon and Washington Coast Ranges, Western Cascades, and Columbia Basin ecoregions.
Curtis, Alexandra M; VanBuren, John; Cavanaugh, Joseph E; Warren, John J; Marshall, Teresa A; Levy, Steven M
2018-05-12
To assess longitudinal associations between permanent tooth caries increment and both modifiable and non-modifiable risk factors, using best subsets model selection. The Iowa Fluoride Study has followed a birth cohort with standardized caries exams without radiographs of the permanent dentition conducted at about ages 9, 13, and 17 years. Questionnaires were sent semi-annually to assess fluoride exposures and intakes, select food and beverage intakes, and tooth brushing frequency. Exposure variables were averaged over ages 7-9, 11-13, and 15-17, reflecting exposure 2 years prior to the caries exam. Longitudinal models were used to relate period-specific averaged exposures and demographic variables to adjusted decayed and filled surface increments (ADJCI) (n = 392). The Akaike Information Criterion (AIC) was used to assess optimal explanatory variable combinations. From birth to age 9, 9-13, and 13-17 years, 24, 30, and 55 percent of subjects had positive permanent ADJCI, respectively. Ten models had AIC values within two units of the lowest AIC model and were deemed optimal based on AIC. Younger age, being male, higher mother's education, and higher brushing frequency were associated with lower caries increment in all 10 models, while milk intake was included in 3 of 10 models. Higher milk intakes were slightly associated with lower ADJCI. With the exception of brushing frequency, modifiable risk factors under study were not significantly associated with ADJCI. When possible, researchers should consider presenting multiple models if fit criteria cannot discern among a group of optimal models. © 2018 American Association of Public Health Dentistry.
What MISR data are available for field experiments?
Atmospheric Science Data Center
2014-12-08
MISR data and imagery are available for many field campaigns. Select data products are subset for the region and dates of interest. Special gridded regional products may be available as well as Local Mode data for select targets...
NASA Astrophysics Data System (ADS)
Araújo, J. P. C.; DA Silva, L. M.; Dourado, F. A. D.; Fernandes, N.
2015-12-01
Landslides are the most damaging natural hazard in the mountainous region of Rio de Janeiro State in Brazil, responsible for thousands of deaths and important financial and environmental losses. However, this region has currently few landslide susceptibility maps implemented on an adequate scale. Identification of landslide susceptibility areas is fundamental in successful land use planning and management practices to reduce risk. This paper applied the Bayes' theorem based on weight of evidence (WoE) using 8 landslide-related factors in a geographic information system (GIS) for landslide susceptibility mapping. 378 landslide locations were identified and mapped on a selected basin in the city of Nova Friburgo, triggered by the January 2011 rainfall event. The landslide scars were divided into two subsets: training and validation subsets. The 8 landslide-related factors weighted by WoE were performed using chi-square test to indicate which variables are conditionally independent of each other to be used in the final map. Finally, the maps of weighted factors were summed up to construct the landslide susceptibility map and validated by the validation landslide subset. According to the results, slope, aspect and contribution area showed the higher positive spatial correlation with landslides. In the landslide susceptibility map, 21% of the area presented very low and low susceptibilities with 3% of the validation scars, 41% presented medium susceptibility with 22% of the validation scars and 38% presented high and very high susceptibilities with 75% of the validation scars. The very high susceptibility class stands for 16% of the basin area and has 54% of the all scars. The approach used in this study can be considered very useful since 75% of the area affected by landslides was included in the high and very high susceptibility classes.
O'Gorman, William E.; Hsieh, Elena W.Y.; Savig, Erica S.; Gherardini, Pier Federico; Hernandez, Joseph D.; Hansmann, Leo; Balboni, Imelda M.; Utz, Paul J.; Bendall, Sean C.; Fantl, Wendy J.; Lewis, David B.; Nolan, Garry P.; Davis, Mark M.
2015-01-01
Background Activation of Toll-Like Receptors (TLRs) induces inflammatory responses involved in immunity to pathogens and autoimmune pathogenesis, such as in Systemic Lupus Erythematosus (SLE). Although TLRs are differentially expressed across the immune system, a comprehensive analysis of how multiple immune cell subsets respond in a system-wide manner has previously not been described. Objective To characterize TLR activation across multiple immune cell subsets and individuals, with the goal of establishing a reference framework against which to compare pathological processes. Methods Peripheral whole blood samples were stimulated with TLR ligands, and analyzed by mass cytometry simultaneously for surface marker expression, activation states of intracellular signaling proteins, and cytokine production. We developed a novel data visualization tool to provide an integrated view of TLR signaling networks with single-cell resolution. We studied seventeen healthy volunteer donors and eight newly diagnosed untreated SLE patients. Results Our data revealed the diversity of TLR-induced responses within cell types, with TLR ligand specificity. Subsets of NK and T cells selectively induced NF-κB in response to TLR2 ligands. CD14hi monocytes exhibited the most polyfunctional cytokine expression patterns, with over 80 distinct cytokine combinations. Monocytic TLR-induced cytokine patterns were shared amongst a group of healthy donors, with minimal intra- and inter- individual variability. Furthermore, autoimmune disease altered baseline cytokine production, as newly diagnosed untreated SLE patients shared a distinct monocytic chemokine signature, despite clinical heterogeneity. Conclusion Mass cytometry analysis defined a systems-level reference framework for human TLR activation, which can be applied to study perturbations in inflammatory disease, such as SLE. PMID:26037552
Feature selection with harmony search.
Diao, Ren; Shen, Qiang
2012-12-01
Many search strategies have been exploited for the task of feature selection (FS), in an effort to identify more compact and better quality subsets. Such work typically involves the use of greedy hill climbing (HC), or nature-inspired heuristics, in order to discover the optimal solution without going through exhaustive search. In this paper, a novel FS approach based on harmony search (HS) is presented. It is a general approach that can be used in conjunction with many subset evaluation techniques. The simplicity of HS is exploited to reduce the overall complexity of the search process. The proposed approach is able to escape from local solutions and identify multiple solutions owing to the stochastic nature of HS. Additional parameter control schemes are introduced to reduce the effort and impact of parameter configuration. These can be further combined with the iterative refinement strategy, tailored to enforce the discovery of quality subsets. The resulting approach is compared with those that rely on HC, genetic algorithms, and particle swarm optimization, accompanied by in-depth studies of the suggested improvements.
Bourion, Virginie; Heulin-Gotty, Karine; Aubert, Véronique; Tisseyre, Pierre; Chabert-Martinello, Marianne; Pervent, Marjorie; Delaitre, Catherine; Vile, Denis; Siol, Mathieu; Duc, Gérard; Brunel, Brigitte; Burstin, Judith; Lepetit, Marc
2018-01-01
Pea forms symbiotic nodules with Rhizobium leguminosarum sv. viciae (Rlv). In the field, pea roots can be exposed to multiple compatible Rlv strains. Little is known about the mechanisms underlying the competitiveness for nodulation of Rlv strains and the ability of pea to choose between diverse compatible Rlv strains. The variability of pea-Rlv partner choice was investigated by co-inoculation with a mixture of five diverse Rlv strains of a 104-pea collection representative of the variability encountered in the genus Pisum. The nitrogen fixation efficiency conferred by each strain was determined in additional mono-inoculation experiments on a subset of 18 pea lines displaying contrasted Rlv choice. Differences in Rlv choice were observed within the pea collection according to their genetic or geographical diversities. The competitiveness for nodulation of a given pea-Rlv association evaluated in the multi-inoculated experiment was poorly correlated with its nitrogen fixation efficiency determined in mono-inoculation. Both plant and bacterial genetic determinants contribute to pea-Rlv partner choice. No evidence was found for co-selection of competitiveness for nodulation and nitrogen fixation efficiency. Plant and inoculant for an improved symbiotic association in the field must be selected not only on nitrogen fixation efficiency but also for competitiveness for nodulation. PMID:29367857
Bourion, Virginie; Heulin-Gotty, Karine; Aubert, Véronique; Tisseyre, Pierre; Chabert-Martinello, Marianne; Pervent, Marjorie; Delaitre, Catherine; Vile, Denis; Siol, Mathieu; Duc, Gérard; Brunel, Brigitte; Burstin, Judith; Lepetit, Marc
2017-01-01
Pea forms symbiotic nodules with Rhizobium leguminosarum sv. viciae (Rlv). In the field, pea roots can be exposed to multiple compatible Rlv strains. Little is known about the mechanisms underlying the competitiveness for nodulation of Rlv strains and the ability of pea to choose between diverse compatible Rlv strains. The variability of pea-Rlv partner choice was investigated by co-inoculation with a mixture of five diverse Rlv strains of a 104-pea collection representative of the variability encountered in the genus Pisum . The nitrogen fixation efficiency conferred by each strain was determined in additional mono-inoculation experiments on a subset of 18 pea lines displaying contrasted Rlv choice. Differences in Rlv choice were observed within the pea collection according to their genetic or geographical diversities. The competitiveness for nodulation of a given pea-Rlv association evaluated in the multi-inoculated experiment was poorly correlated with its nitrogen fixation efficiency determined in mono-inoculation. Both plant and bacterial genetic determinants contribute to pea-Rlv partner choice. No evidence was found for co-selection of competitiveness for nodulation and nitrogen fixation efficiency. Plant and inoculant for an improved symbiotic association in the field must be selected not only on nitrogen fixation efficiency but also for competitiveness for nodulation.
Robarts, Daniel W H; Wolfe, Andrea D
2014-07-01
In the past few decades, many investigations in the field of plant biology have employed selectively neutral, multilocus, dominant markers such as inter-simple sequence repeat (ISSR), random-amplified polymorphic DNA (RAPD), and amplified fragment length polymorphism (AFLP) to address hypotheses at lower taxonomic levels. More recently, sequence-related amplified polymorphism (SRAP) markers have been developed, which are used to amplify coding regions of DNA with primers targeting open reading frames. These markers have proven to be robust and highly variable, on par with AFLP, and are attained through a significantly less technically demanding process. SRAP markers have been used primarily for agronomic and horticultural purposes, developing quantitative trait loci in advanced hybrids and assessing genetic diversity of large germplasm collections. Here, we suggest that SRAP markers should be employed for research addressing hypotheses in plant systematics, biogeography, conservation, ecology, and beyond. We provide an overview of the SRAP literature to date, review descriptive statistics of SRAP markers in a subset of 171 publications, and present relevant case studies to demonstrate the applicability of SRAP markers to the diverse field of plant biology. Results of these selected works indicate that SRAP markers have the potential to enhance the current suite of molecular tools in a diversity of fields by providing an easy-to-use, highly variable marker with inherent biological significance.
Robarts, Daniel W. H.; Wolfe, Andrea D.
2014-01-01
In the past few decades, many investigations in the field of plant biology have employed selectively neutral, multilocus, dominant markers such as inter-simple sequence repeat (ISSR), random-amplified polymorphic DNA (RAPD), and amplified fragment length polymorphism (AFLP) to address hypotheses at lower taxonomic levels. More recently, sequence-related amplified polymorphism (SRAP) markers have been developed, which are used to amplify coding regions of DNA with primers targeting open reading frames. These markers have proven to be robust and highly variable, on par with AFLP, and are attained through a significantly less technically demanding process. SRAP markers have been used primarily for agronomic and horticultural purposes, developing quantitative trait loci in advanced hybrids and assessing genetic diversity of large germplasm collections. Here, we suggest that SRAP markers should be employed for research addressing hypotheses in plant systematics, biogeography, conservation, ecology, and beyond. We provide an overview of the SRAP literature to date, review descriptive statistics of SRAP markers in a subset of 171 publications, and present relevant case studies to demonstrate the applicability of SRAP markers to the diverse field of plant biology. Results of these selected works indicate that SRAP markers have the potential to enhance the current suite of molecular tools in a diversity of fields by providing an easy-to-use, highly variable marker with inherent biological significance. PMID:25202637
Bootstrap Enhanced Penalized Regression for Variable Selection with Neuroimaging Data.
Abram, Samantha V; Helwig, Nathaniel E; Moodie, Craig A; DeYoung, Colin G; MacDonald, Angus W; Waller, Niels G
2016-01-01
Recent advances in fMRI research highlight the use of multivariate methods for examining whole-brain connectivity. Complementary data-driven methods are needed for determining the subset of predictors related to individual differences. Although commonly used for this purpose, ordinary least squares (OLS) regression may not be ideal due to multi-collinearity and over-fitting issues. Penalized regression is a promising and underutilized alternative to OLS regression. In this paper, we propose a nonparametric bootstrap quantile (QNT) approach for variable selection with neuroimaging data. We use real and simulated data, as well as annotated R code, to demonstrate the benefits of our proposed method. Our results illustrate the practical potential of our proposed bootstrap QNT approach. Our real data example demonstrates how our method can be used to relate individual differences in neural network connectivity with an externalizing personality measure. Also, our simulation results reveal that the QNT method is effective under a variety of data conditions. Penalized regression yields more stable estimates and sparser models than OLS regression in situations with large numbers of highly correlated neural predictors. Our results demonstrate that penalized regression is a promising method for examining associations between neural predictors and clinically relevant traits or behaviors. These findings have important implications for the growing field of functional connectivity research, where multivariate methods produce numerous, highly correlated brain networks.
Bootstrap Enhanced Penalized Regression for Variable Selection with Neuroimaging Data
Abram, Samantha V.; Helwig, Nathaniel E.; Moodie, Craig A.; DeYoung, Colin G.; MacDonald, Angus W.; Waller, Niels G.
2016-01-01
Recent advances in fMRI research highlight the use of multivariate methods for examining whole-brain connectivity. Complementary data-driven methods are needed for determining the subset of predictors related to individual differences. Although commonly used for this purpose, ordinary least squares (OLS) regression may not be ideal due to multi-collinearity and over-fitting issues. Penalized regression is a promising and underutilized alternative to OLS regression. In this paper, we propose a nonparametric bootstrap quantile (QNT) approach for variable selection with neuroimaging data. We use real and simulated data, as well as annotated R code, to demonstrate the benefits of our proposed method. Our results illustrate the practical potential of our proposed bootstrap QNT approach. Our real data example demonstrates how our method can be used to relate individual differences in neural network connectivity with an externalizing personality measure. Also, our simulation results reveal that the QNT method is effective under a variety of data conditions. Penalized regression yields more stable estimates and sparser models than OLS regression in situations with large numbers of highly correlated neural predictors. Our results demonstrate that penalized regression is a promising method for examining associations between neural predictors and clinically relevant traits or behaviors. These findings have important implications for the growing field of functional connectivity research, where multivariate methods produce numerous, highly correlated brain networks. PMID:27516732
Genetic Divergence and Chemotype Diversity in the Fusarium Head Blight Pathogen Fusarium poae.
Vanheule, Adriaan; De Boevre, Marthe; Moretti, Antonio; Scauflaire, Jonathan; Munaut, Françoise; De Saeger, Sarah; Bekaert, Boris; Haesaert, Geert; Waalwijk, Cees; van der Lee, Theo; Audenaert, Kris
2017-08-23
Fusarium head blight is a disease caused by a complex of Fusarium species. F. poae is omnipresent throughout Europe in spite of its low virulence. In this study, we assessed a geographically diverse collection of F. poae isolates for its genetic diversity using AFLP (Amplified Fragment Length Polymorphism). Furthermore, studying the mating type locus and chromosomal insertions, we identified hallmarks of both sexual recombination and clonal spread of successful genotypes in the population. Despite the large genetic variation found, all F. poae isolates possess the nivalenol chemotype based on Tri7 sequence analysis. Nevertheless, Tri gene clusters showed two layers of genetic variability. Firstly, the Tri1 locus was highly variable with mostly synonymous mutations and mutations in introns pointing to a strong purifying selection pressure. Secondly, in a subset of isolates, the main trichothecene gene cluster was invaded by a transposable element between Tri5 and Tri6 . To investigate the impact of these variations on the phenotypic chemotype, mycotoxin production was assessed on artificial medium. Complex blends of type A and type B trichothecenes were produced but neither genetic variability in the Tri genes nor variability in the genome or geography accounted for the divergence in trichothecene production. In view of its complex chemotype, it will be of utmost interest to uncover the role of trichothecenes in virulence, spread and survival of F. poae .
Terrill, Philip I; Wilson, Stephen J; Suresh, Sadasivam; Cooper, David M; Dakin, Carolyn
2012-08-01
Previous work has identified that non-linear variables calculated from respiratory data vary between sleep states, and that variables derived from the non-linear analytical tool recurrence quantification analysis (RQA) are accurate infant sleep state discriminators. This study aims to apply these discriminators to automatically classify 30 s epochs of infant sleep as REM, non-REM and wake. Polysomnograms were obtained from 25 healthy infants at 2 weeks, 3, 6 and 12 months of age, and manually sleep staged as wake, REM and non-REM. Inter-breath interval data were extracted from the respiratory inductive plethysmograph, and RQA applied to calculate radius, determinism and laminarity. Time-series statistic and spectral analysis variables were also calculated. A nested cross-validation method was used to identify the optimal feature subset, and to train and evaluate a linear discriminant analysis-based classifier. The RQA features radius and laminarity and were reliably selected. Mean agreement was 79.7, 84.9, 84.0 and 79.2 % at 2 weeks, 3, 6 and 12 months, and the classifier performed better than a comparison classifier not including RQA variables. The performance of this sleep-staging tool compares favourably with inter-human agreement rates, and improves upon previous systems using only respiratory data. Applications include diagnostic screening and population-based sleep research.
Texture analysis of tissues in Gleason grading of prostate cancer
NASA Astrophysics Data System (ADS)
Alexandratou, Eleni; Yova, Dido; Gorpas, Dimitris; Maragos, Petros; Agrogiannis, George; Kavantzas, Nikolaos
2008-02-01
Prostate cancer is a common malignancy among maturing men and the second leading cause of cancer death in USA. Histopathological grading of prostate cancer is based on tissue structural abnormalities. Gleason grading system is the gold standard and is based on the organization features of prostatic glands. Although Gleason score has contributed on cancer prognosis and on treatment planning, its accuracy is about 58%, with this percentage to be lower in GG2, GG3 and GG5 grading. On the other hand it is strongly affected by "inter- and intra observer variations", making the whole process very subjective. Therefore, there is need for the development of grading tools based on imaging and computer vision techniques for a more accurate prostate cancer prognosis. The aim of this paper is the development of a novel method for objective grading of biopsy specimen in order to support histopathological prognosis of the tumor. This new method is based on texture analysis techniques, and particularly on Gray Level Co-occurrence Matrix (GLCM) that estimates image properties related to second order statistics. Histopathological images of prostate cancer, from Gleason grade2 to Gleason grade 5, were acquired and subjected to image texture analysis. Thirteen texture characteristics were calculated from this matrix as they were proposed by Haralick. Using stepwise variable selection, a subset of four characteristics were selected and used for the description and classification of each image field. The selected characteristics profile was used for grading the specimen with the multiparameter statistical method of multiple logistic discrimination analysis. The subset of these characteristics provided 87% correct grading of the specimens. The addition of any of the remaining characteristics did not improve significantly the diagnostic ability of the method. This study demonstrated that texture analysis techniques could provide valuable grading decision support to the pathologists, concerning prostate cancer prognosis.
Adoptive therapy with chimeric antigen receptor-modified T cells of defined subset composition.
Riddell, Stanley R; Sommermeyer, Daniel; Berger, Carolina; Liu, Lingfeng Steven; Balakrishnan, Ashwini; Salter, Alex; Hudecek, Michael; Maloney, David G; Turtle, Cameron J
2014-01-01
The ability to engineer T cells to recognize tumor cells through genetic modification with a synthetic chimeric antigen receptor has ushered in a new era in cancer immunotherapy. The most advanced clinical applications are in targeting CD19 on B-cell malignancies. The clinical trials of CD19 chimeric antigen receptor therapy have thus far not attempted to select defined subsets before transduction or imposed uniformity of the CD4 and CD8 cell composition of the cell products. This review will discuss the rationale for and challenges to using adoptive therapy with genetically modified T cells of defined subset and phenotypic composition.
Remote Sensing Information Gateway, a tool that allows scientists, researchers and decision makers to access a variety of multi-terabyte, environmental datasets and to subset the data and obtain only needed variables, greatly improving the download time.
Larrañaga, Ana; Bielza, Concha; Pongrácz, Péter; Faragó, Tamás; Bálint, Anna; Larrañaga, Pedro
2015-03-01
Barking is perhaps the most characteristic form of vocalization in dogs; however, very little is known about its role in the intraspecific communication of this species. Besides the obvious need for ethological research, both in the field and in the laboratory, the possible information content of barks can also be explored by computerized acoustic analyses. This study compares four different supervised learning methods (naive Bayes, classification trees, [Formula: see text]-nearest neighbors and logistic regression) combined with three strategies for selecting variables (all variables, filter and wrapper feature subset selections) to classify Mudi dogs by sex, age, context and individual from their barks. The classification accuracy of the models obtained was estimated by means of [Formula: see text]-fold cross-validation. Percentages of correct classifications were 85.13 % for determining sex, 80.25 % for predicting age (recodified as young, adult and old), 55.50 % for classifying contexts (seven situations) and 67.63 % for recognizing individuals (8 dogs), so the results are encouraging. The best-performing method was [Formula: see text]-nearest neighbors following a wrapper feature selection approach. The results for classifying contexts and recognizing individual dogs were better with this method than they were for other approaches reported in the specialized literature. This is the first time that the sex and age of domestic dogs have been predicted with the help of sound analysis. This study shows that dog barks carry ample information regarding the caller's indexical features. Our computerized analysis provides indirect proof that barks may serve as an important source of information for dogs as well.
Column Subset Selection, Matrix Factorization, and Eigenvalue Optimization
2008-07-01
Pietsch and Grothendieck, which are regarded as basic instruments in modern functional analysis [Pis86]. • The methods for computing these... Pietsch factorization and the maxcut semi- definite program [GW95]. 1.2. Overview. We focus on the algorithmic version of the Kashin–Tzafriri theorem...will see that the desired subset is exposed by factoring the random submatrix. This factorization, which was invented by Pietsch , is regarded as a basic
Selection-Fusion Approach for Classification of Datasets with Missing Values
Ghannad-Rezaie, Mostafa; Soltanian-Zadeh, Hamid; Ying, Hao; Dong, Ming
2010-01-01
This paper proposes a new approach based on missing value pattern discovery for classifying incomplete data. This approach is particularly designed for classification of datasets with a small number of samples and a high percentage of missing values where available missing value treatment approaches do not usually work well. Based on the pattern of the missing values, the proposed approach finds subsets of samples for which most of the features are available and trains a classifier for each subset. Then, it combines the outputs of the classifiers. Subset selection is translated into a clustering problem, allowing derivation of a mathematical framework for it. A trade off is established between the computational complexity (number of subsets) and the accuracy of the overall classifier. To deal with this trade off, a numerical criterion is proposed for the prediction of the overall performance. The proposed method is applied to seven datasets from the popular University of California, Irvine data mining archive and an epilepsy dataset from Henry Ford Hospital, Detroit, Michigan (total of eight datasets). Experimental results show that classification accuracy of the proposed method is superior to those of the widely used multiple imputations method and four other methods. They also show that the level of superiority depends on the pattern and percentage of missing values. PMID:20212921
Bliss, Sarah A; Paul, Sunirmal; Pobiarzyn, Piotr W; Ayer, Seda; Sinha, Garima; Pant, Saumya; Hilton, Holly; Sharma, Neha; Cunha, Maria F; Engelberth, Daniel J; Greco, Steven J; Bryan, Margarette; Kucia, Magdalena J; Kakar, Sham S; Ratajczak, Mariusz Z; Rameshwar, Pranela
2018-01-10
This study proposes that a novel developmental hierarchy of breast cancer (BC) cells (BCCs) could predict treatment response and outcome. The continued challenge to treat BC requires stratification of BCCs into distinct subsets. This would provide insights on how BCCs evade treatment and adapt dormancy for decades. We selected three subsets, based on the relative expression of octamer-binding transcription factor 4 A (Oct4A) and then analysed each with Affymetrix gene chip. Oct4A is a stem cell gene and would separate subsets based on maturation. Data analyses and gene validation identified three membrane proteins, TMEM98, GPR64 and FAT4. BCCs from cell lines and blood from BC patients were analysed for these three membrane proteins by flow cytometry, along with known markers of cancer stem cells (CSCs), CD44, CD24 and Oct4, aldehyde dehydrogenase 1 (ALDH1) activity and telomere length. A novel working hierarchy of BCCs was established with the most immature subset as CSCs. This group was further subdivided into long- and short-term CSCs. Analyses of 20 post-treatment blood indicated that circulating CSCs and early BC progenitors may be associated with recurrence or early death. These results suggest that the novel hierarchy may predict treatment response and prognosis.
Morikawa, Masatoshi; Tsujibe, Satoshi; Kiyoshima-Shibata, Junko; Watanabe, Yohei; Kato-Nagaoka, Noriko; Shida, Kan; Matsumoto, Satoshi
2016-01-01
Phagocytes such as dendritic cells and macrophages, which are distributed in the small intestinal mucosa, play a crucial role in maintaining mucosal homeostasis by sampling the luminal gut microbiota. However, there is limited information regarding microbial uptake in a steady state. We investigated the composition of murine gut microbiota that is engulfed by phagocytes of specific subsets in the small intestinal lamina propria (SILP) and Peyer’s patches (PP). Analysis of bacterial 16S rRNA gene amplicon sequences revealed that: 1) all the phagocyte subsets in the SILP primarily engulfed Lactobacillus (the most abundant microbe in the small intestine), whereas CD11bhi and CD11bhiCD11chi cell subsets in PP mostly engulfed segmented filamentous bacteria (indigenous bacteria in rodents that are reported to adhere to intestinal epithelial cells); and 2) among the Lactobacillus species engulfed by the SILP cell subsets, L. murinus was engulfed more frequently than L. taiwanensis, although both these Lactobacillus species were abundant in the small intestine under physiological conditions. These results suggest that small intestinal microbiota is selectively engulfed by phagocytes that localize in the adjacent intestinal mucosa in a steady state. These observations may provide insight into the crucial role of phagocytes in immune surveillance of the small intestinal mucosa. PMID:27701454
Morikawa, Masatoshi; Tsujibe, Satoshi; Kiyoshima-Shibata, Junko; Watanabe, Yohei; Kato-Nagaoka, Noriko; Shida, Kan; Matsumoto, Satoshi
2016-01-01
Phagocytes such as dendritic cells and macrophages, which are distributed in the small intestinal mucosa, play a crucial role in maintaining mucosal homeostasis by sampling the luminal gut microbiota. However, there is limited information regarding microbial uptake in a steady state. We investigated the composition of murine gut microbiota that is engulfed by phagocytes of specific subsets in the small intestinal lamina propria (SILP) and Peyer's patches (PP). Analysis of bacterial 16S rRNA gene amplicon sequences revealed that: 1) all the phagocyte subsets in the SILP primarily engulfed Lactobacillus (the most abundant microbe in the small intestine), whereas CD11bhi and CD11bhiCD11chi cell subsets in PP mostly engulfed segmented filamentous bacteria (indigenous bacteria in rodents that are reported to adhere to intestinal epithelial cells); and 2) among the Lactobacillus species engulfed by the SILP cell subsets, L. murinus was engulfed more frequently than L. taiwanensis, although both these Lactobacillus species were abundant in the small intestine under physiological conditions. These results suggest that small intestinal microbiota is selectively engulfed by phagocytes that localize in the adjacent intestinal mucosa in a steady state. These observations may provide insight into the crucial role of phagocytes in immune surveillance of the small intestinal mucosa.
Standardizing Flow Cytometry Immunophenotyping Analysis from the Human ImmunoPhenotyping Consortium
Finak, Greg; Langweiler, Marc; Jaimes, Maria; Malek, Mehrnoush; Taghiyar, Jafar; Korin, Yael; Raddassi, Khadir; Devine, Lesley; Obermoser, Gerlinde; Pekalski, Marcin L.; Pontikos, Nikolas; Diaz, Alain; Heck, Susanne; Villanova, Federica; Terrazzini, Nadia; Kern, Florian; Qian, Yu; Stanton, Rick; Wang, Kui; Brandes, Aaron; Ramey, John; Aghaeepour, Nima; Mosmann, Tim; Scheuermann, Richard H.; Reed, Elaine; Palucka, Karolina; Pascual, Virginia; Blomberg, Bonnie B.; Nestle, Frank; Nussenblatt, Robert B.; Brinkman, Ryan Remy; Gottardo, Raphael; Maecker, Holden; McCoy, J Philip
2016-01-01
Standardization of immunophenotyping requires careful attention to reagents, sample handling, instrument setup, and data analysis, and is essential for successful cross-study and cross-center comparison of data. Experts developed five standardized, eight-color panels for identification of major immune cell subsets in peripheral blood. These were produced as pre-configured, lyophilized, reagents in 96-well plates. We present the results of a coordinated analysis of samples across nine laboratories using these panels with standardized operating procedures (SOPs). Manual gating was performed by each site and by a central site. Automated gating algorithms were developed and tested by the FlowCAP consortium. Centralized manual gating can reduce cross-center variability, and we sought to determine whether automated methods could streamline and standardize the analysis. Within-site variability was low in all experiments, but cross-site variability was lower when central analysis was performed in comparison with site-specific analysis. It was also lower for clearly defined cell subsets than those based on dim markers and for rare populations. Automated gating was able to match the performance of central manual analysis for all tested panels, exhibiting little to no bias and comparable variability. Standardized staining, data collection, and automated gating can increase power, reduce variability, and streamline analysis for immunophenotyping. PMID:26861911
Remote Sensing Information Gateway, a tool that allows scientists, researchers and decision makers to access a variety of multi-terabyte, environmental datasets and to subset the data and obtain only needed variables, greatly improving the download time.
Remote Sensing Information Gateway
Remote Sensing Information Gateway, a tool that allows scientists, researchers and decision makers to access a variety of multi-terabyte, environmental datasets and to subset the data and obtain only needed variables, greatly improving the download time.
Remote Sensing Information Gateway, a tool that allows scientists, researchers and decision makers to access a variety of multi-terabyte, environmental datasets and to subset the data and obtain only needed variables, greatly improving the download time.
Remote Sensing Information Gateway, a tool that allows scientists, researchers and decision makers to access a variety of multi-terabyte, environmental datasets and to subset the data and obtain only needed variables, greatly improving the download time.
Simulation Of Research And Development Projects
NASA Technical Reports Server (NTRS)
Miles, Ralph F.
1987-01-01
Measures of preference for alternative project plans calculated. Simulation of Research and Development Projects (SIMRAND) program aids in optimal allocation of research and development resources needed to achieve project goals. Models system subsets or project tasks as various network paths to final goal. Each path described in terms of such task variables as cost per hour, cost per unit, and availability of resources. Uncertainty incorporated by treating task variables as probabilistic random variables. Written in Microsoft FORTRAN 77.
The continuum spectral characteristics of gamma-ray bursts observed by BATSE
NASA Technical Reports Server (NTRS)
Pendleton, Geoffrey N.; Paciesas, William S.; Briggs, Michael S.; Mallozzi, Robert S.; Koshut, Tom M.; Fishman, Gerald J.; Meegan, Charles A.; Wilson, Robert B.; Harmon, Alan B.; Kouveliotou, Chryssa
1994-01-01
Distributions of the continuum spectral characteristics of 260 bursts in the first Burst And Transient Source Experiement (BATSE) catalog are presented. The data are derived from flux calculated from BATSE Large Area Detector (LAD) four-channel discriminator data. The data are converted from counts to protons using a direct spectral inversion technique to remove the effects of atmospheric scattering and the energy dependence of the detector angular response. Although there are intriguing clusters of bursts in the spectral hardness ratio distributions, no evidence for the presence of distinct burst classes based in spectral hardness ratios alone is found. All subsets of bursts selected for their spectral characteristics in this analysis exhibit spatial distributions consistent with isotropy. The spectral diversity of the burst population appears to be caused largely by the highly variable nature of the burst production mechanisms themselves.
Barbet, Anthony F.; Al-Khedery, Basima; Stuen, Snorre; Granquist, Erik G.; Felsheim, Roderick F.; Munderloh, Ulrike G.
2013-01-01
The prevalence of tick-borne diseases is increasing worldwide. One such emerging disease is human anaplasmosis. The causative organism, Anaplasma phagocytophilum, is known to infect multiple animal species and cause human fatalities in the U.S., Europe and Asia. Although long known to infect ruminants, it is unclear why there are increasing numbers of human infections. We analyzed the genome sequences of strains infecting humans, animals and ticks from diverse geographic locations. Despite extensive variability amongst these strains, those infecting humans had conserved genome structure including the pfam01617 superfamily that encodes the major, neutralization-sensitive, surface antigen. These data provide potential targets to identify human-infective strains and have significance for understanding the selective pressures that lead to emergence of disease in new species. PMID:25437207
Climate change, extreme weather events, and us health impacts: what can we say?
Mills, David M
2009-01-01
Address how climate change impacts on a group of extreme weather events could affect US public health. A literature review summarizes arguments for, and evidence of, a climate change signal in select extreme weather event categories, projections for future events, and potential trends in adaptive capacity and vulnerability in the United States. Western US wildfires already exhibit a climate change signal. The variability within hurricane and extreme precipitation/flood data complicates identifying a similar climate change signal. Health impacts of extreme events are not equally distributed and are very sensitive to a subset of exceptional extreme events. Cumulative uncertainty in forecasting climate change driven characteristics of extreme events and adaptation prevents confidently projecting the future health impacts from hurricanes, wildfires, and extreme precipitation/floods in the United States attributable to climate change.
Evaluation of trends in wheat yield models
NASA Technical Reports Server (NTRS)
Ferguson, M. C.
1982-01-01
Trend terms in models for wheat yield in the U.S. Great Plains for the years 1932 to 1976 are evaluated. The subset of meteorological variables yielding the largest adjusted R(2) is selected using the method of leaps and bounds. Latent root regression is used to eliminate multicollinearities, and generalized ridge regression is used to introduce bias to provide stability in the data matrix. The regression model used provides for two trends in each of two models: a dependent model in which the trend line is piece-wise continuous, and an independent model in which the trend line is discontinuous at the year of the slope change. It was found that the trend lines best describing the wheat yields consisted of combinations of increasing, decreasing, and constant trend: four combinations for the dependent model and seven for the independent model.
Borai, Anwar; Livingstone, Callum; Alsobhi, Enaam; Al Sofyani, Abeer; Balgoon, Dalal; Farzal, Anwar; Almohammadi, Mohammed; Al-Amri, Abdulafattah; Bahijri, Suhad; Alrowaili, Daad; Bassiuni, Wafaa; Saleh, Ayman; Alrowaili, Norah; Abdelaal, Mohamed
2017-04-01
Whole blood donation has immunomodulatory effects, and most of these have been observed at short intervals following blood donation. This study aimed to investigate the impact of whole blood donation on lymphocyte subsets over a typical inter-donation interval. Healthy male subjects were recruited to study changes in complete blood count (CBC) (n = 42) and lymphocyte subsets (n = 16) before and at four intervals up to 106 days following blood donation. Repeated measures ANOVA were used to compare quantitative variables between different visits. Following blood donation, changes in CBC and erythropoietin were as expected. The neutrophil count increased by 11.3% at 8 days (p < .001). Novel changes were observed in lymphocyte subsets as the CD4/CD8 ratio increased by 9.2% (p < .05) at 8 days and 13.7% (p < .05) at 22 days. CD16-56 cells decreased by 16.2% (p < .05) at 8 days. All the subsets had returned to baseline by 106 days. Regression analysis showed that the changes in CD16-56 cells and CD4/CD8 ratio were not significant (Wilk's lambda = 0.15 and 0.94, respectively) when adjusted for BMI. In conclusion, following whole blood donation, there are transient changes in lymphocyte subsets. The effect of BMI on lymphocyte subsets and the effect of this immunomodulation on the immune response merit further investigation.
Mosayebi, Ghasem; Rizgar, Mageed; Gharagozloo, Soheila; Shokri, Fazel
2007-01-01
High levels of rheumatoid factors (RF) are detectable in serum of the majority of patients with rheumatoid arthritis (RA), but 5-10% of patients remain seronegative (SN). Despite clinical and genetic similarities between these two subsets of RA, it has been proposed that they may be regarded as distinct clinical entities. In the present study a panel of monoclonal antibodies (mAb) recognizing RF-associated cross-reactive idiotypes (CRI) linked to the VH1 (G8), VH4 (LC1), VK3b (17-109) and a mAb recognizing the VK3 subgroup (C7) of immunoglobulin variable region (IgV) gene products were used to quantitate the level of expression of these gene products in serum and synovial fluid of 35 seropositive (SP) and 8 SN RA patients by capture ELISA. While the concentration and relative proportion of the IgV are recognized by the mAb G8, 17-109 and C7 were significantly higher in serum and synovial fluid of the SP RA, compared to the SN-RA patients (G8, p = 0.009; 17-109, p = 0.0001; C7, p = 0.001). The CRI recognized by the mAb LC1 was highly represented in serum and synovial fluid of the SN-RA patients. There have been no significant differences in the level of expression of these IgV gene products (other than the product recognized by C7 mAb in SP patients) between serum and synovial fluid of either group of patients. Our results suggest that the expressed repertoire of Ig VH and VK genes in these two subsets of RA is differentially regulated and may be influenced by selective mechanisms leading to positive or negative selection of certain genes.
Parameterizing the Variability and Uncertainty of Wind and Solar in CEMs
DOE Office of Scientific and Technical Information (OSTI.GOV)
Frew, Bethany
We present current and improved methods for estimating the capacity value and curtailment impacts from variable generation (VG) in capacity expansion models (CEMs). The ideal calculation of these variability metrics is through an explicit co-optimized investment-dispatch model using multiple years of VG and load data. Because of data and computational limitations, existing CEMs typically approximate these metrics using a subset of all hours from a single year and/or using statistical methods, which often do not capture the tail-event impacts or the broader set of interactions between VG, storage, and conventional generators. In our proposed new methods, we use hourly generationmore » and load values across all hours of the year to characterize the (1) contribution of VG to system capacity during high load hours, (2) the curtailment level of VG, and (3) the reduction in VG curtailment due to storage and shutdown of select thermal generators. Using CEM model outputs from a preceding model solve period, we apply these methods to exogenously calculate capacity value and curtailment metrics for the subsequent model solve period. Preliminary results suggest that these hourly methods offer improved capacity value and curtailment representations of VG in the CEM from existing approximation methods without additional computational burdens.« less
Nanolaminate microfluidic device for mobility selection of particles
Surh, Michael P [Livermore, CA; Wilson, William D [Pleasanton, CA; Barbee, Jr., Troy W.; Lane, Stephen M [Oakland, CA
2006-10-10
A microfluidic device made from nanolaminate materials that are capable of electrophoretic selection of particles on the basis of their mobility. Nanolaminate materials are generally alternating layers of two materials (one conducting, one insulating) that are made by sputter coating a flat substrate with a large number of layers. Specific subsets of the conducting layers are coupled together to form a single, extended electrode, interleaved with other similar electrodes. Thereby, the subsets of conducting layers may be dynamically charged to create time-dependent potential fields that can trap or transport charge colloidal particles. The addition of time-dependence is applicable to all geometries of nanolaminate electrophoretic and electrochemical designs from sinusoidal to nearly step-like.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gomez, L.S.; Marietta, M.G.; Jackson, D.W.
1987-04-01
The US Subseabed Disposal Program has compiled an extensive concentration factor and biological half-life data base from the international marine radioecological literature. A microcomputer-based data management system has been implemented to provide statistical and graphic summaries of these data. The data base is constructed in a manner which allows subsets to be sorted using a number of interstudy variables such as organism category, tissue/organ category, geographic location (for in situ studies), and several laboratory-related conditions (e.g., exposure time and exposure concentration). This report updates earlier reviews and provides summaries of the tabulated data. In addition to the concentration factor/biological half-lifemore » data base, we provide an outline of other published marine radioecological works. Our goal is to present these data in a form that enables those concerned with predictive assessment of radiation dose in the marine environment to make a more judicious selection of data for a given application. 555 refs., 19 figs., 7 tabs.« less
Remote Sensing Information Gateway, a tool that allows scientists, researchers and decision makers to access a variety of multi-terabyte, environmental datasets and to subset the data and obtain only needed variables, greatly improving the download time.
Remote Sensing Information Gateway, a tool that allows scientists, researchers and decision makers to access a variety of multi-terabyte, environmental datasets and to subset the data and obtain only needed variables, greatly improving the download time.
Selecting sequence variants to improve genomic predictions for dairy cattle
USDA-ARS?s Scientific Manuscript database
Millions of genetic variants have been identified by population-scale sequencing projects, but subsets are needed for routine genomic predictions or to include on genotyping arrays. Methods of selecting sequence variants were compared using both simulated sequence genotypes and actual data from run ...
Classification of urine sediment based on convolution neural network
NASA Astrophysics Data System (ADS)
Pan, Jingjing; Jiang, Cunbo; Zhu, Tiantian
2018-04-01
By designing a new convolution neural network framework, this paper breaks the constraints of the original convolution neural network framework requiring large training samples and samples of the same size. Move and cropping the input images, generate the same size of the sub-graph. And then, the generated sub-graph uses the method of dropout, increasing the diversity of samples and preventing the fitting generation. Randomly select some proper subset in the sub-graphic set and ensure that the number of elements in the proper subset is same and the proper subset is not the same. The proper subsets are used as input layers for the convolution neural network. Through the convolution layer, the pooling, the full connection layer and output layer, we can obtained the classification loss rate of test set and training set. In the red blood cells, white blood cells, calcium oxalate crystallization classification experiment, the classification accuracy rate of 97% or more.
Eradication of melanomas by targeted elimination of a minor subset of tumor cells
Schmidt, Patrick; Kopecky, Caroline; Hombach, Andreas; Zigrino, Paola; Mauch, Cornelia; Abken, Hinrich
2011-01-01
Proceeding on the assumption that all cancer cells have equal malignant capacities, current regimens in cancer therapy attempt to eradicate all malignant cells of a tumor lesion. Using in vivo targeting of tumor cell subsets, we demonstrate that selective elimination of a definite, minor tumor cell subpopulation is particularly effective in eradicating established melanoma lesions irrespective of the bulk of cancer cells. Tumor cell subsets were specifically eliminated in a tumor lesion by adoptive transfer of engineered cytotoxic T cells redirected in an antigen-restricted manner via a chimeric antigen receptor. Targeted elimination of less than 2% of the tumor cells that coexpress high molecular weight melanoma-associated antigen (HMW-MAA) (melanoma-associated chondroitin sulfate proteoglycan, MCSP) and CD20 lastingly eradicated melanoma lesions, whereas targeting of any random 10% tumor cell subset was not effective. Our data challenge the biological therapy and current drug development paradigms in the treatment of cancer. PMID:21282657
Echocardiographic agreement in the diagnostic evaluation for infective endocarditis.
Lauridsen, Trine Kiilerich; Selton-Suty, Christine; Tong, Steven; Afonso, Luis; Cecchi, Enrico; Park, Lawrence; Yow, Eric; Barnhart, Huiman X; Paré, Carlos; Samad, Zainab; Levine, Donald; Peterson, Gail; Stancoven, Amy Butler; Johansson, Magnus Carl; Dickerman, Stuart; Tamin, Syahidah; Habib, Gilbert; Douglas, Pamela S; Bruun, Niels Eske; Crowley, Anna Lisa
2016-07-01
Echocardiography is essential for the diagnosis and management of infective endocarditis (IE). However, the reproducibility for the echocardiographic assessment of variables relevant to IE is unknown. Objectives of this study were: (1) To define the reproducibility for IE echocardiographic variables and (2) to describe a methodology for assessing quality in an observational cohort containing site-interpreted data. IE reproducibility was assessed on a subset of echocardiograms from subjects enrolled in the International Collaboration on Endocarditis registry. Specific echocardiographic case report forms were used. Intra-observer agreement was assessed from six site readers on ten randomly selected echocardiograms. Inter-observer agreement between sites and an echocardiography core laboratory was assessed on a separate random sample of 110 echocardiograms. Agreement was determined using intraclass correlation (ICC), coverage probability (CP), and limits of agreement for continuous variables and kappa statistics (κweighted) and CP for categorical variables. Intra-observer agreement for LVEF was excellent [ICC = 0.93 ± 0.1 and all pairwise differences for LVEF (CP) were within 10 %]. For IE categorical echocardiographic variables, intra-observer agreement was best for aortic abscess (κweighted = 1.0, CP = 1.0 for all readers). Highest inter-observer agreement for IE categorical echocardiographic variables was obtained for vegetation location (κweighted = 0.95; 95 % CI 0.92-0.99) and lowest agreement was found for vegetation mobility (κweighted = 0.69; 95 % CI 0.62-0.86). Moderate to excellent intra- and inter-observer agreement is observed for echocardiographic variables in the diagnostic assessment of IE. A pragmatic approach for determining echocardiographic data reproducibility in a large, multicentre, site interpreted observational cohort is feasible.
Use of ancillary data to improve the analysis of forest health indicators
Dave Gartner
2013-01-01
In addition to its standard suite of mensuration variables, the Forest Inventory and Analysis (FIA) program of the U.S. Forest Service also collects data on forest health variables formerly measured by the Forest Health Monitoring program. FIA obtains forest health information on a subset of the base sample plots. Due to the sample size differences, the two sets of...
Vescovini, Rosanna; Fagnoni, Francesco Fausto; Telera, Anna Rita; Bucci, Laura; Pedrazzoni, Mario; Magalini, Francesca; Stella, Adriano; Pasin, Federico; Medici, Maria Cristina; Calderaro, Adriana; Volpi, Riccardo; Monti, Daniela; Franceschi, Claudio; Nikolich-Žugich, Janko; Sansoni, Paolo
2014-04-01
Alterations in the circulating CD8+ T cell pool, with a loss of naïve and accumulation of effector/effector memory cells, are pronounced in older adults. However, homeostatic forces that dictate such changes remain incompletely understood. This observational cross-sectional study explored the basis for variability of CD8+ T cell number and composition of its main subsets: naïve, central memory and effector memory T cells, in 131 cytomegalovirus (CMV) seropositive subjects aged over 60 years. We found great heterogeneity of CD8+ T cell numbers, which was mainly due to variability of the CD8 + CD28- T cell subset regardless of age. Analysis, by multiple regression, of distinct factors revealed that age was a predictor for the loss in absolute number of naïve T cells, but was not associated with changes in central or effector memory CD8+ T cell subsets. By contrast, the size of CD8+ T cells specific to pp65 and IE-1 antigens of CMV, predicted CD28 - CD8+ T cell, antigen-experienced CD8+ T cell, and even total CD8+ T cell numbers, but not naïve CD8+ T cell loss. These results indicate a clear dichotomy between the homeostasis of naïve and antigen-experienced subsets of CD8+ T cells which are independently affected, in human later life, by age and antigen-specific responses to CMV, respectively.
Chemical library subset selection algorithms: a unified derivation using spatial statistics.
Hamprecht, Fred A; Thiel, Walter; van Gunsteren, Wilfred F
2002-01-01
If similar compounds have similar activity, rational subset selection becomes superior to random selection in screening for pharmacological lead discovery programs. Traditional approaches to this experimental design problem fall into two classes: (i) a linear or quadratic response function is assumed (ii) some space filling criterion is optimized. The assumptions underlying the first approach are clear but not always defendable; the second approach yields more intuitive designs but lacks a clear theoretical foundation. We model activity in a bioassay as realization of a stochastic process and use the best linear unbiased estimator to construct spatial sampling designs that optimize the integrated mean square prediction error, the maximum mean square prediction error, or the entropy. We argue that our approach constitutes a unifying framework encompassing most proposed techniques as limiting cases and sheds light on their underlying assumptions. In particular, vector quantization is obtained, in dimensions up to eight, in the limiting case of very smooth response surfaces for the integrated mean square error criterion. Closest packing is obtained for very rough surfaces under the integrated mean square error and entropy criteria. We suggest to use either the integrated mean square prediction error or the entropy as optimization criteria rather than approximations thereof and propose a scheme for direct iterative minimization of the integrated mean square prediction error. Finally, we discuss how the quality of chemical descriptors manifests itself and clarify the assumptions underlying the selection of diverse or representative subsets.
Niu, Xiaoping; Qi, Jianmin; Zhang, Gaoyang; Xu, Jiantang; Tao, Aifen; Fang, Pingping; Su, Jianguang
2015-01-01
To accurately measure gene expression using quantitative reverse transcription PCR (qRT-PCR), reliable reference gene(s) are required for data normalization. Corchorus capsularis, an annual herbaceous fiber crop with predominant biodegradability and renewability, has not been investigated for the stability of reference genes with qRT-PCR. In this study, 11 candidate reference genes were selected and their expression levels were assessed using qRT-PCR. To account for the influence of experimental approach and tissue type, 22 different jute samples were selected from abiotic and biotic stress conditions as well as three different tissue types. The stability of the candidate reference genes was evaluated using geNorm, NormFinder, and BestKeeper programs, and the comprehensive rankings of gene stability were generated by aggregate analysis. For the biotic stress and NaCl stress subsets, ACT7 and RAN were suitable as stable reference genes for gene expression normalization. For the PEG stress subset, UBC, and DnaJ were sufficient for accurate normalization. For the tissues subset, four reference genes TUBβ, UBI, EF1α, and RAN were sufficient for accurate normalization. The selected genes were further validated by comparing expression profiles of WRKY15 in various samples, and two stable reference genes were recommended for accurate normalization of qRT-PCR data. Our results provide researchers with appropriate reference genes for qRT-PCR in C. capsularis, and will facilitate gene expression study under these conditions. PMID:26528312
Tan, Chao; Chen, Hui; Wang, Chao; Zhu, Wanping; Wu, Tong; Diao, Yuanbo
2013-03-15
Near and mid-infrared (NIR/MIR) spectroscopy techniques have gained great acceptance in the industry due to their multiple applications and versatility. However, a success of application often depends heavily on the construction of accurate and stable calibration models. For this purpose, a simple multi-model fusion strategy is proposed. It is actually the combination of Kohonen self-organizing map (KSOM), mutual information (MI) and partial least squares (PLSs) and therefore named as KMICPLS. It works as follows: First, the original training set is fed into a KSOM for unsupervised clustering of samples, on which a series of training subsets are constructed. Thereafter, on each of the training subsets, a MI spectrum is calculated and only the variables with higher MI values than the mean value are retained, based on which a candidate PLS model is constructed. Finally, a fixed number of PLS models are selected to produce a consensus model. Two NIR/MIR spectral datasets from brewing industry are used for experiments. The results confirms its superior performance to two reference algorithms, i.e., the conventional PLS and genetic algorithm-PLS (GAPLS). It can build more accurate and stable calibration models without increasing the complexity, and can be generalized to other NIR/MIR applications. Copyright © 2012 Elsevier B.V. All rights reserved.
Karayianni, Katerina N; Grimaldi, Keith A; Nikita, Konstantina S; Valavanis, Ioannis K
2015-01-01
This paper aims to enlighten the complex etiology beneath obesity by analysing data from a large nutrigenetics study, in which nutritional and genetic factors associated with obesity were recorded for around two thousand individuals. In our previous work, these data have been analysed using artificial neural network methods, which identified optimised subsets of factors to predict one's obesity status. These methods did not reveal though how the selected factors interact with each other in the obtained predictive models. For that reason, parallel Multifactor Dimensionality Reduction (pMDR) was used here to further analyse the pre-selected subsets of nutrigenetic factors. Within pMDR, predictive models using up to eight factors were constructed, further reducing the input dimensionality, while rules describing the interactive effects of the selected factors were derived. In this way, it was possible to identify specific genetic variations and their interactive effects with particular nutritional factors, which are now under further study.
Optimisation algorithms for ECG data compression.
Haugland, D; Heber, J G; Husøy, J H
1997-07-01
The use of exact optimisation algorithms for compressing digital electrocardiograms (ECGs) is demonstrated. As opposed to traditional time-domain methods, which use heuristics to select a small subset of representative signal samples, the problem of selecting the subset is formulated in rigorous mathematical terms. This approach makes it possible to derive algorithms guaranteeing the smallest possible reconstruction error when a bounded selection of signal samples is interpolated. The proposed model resembles well-known network models and is solved by a cubic dynamic programming algorithm. When applied to standard test problems, the algorithm produces a compressed representation for which the distortion is about one-half of that obtained by traditional time-domain compression techniques at reasonable compression ratios. This illustrates that, in terms of the accuracy of decoded signals, existing time-domain heuristics for ECG compression may be far from what is theoretically achievable. The paper is an attempt to bridge this gap.
NASA Astrophysics Data System (ADS)
Alvarez, Diego A.; Uribe, Felipe; Hurtado, Jorge E.
2018-02-01
Random set theory is a general framework which comprises uncertainty in the form of probability boxes, possibility distributions, cumulative distribution functions, Dempster-Shafer structures or intervals; in addition, the dependence between the input variables can be expressed using copulas. In this paper, the lower and upper bounds on the probability of failure are calculated by means of random set theory. In order to accelerate the calculation, a well-known and efficient probability-based reliability method known as subset simulation is employed. This method is especially useful for finding small failure probabilities in both low- and high-dimensional spaces, disjoint failure domains and nonlinear limit state functions. The proposed methodology represents a drastic reduction of the computational labor implied by plain Monte Carlo simulation for problems defined with a mixture of representations for the input variables, while delivering similar results. Numerical examples illustrate the efficiency of the proposed approach.
2011-06-01
implementing, and evaluating many feature selection algorithms. Mucciardi and Gose compared seven different techniques for choosing subsets of pattern...122 THIS PAGE INTENTIONALLY LEFT BLANK 123 LIST OF REFERENCES [1] A. Mucciardi and E. Gose , “A comparison of seven techniques for
Wong, J T; Pinto, C E; Gifford, J D; Kurnick, J T; Kradin, R L
1989-11-15
To study the CD4+ and CD8+ tumor infiltrating lymphocytes (TIL) in the antitumor response, we propagated these subsets directly from tumor tissues with anti-CD3:anti-CD8 (CD3,8) and anti-CD3:anti-CD4 (CD3,4) bispecific mAb (BSMAB). CD3,8 BSMAB cause selective cytolysis of CD8+ lymphocytes by bridging the CD8 molecules of target lymphocytes to the CD3 molecular complex of cytolytic T lymphocytes with concurrent activation and proliferation of residual CD3+CD4+ T lymphocytes. Similarly, CD3,4 BSMAB cause selective lysis of CD4+ lymphocytes whereas concurrently activating the residual CD3+CD8+ T cells. Small tumor fragments from four malignant melanoma and three renal cell carcinoma patients were cultured in medium containing CD3,8 + IL-2, CD3,4 + IL-2, or IL-2 alone. CD3,8 led to selective propagation of the CD4+ TIL whereas CD3,4 led to selective propagation of the CD8+ TIL from each of the tumors. The phenotypes of the TIL subset cultures were generally stable when assayed over a 1 to 3 months period and after further expansion with anti-CD3 mAb or lectins. Specific 51Cr release of labeled target cells that were bridged to the CD3 molecular complexes of TIL suggested that both CD4+ and CD8+ TIL cultures have the capacity of mediating cytolysis via their Ti/CD3 TCR complexes. In addition, both CD4+ and CD8+ TIL cultures from most patients caused substantial (greater than 20%) lysis of the NK-sensitive K562 cell line. The majority of CD4+ but not CD8+ TIL cultures also produced substantial lysis of the NK-resistant Daudi cell line. Lysis of the autologous tumor by the TIL subsets was assessed in two patients with malignant melanoma. The CD8+ TIL from one tumor demonstrated cytotoxic activity against the autologous tumor but negligible lysis of allogeneic melanoma targets. In conclusion, immunocompetent CD4+ and CD8+ TIL subsets can be isolated and expanded directly from small tumor fragments of malignant melanoma and renal cell carcinoma using BSMAB. The resultant TIL subsets can be further expanded for detailed studies or for adoptive immunotherapy.
Wen, Xiaotong; Rangarajan, Govindan; Ding, Mingzhou
2013-01-01
Granger causality is increasingly being applied to multi-electrode neurophysiological and functional imaging data to characterize directional interactions between neurons and brain regions. For a multivariate dataset, one might be interested in different subsets of the recorded neurons or brain regions. According to the current estimation framework, for each subset, one conducts a separate autoregressive model fitting process, introducing the potential for unwanted variability and uncertainty. In this paper, we propose a multivariate framework for estimating Granger causality. It is based on spectral density matrix factorization and offers the advantage that the estimation of such a matrix needs to be done only once for the entire multivariate dataset. For any subset of recorded data, Granger causality can be calculated through factorizing the appropriate submatrix of the overall spectral density matrix. PMID:23858479
NASA Astrophysics Data System (ADS)
Kukunda, Collins B.; Duque-Lazo, Joaquín; González-Ferreiro, Eduardo; Thaden, Hauke; Kleinn, Christoph
2018-03-01
Distinguishing tree species is relevant in many contexts of remote sensing assisted forest inventory. Accurate tree species maps support management and conservation planning, pest and disease control and biomass estimation. This study evaluated the performance of applying ensemble techniques with the goal of automatically distinguishing Pinus sylvestris L. and Pinus uncinata Mill. Ex Mirb within a 1.3 km2 mountainous area in Barcelonnette (France). Three modelling schemes were examined, based on: (1) high-density LiDAR data (160 returns m-2), (2) Worldview-2 multispectral imagery, and (3) Worldview-2 and LiDAR in combination. Variables related to the crown structure and height of individual trees were extracted from the normalized LiDAR point cloud at individual-tree level, after performing individual tree crown (ITC) delineation. Vegetation indices and the Haralick texture indices were derived from Worldview-2 images and served as independent spectral variables. Selection of the best predictor subset was done after a comparison of three variable selection procedures: (1) Random Forests with cross validation (AUCRFcv), (2) Akaike Information Criterion (AIC) and (3) Bayesian Information Criterion (BIC). To classify the species, 9 regression techniques were combined using ensemble models. Predictions were evaluated using cross validation and an independent dataset. Integration of datasets and models improved individual tree species classification (True Skills Statistic, TSS; from 0.67 to 0.81) over individual techniques and maintained strong predictive power (Relative Operating Characteristic, ROC = 0.91). Assemblage of regression models and integration of the datasets provided more reliable species distribution maps and associated tree-scale mapping uncertainties. Our study highlights the potential of model and data assemblage at improving species classifications needed in present-day forest planning and management.
Cai, Hongmin; Peng, Yanxia; Ou, Caiwen; Chen, Minsheng; Li, Li
2014-01-01
Dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) is increasingly used for breast cancer diagnosis as supplementary to conventional imaging techniques. Combining of diffusion-weighted imaging (DWI) of morphology and kinetic features from DCE-MRI to improve the discrimination power of malignant from benign breast masses is rarely reported. The study comprised of 234 female patients with 85 benign and 149 malignant lesions. Four distinct groups of features, coupling with pathological tests, were estimated to comprehensively characterize the pictorial properties of each lesion, which was obtained by a semi-automated segmentation method. Classical machine learning scheme including feature subset selection and various classification schemes were employed to build prognostic model, which served as a foundation for evaluating the combined effects of the multi-sided features for predicting of the types of lesions. Various measurements including cross validation and receiver operating characteristics were used to quantify the diagnostic performances of each feature as well as their combination. Seven features were all found to be statistically different between the malignant and the benign groups and their combination has achieved the highest classification accuracy. The seven features include one pathological variable of age, one morphological variable of slope, three texture features of entropy, inverse difference and information correlation, one kinetic feature of SER and one DWI feature of apparent diffusion coefficient (ADC). Together with the selected diagnostic features, various classical classification schemes were used to test their discrimination power through cross validation scheme. The averaged measurements of sensitivity, specificity, AUC and accuracy are 0.85, 0.89, 90.9% and 0.93, respectively. Multi-sided variables which characterize the morphological, kinetic, pathological properties and DWI measurement of ADC can dramatically improve the discriminatory power of breast lesions.
Badia, Jordi; Raspopovic, Stanisa; Carpaneto, Jacopo; Micera, Silvestro; Navarro, Xavier
2016-01-01
The selection of suitable peripheral nerve electrodes for biomedical applications implies a trade-off between invasiveness and selectivity. The optimal design should provide the highest selectivity for targeting a large number of nerve fascicles with the least invasiveness and potential damage to the nerve. The transverse intrafascicular multichannel electrode (TIME), transversally inserted in the peripheral nerve, has been shown to be useful for the selective activation of subsets of axons, both at inter- and intra-fascicular levels, in the small sciatic nerve of the rat. In this study we assessed the capabilities of TIME for the selective recording of neural activity, considering the topographical selectivity and the distinction of neural signals corresponding to different sensory types. Topographical recording selectivity was proved by the differential recording of CNAPs from different subsets of nerve fibers, such as those innervating toes 2 and 4 of the hindpaw of the rat. Neural signals elicited by sensory stimuli applied to the rat paw were successfully recorded. Signal processing allowed distinguishing three different types of sensory stimuli such as tactile, proprioceptive and nociceptive ones with high performance. These findings further support the suitability of TIMEs for neuroprosthetic applications, by exploiting the transversal topographical structure of the peripheral nerves.
Global Tree Range Shifts Under Forecasts from Two Alternative GCMs Using Two Future Scenarios
NASA Astrophysics Data System (ADS)
Hargrove, W. W.; Kumar, J.; Potter, K. M.; Hoffman, F. M.
2013-12-01
Global shifts in the environmentally suitable ranges of 215 tree species were predicted under forecasts from two GCMs (the Parallel Climate Model (PCM), and the Hadley Model), each under two IPCC future climatic scenarios (A1 and B1), each at two future dates (2050 and 2100). The analysis considers all global land surface at a resolution of 4 km2. A statistical multivariate clustering procedure was used to quantitatively delineate 30 thousand environmentally homogeneous ecoregions across present and 8 potential future global locations at once, using global maps of 17 environmental characteristics describing temperature, precipitation, soils, topography and solar insolation. Presence of each tree species on Forest Inventory Analysis (FIA) plots and in Global Biodiversity Information Facility (GBIF) samples was used to select a subset of suitable ecoregions from the full set of 30 thousand. Once identified, this suitable subset of ecoregions was compared to the known current range of the tree species under present conditions. Predicted present ranges correspond well with current understanding for all but a few of the 215 tree species. The subset of suitable ecoregions for each tree species can then be tracked into the future to determine whether the suitable home range for this species remains the same, moves, grows, shrinks, or disappears under each model/scenario combination. Occurrence and growth performance measurements for various tree species across the U.S. are limited to FIA plots. We present a new, general-purpose empirical imputation method which associates sparse measurements of dependent variables with particular multivariate clustered combinations of the independent variables, and then estimates values for unmeasured clusters, based on directional proximity in multidimensional data space, at both the cluster and map-cell levels of resolution. Using Associative Clustering, we scaled up the FIA point measurements into contonuous maps that show the expected growth and suitability for individual tree species across the continental US. Maps were generated for each tree species showing the Minimum Required Movement (MRM) straight-line distance from each currently suitable location to the geographically nearest "lifeboat" location having suitable conditions in the future. Locations that are the closest "lifeboats" for many MRM propagules originating from wide surrounding areas may constitute high-priority preservation targets as a refugium against climatic change.
Better physical activity classification using smartphone acceleration sensor.
Arif, Muhammad; Bilal, Mohsin; Kattan, Ahmed; Ahamed, S Iqbal
2014-09-01
Obesity is becoming one of the serious problems for the health of worldwide population. Social interactions on mobile phones and computers via internet through social e-networks are one of the major causes of lack of physical activities. For the health specialist, it is important to track the record of physical activities of the obese or overweight patients to supervise weight loss control. In this study, acceleration sensor present in the smartphone is used to monitor the physical activity of the user. Physical activities including Walking, Jogging, Sitting, Standing, Walking upstairs and Walking downstairs are classified. Time domain features are extracted from the acceleration data recorded by smartphone during different physical activities. Time and space complexity of the whole framework is done by optimal feature subset selection and pruning of instances. Classification results of six physical activities are reported in this paper. Using simple time domain features, 99 % classification accuracy is achieved. Furthermore, attributes subset selection is used to remove the redundant features and to minimize the time complexity of the algorithm. A subset of 30 features produced more than 98 % classification accuracy for the six physical activities.
On Some Multiple Decision Problems
1976-08-01
parameter space. Some recent results in the area of subset selection formulation are Gnanadesikan and Gupta [28], Gupta and Studden [43], Gupta and...York, pp. 363-376. [27) Gnanadesikan , M. (1966). Some Selection and Ranking Procedures for Multivariate Normal Populations. Ph.D. Thesis. Dept. of...Statist., Purdue Univ., West Lafayette, Indiana 47907. [28) Gnanadesikan , M. and Gupta, S. S. (1970). Selection procedures for multivariate normal
USDA-ARS?s Scientific Manuscript database
The phytopathogen Ralstonia solanacearum is a species complex that contains a subset of strains that are quarantined or select agent pathogens. An unidentified R. solanacearum strain is considered a select agent in the US until proven otherwise, which can be done by phylogenetic analysis of a partia...
Wang, Shuaiqun; Aorigele; Kong, Wei; Zeng, Weiming; Hong, Xiaomin
2016-01-01
Gene expression data composed of thousands of genes play an important role in classification platforms and disease diagnosis. Hence, it is vital to select a small subset of salient features over a large number of gene expression data. Lately, many researchers devote themselves to feature selection using diverse computational intelligence methods. However, in the progress of selecting informative genes, many computational methods face difficulties in selecting small subsets for cancer classification due to the huge number of genes (high dimension) compared to the small number of samples, noisy genes, and irrelevant genes. In this paper, we propose a new hybrid algorithm HICATS incorporating imperialist competition algorithm (ICA) which performs global search and tabu search (TS) that conducts fine-tuned search. In order to verify the performance of the proposed algorithm HICATS, we have tested it on 10 well-known benchmark gene expression classification datasets with dimensions varying from 2308 to 12600. The performance of our proposed method proved to be superior to other related works including the conventional version of binary optimization algorithm in terms of classification accuracy and the number of selected genes.
Aorigele; Zeng, Weiming; Hong, Xiaomin
2016-01-01
Gene expression data composed of thousands of genes play an important role in classification platforms and disease diagnosis. Hence, it is vital to select a small subset of salient features over a large number of gene expression data. Lately, many researchers devote themselves to feature selection using diverse computational intelligence methods. However, in the progress of selecting informative genes, many computational methods face difficulties in selecting small subsets for cancer classification due to the huge number of genes (high dimension) compared to the small number of samples, noisy genes, and irrelevant genes. In this paper, we propose a new hybrid algorithm HICATS incorporating imperialist competition algorithm (ICA) which performs global search and tabu search (TS) that conducts fine-tuned search. In order to verify the performance of the proposed algorithm HICATS, we have tested it on 10 well-known benchmark gene expression classification datasets with dimensions varying from 2308 to 12600. The performance of our proposed method proved to be superior to other related works including the conventional version of binary optimization algorithm in terms of classification accuracy and the number of selected genes. PMID:27579323
Eliason, Michele J; Streed, Carl G
2017-10-01
Researchers struggle to find effective ways to measure sexual and gender identities to determine whether there are health differences among subsets of the LGBTQ+ population. This study examines responses on the National Health Interview Survey (NHIS) sexual identity questions among 277 LGBTQ+ healthcare providers. Eighteen percent indicated that their sexual identity was "something else" on the first question, and 57% of those also selected "something else" on the second question. Half of the genderqueer/gender variant participants and 100% of transgender-identified participants selected "something else" as their sexual identity. The NHIS question does not allow all respondents in LGBTQ+ populations to be categorized, thus we are potentially missing vital health disparity information about subsets of the LGBTQ+ population.
Distinguishing Different Strategies of Across-Dimension Attentional Selection
ERIC Educational Resources Information Center
Huang, Liqiang; Pashler, Harold
2012-01-01
Selective attention in multidimensional displays has usually been examined using search tasks requiring the detection of a single target. We examined the ability to perceive a spatial structure in multi-item subsets of a display that were defined either conjunctively or disjunctively. Observers saw two adjacent displays and indicated whether the…
Testing Different Model Building Procedures Using Multiple Regression.
ERIC Educational Resources Information Center
Thayer, Jerome D.
The stepwise regression method of selecting predictors for computer assisted multiple regression analysis was compared with forward, backward, and best subsets regression, using 16 data sets. The results indicated the stepwise method was preferred because of its practical nature, when the models chosen by different selection methods were similar…
Yee, Lydia T S
2017-01-01
Words are frequently used as stimuli in cognitive psychology experiments, for example, in recognition memory studies. In these experiments, it is often desirable to control for the words' psycholinguistic properties because differences in such properties across experimental conditions might introduce undesirable confounds. In order to avoid confounds, studies typically check to see if various affective and lexico-semantic properties are matched across experimental conditions, and so databases that contain values for these properties are needed. While word ratings for these variables exist in English and other European languages, ratings for Chinese words are not comprehensive. In particular, while ratings for single characters exist, ratings for two-character words-which often have different meanings than their constituent characters, are scarce. In this study, ratings for 292 two-character Chinese nouns were obtained from Cantonese speakers in Hong Kong. Affective variables, including valence and arousal, and lexico-semantic variables, including familiarity, concreteness, and imageability, were rated in the study. The words were selected from a film subtitle database containing word frequency information that could be extracted and listed alongside the resulting ratings. Overall, the subjective ratings showed good reliability across all rated dimensions, as well as good reliability within and between the different groups of participants who each rated a subset of the words. Moreover, several well-established relationships between the variables found consistently in other languages were also observed in this study, demonstrating that the ratings are valid. The resulting word database can be used in studies where control for the above psycholinguistic variables is critical to the research design.
Chechlacz, Magdalena; Gillebert, Celine R; Vangkilde, Signe A; Petersen, Anders; Humphreys, Glyn W
2015-07-29
Visuospatial attention allows us to select and act upon a subset of behaviorally relevant visual stimuli while ignoring distraction. Bundesen's theory of visual attention (TVA) (Bundesen, 1990) offers a quantitative analysis of the different facets of attention within a unitary model and provides a powerful analytic framework for understanding individual differences in attentional functions. Visuospatial attention is contingent upon large networks, distributed across both hemispheres, consisting of several cortical areas interconnected by long-association frontoparietal pathways, including three branches of the superior longitudinal fasciculus (SLF I-III) and the inferior fronto-occipital fasciculus (IFOF). Here we examine whether structural variability within human frontoparietal networks mediates differences in attention abilities as assessed by the TVA. Structural measures were based on spherical deconvolution and tractography-derived indices of tract volume and hindrance-modulated orientational anisotropy (HMOA). Individual differences in visual short-term memory (VSTM) were linked to variability in the microstructure (HMOA) of SLF II, SLF III, and IFOF within the right hemisphere. Moreover, VSTM and speed of information processing were linked to hemispheric lateralization within the IFOF. Differences in spatial bias were mediated by both variability in microstructure and volume of the right SLF II. Our data indicate that the microstructural and macrostrucutral organization of white matter pathways differentially contributes to both the anatomical lateralization of frontoparietal attentional networks and to individual differences in attentional functions. We conclude that individual differences in VSTM capacity, processing speed, and spatial bias, as assessed by TVA, link to variability in structural organization within frontoparietal pathways. Copyright © 2015 Chechlacz et al.
Interaction of PRRS virus with bone marrow monocyte subsets.
Fernández-Caballero, Teresa; Álvarez, Belén; Alonso, Fernando; Revilla, Concepción; Martínez-Lobo, Javier; Prieto, Cinta; Ezquerra, Ángel; Domínguez, Javier
2018-06-01
PRRSV can replicate for months in lymphoid organs leading to persistent host infections. Porcine bone marrow comprises two major monocyte subsets, one of which expresses CD163 and CD169, two receptors involved in the entry of PRRSV in macrophages. In this study, we investigate the permissiveness of these subsets to PRRSV infection. PRRSV replicates efficiently in BM CD163 + monocytes reaching titers similar to those obtained in alveolar macrophages, but with a delayed kinetics. Infection of BM CD163 - monocytes was variable and yielded lower titers. This may be related with the capacity of BM CD163 - monocytes to differentiate into CD163 + CD169 + cells after culture in presence of M-CSF. Both subsets secreted IL-8 in response to virus but CD163 + cells tended to produce higher amounts. The infection of BM monocytes by PRRSV may contribute to persistence of the virus in this compartment and to hematological disorders found in infected animals such as the reduction in the number of peripheral blood monocytes. Copyright © 2018 Elsevier B.V. All rights reserved.
2008 Niday Perinatal Database quality audit: report of a quality assurance project.
Dunn, S; Bottomley, J; Ali, A; Walker, M
2011-12-01
This quality assurance project was designed to determine the reliability, completeness and comprehensiveness of the data entered into Niday Perinatal Database. Quality of the data was measured by comparing data re-abstracted from the patient record to the original data entered into the Niday Perinatal Database. A representative sample of hospitals in Ontario was selected and a random sample of 100 linked mother and newborn charts were audited for each site. A subset of 33 variables (representing 96 data fields) from the Niday dataset was chosen for re-abstraction. Of the data fields for which Cohen's kappa statistic or intraclass correlation coefficient (ICC) was calculated, 44% showed substantial or almost perfect agreement (beyond chance). However, about 17% showed less than 95% agreement and a kappa or ICC value of less than 60% indicating only slight, fair or moderate agreement (beyond chance). Recommendations to improve the quality of these data fields are presented.
The continuum spectral characteristics of gamma ray bursts observed by BATSE
NASA Technical Reports Server (NTRS)
Pendleton, Geoffrey N.; Paciesas, William S.; Briggs, Michael S.; Mallozzi, Robert S.; Koshut, Tom M.; Fishman, Gerald J.; Meegan, Charles A.; Wilson, Robert B.; Harmon, Alan B.; Kouveliotou, Chryssa
1994-01-01
Distributions of the continuum spectral characteristics of 260 bursts in the first Burst and Transient Source Experiment (BATSE) catalog are presented. The data are derived from flux ratios calculated from the BATSE Large Area Detector (LAD) four channel discriminator data. The data are converted from counts to photons using a direct spectral inversion technique to remove the effects of atmospheric scattering and the energy dependence of the detector angular response. Although there are intriguing clusterings of bursts in the spectral hardness ratio distributions, no evidence for the presence of distinct burst classes based on spectral hardness ratios alone is found. All subsets of bursts selected for their spectral characteristics in this analysis exhibit spatial distributions consistent with isotropy. The spectral diversity of the burst population appears to be caused largely by the highly variable nature of the burst production mechanisms themselves.
The Asthma Mobile Health Study, a large-scale clinical observational study using ResearchKit.
Chan, Yu-Feng Yvonne; Wang, Pei; Rogers, Linda; Tignor, Nicole; Zweig, Micol; Hershman, Steven G; Genes, Nicholas; Scott, Erick R; Krock, Eric; Badgeley, Marcus; Edgar, Ron; Violante, Samantha; Wright, Rosalind; Powell, Charles A; Dudley, Joel T; Schadt, Eric E
2017-04-01
The feasibility of using mobile health applications to conduct observational clinical studies requires rigorous validation. Here, we report initial findings from the Asthma Mobile Health Study, a research study, including recruitment, consent, and enrollment, conducted entirely remotely by smartphone. We achieved secure bidirectional data flow between investigators and 7,593 participants from across the United States, including many with severe asthma. Our platform enabled prospective collection of longitudinal, multidimensional data (e.g., surveys, devices, geolocation, and air quality) in a subset of users over the 6-month study period. Consistent trending and correlation of interrelated variables support the quality of data obtained via this method. We detected increased reporting of asthma symptoms in regions affected by heat, pollen, and wildfires. Potential challenges with this technology include selection bias, low retention rates, reporting bias, and data security. These issues require attention to realize the full potential of mobile platforms in research and patient care.
A Catalog of Photometric Redshift and the Distribution of Broad Galaxy Morphologies
NASA Astrophysics Data System (ADS)
Paul, Nicholas; Virag, Nicholas; Shamir, Lior
2018-06-01
We created a catalog of photometric redshift of ~3,000,000 SDSS galaxies annotated by their broad morphology. The photometric redshift was optimized by testing and comparing several pattern recognition algorithms and variable selection strategies, trained and tested on a subset of the galaxies in the catalog that had spectra. The galaxies in the catalog have i magnitude brighter than 18 and Petrosian radius greater than 5.5''. The majority of these objects are not included in previous SDSS photometric redshift catalogs such as the photoz table of SDSS DR12. Analysis of the catalog shows that the number of galaxies in the catalog that are visually spiral increases until redshift of ~0.085, where it peaks and starts to decrease. It also shows that the number of spiral galaxies compared to elliptical galaxies drops as the redshift increases. The catalog is publicly available at https://figshare.com/articles/Morphology_and_photometric_redshift_catalog/4833593
Technique and cue selection for graphical presentation of generic hyperdimensional data
NASA Astrophysics Data System (ADS)
Howard, Lee M.; Burton, Robert P.
2013-12-01
Several presentation techniques have been created for visualization of data with more than three variables. Packages have been written, each of which implements a subset of these techniques. However, these packages generally fail to provide all the features needed by the user during the visualization process. Further, packages generally limit support for presentation techniques to a few techniques. A new package called Petrichor accommodates all necessary and useful features together in one system. Any presentation technique may be added easily through an extensible plugin system. Features are supported by a user interface that allows easy interaction with data. Annotations allow users to mark up visualizations and share information with others. By providing a hyperdimensional graphics package that easily accommodates presentation techniques and includes a complete set of features, including those that are rarely or never supported elsewhere, the user is provided with a tool that facilitates improved interaction with multivariate data to extract and disseminate information.
Basu, D; Nguyen, T-T K; Montone, K T; Zhang, G; Wang, L-P; Diehl, J A; Rustgi, A K; Lee, J T; Weinstein, G S; Herlyn, M
2010-07-22
Variable drug responses among malignant cells within individual tumors may represent a barrier to their eradication using chemotherapy. Carcinoma cells expressing mesenchymal markers resist conventional and epidermal growth factor receptor (EGFR)-targeted chemotherapy. In this study, we evaluated whether mesenchymal-like sub-populations within human squamous cell carcinomas (SCCs) with predominantly epithelial features contribute to overall therapy resistance. We identified a mesenchymal-like subset expressing low E-cadherin (Ecad-lo) and high vimentin within the upper aerodigestive tract SCCs. This subset was both isolated from the cell lines and was identified in xenografts and primary clinical specimens. The Ecad-lo subset contained more low-turnover cells, correlating with resistance to the conventional chemotherapeutic paclitaxel in vitro. Epidermal growth factor induced less stimulation of the mitogen-activated protein kinase and phosphatidylinositol-3-kinase pathways in Ecad-lo cells, which was likely due to lower EGFR expression in this subset and correlated with in vivo resistance to the EGFR-targeted antibody, cetuximab. The Ecad-lo and high E-cadherin subsets were dynamic in phenotype, showing the capacity to repopulate each other from single-cell clones. Taken together, these results provide evidence for a low-turnover, mesenchymal-like sub-population in SCCs with diminished EGFR pathway function and intrinsic resistance to conventional and EGFR-targeted chemotherapies.
Nurmi, Erika L; Dowd, Michael; Tadevosyan-Leyfer, Ovsanna; Haines, Jonathan L; Folstein, Susan E; Sutcliffe, James S
2003-07-01
Autism displays a remarkably high heritability but a complex genetic etiology. One approach to identifying susceptibility loci under these conditions is to define more homogeneous subsets of families on the basis of genetically relevant phenotypic or biological characteristics that vary from case to case. The authors performed a principal components analysis, using items from the Autism Diagnostic Interview, which resulted in six clusters of variables, five of which showed significant sib-sib correlation. The utility of these phenotypic subsets was tested in an exploratory genetic analysis of the autism candidate region on chromosome 15q11-q13. When the Collaborative Linkage Study of Autism sample was divided, on the basis of mean proband score for the "savant skills" cluster, the heterogeneity logarithm of the odds under a recessive model at D15S511, within the GABRB3 gene, increased from 0.6 to 2.6 in the subset of families in which probands had greater savant skills. These data are consistent with the genetic contribution of a 15q locus to autism susceptibility in a subset of affected individuals exhibiting savant skills. Similar types of skills have been noted in individuals with Prader-Willi syndrome, which results from deletions of this chromosomal region.
Gu, Jiwei; Andreasen, Jan J; Melgaard, Jacob; Lundbye-Christensen, Søren; Hansen, John; Schmidt, Erik B; Thorsteinsson, Kristinn; Graff, Claus
2017-02-01
To investigate if electrocardiogram (ECG) markers from routine preoperative ECGs can be used in combination with clinical data to predict new-onset postoperative atrial fibrillation (POAF) following cardiac surgery. Retrospective observational case-control study. Single-center university hospital. One hundred consecutive adult patients (50 POAF, 50 without POAF) who underwent coronary artery bypass grafting, valve surgery, or combinations. Retrospective review of medical records and registration of POAF. Clinical data and demographics were retrieved from the Western Denmark Heart Registry and patient records. Paper tracings of preoperative ECGs were collected from patient records, and ECG measurements were read by two independent readers blinded to outcome. A subset of four clinical variables (age, gender, body mass index, and type of surgery) were selected to form a multivariate clinical prediction model for POAF and five ECG variables (QRS duration, PR interval, P-wave duration, left atrial enlargement, and left ventricular hypertrophy) were used in a multivariate ECG model. Adding ECG variables to the clinical prediction model significantly improved the area under the receiver operating characteristic curve from 0.54 to 0.67 (with cross-validation). The best predictive model for POAF was a combined clinical and ECG model with the following four variables: age, PR-interval, QRS duration, and left atrial enlargement. ECG markers obtained from a routine preoperative ECG may be helpful in predicting new-onset POAF in patients undergoing cardiac surgery. Copyright © 2017 Elsevier Inc. All rights reserved.
ERIC Educational Resources Information Center
Waller, Niels; Jones, Jeff
2011-01-01
We describe methods for assessing all possible criteria (i.e., dependent variables) and subsets of criteria for regression models with a fixed set of predictors, x (where x is an n x 1 vector of independent variables). Our methods build upon the geometry of regression coefficients (hereafter called regression weights) in n-dimensional space. For a…
A Metacommunity Framework for Enhancing the Effectiveness of Biological Monitoring Strategies
Roque, Fabio O.; Cottenie, Karl
2012-01-01
Because of inadequate knowledge and funding, the use of biodiversity indicators is often suggested as a way to support management decisions. Consequently, many studies have analyzed the performance of certain groups as indicator taxa. However, in addition to knowing whether certain groups can adequately represent the biodiversity as a whole, we must also know whether they show similar responses to the main structuring processes affecting biodiversity. Here we present an application of the metacommunity framework for evaluating the effectiveness of biodiversity indicators. Although the metacommunity framework has contributed to a better understanding of biodiversity patterns, there is still limited discussion about its implications for conservation and biomonitoring. We evaluated the effectiveness of indicator taxa in representing spatial variation in macroinvertebrate community composition in Atlantic Forest streams, and the processes that drive this variation. We focused on analyzing whether some groups conform to environmental processes and other groups are more influenced by spatial processes, and on how this can help in deciding which indicator group or groups should be used. We showed that a relatively small subset of taxa from the metacommunity would represent 80% of the variation in community composition shown by the entire metacommunity. Moreover, this subset does not have to be composed of predetermined taxonomic groups, but rather can be defined based on random subsets. We also found that some random subsets composed of a small number of genera performed better in responding to major environmental gradients. There were also random subsets that seemed to be affected by spatial processes, which could indicate important historical processes. We were able to integrate in the same theoretical and practical framework, the selection of biodiversity surrogates, indicators of environmental conditions, and more importantly, an explicit integration of environmental and spatial processes into the selection approach. PMID:22937068
Wagner, Abram L; Boulton, Matthew L; Gillespie, Brenda W; Zhang, Ying; Ding, Yaxing; Carlson, Bradley F; Luo, Xiaoyan; Montgomery, JoLynn P; Wang, Xiexiu
2017-01-01
Control groups in previous case-control studies of vaccine-preventable diseases have included people immune to disease. This study examines risk factors for measles acquisition among adults 20 to 49 years of age in Tianjin, China, and compares findings using measles IgG antibody-negative controls to all controls, both IgG-negative and IgG-positive. Measles cases were sampled from a disease registry, and controls were enrolled from community registries in Tianjin, China, 2011-2015. Through a best subsets selection procedure, we compared which variables were selected at different model sizes when using IgG-negative controls or all controls. We entered risk factors for measles in two separate logistic regression models: one with measles IgG-negative controls and the other with all controls. The study included 384 measles cases and 1,596 community controls (194 IgG-negative). Visiting a hospital was an important risk factor. For specialty hospitals, the odds ratio (OR) was 4.53 (95% confidence interval (CI): 1.28, 16.03) using IgG-negative controls, and OR = 5.27 (95% CI: 2.73, 10.18) using all controls. Variables, such as age or length of time in Tianjin, were differentially selected depending on the control group. Individuals living in Tianjin ≤3 years had 2.87 (95% CI: 1.46, 5.66) times greater odds of measles case status compared to all controls, but this relationship was not apparent for IgG-negative controls. We recommend that case-control studies examining risk factors for infectious diseases, particularly in the context of transmission dynamics, consider antibody-negative controls as the gold standard.
Accounting for complementarity to maximize monitoring power for species management.
Tulloch, Ayesha I T; Chadès, Iadine; Possingham, Hugh P
2013-10-01
To choose among conservation actions that may benefit many species, managers need to monitor the consequences of those actions. Decisions about which species to monitor from a suite of different species being managed are hindered by natural variability in populations and uncertainty in several factors: the ability of the monitoring to detect a change, the likelihood of the management action being successful for a species, and how representative species are of one another. However, the literature provides little guidance about how to account for these uncertainties when deciding which species to monitor to determine whether the management actions are delivering outcomes. We devised an approach that applies decision science and selects the best complementary suite of species to monitor to meet specific conservation objectives. We created an index for indicator selection that accounts for the likelihood of successfully detecting a real trend due to a management action and whether that signal provides information about other species. We illustrated the benefit of our approach by analyzing a monitoring program for invasive predator management aimed at recovering 14 native Australian mammals of conservation concern. Our method selected the species that provided more monitoring power at lower cost relative to the current strategy and traditional approaches that consider only a subset of the important considerations. Our benefit function accounted for natural variability in species growth rates, uncertainty in the responses of species to the prescribed action, and how well species represent others. Monitoring programs that ignore uncertainty, likelihood of detecting change, and complementarity between species will be more costly and less efficient and may waste funding that could otherwise be used for management. © 2013 Society for Conservation Biology.
Abdel-Rahman, Susan M.; Preuett, Barry L.
2012-01-01
Background Trichophyton tonsurans is the foremost fungal pathogen of minority children in the U.S. Despite overwhelming infection rates, it does not appear that this fungus infects children in a non-specific manner. Objective This study was designed to identify genes that may predispose or protect a child from T. tonsurans infection. Methods Children participating in an earlier longitudinal study wherein infection rates could be reliably determined were eligible for inclusion. DNA from a subset (n=40) of these children at the population extremes underwent whole genome genotyping (WGG). Allele frequencies between cases and controls were examined and significant SNPs were used to develop a candidate gene list for which the remainder of the cohort (n=115) were genotyped. Cumulative infection rate was examined by genotype and the ability of selected genotypes to predict the likelihood of infection explored by multivariable analysis. Results 23 genes with a putative mechanistic role in cutaneous infection were selected for evaluation. Of these, 21 demonstrated significant differences in infection rate between genotypes. A risk index assigned to genotypes in the 21 genes accounted for over 60% of the variability observed in infection rate (adjusted r2=0.665, p<0.001). Among these, 8 appeared to account for the majority of variability that was observed (r2=0.603, p<0.001). These included genes involved in: leukocyte activation and migration, extracellular matrix integrity and remodeling, epidermal maintenance and wound repair, and cutaneous permeability. Conclusions Applying WGG to individuals at the extremes of phenotype can help to guide the selection of candidate genes in populations of small cohorts where disease etiology is likely polygenic in nature. PMID:22704677
Baranasic, Damir; Oppermann, Timo; Cheaib, Miriam; Cullum, John; Schmidt, Helmut
2014-01-01
ABSTRACT Antigenic or phenotypic variation is a widespread phenomenon of expression of variable surface protein coats on eukaryotic microbes. To clarify the mechanism behind mutually exclusive gene expression, we characterized the genetic properties of the surface antigen multigene family in the ciliate Paramecium tetraurelia and the epigenetic factors controlling expression and silencing. Genome analysis indicated that the multigene family consists of intrachromosomal and subtelomeric genes; both classes apparently derive from different gene duplication events: whole-genome and intrachromosomal duplication. Expression analysis provides evidence for telomere position effects, because only subtelomeric genes follow mutually exclusive transcription. Microarray analysis of cultures deficient in Rdr3, an RNA-dependent RNA polymerase, in comparison to serotype-pure wild-type cultures, shows cotranscription of a subset of subtelomeric genes, indicating that the telomere position effect is due to a selective occurrence of Rdr3-mediated silencing in subtelomeric regions. We present a model of surface antigen evolution by intrachromosomal gene duplication involving the maintenance of positive selection of structurally relevant regions. Further analysis of chromosome heterogeneity shows that alternative telomere addition regions clearly affect transcription of closely related genes. Consequently, chromosome fragmentation appears to be of crucial importance for surface antigen expression and evolution. Our data suggest that RNAi-mediated control of this genetic network by trans-acting RNAs allows rapid epigenetic adaptation by phenotypic variation in combination with long-term genetic adaptation by Darwinian evolution of antigen genes. PMID:25389173
Genomic Prediction Accounting for Residual Heteroskedasticity
Ou, Zhining; Tempelman, Robert J.; Steibel, Juan P.; Ernst, Catherine W.; Bates, Ronald O.; Bello, Nora M.
2015-01-01
Whole-genome prediction (WGP) models that use single-nucleotide polymorphism marker information to predict genetic merit of animals and plants typically assume homogeneous residual variance. However, variability is often heterogeneous across agricultural production systems and may subsequently bias WGP-based inferences. This study extends classical WGP models based on normality, heavy-tailed specifications and variable selection to explicitly account for environmentally-driven residual heteroskedasticity under a hierarchical Bayesian mixed-models framework. WGP models assuming homogeneous or heterogeneous residual variances were fitted to training data generated under simulation scenarios reflecting a gradient of increasing heteroskedasticity. Model fit was based on pseudo-Bayes factors and also on prediction accuracy of genomic breeding values computed on a validation data subset one generation removed from the simulated training dataset. Homogeneous vs. heterogeneous residual variance WGP models were also fitted to two quantitative traits, namely 45-min postmortem carcass temperature and loin muscle pH, recorded in a swine resource population dataset prescreened for high and mild residual heteroskedasticity, respectively. Fit of competing WGP models was compared using pseudo-Bayes factors. Predictive ability, defined as the correlation between predicted and observed phenotypes in validation sets of a five-fold cross-validation was also computed. Heteroskedastic error WGP models showed improved model fit and enhanced prediction accuracy compared to homoskedastic error WGP models although the magnitude of the improvement was small (less than two percentage points net gain in prediction accuracy). Nevertheless, accounting for residual heteroskedasticity did improve accuracy of selection, especially on individuals of extreme genetic merit. PMID:26564950
Test-retest reliability of retinal oxygen saturation measurement.
O'Connell, Rachael A; Anderson, Andrew J; Hosking, Sarah L; Batcha, Abrez H; Bui, Bang V
2014-06-01
To determine intrasession and intersession repeatability of retinal vessel oxygen saturation from the Oxymap Retinal Oximeter using a whole image-based analysis technique and so determine optimal analysis parameters to reduce variability. Ten fundus oximetry images were acquired through dilated pupils from 18 healthy participants (aged 22 to 38) using the Oxymap Retinal Oximeter T1. A further 10 images were obtained 1 to 2 weeks later from each individual. Analysis was undertaken for subsets of images to determine the number of images needed to return a stable coefficient of variation (CoV). Intrasession and intersession variability were quantified by evaluating the CoV and establishing the 95% limits of agreement using Bland and Altman analysis. Retinal oxygenation was derived from the distribution of oxygenation values from all vessels of a given width in an image or set of images, as described by Paul et al. in 2013. Grouped in 10-μm-wide bins, oxygen saturation varied significantly for both arteries and veins (p < 0.01). Between 110 and 150 μm, arteries had the least variability between individuals, with average CoVs less than 5% whose confidence intervals did not overlap with the greater than 10% average CoVs for veins across the same range. Bland and Altman analysis showed that there was no bias within or between recording sessions and that the 95% limits of agreement were generally lower in arteries. Retinal vessel oxygen saturation measurements show variability within and between clinical sessions when the whole image is used, which we believe more accurately reflects the true variability in Oxymap images than previous studies on select image segments. Averaging data from vessels 100 to 150 μm in width may help to minimize such variability.
SIMRAND I- SIMULATION OF RESEARCH AND DEVELOPMENT PROJECTS
NASA Technical Reports Server (NTRS)
Miles, R. F.
1994-01-01
The Simulation of Research and Development Projects program (SIMRAND) aids in the optimal allocation of R&D resources needed to achieve project goals. SIMRAND models the system subsets or project tasks as various network paths to a final goal. Each path is described in terms of task variables such as cost per hour, cost per unit, availability of resources, etc. Uncertainty is incorporated by treating task variables as probabilistic random variables. SIMRAND calculates the measure of preference for each alternative network. The networks yielding the highest utility function (or certainty equivalence) are then ranked as the optimal network paths. SIMRAND has been used in several economic potential studies at NASA's Jet Propulsion Laboratory involving solar dish power systems and photovoltaic array construction. However, any project having tasks which can be reduced to equations and related by measures of preference can be modeled. SIMRAND analysis consists of three phases: reduction, simulation, and evaluation. In the reduction phase, analytical techniques from probability theory and simulation techniques are used to reduce the complexity of the alternative networks. In the simulation phase, a Monte Carlo simulation is used to derive statistics on the variables of interest for each alternative network path. In the evaluation phase, the simulation statistics are compared and the networks are ranked in preference by a selected decision rule. The user must supply project subsystems in terms of equations based on variables (for example, parallel and series assembly line tasks in terms of number of items, cost factors, time limits, etc). The associated cumulative distribution functions and utility functions for each variable must also be provided (allowable upper and lower limits, group decision factors, etc). SIMRAND is written in Microsoft FORTRAN 77 for batch execution and has been implemented on an IBM PC series computer operating under DOS.
Maria, Merlino Valentina; Danielle, Borra; Tibor, Verduna; Stefano, Massaglia
2017-10-31
Meat consumers around the world are increasingly paying attention to product quality and safety, and are starting to reduce their meat consumption, especially with regard to red meat. This trend is prevalent in households with children who prefer health-certified meat products. Our study compares meat consumption habits in households with and without children or adolescences (0-18 years). A structured questionnaire was distributed to 401 retail purchasers at 12 different points of sales of meat in the Piedmont region in northwest Italy. Socio-demographic variables and quantitative-qualitative meat consumption habits of retail purchasers were investigated. One part of the questionnaire analyzed the relative importance of 12 meat choice purchasing attributes by employing the Best-Worst scaling methodology, a type of choice experiment. Our research found that households without children (subset B) have higher weekly meat consumption habits than those with children (subset A). Alternatively, the households with children (subset A) have a diet characterized by a greater variety of protein sources, such as legumes and fish. Both of the considered subsets preferred trusted butchers for meat buying, with supermarkets as a second choice. However, only consumers of subset A bought meat from farm butchers. Our team performed a consumer analysis to identify meat consumption patterns in the two considered subsets. Simultaneously, a Best-Worst analysis evidenced several choice attributes with different relevance for the two investigated samples segmentation in three clusters of purchase.
Danielle, Borra; Tibor, Verduna; Stefano, Massaglia
2017-01-01
Meat consumers around the world are increasingly paying attention to product quality and safety, and are starting to reduce their meat consumption, especially with regard to red meat. This trend is prevalent in households with children who prefer health-certified meat products. Our study compares meat consumption habits in households with and without children or adolescences (0–18 years). A structured questionnaire was distributed to 401 retail purchasers at 12 different points of sales of meat in the Piedmont region in northwest Italy. Socio-demographic variables and quantitative-qualitative meat consumption habits of retail purchasers were investigated. One part of the questionnaire analyzed the relative importance of 12 meat choice purchasing attributes by employing the Best-Worst scaling methodology, a type of choice experiment. Our research found that households without children (subset B) have higher weekly meat consumption habits than those with children (subset A). Alternatively, the households with children (subset A) have a diet characterized by a greater variety of protein sources, such as legumes and fish. Both of the considered subsets preferred trusted butchers for meat buying, with supermarkets as a second choice. However, only consumers of subset A bought meat from farm butchers. Our team performed a consumer analysis to identify meat consumption patterns in the two considered subsets. Simultaneously, a Best-Worst analysis evidenced several choice attributes with different relevance for the two investigated samples segmentation in three clusters of purchase. PMID:29088120
Furlanello, Cesare; Serafini, Maria; Merler, Stefano; Jurman, Giuseppe
2003-11-06
We describe the E-RFE method for gene ranking, which is useful for the identification of markers in the predictive classification of array data. The method supports a practical modeling scheme designed to avoid the construction of classification rules based on the selection of too small gene subsets (an effect known as the selection bias, in which the estimated predictive errors are too optimistic due to testing on samples already considered in the feature selection process). With E-RFE, we speed up the recursive feature elimination (RFE) with SVM classifiers by eliminating chunks of uninteresting genes using an entropy measure of the SVM weights distribution. An optimal subset of genes is selected according to a two-strata model evaluation procedure: modeling is replicated by an external stratified-partition resampling scheme, and, within each run, an internal K-fold cross-validation is used for E-RFE ranking. Also, the optimal number of genes can be estimated according to the saturation of Zipf's law profiles. Without a decrease of classification accuracy, E-RFE allows a speed-up factor of 100 with respect to standard RFE, while improving on alternative parametric RFE reduction strategies. Thus, a process for gene selection and error estimation is made practical, ensuring control of the selection bias, and providing additional diagnostic indicators of gene importance.
Hinkle, S.R.; Kauffman, L.J.; Thomas, M.A.; Brown, C.J.; McCarthy, K.A.; Eberts, S.M.; Rosen, Michael R.; Katz, B.G.
2009-01-01
Flow-model particle-tracking results and geochemical data from seven study areas across the United States were analyzed using three statistical methods to test the hypothesis that these variables can successfully be used to assess public supply well vulnerability to arsenic and uranium. Principal components analysis indicated that arsenic and uranium concentrations were associated with particle-tracking variables that simulate time of travel and water fluxes through aquifer systems and also through specific redox and pH zones within aquifers. Time-of-travel variables are important because many geochemical reactions are kinetically limited, and geochemical zonation can account for different modes of mobilization and fate. Spearman correlation analysis established statistical significance for correlations of arsenic and uranium concentrations with variables derived using the particle-tracking routines. Correlations between uranium concentrations and particle-tracking variables were generally strongest for variables computed for distinct redox zones. Classification tree analysis on arsenic concentrations yielded a quantitative categorical model using time-of-travel variables and solid-phase-arsenic concentrations. The classification tree model accuracy on the learning data subset was 70%, and on the testing data subset, 79%, demonstrating one application in which particle-tracking variables can be used predictively in a quantitative screening-level assessment of public supply well vulnerability. Ground-water management actions that are based on avoidance of young ground water, reflecting the premise that young ground water is more vulnerable to anthropogenic contaminants than is old ground water, may inadvertently lead to increased vulnerability to natural contaminants due to the tendency for concentrations of many natural contaminants to increase with increasing ground-water residence time.
Moyé, Lemuel A; Lai, Dejian; Jing, Kaiyan; Baraniuk, Mary Sarah; Kwak, Minjung; Penn, Marc S; Wu, Colon O
2011-01-01
The assumptions that anchor large clinical trials are rooted in smaller, Phase II studies. In addition to specifying the target population, intervention delivery, and patient follow-up duration, physician-scientists who design these Phase II studies must select the appropriate response variables (endpoints). However, endpoint measures can be problematic. If the endpoint assesses the change in a continuous measure over time, then the occurrence of an intervening significant clinical event (SCE), such as death, can preclude the follow-up measurement. Finally, the ideal continuous endpoint measurement may be contraindicated in a fraction of the study patients, a change that requires a less precise substitution in this subset of participants.A score function that is based on the U-statistic can address these issues of 1) intercurrent SCE's and 2) response variable ascertainments that use different measurements of different precision. The scoring statistic is easy to apply, clinically relevant, and provides flexibility for the investigators' prospective design decisions. Sample size and power formulations for this statistic are provided as functions of clinical event rates and effect size estimates that are easy for investigators to identify and discuss. Examples are provided from current cardiovascular cell therapy research.
Exchange inlet optimization by genetic algorithm for improved RBCC performance
NASA Astrophysics Data System (ADS)
Chorkawy, G.; Etele, J.
2017-09-01
A genetic algorithm based on real parameter representation using a variable selection pressure and variable probability of mutation is used to optimize an annular air breathing rocket inlet called the Exchange Inlet. A rapid and accurate design method which provides estimates for air breathing, mixing, and isentropic flow performance is used as the engine of the optimization routine. Comparison to detailed numerical simulations show that the design method yields desired exit Mach numbers to within approximately 1% over 75% of the annular exit area and predicts entrained air massflows to between 1% and 9% of numerically simulated values depending on the flight condition. Optimum designs are shown to be obtained within approximately 8000 fitness function evaluations in a search space on the order of 106. The method is also shown to be able to identify beneficial values for particular alleles when they exist while showing the ability to handle cases where physical and aphysical designs co-exist at particular values of a subset of alleles within a gene. For an air breathing engine based on a hydrogen fuelled rocket an exchange inlet is designed which yields a predicted air entrainment ratio within 95% of the theoretical maximum.
Diaz, Sílvia O; Barros, António S; Goodfellow, Brian J; Duarte, Iola F; Galhano, Eulália; Pita, Cristina; Almeida, Maria do Céu; Carreira, Isabel M; Gil, Ana M
2013-06-07
Given the recognized lack of prenatal clinical methods for the early diagnosis of preterm delivery, intrauterine growth restriction, preeclampsia and gestational diabetes mellitus, and the continuing need for optimized diagnosis methods for specific chromosomal disorders (e.g., trisomy 21) and fetal malformations, this work sought specific metabolic signatures of these conditions in second trimester maternal urine, using (1)H Nuclear Magnetic Resonance ((1)H NMR) metabolomics. Several variable importance to the projection (VIP)- and b-coefficient-based variable selection methods were tested, both individually and through their intersection, and the resulting data sets were analyzed by partial least-squares discriminant analysis (PLS-DA) and submitted to Monte Carlo cross validation (MCCV) and permutation tests to evaluate model predictive power. The NMR data subsets produced significantly improved PLS-DA models for all conditions except for pre-premature rupture of membranes. Specific urinary metabolic signatures were unveiled for central nervous system malformations, trisomy 21, preterm delivery, gestational diabetes, intrauterine growth restriction and preeclampsia, and biochemical interpretations were proposed. This work demonstrated, for the first time, the value of maternal urine profiling as a complementary means of prenatal diagnostics and early prediction of several poor pregnancy outcomes.
Hercend, T; Griffin, J D; Bensussan, A; Schmidt, R E; Edson, M A; Brennan, A; Murray, C; Daley, J F; Schlossman, S F; Ritz, J
1985-01-01
The initial characterization of two monoclonal antibodies directed at antigens selectively expressed on large granular lymphocytes (LGL) is reported in the present paper. These two reagents, anti-natural killer (NK) H1A and anti-NKH2, were obtained following immunization of mouse spleen cells with a cloned human NK cell line termed JT3. In fresh human peripheral blood, both anti-NKH1A and anti-NKH2 selectively reacted with cells that appeared morphologically as large granular lymphocytes. However, complement lysis studies and two color fluorescence analysis demonstrated that some LGL express both antigens and other cells express only NKH1A or NKH2. Functional analysis of these subsets indicated that the population of NKH1A+ cells contains the entire pool of NK active lymphocytes, whereas expression of NKH2 antigen appeared to delineate a unique subpopulation of LGL which, in a resting state, display a low degree of spontaneous cytotoxicity. Expression of NKH1A and NKH2 was also investigated using a series of nine well characterized human NK clones. All NK clones were found to be NKH1A+ and four out of nine also expressed NKH2. These results strongly supported the view that NKH1A is a "pan-NK" associated antigen, and indicated that at least a fraction of cloned NKH2 + LGL are strongly cytotoxic. Anti-NKH1A was shown to have the same specificity as the previously described N901 antibody and was found here to precipitate a 200,000-220,000-mol wt molecule in SDS-polyacrylamide gel electrophoresis (PAGE) analysis. Anti-NKH2 was specific for a structure that migrates at 60,000 mol wt in SDS-PAGE analysis under reducing conditions. Two color immunofluorescence analysis of NKH1A, NKH2, and other NK-associated antigens (Leu7 and B73.1) demonstrated variable degrees of coexpression of these antigens, which confirmed that NKH1A and NKH2 define distinct cell surface structures. Anti-NKH1A and anti-NKH2 appear to be useful reagents for characterizing LGL present in human peripheral blood and for identifying functionally relevant subsets within this heterogeneous population of cytotoxic lymphocytes. Images PMID:3884668
Satellite Level 3 & 4 Data Subsetting at NASA GES DISC
NASA Technical Reports Server (NTRS)
Huwe, Paul; Su, Jian; Loeser, Carlee; Ostrenga, Dana; Rui, Hualan; Vollmer, Bruce
2017-01-01
Earth Science data are available in many file formats (NetCDF, HDF, GRB, etc.) and in a wide range of sizes, from kilobytes to gigabytes. These properties have become a challenge to users if they are not familiar with these formats or only want a small region of interest (ROI) from a specific dataset. At NASA Goddard Earth Sciences Data and Information Services Center (GES DISC), we have developed and implemented a multipurpose subset service to ease user access to Earth Science data. Our Level 3 & 4 Regridder is capable of subsetting across multiple parameters (spatially, temporally, by level, and by variable) as well as having additional beneficial features (temporal means, regridding to target grids, and file conversion to other data formats). In this presentation, we will demonstrate how users can use this service to better access only the data they need in the form they require.
Satellite Level 3 & 4 Data Subsetting at NASA GES DISC
NASA Astrophysics Data System (ADS)
Huwe, P.; Su, J.; Loeser, C. F.; Ostrenga, D.; Rui, H.; Vollmer, B.
2017-12-01
Earth Science data are available in many file formats (NetCDF, HDF, GRB, etc.) and in a wide range of sizes, from kilobytes to gigabytes. These properties have become a challenge to users if they are not familiar with these formats or only want a small region of interest (ROI) from a specific dataset. At NASA Goddard Earth Sciences Data and Information Services Center (GES DISC), we have developed and implemented a multipurpose subset service to ease user access to Earth Science data. Our Level 3 & 4 Regridder is capable of subsetting across multiple parameters (spatially, temporally, by level, and by variable) as well as having additional beneficial features (temporal means, regridding to target grids, and file conversion to other data formats). In this presentation, we will demonstrate how users can use this service to better access only the data they need in the form they require.
Li, Juan
2011-03-01
To study the change law of serum IL-6, TNF-α and peripheral blood T lymphocyte subsets in the pregnant women during perinatal period. 100 pregnant women in our hospital from November 2009 to October 2010 were selected as research object, and the serum IL-6, TNF-α and peripheral blood T lymphocyte subsets be-fore and at labor onset occurring, after delivery at the first and third day were analyzed and compared. According the study, the serum IL-6 and TNF-aat labor onset occurring were higher than those before labor onset and af-ter delivery at the first and third day , the CD3(+), CD4 (+), CD8(+) and CD4/CD8 decreased first and then increased, all P < 0. 05, there were significant differences. The changes of serum IL-6, TNF-α and peripheral blood T lymphocyte subsets in the pregnant women during perinatal period has a regular pattern, and it is worthy of.
MODIS Interactive Subsetting Tool (MIST)
NASA Astrophysics Data System (ADS)
McAllister, M.; Duerr, R.; Haran, T.; Khalsa, S. S.; Miller, D.
2008-12-01
In response to requests from the user community, NSIDC has teamed with the Oak Ridge National Laboratory Distributive Active Archive Center (ORNL DAAC) and the Moderate Resolution Data Center (MrDC) to provide time series subsets of satellite data covering stations in the Greenland Climate Network (GC-NET) and the International Arctic Systems for Observing the Atmosphere (IASOA) network. To serve these data NSIDC created the MODIS Interactive Subsetting Tool (MIST). MIST works with 7 km by 7 km subset time series of certain Version 5 (V005) MODIS products over GC-Net and IASOA stations. User- selected data are delivered in a text Comma Separated Value (CSV) file format. MIST also provides online analysis capabilities that include generating time series and scatter plots. Currently, MIST is a Beta prototype and NSIDC intends that user requests will drive future development of the tool. The intent of this poster is to introduce MIST to the MODIS data user audience and illustrate some of the online analysis capabilities.
Fish swarm intelligent to optimize real time monitoring of chips drying using machine vision
NASA Astrophysics Data System (ADS)
Hendrawan, Y.; Hawa, L. C.; Damayanti, R.
2018-03-01
This study attempted to apply machine vision-based chips drying monitoring system which is able to optimise the drying process of cassava chips. The objective of this study is to propose fish swarm intelligent (FSI) optimization algorithms to find the most significant set of image features suitable for predicting water content of cassava chips during drying process using artificial neural network model (ANN). Feature selection entails choosing the feature subset that maximizes the prediction accuracy of ANN. Multi-Objective Optimization (MOO) was used in this study which consisted of prediction accuracy maximization and feature-subset size minimization. The results showed that the best feature subset i.e. grey mean, L(Lab) Mean, a(Lab) energy, red entropy, hue contrast, and grey homogeneity. The best feature subset has been tested successfully in ANN model to describe the relationship between image features and water content of cassava chips during drying process with R2 of real and predicted data was equal to 0.9.
McClelland, Shawn; Brennan, Gary P; Dubé, Celine; Rajpara, Seeta; Iyer, Shruti; Richichi, Cristina; Bernard, Christophe; Baram, Tallie Z
2014-01-01
The mechanisms generating epileptic neuronal networks following insults such as severe seizures are unknown. We have previously shown that interfering with the function of the neuron-restrictive silencer factor (NRSF/REST), an important transcription factor that influences neuronal phenotype, attenuated development of this disorder. In this study, we found that epilepsy-provoking seizures increased the low NRSF levels in mature hippocampus several fold yet surprisingly, provoked repression of only a subset (∼10%) of potential NRSF target genes. Accordingly, the repressed gene-set was rescued when NRSF binding to chromatin was blocked. Unexpectedly, genes selectively repressed by NRSF had mid-range binding frequencies to the repressor, a property that rendered them sensitive to moderate fluctuations of NRSF levels. Genes selectively regulated by NRSF during epileptogenesis coded for ion channels, receptors, and other crucial contributors to neuronal function. Thus, dynamic, selective regulation of NRSF target genes may play a role in influencing neuronal properties in pathological and physiological contexts. DOI: http://dx.doi.org/10.7554/eLife.01267.001 PMID:25117540
Dense mesh sampling for video-based facial animation
NASA Astrophysics Data System (ADS)
Peszor, Damian; Wojciechowska, Marzena
2016-06-01
The paper describes an approach for selection of feature points on three-dimensional, triangle mesh obtained using various techniques from several video footages. This approach has a dual purpose. First, it allows to minimize the data stored for the purpose of facial animation, so that instead of storing position of each vertex in each frame, one could store only a small subset of vertices for each frame and calculate positions of others based on the subset. Second purpose is to select feature points that could be used for anthropometry-based retargeting of recorded mimicry to another model, with sampling density beyond that which can be achieved using marker-based performance capture techniques. Developed approach was successfully tested on artificial models, models constructed using structured light scanner, and models constructed from video footages using stereophotogrammetry.
Kraschnewski, Jennifer L; Keyserling, Thomas C; Bangdiwala, Shrikant I; Gizlice, Ziya; Garcia, Beverly A; Johnston, Larry F; Gustafson, Alison; Petrovic, Lindsay; Glasgow, Russell E; Samuel-Hodge, Carmen D
2010-01-01
Studies of type 2 translation, the adaption of evidence-based interventions to real-world settings, should include representative study sites and staff to improve external validity. Sites for such studies are, however, often selected by convenience sampling, which limits generalizability. We used an optimized probability sampling protocol to select an unbiased, representative sample of study sites to prepare for a randomized trial of a weight loss intervention. We invited North Carolina health departments within 200 miles of the research center to participate (N = 81). Of the 43 health departments that were eligible, 30 were interested in participating. To select a representative and feasible sample of 6 health departments that met inclusion criteria, we generated all combinations of 6 from the 30 health departments that were eligible and interested. From the subset of combinations that met inclusion criteria, we selected 1 at random. Of 593,775 possible combinations of 6 counties, 15,177 (3%) met inclusion criteria. Sites in the selected subset were similar to all eligible sites in terms of health department characteristics and county demographics. Optimized probability sampling improved generalizability by ensuring an unbiased and representative sample of study sites.
NASA Astrophysics Data System (ADS)
Borgelt, Christian
In clustering we often face the situation that only a subset of the available attributes is relevant for forming clusters, even though this may not be known beforehand. In such cases it is desirable to have a clustering algorithm that automatically weights attributes or even selects a proper subset. In this paper I study such an approach for fuzzy clustering, which is based on the idea to transfer an alternative to the fuzzifier (Klawonn and Höppner, What is fuzzy about fuzzy clustering? Understanding and improving the concept of the fuzzifier, In: Proc. 5th Int. Symp. on Intelligent Data Analysis, 254-264, Springer, Berlin, 2003) to attribute weighting fuzzy clustering (Keller and Klawonn, Int J Uncertain Fuzziness Knowl Based Syst 8:735-746, 2000). In addition, by reformulating Gustafson-Kessel fuzzy clustering, a scheme for weighting and selecting principal axes can be obtained. While in Borgelt (Feature weighting and feature selection in fuzzy clustering, In: Proc. 17th IEEE Int. Conf. on Fuzzy Systems, IEEE Press, Piscataway, NJ, 2008) I already presented such an approach for a global selection of attributes and principal axes, this paper extends it to a cluster-specific selection, thus arriving at a fuzzy subspace clustering algorithm (Parsons, Haque, and Liu, 2004).
Hypergraph Based Feature Selection Technique for Medical Diagnosis.
Somu, Nivethitha; Raman, M R Gauthama; Kirthivasan, Kannan; Sriram, V S Shankar
2016-11-01
The impact of internet and information systems across various domains have resulted in substantial generation of multidimensional datasets. The use of data mining and knowledge discovery techniques to extract the original information contained in the multidimensional datasets play a significant role in the exploitation of complete benefit provided by them. The presence of large number of features in the high dimensional datasets incurs high computational cost in terms of computing power and time. Hence, feature selection technique has been commonly used to build robust machine learning models to select a subset of relevant features which projects the maximal information content of the original dataset. In this paper, a novel Rough Set based K - Helly feature selection technique (RSKHT) which hybridize Rough Set Theory (RST) and K - Helly property of hypergraph representation had been designed to identify the optimal feature subset or reduct for medical diagnostic applications. Experiments carried out using the medical datasets from the UCI repository proves the dominance of the RSKHT over other feature selection techniques with respect to the reduct size, classification accuracy and time complexity. The performance of the RSKHT had been validated using WEKA tool, which shows that RSKHT had been computationally attractive and flexible over massive datasets.
The proposal of architecture for chemical splitting to optimize QSAR models for aquatic toxicity.
Colombo, Andrea; Benfenati, Emilio; Karelson, Mati; Maran, Uko
2008-06-01
One of the challenges in the field of quantitative structure-activity relationship (QSAR) analysis is the correct classification of a chemical compound to an appropriate model for the prediction of activity. Thus, in previous studies, compounds have been divided into distinct groups according to their mode of action or chemical class. In the current study, theoretical molecular descriptors were used to divide 568 organic substances into subsets with toxicity measured for the 96-h lethal median concentration for the Fathead minnow (Pimephales promelas). Simple constitutional descriptors such as the number of aliphatic and aromatic rings and a quantum chemical descriptor, maximum bond order of a carbon atom divide compounds into nine subsets. For each subset of compounds the automatic forward selection of descriptors was applied to construct QSAR models. Significant correlations were achieved for each subset of chemicals and all models were validated with the leave-one-out internal validation procedure (R(2)(cv) approximately 0.80). The results encourage to consider this alternative way for the prediction of toxicity using QSAR subset models without direct reference to the mechanism of toxic action or the traditional chemical classification.
Wang, Jie; Feng, Zuren; Lu, Na; Luo, Jing
2018-06-01
Feature selection plays an important role in the field of EEG signals based motor imagery pattern classification. It is a process that aims to select an optimal feature subset from the original set. Two significant advantages involved are: lowering the computational burden so as to speed up the learning procedure and removing redundant and irrelevant features so as to improve the classification performance. Therefore, feature selection is widely employed in the classification of EEG signals in practical brain-computer interface systems. In this paper, we present a novel statistical model to select the optimal feature subset based on the Kullback-Leibler divergence measure, and automatically select the optimal subject-specific time segment. The proposed method comprises four successive stages: a broad frequency band filtering and common spatial pattern enhancement as preprocessing, features extraction by autoregressive model and log-variance, the Kullback-Leibler divergence based optimal feature and time segment selection and linear discriminate analysis classification. More importantly, this paper provides a potential framework for combining other feature extraction models and classification algorithms with the proposed method for EEG signals classification. Experiments on single-trial EEG signals from two public competition datasets not only demonstrate that the proposed method is effective in selecting discriminative features and time segment, but also show that the proposed method yields relatively better classification results in comparison with other competitive methods. Copyright © 2018 Elsevier Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Golay, Jean; Kanevski, Mikhaïl
2013-04-01
The present research deals with the exploration and modeling of a complex dataset of 200 measurement points of sediment pollution by heavy metals in Lake Geneva. The fundamental idea was to use multivariate Artificial Neural Networks (ANN) along with geostatistical models and tools in order to improve the accuracy and the interpretability of data modeling. The results obtained with ANN were compared to those of traditional geostatistical algorithms like ordinary (co)kriging and (co)kriging with an external drift. Exploratory data analysis highlighted a great variety of relationships (i.e. linear, non-linear, independence) between the 11 variables of the dataset (i.e. Cadmium, Mercury, Zinc, Copper, Titanium, Chromium, Vanadium and Nickel as well as the spatial coordinates of the measurement points and their depth). Then, exploratory spatial data analysis (i.e. anisotropic variography, local spatial correlations and moving window statistics) was carried out. It was shown that the different phenomena to be modeled were characterized by high spatial anisotropies, complex spatial correlation structures and heteroscedasticity. A feature selection procedure based on General Regression Neural Networks (GRNN) was also applied to create subsets of variables enabling to improve the predictions during the modeling phase. The basic modeling was conducted using a Multilayer Perceptron (MLP) which is a workhorse of ANN. MLP models are robust and highly flexible tools which can incorporate in a nonlinear manner different kind of high-dimensional information. In the present research, the input layer was made of either two (spatial coordinates) or three neurons (when depth as auxiliary information could possibly capture an underlying trend) and the output layer was composed of one (univariate MLP) to eight neurons corresponding to the heavy metals of the dataset (multivariate MLP). MLP models with three input neurons can be referred to as Artificial Neural Networks with EXternal drift (ANNEX). Moreover, the exact number of output neurons and the selection of the corresponding variables were based on the subsets created during the exploratory phase. Concerning hidden layers, no restriction were made and multiple architectures were tested. For each MLP model, the quality of the modeling procedure was assessed by variograms: if the variogram of the residuals demonstrates pure nugget effect and if the level of the nugget exactly corresponds to the nugget value of the theoretical variogram of the corresponding variable, all the structured information has been correctly extracted without overfitting. Finally, it is worth mentioning that simple MLP models are not always able to remove all the spatial correlation structure from the data. In that case, Neural Network Residual Kriging (NNRK) can be carried out and risk assessment can be conducted with Neural Network Residual Simulations (NNRS). Finally, the results of the ANNEX models were compared to those of ordinary (co)kriging and (co)kriging with an external drift. It was shown that the ANNEX models performed better than traditional geostatistical algorithms when the relationship between the variable of interest and the auxiliary predictor was not linear. References Kanevski, M. and Maignan, M. (2004). Analysis and Modelling of Spatial Environmental Data. Lausanne: EPFL Press.
Enantioselectivity in Candida antarctica lipase B: A molecular dynamics study
Raza, Sami; Fransson, Linda; Hult, Karl
2001-01-01
A major problem in predicting the enantioselectivity of an enzyme toward substrate molecules is that even high selectivity toward one substrate enantiomer over the other corresponds to a very small difference in free energy. However, total free energies in enzyme-substrate systems are very large and fluctuate significantly because of general protein motion. Candida antarctica lipase B (CALB), a serine hydrolase, displays enantioselectivity toward secondary alcohols. Here, we present a modeling study where the aim has been to develop a molecular dynamics-based methodology for the prediction of enantioselectivity in CALB. The substrates modeled (seven in total) were 3-methyl-2-butanol with various aliphatic carboxylic acids and also 2-butanol, as well as 3,3-dimethyl-2-butanol with octanoic acid. The tetrahedral reaction intermediate was used as a model of the transition state. Investigative analyses were performed on ensembles of nonminimized structures and focused on the potential energies of a number of subsets within the modeled systems to determine which specific regions are important for the prediction of enantioselectivity. One category of subset was based on atoms that make up the core structural elements of the transition state. We considered that a more favorable energetic conformation of such a subset should relate to a greater likelihood for catalysis to occur, thus reflecting higher selectivity. The results of this study conveyed that the use of this type of subset was viable for the analysis of structural ensembles and yielded good predictions of enantioselectivity. PMID:11266619
Voils, Corrine I.; Olsen, Maren K.; Williams, John W.; for the IMPACT Study Investigators
2008-01-01
Objective: To determine whether a subset of depressive symptoms could be identified to facilitate diagnosis of depression in older adults in primary care. Method: Secondary analysis was conducted on 898 participants aged 60 years or older with major depressive disorder and/or dysthymic disorder (according to DSM-IV criteria) who participated in the Improving Mood–Promoting Access to Collaborative Treatment (IMPACT) study, a multisite, randomized trial of collaborative care for depression (recruitment from July 1999 to August 2001). Linear regression was used to identify a core subset of depressive symptoms associated with decreased social, physical, and mental functioning. The sensitivity and specificity, adjusting for selection bias, were evaluated for these symptoms. The sensitivity and specificity of a second subset of 4 depressive symptoms previously validated in a midlife sample was also evaluated. Results: Psychomotor changes, fatigue, and suicidal ideation were associated with decreased functioning and served as the core set of symptoms. Adjusting for selection bias, the sensitivity of these 3 symptoms was 0.012 and specificity 0.994. The sensitivity of the 4 symptoms previously validated in a midlife sample was 0.019 and specificity was 0.997. Conclusion: We identified 3 depression symptoms that were highly specific for major depressive disorder in older adults. However, these symptoms and a previously identified subset were too insensitive for accurate diagnosis. Therefore, we recommend a full assessment of DSM-IV depression criteria for accurate diagnosis. PMID:18311416
Heidema, A Geert; Boer, Jolanda M A; Nagelkerke, Nico; Mariman, Edwin C M; van der A, Daphne L; Feskens, Edith J M
2006-04-21
Genetic epidemiologists have taken the challenge to identify genetic polymorphisms involved in the development of diseases. Many have collected data on large numbers of genetic markers but are not familiar with available methods to assess their association with complex diseases. Statistical methods have been developed for analyzing the relation between large numbers of genetic and environmental predictors to disease or disease-related variables in genetic association studies. In this commentary we discuss logistic regression analysis, neural networks, including the parameter decreasing method (PDM) and genetic programming optimized neural networks (GPNN) and several non-parametric methods, which include the set association approach, combinatorial partitioning method (CPM), restricted partitioning method (RPM), multifactor dimensionality reduction (MDR) method and the random forests approach. The relative strengths and weaknesses of these methods are highlighted. Logistic regression and neural networks can handle only a limited number of predictor variables, depending on the number of observations in the dataset. Therefore, they are less useful than the non-parametric methods to approach association studies with large numbers of predictor variables. GPNN on the other hand may be a useful approach to select and model important predictors, but its performance to select the important effects in the presence of large numbers of predictors needs to be examined. Both the set association approach and random forests approach are able to handle a large number of predictors and are useful in reducing these predictors to a subset of predictors with an important contribution to disease. The combinatorial methods give more insight in combination patterns for sets of genetic and/or environmental predictor variables that may be related to the outcome variable. As the non-parametric methods have different strengths and weaknesses we conclude that to approach genetic association studies using the case-control design, the application of a combination of several methods, including the set association approach, MDR and the random forests approach, will likely be a useful strategy to find the important genes and interaction patterns involved in complex diseases.
Cuq, Benoît; Blois, Shauna L; Wood, R Darren; Monteith, Gabrielle; Abrams-Ogg, Anthony C; Bédard, Christian; Wood, Geoffrey A
2018-06-01
Thrombin plays a central role in hemostasis and thrombosis. Calibrated automated thrombography (CAT), a thrombin generation assay, may be a useful test for hemostatic disorders in dogs. To describe CAT results in a group of healthy dogs, and assess preanalytical variables and biological variability. Forty healthy dogs were enrolled. Lag time (Lag), time to peak (ttpeak), peak thrombin generation (peak), and endogenous thrombin potential (ETP) were measured. Direct jugular venipuncture and winged-needle catheter-assisted saphenous venipuncture were used to collect samples from each dog, and results were compared between methods. Sample stability at -80°C was assessed over 12 months in a subset of samples. Biological variability of CAT was assessed via nested ANOVA using samples obtained weekly from a subset of 9 dogs for 4 consecutive weeks. Samples for CAT were stable at -80°C over 12 months of storage. Samples collected via winged-needle catheter venipuncture showed poor repeatability compared to direct venipuncture samples; there was also poor agreement between the 2 sampling methods. Intra-individual variability of CAT parameters was below 25%; inter-individual variability ranged from 36.9% to 78.5%. Measurement of thrombin generation using CAT appears to be repeatable in healthy dogs, and samples are stable for at least 12 months when stored at -80°C. Direct venipuncture sampling is recommended for CAT. Low indices of individuality suggest that subject-based reference intervals are more suitable when interpreting CAT results. © 2018 American Society for Veterinary Clinical Pathology.
Cao, Xun; Luo, Rong-Zhen; He, Li-Ru; Li, Yong; Lin, Wen-Qian; Chen, You-Fang; Wen, Zhe-Sheng
2011-08-26
Lung metastases arising from nasopharyngeal carcinomas (NPC) have a relatively favourable prognosis. The purpose of this study was to identify the prognostic factors and to establish a risk grouping in patients with lung metastases from NPC. A total of 198 patients who developed lung metastases from NPC after primary therapy were retrospectively recruited from January 1982 to December 2000. Univariate and multivariate analyses of clinical variables were performed using Cox proportional hazards regression models. Actuarial survival rates were plotted against time using the Kaplan-Meier method, and log-rank testing was used to compare the differences between the curves. The median overall survival (OS) period and the lung metastasis survival (LMS) period were 51.5 and 20.9 months, respectively. After univariate and multivariate analyses of the clinical variables, age, T classification, N classification, site of metastases, secondary metastases and disease-free interval (DFI) correlated with OS, whereas age, VCA-IgA titre, number of metastases and secondary metastases were related to LMS. The prognoses of the low- (score 0-1), intermediate- (score 2-3) and high-risk (score 4-8) subsets based on these factors were significantly different. The 3-, 5- and 10-year survival rates of the low-, intermediate- and high-risk subsets, respectively (P < 0.001) were as follows: 77.3%, 60% and 59%; 52.3%, 30% and 27.8%; and 20.5%, 7% and 0%. In this study, clinical variables provided prognostic indicators of survival in NPC patients with lung metastases. Risk subsets would help in a more accurate assessment of a patient's prognosis in the clinical setting and could facilitate the establishment of patient-tailored medical strategies and supports.
Federal Register 2010, 2011, 2012, 2013, 2014
2012-07-26
... Proposed Rule Change Amending NYSE Arca Equities Rule 7.31(h) To Add a PL Select Order Type July 20, 2012...(h) to add a PL Select Order type. The proposed rule change was published for comment in the Federal... security at a specified, undisplayed price. The PL Select Order would be a subset of the PL Order that...
Gene selection for microarray data classification via subspace learning and manifold regularization.
Tang, Chang; Cao, Lijuan; Zheng, Xiao; Wang, Minhui
2017-12-19
With the rapid development of DNA microarray technology, large amount of genomic data has been generated. Classification of these microarray data is a challenge task since gene expression data are often with thousands of genes but a small number of samples. In this paper, an effective gene selection method is proposed to select the best subset of genes for microarray data with the irrelevant and redundant genes removed. Compared with original data, the selected gene subset can benefit the classification task. We formulate the gene selection task as a manifold regularized subspace learning problem. In detail, a projection matrix is used to project the original high dimensional microarray data into a lower dimensional subspace, with the constraint that the original genes can be well represented by the selected genes. Meanwhile, the local manifold structure of original data is preserved by a Laplacian graph regularization term on the low-dimensional data space. The projection matrix can serve as an importance indicator of different genes. An iterative update algorithm is developed for solving the problem. Experimental results on six publicly available microarray datasets and one clinical dataset demonstrate that the proposed method performs better when compared with other state-of-the-art methods in terms of microarray data classification. Graphical Abstract The graphical abstract of this work.
A small number of candidate gene SNPs reveal continental ancestry in African Americans
KODAMAN, NURI; ALDRICH, MELINDA C.; SMITH, JEFFREY R.; SIGNORELLO, LISA B.; BRADLEY, KEVIN; BREYER, JOAN; COHEN, SARAH S.; LONG, JIRONG; CAI, QIUYIN; GILES, JUSTIN; BUSH, WILLIAM S.; BLOT, WILLIAM J.; MATTHEWS, CHARLES E.; WILLIAMS, SCOTT M.
2013-01-01
SUMMARY Using genetic data from an obesity candidate gene study of self-reported African Americans and European Americans, we investigated the number of Ancestry Informative Markers (AIMs) and candidate gene SNPs necessary to infer continental ancestry. Proportions of African and European ancestry were assessed with STRUCTURE (K=2), using 276 AIMs. These reference values were compared to estimates derived using 120, 60, 30, and 15 SNP subsets randomly chosen from the 276 AIMs and from 1144 SNPs in 44 candidate genes. All subsets generated estimates of ancestry consistent with the reference estimates, with mean correlations greater than 0.99 for all subsets of AIMs, and mean correlations of 0.99±0.003; 0.98± 0.01; 0.93±0.03; and 0.81± 0.11 for subsets of 120, 60, 30, and 15 candidate gene SNPs, respectively. Among African Americans, the median absolute difference from reference African ancestry values ranged from 0.01 to 0.03 for the four AIMs subsets and from 0.03 to 0.09 for the four candidate gene SNP subsets. Furthermore, YRI/CEU Fst values provided a metric to predict the performance of candidate gene SNPs. Our results demonstrate that a small number of SNPs randomly selected from candidate genes can be used to estimate admixture proportions in African Americans reliably. PMID:23278390
Burel, Julie G.; Qian, Yu; Arlehamn, Cecilia Lindestam; Weiskopf, Daniela; Zapardiel-Gonzalo, Jose; Taplitz, Randy; Gilman, Robert H.; Saito, Mayuko; de Silva, Aruna D.; Vijayanand, Pandurangan; Scheuermann, Richard H.; Sette, Alessandro; Peters, Bjoern
2016-01-01
In the context of large-scale human system immunology studies, controlling for technical and biological variability is crucial to ensure that experimental data support research conclusions. Here, we report on a universal workflow to evaluate both technical and biological variation in multiparameter flow cytometry, applied to the development of a 10-color panel to identify all major cell populations and T cell subsets in cryopreserved PBMC. Replicate runs from a control donation and comparison of different gating strategies assessed technical variability associated with each cell population and permitted the calculation of a quality control score. Applying our panel to a large collection of PBMC samples, we found that most cell populations showed low intra-individual variability over time. In contrast, certain subpopulations such as CD56 T cells and Temra CD4 T cells were associated with high inter-individual variability. Age but not gender had a significant effect on the frequency of several populations, with a drastic decrease in naïve T cells observed in older donors. Ethnicity also influenced a significant proportion of immune cell population frequencies, emphasizing the need to account for these co-variates in immune profiling studies. Finally, we exemplify the usefulness of our workflow by identifying a novel cell-subset signature of latent tuberculosis infection. Thus, our study provides a universal workflow to establish and evaluate any flow cytometry panel in systems immunology studies. PMID:28069807
Burel, Julie G; Qian, Yu; Lindestam Arlehamn, Cecilia; Weiskopf, Daniela; Zapardiel-Gonzalo, Jose; Taplitz, Randy; Gilman, Robert H; Saito, Mayuko; de Silva, Aruna D; Vijayanand, Pandurangan; Scheuermann, Richard H; Sette, Alessandro; Peters, Bjoern
2017-02-15
In the context of large-scale human system immunology studies, controlling for technical and biological variability is crucial to ensure that experimental data support research conclusions. In this study, we report on a universal workflow to evaluate both technical and biological variation in multiparameter flow cytometry, applied to the development of a 10-color panel to identify all major cell populations and T cell subsets in cryopreserved PBMC. Replicate runs from a control donation and comparison of different gating strategies assessed the technical variability associated with each cell population and permitted the calculation of a quality control score. Applying our panel to a large collection of PBMC samples, we found that most cell populations showed low intraindividual variability over time. In contrast, certain subpopulations such as CD56 T cells and Temra CD4 T cells were associated with high interindividual variability. Age but not gender had a significant effect on the frequency of several populations, with a drastic decrease in naive T cells observed in older donors. Ethnicity also influenced a significant proportion of immune cell population frequencies, emphasizing the need to account for these covariates in immune profiling studies. We also exemplify the usefulness of our workflow by identifying a novel cell-subset signature of latent tuberculosis infection. Thus, our study provides a universal workflow to establish and evaluate any flow cytometry panel in systems immunology studies. Copyright © 2017 by The American Association of Immunologists, Inc.
Kline, Jeffrey A; Courtney, D Mark; Than, Martin P; Hogg, Kerstin; Miller, Chadwick D; Johnson, Charles L; Smithline, Howard A
2010-02-01
Attribute matching matches an explicit clinical profile of a patient to a reference database to estimate the numeric value for the pretest probability of an acute disease. The authors tested the accuracy of this method for forecasting a very low probability of venous thromboembolism (VTE) in symptomatic emergency department (ED) patients. The authors performed a secondary analysis of five data sets from 15 hospitals in three countries. All patients had data collected at the time of clinical evaluation for suspected pulmonary embolism (PE). The criterion standard to exclude VTE required no evidence of PE or deep venous thrombosis (DVT) within 45 days of enrollment. To estimate pretest probabilities, a computer program selected, from a large reference database of patients previously evaluated for PE, patients who matched 10 predictor variables recorded for each current test patient. The authors compared the outcome frequency of having VTE [VTE(+)] in patients with a pretest probability estimate of <2.5% by attribute matching, compared with a value of 0 from the Wells score. The five data sets included 10,734 patients, and 747 (7.0%, 95% confidence interval [CI] = 6.5% to 7.5%) were VTE(+) within 45 days. The pretest probability estimate for PE was <2.5% in 2,975 of 10,734 (27.7%) patients, and within this subset, the observed frequency of VTE(+) was 48 of 2,975 (1.6%, 95% CI = 1.2% to 2.1%). The lowest possible Wells score (0) was observed in 3,412 (31.7%) patients, and within this subset, the observed frequency of VTE(+) was 79 of 3,412 (2.3%, 95% CI = 1.8% to 2.9%) patients. Attribute matching categorizes over one-quarter of patients tested for PE as having a pretest probability of <2.5%, and the observed rate of VTE within 45 days in this subset was <2.5%. (c) 2010 by the Society for Academic Emergency Medicine.
Vafaee Sharbaf, Fatemeh; Mosafer, Sara; Moattar, Mohammad Hossein
2016-06-01
This paper proposes an approach for gene selection in microarray data. The proposed approach consists of a primary filter approach using Fisher criterion which reduces the initial genes and hence the search space and time complexity. Then, a wrapper approach which is based on cellular learning automata (CLA) optimized with ant colony method (ACO) is used to find the set of features which improve the classification accuracy. CLA is applied due to its capability to learn and model complicated relationships. The selected features from the last phase are evaluated using ROC curve and the most effective while smallest feature subset is determined. The classifiers which are evaluated in the proposed framework are K-nearest neighbor; support vector machine and naïve Bayes. The proposed approach is evaluated on 4 microarray datasets. The evaluations confirm that the proposed approach can find the smallest subset of genes while approaching the maximum accuracy. Copyright © 2016 Elsevier Inc. All rights reserved.
Nohara, Ryuki; Endo, Yui; Murai, Akihiko; Takemura, Hiroshi; Kouchi, Makiko; Tada, Mitsunori
2016-08-01
Individual human models are usually created by direct 3D scanning or deforming a template model according to the measured dimensions. In this paper, we propose a method to estimate all the necessary dimensions (full set) for the human model individualization from a small number of measured dimensions (subset) and human dimension database. For this purpose, we solved multiple regression equation from the dimension database given full set dimensions as the objective variable and subset dimensions as the explanatory variables. Thus, the full set dimensions are obtained by simply multiplying the subset dimensions to the coefficient matrix of the regression equation. We verified the accuracy of our method by imputing hand, foot, and whole body dimensions from their dimension database. The leave-one-out cross validation is employed in this evaluation. The mean absolute errors (MAE) between the measured and the estimated dimensions computed from 4 dimensions (hand length, breadth, middle finger breadth at proximal, and middle finger depth at proximal) in the hand, 2 dimensions (foot length, breadth, and lateral malleolus height) in the foot, and 1 dimension (height) and weight in the whole body are computed. The average MAE of non-measured dimensions were 4.58% in the hand, 4.42% in the foot, and 3.54% in the whole body, while that of measured dimensions were 0.00%.
Hendry, G; North, D; Zewotir, T; Naidoo, R N
2014-09-28
Non-response in cross-sectional data is not uncommon and requires careful handling during the analysis stage so as not to bias results. In this paper, we illustrate how subset correspondence analysis can be applied in order to manage the non-response while at the same time retaining all observed data. This variant of correspondence analysis was applied to a set of epidemiological data in which relationships between numerous environmental, genetic, behavioural and socio-economic factors and their association with asthma severity in children were explored. The application of subset correspondence analysis revealed interesting associations between the measured variables that otherwise may not have been exposed. Many of the associations found confirm established theories found in literature regarding factors that exacerbate childhood asthma. Moderate to severe asthma was found to be associated with needing neonatal care, male children, 8- to 9-year olds, exposure to tobacco smoke in vehicles and living in areas that suffer from extreme air pollution. Associations were found between mild persistent asthma and low birthweight, and being exposed to smoke in the home and living in a home with up to four people. The classification of probable asthma was associated with a group of variables that indicate low socio-economic status. Copyright © 2014 John Wiley & Sons, Ltd.
Physiological IgM Class Catalytic Antibodies Selective for Transthyretin Amyloid*
Planque, Stephanie A.; Nishiyama, Yasuhiro; Hara, Mariko; Sonoda, Sari; Murphy, Sarah K.; Watanabe, Kenji; Mitsuda, Yukie; Brown, Eric L.; Massey, Richard J.; Primmer, Stanley R.; O'Nuallain, Brian; Paul, Sudhir
2014-01-01
Peptide bond-hydrolyzing catalytic antibodies (catabodies) could degrade toxic proteins, but acquired immunity principles have not provided evidence for beneficial catabodies. Transthyretin (TTR) forms misfolded β-sheet aggregates responsible for age-associated amyloidosis. We describe nucleophilic catabodies from healthy humans without amyloidosis that degraded misfolded TTR (misTTR) without reactivity to the physiological tetrameric TTR (phyTTR). IgM class B cell receptors specifically recognized the electrophilic analog of misTTR but not phyTTR. IgM but not IgG class antibodies hydrolyzed the particulate and soluble misTTR species. No misTTR-IgM binding was detected. The IgMs accounted for essentially all of the misTTR hydrolytic activity of unfractionated human serum. The IgMs did not degrade non-amyloidogenic, non-superantigenic proteins. Individual monoclonal IgMs (mIgMs) expressed variable misTTR hydrolytic rates and differing oligoreactivity directed to amyloid β peptide and microbial superantigen proteins. A subset of the mIgMs was monoreactive for misTTR. Excess misTTR was dissolved by a hydrolytic mIgM. The studies reveal a novel antibody property, the innate ability of IgMs to selectively degrade and dissolve toxic misTTR species as a first line immune function. PMID:24648510
Search for anomalous kinematics in tt dilepton events at CDF II.
Acosta, D; Adelman, J; Affolder, T; Akimoto, T; Albrow, M G; Ambrose, D; Amerio, S; Amidei, D; Anastassov, A; Anikeev, K; Annovi, A; Antos, J; Aoki, M; Apollinari, G; Arisawa, T; Arguin, J-F; Artikov, A; Ashmanskas, W; Attal, A; Azfar, F; Azzi-Bacchetta, P; Bacchetta, N; Bachacou, H; Badgett, W; Barbaro-Galtieri, A; Barker, G J; Barnes, V E; Barnett, B A; Baroiant, S; Barone, M; Bauer, G; Bedeschi, F; Behari, S; Belforte, S; Bellettini, G; Bellinger, J; Ben-Haim, E; Benjamin, D; Beretvas, A; Bhatti, A; Binkley, M; Bisello, D; Bishai, M; Blair, R E; Blocker, C; Bloom, K; Blumenfeld, B; Bocci, A; Bodek, A; Bolla, G; Bolshov, A; Booth, P S L; Bortoletto, D; Boudreau, J; Bourov, S; Brau, B; Bromberg, C; Brubaker, E; Budagov, J; Budd, H S; Burkett, K; Busetto, G; Bussey, P; Byrum, K L; Cabrera, S; Campanelli, M; Campbell, M; Canepa, A; Casarsa, M; Carlsmith, D; Carron, S; Carosi, R; Cavalli-Sforza, M; Castro, A; Catastini, P; Cauz, D; Cerri, A; Cerrito, L; Chapman, J; Chen, C; Chen, Y C; Chertok, M; Chiarelli, G; Chlachidze, G; Chlebana, F; Cho, I; Cho, K; Chokheli, D; Chou, J P; Chu, M L; Chuang, S; Chung, J Y; Chung, W-H; Chung, Y S; Ciobanu, C I; Ciocci, M A; Clark, A G; Clark, D; Coca, M; Connolly, A; Convery, M; Conway, J; Cooper, B; Cordelli, M; Cortiana, G; Cranshaw, J; Cuevas, J; Culbertson, R; Currat, C; Cyr, D; Dagenhart, D; Da Ronco, S; D'Auria, S; de Barbaro, P; De Cecco, S; De Lentdecker, G; Dell'Agnello, S; Dell'Orso, M; Demers, S; Demortier, L; Deninno, M; De Pedis, D; Derwent, P F; Dionisi, C; Dittmann, J R; Dörr, C; Doksus, P; Dominguez, A; Donati, S; Donega, M; Donini, J; D'Onofrio, M; Dorigo, T; Drollinger, V; Ebina, K; Eddy, N; Ehlers, J; Ely, R; Erbacher, R; Erdmann, M; Errede, D; Errede, S; Eusebi, R; Fang, H-C; Farrington, S; Fedorko, I; Fedorko, W T; Feild, R G; Feindt, M; Fernandez, J P; Ferretti, C; Field, R D; Flanagan, G; Flaugher, B; Flores-Castillo, L R; Foland, A; Forrester, S; Foster, G W; Franklin, M; Freeman, J C; Fujii, Y; Furic, I; Gajjar, A; Gallas, A; Galyardt, J; Gallinaro, M; Garcia-Sciveres, M; Garfinkel, A F; Gay, C; Gerberich, H; Gerdes, D W; Gerchtein, E; Giagu, S; Giannetti, P; Gibson, A; Gibson, K; Ginsburg, C; Giolo, K; Giordani, M; Giunta, M; Giurgiu, G; Glagolev, V; Glenzinski, D; Gold, M; Goldschmidt, N; Goldstein, D; Goldstein, J; Gomez, G; Gomez-Ceballos, G; Goncharov, M; González, O; Gorelov, I; Goshaw, A T; Gotra, Y; Goulianos, K; Gresele, A; Griffiths, M; Grosso-Pilcher, C; Grundler, U; Guenther, M; Guimaraes da Costa, J; Haber, C; Hahn, K; Hahn, S R; Halkiadakis, E; Hamilton, A; Han, B-Y; Handler, R; Happacher, F; Hara, K; Hare, M; Harr, R F; Harris, R M; Hartmann, F; Hatakeyama, K; Hauser, J; Hays, C; Hayward, H; Heider, E; Heinemann, B; Heinrich, J; Hennecke, M; Herndon, M; Hill, C; Hirschhbuehl, D; Hocker, A; Hoffman, K D; Holloway, A; Hou, S; Houlden, M A; Huffman, B T; Huang, Y; Hughes, R E; Huston, J; Ikado, K; Incandela, J; Introzzi, G; Iori, M; Ishizawa, Y; Issever, C; Ivanov, A; Iwata, Y; Iyutin, B; James, E; Jang, D; Jarrell, J; Jeans, D; Jensen, H; Jeon, E J; Jones, M; Joo, K K; Jun, S Y; Junk, T; Kamon, T; Kang, J; Karagoz Unel, M; Karchin, P E; Kartal, S; Kato, Y; Kemp, Y; Kephart, R; Kerzel, U; Khotilovich, V; Kilminster, B; Kim, D H; Kim, H S; Kim, J E; Kim, M J; Kim, M S; Kim, S B; Kim, S H; Kim, T H; Kim, Y K; King, B T; Kirby, M; Kirsch, L; Klimenko, S; Knuteson, B; Ko, B R; Kobayashi, H; Koehn, P; Kong, D J; Kondo, K; Konigsberg, J; Kordas, K; Korn, A; Korytov, A; Kotelnikov, K; Kotwal, A V; Kovalev, A; Kraus, J; Kravchenko, I; Kreymer, A; Kroll, J; Kruse, M; Krutelyov, V; Kuhlmann, S E; Kwang, S; Laasanen, A T; Lai, S; Lami, S; Lammel, S; Lancaster, J; Lancaster, M; Lander, R; Lannon, K; Lath, A; Latino, G; Lauhakangas, R; Lazzizzera, I; Le, Y; Lecci, C; LeCompte, T; Lee, J; Lee, J; Lee, S W; Lefèvre, R; Leonardo, N; Leone, S; Levy, S; Lewis, J D; Li, K; Lin, C; Lin, C S; Lindgren, M; Liss, T M; Lister, A; Litvintsev, D O; Liu, T; Liu, Y; Lockyer, N S; Loginov, A; Loreti, M; Loverre, P; Lu, R-S; Lucchesi, D; Lujan, P; Lukens, P; Lungu, G; Lyons, L; Lys, J; Lysak, R; MacQueen, D; Madrak, R; Maeshima, K; Maksimovic, P; Malferrari, L; Manca, G; Marginean, R; Marino, C; Martin, A; Martin, M; Martin, V; Martínez, M; Maruyama, T; Matsunaga, H; Mattson, M; Mazzanti, P; McFarland, K S; McGivern, D; McIntyre, P M; McNamara, P; NcNulty, R; Mehta, A; Menzemer, S; Menzione, A; Merkel, P; Mesropian, C; Messina, A; Miao, T; Miladinovic, N; Miller, L; Miller, R; Miller, J S; Miquel, R; Miscetti, S; Mitselmakher, G; Miyamoto, A; Miyazaki, Y; Moggi, N; Mohr, B; Moore, R; Morello, M; Movilla Fernandez, P A; Mukherjee, A; Mulhearn, M; Muller, T; Mumford, R; Munar, A; Murat, P; Nachtman, J; Nahn, S; Nakamura, I; Nakano, I; Napier, A; Napora, R; Naumov, D; Necula, V; Niell, F; Nielsen, J; Nelson, C; Nelson, T; Neu, C; Neubauer, M S; Newman-Holmes, C; Nigmanov, T; Nodulman, L; Norniella, O; Oesterberg, K; Ogawa, T; Oh, S H; Oh, Y D; Ohsugi, T; Okusawa, T; Oldeman, R; Orava, R; Orejudos, W; Pagliarone, C; Palencia, E; Paoletti, R; Papadimitriou, V; Pashapour, S; Patrick, J; Pauletta, G; Paulini, M; Pauly, T; Paus, C; Pellett, D; Penzo, A; Phillips, T J; Piacentino, G; Piedra, J; Pitts, K T; Plager, C; Pompos, A; Pondrom, L; Pope, G; Portell, X; Poukhov, O; Prakoshyn, F; Pratt, T; Pronko, A; Proudfoot, J; Ptohos, F; Punzi, G; Rademachker, J; Rahaman, M A; Rakitine, A; Rappoccio, S; Ratnikov, F; Ray, H; Reisert, B; Rekovic, V; Renton, P; Rescigno, M; Rimondi, F; Rinnert, K; Ristori, L; Robertson, W J; Robson, A; Rodrigo, T; Rolli, S; Rosenson, L; Roser, R; Rossin, R; Rott, C; Russ, J; Rusu, V; Ruiz, A; Ryan, D; Saarikko, H; Sabik, S; Safonov, A; St Denis, R; Sakumoto, W K; Salamanna, G; Saltzberg, D; Sanchez, C; Sansoni, A; Santi, L; Sarkar, S; Sato, K; Savard, P; Savoy-Navarro, A; Schlabach, P; Schmidt, E E; Schmidt, M P; Schmitt, M; Scodellaro, L; Scribano, A; Scuri, F; Sedov, A; Seidel, S; Seiya, Y; Semeria, F; Sexton-Kennedy, L; Sfiligoi, I; Shapiro, M D; Shears, T; Shepard, P F; Sherman, D; Shimojima, M; Shochet, M; Shon, Y; Shreyber, I; Sidoti, A; Siegrist, J; Siket, M; Sill, A; Sinervo, P; Sisakyan, A; Skiba, A; Slaughter, A J; Sliwa, K; Smirnov, D; Smith, J R; Snider, F D; Snihur, R; Soha, A; Somalwar, S V; Spalding, J; Spezziga, M; Spiegel, L; Spinella, F; Spiropulu, M; Squillacioti, P; Stadie, H; Stelzer, B; Stelzer-Chilton, O; Strologas, J; Stuart, D; Sukhanov, A; Sumorok, K; Sun, H; Suzuki, T; Taffard, A; Tafirout, R; Takach, S F; Takano, H; Takashima, R; Takeuchi, Y; Takikawa, K; Tanaka, M; Tanaka, R; Tanimoto, N; Tapprogge, S; Tecchio, M; Teng, P K; Terashi, K; Tesarek, R J; Tether, S; Thom, J; Thompson, A S; Thomson, E; Tipton, P; Tiwari, V; Trkaczyk, S; Toback, D; Tollefson, K; Tomura, T; Tonelli, D; Tönnesmann, M; Torre, S; Torretta, D; Tourneur, S; Trischuk, W; Tseng, J; Tsuchiya, R; Tsuno, S; Tsybychev, D; Turini, N; Turner, M; Ukegawa, F; Unverhau, T; Uozumi, S; Usynin, D; Vacavant, L; Vaiciulis, A; Varganov, A; Vataga, E; Vejcik, S; Velev, G; Veszpremi, V; Veramendi, G; Vickey, T; Vidal, R; Vila, I; Vilar, R; Vollrath, I; Volobouev, I; von der Mey, M; Wagner, P; Wagner, R G; Wagner, R L; Wagner, W; Wallny, R; Walter, T; Yamashita, T; Yamamoto, K; Wan, Z; Wang, M J; Wang, S M; Warburton, A; Ward, B; Waschke, S; Waters, D; Watts, T; Weber, M; Wester, W C; Whitehouse, B; Wicklund, A B; Wicklund, E; Williams, H H; Wilson, P; Winer, B L; Wittich, P; Wolbers, S; Wolter, M; Worcester, M; Worm, S; Wright, T; Wu, X; Würthwein, F; Wyatt, A; Yagil, A; Yang, C; Yang, U K; Yao, W; Yeh, G P; Yi, K; Yoh, J; Yoon, P; Yorita, K; Yoshida, T; Yu, I; Yu, S; Yu, Z; Yun, J C; Zanello, L; Zanetti, A; Zaw, I; Zetti, F; Zhou, J; Zsenei, A; Zucchelli, S
2005-07-08
We report on a search for anomalous kinematics of tt dilepton events in pp collisions at square root of s=1.96 TeV using 193 pb(-1) of data collected with the CDF II detector. We developed a new a priori technique designed to isolate the subset in a data sample revealing the largest deviation from standard model (SM) expectations and to quantify the significance of this departure. In the four-variable space considered, no particular subset shows a significant discrepancy, and we find that the probability of obtaining a data sample less consistent with the SM than what is observed is 1.0%-4.5%.
NASA Technical Reports Server (NTRS)
Rietmeijer, F. J. M.
1989-01-01
Olivine-rich chondritic interplanetary dust particles (IDPs) are an important subset of fluffy chondritic IDPs collected in the earth's stratosphere. Particles in this subset are characterized by a matrix of nonporous, ultrafine-grained granular units. Euhedral single crystals, crystals fragments, and platey single crystals occur dispersed in the matrix. Analytical electron microscopy of granular units reveals predominant magnesium-rich olivines and FeNi-sulfides embedded in amorphous carbonaceous matrix material. The variable ratio of ultrafine-grained minerals vs. carbonaceous matrix material in granular units support variable C/Si ratios, and some fraction of sulfur is associated with carbonaceous matrix material. The high Mg/(Mg+Fe) ratios in granular units is similar to this distribution in P/Comet Halley dust. The chondritic composition of fine-grained, polycrystalline IDPs gradually breaks down into nonchondritic, and ultimately, single mineral compositions as a function of decreased particle mass. The relationship between particle mass and composition in the matrix of olivine-rich chondritic IDPs is comparable with the relationship inferred for P/Comet Halley dust.
Strains, functions, and dynamics in the expanded Human Microbiome Project
Lloyd-Price, Jason; Mahurkar, Anup; Rahnavard, Gholamali; Crabtree, Jonathan; Orvis, Joshua; Hall, A. Brantley; Brady, Arthur; Creasy, Heather H.; McCracken, Carrie; Giglio, Michelle G.; McDonald, Daniel; Franzosa, Eric A.; Knight, Rob; White, Owen; Huttenhower, Curtis
2018-01-01
Summary The characterization of baseline microbial and functional diversity in the human microbiome has enabled studies of microbiome-related disease, microbial population diversity, biogeography, and molecular function. The NIH Human Microbiome Project (HMP) has provided one of the broadest such characterizations to date. Here, we introduce an expanded second phase of the study, abbreviated HMP1-II, comprising 1,631 new metagenomic samples (2,355 total) targeting diverse body sites with multiple time points in 265 individuals. We applied updated profiling and assembly methods to these data to provide new characterizations of microbiome personalization. Strain identification revealed distinct subspecies clades specific to body sites; it also quantified species with phylogenetic diversity under-represented in isolate genomes. Body-wide functional profiling classified pathways into universal, human-enriched, and body site-enriched subsets. Finally, temporal analysis decomposed microbial variation into rapidly variable, moderately variable, and stable subsets. This study furthers our knowledge of baseline human microbial diversity, thus enabling an understanding of personalized microbiome function and dynamics. PMID:28953883
Pitchers, W. R.; Brooks, R.; Jennions, M. D.; Tregenza, T.; Dworkin, I.; Hunt, J.
2013-01-01
Phenotypic integration and plasticity are central to our understanding of how complex phenotypic traits evolve. Evolutionary change in complex quantitative traits can be predicted using the multivariate breeders’ equation, but such predictions are only accurate if the matrices involved are stable over evolutionary time. Recent work, however, suggests that these matrices are temporally plastic, spatially variable and themselves evolvable. The data available on phenotypic variance-covariance matrix (P) stability is sparse, and largely focused on morphological traits. Here we compared P for the structure of the complex sexual advertisement call of six divergent allopatric populations of the Australian black field cricket, Teleogryllus commodus. We measured a subset of calls from wild-caught crickets from each of the populations and then a second subset after rearing crickets under common-garden conditions for three generations. In a second experiment, crickets from each population were reared in the laboratory on high- and low-nutrient diets and their calls recorded. In both experiments, we estimated P for call traits and used multiple methods to compare them statistically (Flury hierarchy, geometric subspace comparisons and random skewers). Despite considerable variation in means and variances of individual call traits, the structure of P was largely conserved among populations, across generations and between our rearing diets. Our finding that P remains largely stable, among populations and between environmental conditions, suggests that selection has preserved the structure of call traits in order that they can function as an integrated unit. PMID:23530814
Spatial Downscaling of Alien Species Presences using Machine Learning
NASA Astrophysics Data System (ADS)
Daliakopoulos, Ioannis N.; Katsanevakis, Stelios; Moustakas, Aristides
2017-07-01
Large scale, high-resolution data on alien species distributions are essential for spatially explicit assessments of their environmental and socio-economic impacts, and management interventions for mitigation. However, these data are often unavailable. This paper presents a method that relies on Random Forest (RF) models to distribute alien species presence counts at a finer resolution grid, thus achieving spatial downscaling. A sufficiently large number of RF models are trained using random subsets of the dataset as predictors, in a bootstrapping approach to account for the uncertainty introduced by the subset selection. The method is tested with an approximately 8×8 km2 grid containing floral alien species presence and several indices of climatic, habitat, land use covariates for the Mediterranean island of Crete, Greece. Alien species presence is aggregated at 16×16 km2 and used as a predictor of presence at the original resolution, thus simulating spatial downscaling. Potential explanatory variables included habitat types, land cover richness, endemic species richness, soil type, temperature, precipitation, and freshwater availability. Uncertainty assessment of the spatial downscaling of alien species’ occurrences was also performed and true/false presences and absences were quantified. The approach is promising for downscaling alien species datasets of larger spatial scale but coarse resolution, where the underlying environmental information is available at a finer resolution than the alien species data. Furthermore, the RF architecture allows for tuning towards operationally optimal sensitivity and specificity, thus providing a decision support tool for designing a resource efficient alien species census.
An overview of the Columbia Habitat Monitoring Program's (CHaMP) spatial-temporal design framework
We briefly review the concept of a master sample applied to stream networks in which a randomized set of stream sites is selected across a broad region to serve as a list of sites from which a subset of sites is selected to achieve multiple objectives of specific designs. The Col...
ERIC Educational Resources Information Center
Scott, Marcia Strong; Delgado, Christine F.; Tu, Shihfen; Fletcher, Kathryn L.
2005-01-01
In this study, predictive classification accuracy was used to select those tasks from a kindergarten screening battery that best identified children who, three years later, were classified as educable mentally handicapped or as having a specific learning disability. A subset of measures enabled correct classification of 91% of the children in…
[Varicocele and coincidental abacterial prostato-vesiculitis: negative role about the sperm output].
Vicari, Enzo; La Vignera, Sandro; Tracia, Angelo; Cardì, Francesco; Donati, Angelo
2003-03-01
To evaluate the frequency and the role of a coincidentally expressed abacterial prostato-vesiculitis (PV) on sperm output in patients with left varicocele (Vr). We evaluated 143 selected infertile patients (mean age 27 years, range 21-43), with oligo- and/or astheno- and/or teratozoospermia (OAT) subdivided in two groups. Group A included 76 patients with previous varicocelectomy and persistent OAT. Group B included 67 infertile patients (mean age 26 years, range 21-37) with OAT and not varicocelectomized. Patients with Vr and coincidental didymo-epididymal ultrasound (US) abnormalities were excluded from the study. Following rectal prostato-vesicular ultrasonography, each group was subdivided in two subsets on the basis of the absence (group A: subset Vr-/PV-; and group B: subset Vr+/PV-) or the presence of an abacterial PV (group A: subset Vr-/PV+; group B: subset Vr+/PV+). Particularly, PV was present in 47.4% and 41.8% patients of groups A and B, respectively. This coincidental pathology was ipsilateral with Vr in the 61% of the cases. Semen analysis was performed in all patients. Patients of group A showed a total sperm number significantly higher than those found in group B. In presence of PV, sperm parameters were not significantly different between matched--subsets (Vr-/PV+ vs. Vr+/PV+). In absence of PV, the sperm density, the total sperm number and the percentage of forward motility from subset with previous varicocelectomy (Vr-/PV) exhibited values significantly higher than those found in the matched--subset (Vr+/PV-). Sperm analysis alone performed in patients with left Vr is not a useful prognostic post-varicocelectomy marker. Since following varicocelectomy a lack of sperm response could mask another coincidental pathology, the identification through US scans of a possible PV may be mandatory. On the other hand, an integrated uro-andrological approach, including US scans, allows to enucleate subsets of patients with Vr alone, who will have an expected better sperm response following Vr repair.
Dunham, Richard M; Cervasi, Barbara; Brenchley, Jason M; Albrecht, Helmut; Weintrob, Amy; Sumpter, Beth; Engram, Jessica; Gordon, Shari; Klatt, Nichole R; Frank, Ian; Sodora, Donald L; Douek, Daniel C; Paiardini, Mirko; Silvestri, Guido
2008-04-15
Decreased CD4(+) T cell counts are the best marker of disease progression during HIV infection. However, CD4(+) T cells are heterogeneous in phenotype and function, and it is unknown how preferential depletion of specific CD4(+) T cell subsets influences disease severity. CD4(+) T cells can be classified into three subsets by the expression of receptors for two T cell-tropic cytokines, IL-2 (CD25) and IL-7 (CD127). The CD127(+)CD25(low/-) subset includes IL-2-producing naive and central memory T cells; the CD127(-)CD25(-) subset includes mainly effector T cells expressing perforin and IFN-gamma; and the CD127(low)CD25(high) subset includes FoxP3-expressing regulatory T cells. Herein we investigated how the proportions of these T cell subsets are changed during HIV infection. When compared with healthy controls, HIV-infected patients show a relative increase in CD4(+)CD127(-)CD25(-) T cells that is related to an absolute decline of CD4(+)CD127(+)CD25(low/-) T cells. Interestingly, this expansion of CD4(+)CD127(-) T cells was not observed in naturally SIV-infected sooty mangabeys. The relative expansion of CD4(+)CD127(-)CD25(-) T cells correlated directly with the levels of total CD4(+) T cell depletion and immune activation. CD4(+)CD127(-)CD25(-) T cells were not selectively resistant to HIV infection as levels of cell-associated virus were similar in all non-naive CD4(+) T cell subsets. These data indicate that, during HIV infection, specific changes in the fraction of CD4(+) T cells expressing CD25 and/or CD127 are associated with disease progression. Further studies will determine whether monitoring the three subsets of CD4(+) T cells defined based on the expression of CD25 and CD127 should be used in the clinical management of HIV-infected individuals.
Jones, Michael P; Arheart, Kristopher L; Cray, Carolyn
2014-06-01
The objectives of this study were to determine reference intervals, perform longitudinal analyses, and determine the index of individuality (IoI) of 8 hematologic, and 13 biochemical and electrophoretic variables for a group of captive bald eagles (Haliaeetus leucocephalus). Reference intervals were determined from blood samples collected during annual wellness examinations for 41 eagles (23 male and 18 female) with ages ranging between 6 and 43 years (18.7 +/- 7.4, mean +/- SD) at the time of sample collection. Longitudinal analyses and IoI were determined for measured hematologic, biochemical, and protein electrophoretic variables, both individually and as a group, for a subset of 16 eagles (10 male and 6 female) during a 12-year period. This smaller group of eagles ranged in age between 2 and 20 years at the start of the study period, and between 14 and 32 years (21.9 +/- 5.0, mean +/- SD) at the end of the study period. Significant increases with age within the group of 16 eagles were observed only for red blood cells, percent heterophils, total protein, and beta-globulin protein fraction, while albumin:globulin decreased significantly with age. A low IoI (> or = 1.4) was determined for all hematologic and biochemical variables except gamma globulins, which had high IoI (< or = 0.6) for 3 individuals within the subset of 16.
NASA Astrophysics Data System (ADS)
Bailey, John I.; Mateo, Mario; White, Russel J.; Shectman, Stephen A.; Crane, Jeffrey D.
2018-04-01
We present multi-epoch high-dispersion optical spectra obtained with the Michigan/Magellan Fibre System of 126 and 125 Sun-like stars in the young clusters NGC 2516 (141 Myr) and NGC 2422 (73 Myr). We determine stellar properties including radial velocity (RV), Teff, [Fe/H], [α/Fe] and the line-of-sight rotation rate, vrsin (i), from these spectra. Our median RV precision of 80 m s-1 on individual epochs that span a temporal baseline of 1.1 yr enables us to investigate membership and stellar binarity, and to search for sub-stellar companions. We determine membership probabilities and RV variability probabilities for our sample along with candidate companion orbital periods for a select subset of stars. In NGC 2516, we identified 81 RV members, 27 spectroscopic binaries (17 previously identified as photometric binaries) and 16 other stars that show significant RV variability after accounting for average stellar jitter at the 74 m s-1 level. In NGC 2422, we identify 57 members, 11 spectroscopic binaries and three other stars that show significant RV variability after accounting for an average jitter of 138 m s-1. We use Monte Carlo simulations to verify our stellar jitter measurements, determine the proportion of exoplanets and stellar companions to which we are sensitive, and estimate companion-mass limits for our targets. We also report mean cluster metallicity, velocity and velocity dispersion based on our member targets. We identify 58 non-member stars as RV variables, 24 of which have RV amplitudes that imply stellar or brown-dwarf mass companions. Finally, we note the discovery of a separate RV clustering of stars in our NGC 2422 sample.
AOIPS data base management systems support for GARP data sets
NASA Technical Reports Server (NTRS)
Gary, J. P.
1977-01-01
A data base management system is identified, developed to provide flexible access to data sets produced by GARP during its data systems tests. The content and coverage of the data base are defined and a computer-aided, interactive information storage and retrieval system, implemented to facilitate access to user specified data subsets, is described. The computer programs developed to provide the capability were implemented on the highly interactive, minicomputer-based AOIPS and are referred to as the data retrieval system (DRS). Implemented as a user interactive but menu guided system, the DRS permits users to inventory the data tape library and create duplicate or subset data sets based on a user selected window defined by time and latitude/longitude boundaries. The DRS permits users to select, display, or produce formatted hard copy of individual data items contained within the data records.
Casati, Anna; Varghaei-Nahvi, Azam; Feldman, Steven Alexander; Assenmacher, Mario; Rosenberg, Steven Aaron; Dudley, Mark Edward; Scheffold, Alexander
2013-10-01
The adoptive transfer of lymphocytes genetically engineered to express tumor-specific antigen receptors is a potent strategy to treat cancer patients. T lymphocyte subsets, such as naïve or central memory T cells, selected in vitro prior to genetic engineering have been extensively investigated in preclinical mouse models, where they demonstrated improved therapeutic efficacy. However, so far, this is challenging to realize in the clinical setting, since good manufacturing practices (GMP) procedures for complex cell sorting and genetic manipulation are limited. To be able to directly compare the immunological attributes and therapeutic efficacy of naïve (T(N)) and central memory (T(CM)) CD8(+) T cells, we investigated clinical-scale procedures for their parallel selection and in vitro manipulation. We also evaluated currently available GMP-grade reagents for stimulation of T cell subsets, including a new type of anti-CD3/anti-CD28 nanomatrix. An optimized protocol was established for the isolation of both CD8(+) T(N) cells (CD4(-)CD62L(+)CD45RA(+)) and CD8(+) T(CM) (CD4(-)CD62L(+)CD45RA(-)) from a single patient. The highly enriched T cell subsets can be efficiently transduced and expanded to large cell numbers, sufficient for clinical applications and equivalent to or better than current cell and gene therapy approaches with unselected lymphocyte populations. The GMP protocols for selection of T(N) and T(CM) we reported here will be the basis for clinical trials analyzing safety, in vivo persistence and clinical efficacy in cancer patients and will help to generate a more reliable and efficacious cellular product.
Wang, Yi-Ting; Sung, Pei-Yuan; Lin, Peng-Lin; Yu, Ya-Wen; Chung, Ren-Hua
2015-05-15
Genome-wide association studies (GWAS) have become a common approach to identifying single nucleotide polymorphisms (SNPs) associated with complex diseases. As complex diseases are caused by the joint effects of multiple genes, while the effect of individual gene or SNP is modest, a method considering the joint effects of multiple SNPs can be more powerful than testing individual SNPs. The multi-SNP analysis aims to test association based on a SNP set, usually defined based on biological knowledge such as gene or pathway, which may contain only a portion of SNPs with effects on the disease. Therefore, a challenge for the multi-SNP analysis is how to effectively select a subset of SNPs with promising association signals from the SNP set. We developed the Optimal P-value Threshold Pedigree Disequilibrium Test (OPTPDT). The OPTPDT uses general nuclear families. A variable p-value threshold algorithm is used to determine an optimal p-value threshold for selecting a subset of SNPs. A permutation procedure is used to assess the significance of the test. We used simulations to verify that the OPTPDT has correct type I error rates. Our power studies showed that the OPTPDT can be more powerful than the set-based test in PLINK, the multi-SNP FBAT test, and the p-value based test GATES. We applied the OPTPDT to a family-based autism GWAS dataset for gene-based association analysis and identified MACROD2-AS1 with genome-wide significance (p-value=2.5×10(-6)). Our simulation results suggested that the OPTPDT is a valid and powerful test. The OPTPDT will be helpful for gene-based or pathway association analysis. The method is ideal for the secondary analysis of existing GWAS datasets, which may identify a set of SNPs with joint effects on the disease.
Longitudinal analyses of correlated response efficiencies of fillet traits in Nile tilapia.
Turra, E M; Fernandes, A F A; de Alvarenga, E R; Teixeira, E A; Alves, G F O; Manduca, L G; Murphy, T W; Silva, M A
2018-03-01
Recent studies with Nile tilapia have shown divergent results regarding the possibility of selecting on morphometric measurements to promote indirect genetic gains in fillet yield (FY). The use of indirect selection for fillet traits is important as these traits are only measurable after harvesting. Random regression models are a powerful tool in association studies to identify the best time point to measure and select animals. Random regression models can also be applied in a multiple trait approach to analyze indirect response to selection, which would avoid the need to sacrifice candidate fish. Therefore, the aim of this study was to investigate the genetic relationships between several body measurements, weight and fillet traits throughout the growth period and to evaluate the possibility of indirect selection for fillet traits in Nile tilapia. Data were collected from 2042 fish and was divided into two subsets. The first subset was used to estimate genetic parameters, including the permanent environmental effect for BW and body measurements (8758 records for each body measurement, as each fish was individually weighed and measured a maximum of six times). The second subset (2042 records for each trait) was used to estimate genetic correlations and heritabilities, which enabled the calculation of correlated response efficiencies between body measurements and the fillet traits. Heritability estimates across ages ranged from 0.05 to 0.5 for height, 0.02 to 0.48 for corrected length (CL), 0.05 to 0.68 for width, 0.08 to 0.57 for fillet weight (FW) and 0.12 to 0.42 for FY. All genetic correlation estimates between body measurements and FW were positive and strong (0.64 to 0.98). The estimates of genetic correlation between body measurements and FY were positive (except for CL at some ages), but weak to moderate (-0.08 to 0.68). These estimates resulted in strong and favorable correlated response efficiencies for FW and positive, but moderate for FY. These results indicate the possibility of achieving indirect genetic gains for FW and by selecting for morphometric traits, but low efficiency for FY when compared with direct selection.
Genomic Prediction Accounting for Residual Heteroskedasticity.
Ou, Zhining; Tempelman, Robert J; Steibel, Juan P; Ernst, Catherine W; Bates, Ronald O; Bello, Nora M
2015-11-12
Whole-genome prediction (WGP) models that use single-nucleotide polymorphism marker information to predict genetic merit of animals and plants typically assume homogeneous residual variance. However, variability is often heterogeneous across agricultural production systems and may subsequently bias WGP-based inferences. This study extends classical WGP models based on normality, heavy-tailed specifications and variable selection to explicitly account for environmentally-driven residual heteroskedasticity under a hierarchical Bayesian mixed-models framework. WGP models assuming homogeneous or heterogeneous residual variances were fitted to training data generated under simulation scenarios reflecting a gradient of increasing heteroskedasticity. Model fit was based on pseudo-Bayes factors and also on prediction accuracy of genomic breeding values computed on a validation data subset one generation removed from the simulated training dataset. Homogeneous vs. heterogeneous residual variance WGP models were also fitted to two quantitative traits, namely 45-min postmortem carcass temperature and loin muscle pH, recorded in a swine resource population dataset prescreened for high and mild residual heteroskedasticity, respectively. Fit of competing WGP models was compared using pseudo-Bayes factors. Predictive ability, defined as the correlation between predicted and observed phenotypes in validation sets of a five-fold cross-validation was also computed. Heteroskedastic error WGP models showed improved model fit and enhanced prediction accuracy compared to homoskedastic error WGP models although the magnitude of the improvement was small (less than two percentage points net gain in prediction accuracy). Nevertheless, accounting for residual heteroskedasticity did improve accuracy of selection, especially on individuals of extreme genetic merit. Copyright © 2016 Ou et al.
Molsberry, Samantha A; Cheng, Yu; Kingsley, Lawrence; Jacobson, Lisa; Levine, Andrew J; Martin, Eileen; Miller, Eric N; Munro, Cynthia A; Ragin, Ann; Sacktor, Ned; Becker, James T
2018-05-11
Mild forms of HIV-associated neurocognitive disorder (HAND) remain prevalent in the combination anti-retroviral therapy (cART) era. This study's objective was to identify neuropsychological subgroups within the Multicenter AIDS Cohort Study (MACS) based on the participant-based latent structure of cognitive function and to identify factors associated with subgroups. The MACS is a four-site longitudinal study of the natural and treated history of HIV disease among gay and bisexual men. Using neuropsychological domain scores we used a cluster variable selection algorithm to identify the optimal subset of domains with cluster information. Latent profile analysis was applied using scores from identified domains. Exploratory and post-hoc analyses were conducted to identify factors associated with cluster membership and the drivers of the observed associations. Cluster variable selection identified all domains as containing cluster information except for Working Memory. A three-profile solution produced the best fit for the data. Profile 1 performed below average on all domains, Profile 2 performed average on executive functioning, motor, and speed and below average on learning and memory, Profile 3 performed at or above average across all domains. Several demographic, cognitive, and social factors were associated with profile membership; these associations were driven by differences between Profile 1 and the other profiles. There is an identifiable pattern of neuropsychological performance among MACS members determined by all domains except Working Memory. Neither HIV nor HIV-related biomarkers were related with cluster membership, consistent with other findings that cognitive performance patterns do not map directly onto HIV serostatus.
Consumer satisfaction with primary care provider choice and associated trust
Chu-Weininger, Ming Ying L; Balkrishnan, Rajesh
2006-01-01
Background Development of managed care, characterized by limited provider choice, is believed to undermine trust. Provider choice has been identified as strongly associated with physician trust. Stakeholders in a competitive healthcare market have competing agendas related to choice. The purpose of this study is to analyze variables associated with consumer's satisfaction that they have enough choice when selecting their primary care provider (PCP), and to analyze the importance of these variables on provider trust. Methods A 1999 randomized national cross-sectional telephone survey conducted of United States residential households, who had a telephone, had seen a medical professional at least twice in the past two years, and aged ≥ 20 years was selected for secondary data analyses. Among 1,117 households interviewed, 564 were selected as the final sample. Subjects responded to a core set of questions related to provider trust, and a subset of questions related to trust in the insurer. A previously developed conceptual framework was adopted. Linear and logistic regressions were performed based on this framework. Results Results affirmed 'satisfaction with amount of PCP choice' was significantly (p < .001) associated with provider trust. 'PCP's care being extremely effective' was strongly associated with 'satisfaction with amount of PCP choice' and 'provider trust'. Having sought a second opinion(s) was associated with lower trust. 'Spoke to the PCP outside the medical office,' 'satisfaction with the insurer' and 'insurer charges less if PCP within network' were all variables associated with 'satisfaction with amount of PCP choice' (all p < .05). Conclusion This study confirmed the association of 'satisfaction with amount of PCP choice' with provider trust. Results affirmed 'enough PCP choice' was a strong predictor of provider trust. 'Second opinion on PCP' may indicate distrust in the provider. Data such as 'trust in providers in general' and 'the role of provider performance information' in choice, though import in PCP choice, were not available for analysis and should be explored in future studies. Results have implications for rethinking the relationships among consumer choice, consumer behaviors in making trade-offs in PCP choice, and the role of healthcare experiences in 'satisfaction with amount of PCP choice' or 'provider trust.' PMID:17059611
Takakusagi, Yoichi; Kuramochi, Kouji; Takagi, Manami; Kusayanagi, Tomoe; Manita, Daisuke; Ozawa, Hiroko; Iwakiri, Kanako; Takakusagi, Kaori; Miyano, Yuka; Nakazaki, Atsuo; Kobayashi, Susumu; Sugawara, Fumio; Sakaguchi, Kengo
2008-11-15
Here, we report an efficient one-cycle affinity selection using a natural-protein or random-peptide T7 phage pool for identification of binding proteins or peptides specific for small-molecules. The screening procedure involved a cuvette type 27-MHz quartz-crystal microbalance (QCM) apparatus with introduction of self-assembled monolayer (SAM) for a specific small-molecule immobilization on the gold electrode surface of a sensor chip. Using this apparatus, we attempted an affinity selection of proteins or peptides against synthetic ligand for FK506-binding protein (SLF) or irinotecan (Iri, CPT-11). An affinity selection using SLF-SAM and a natural-protein T7 phage pool successfully detected FK506-binding protein 12 (FKBP12)-displaying T7 phage after an interaction time of only 10 min. Extensive exploration of time-consuming wash and/or elution conditions together with several rounds of selection was not required. Furthermore, in the selection using a 15-mer random-peptide T7 phage pool and subsequent analysis utilizing receptor ligand contact (RELIC) software, a subset of SLF-selected peptides clearly pinpointed several amino-acid residues within the binding site of FKBP12. Likewise, a subset of Iri-selected peptides pinpointed part of the positive amino-acid region of residues from the Iri-binding site of the well-known direct targets, acetylcholinesterase (AChE) and carboxylesterase (CE). Our findings demonstrate the effectiveness of this method and general applicability for a wide range of small-molecules.
Chéron, Jean-Baptiste; Triki, Dhoha; Senac, Caroline; Flatters, Delphine; Camproux, Anne-Claude
2017-01-01
Protein flexibility is often implied in binding with different partners and is essential for protein function. The growing number of macromolecular structures in the Protein Data Bank entries and their redundancy has become a major source of structural knowledge of the protein universe. The analysis of structural variability through available redundant structures of a target, called multiple target conformations (MTC), obtained using experimental or modeling methods and under different biological conditions or different sources is one way to explore protein flexibility. This analysis is essential to improve the understanding of various mechanisms associated with protein target function and flexibility. In this study, we explored structural variability of three biological targets by analyzing different MTC sets associated with these targets. To facilitate the study of these MTC sets, we have developed an efficient tool, SA-conf, dedicated to capturing and linking the amino acid and local structure variability and analyzing the target structural variability space. The advantage of SA-conf is that it could be applied to divers sets composed of MTCs available in the PDB obtained using NMR and crystallography or homology models. This tool could also be applied to analyze MTC sets obtained by dynamics approaches. Our results showed that SA-conf tool is effective to quantify the structural variability of a MTC set and to localize the structural variable positions and regions of the target. By selecting adapted MTC subsets and comparing their variability detected by SA-conf, we highlighted different sources of target flexibility such as induced by binding partner, by mutation and intrinsic flexibility. Our results support the interest to mine available structures associated with a target using to offer valuable insight into target flexibility and interaction mechanisms. The SA-conf executable script, with a set of pre-compiled binaries are available at http://www.mti.univ-paris-diderot.fr/recherche/plateformes/logiciels. PMID:28817602
Regad, Leslie; Chéron, Jean-Baptiste; Triki, Dhoha; Senac, Caroline; Flatters, Delphine; Camproux, Anne-Claude
2017-01-01
Protein flexibility is often implied in binding with different partners and is essential for protein function. The growing number of macromolecular structures in the Protein Data Bank entries and their redundancy has become a major source of structural knowledge of the protein universe. The analysis of structural variability through available redundant structures of a target, called multiple target conformations (MTC), obtained using experimental or modeling methods and under different biological conditions or different sources is one way to explore protein flexibility. This analysis is essential to improve the understanding of various mechanisms associated with protein target function and flexibility. In this study, we explored structural variability of three biological targets by analyzing different MTC sets associated with these targets. To facilitate the study of these MTC sets, we have developed an efficient tool, SA-conf, dedicated to capturing and linking the amino acid and local structure variability and analyzing the target structural variability space. The advantage of SA-conf is that it could be applied to divers sets composed of MTCs available in the PDB obtained using NMR and crystallography or homology models. This tool could also be applied to analyze MTC sets obtained by dynamics approaches. Our results showed that SA-conf tool is effective to quantify the structural variability of a MTC set and to localize the structural variable positions and regions of the target. By selecting adapted MTC subsets and comparing their variability detected by SA-conf, we highlighted different sources of target flexibility such as induced by binding partner, by mutation and intrinsic flexibility. Our results support the interest to mine available structures associated with a target using to offer valuable insight into target flexibility and interaction mechanisms. The SA-conf executable script, with a set of pre-compiled binaries are available at http://www.mti.univ-paris-diderot.fr/recherche/plateformes/logiciels.
Optimal Tuner Selection for Kalman-Filter-Based Aircraft Engine Performance Estimation
NASA Technical Reports Server (NTRS)
Simon, Donald L.; Garg, Sanjay
2011-01-01
An emerging approach in the field of aircraft engine controls and system health management is the inclusion of real-time, onboard models for the inflight estimation of engine performance variations. This technology, typically based on Kalman-filter concepts, enables the estimation of unmeasured engine performance parameters that can be directly utilized by controls, prognostics, and health-management applications. A challenge that complicates this practice is the fact that an aircraft engine s performance is affected by its level of degradation, generally described in terms of unmeasurable health parameters such as efficiencies and flow capacities related to each major engine module. Through Kalman-filter-based estimation techniques, the level of engine performance degradation can be estimated, given that there are at least as many sensors as health parameters to be estimated. However, in an aircraft engine, the number of sensors available is typically less than the number of health parameters, presenting an under-determined estimation problem. A common approach to address this shortcoming is to estimate a subset of the health parameters, referred to as model tuning parameters. The problem/objective is to optimally select the model tuning parameters to minimize Kalman-filterbased estimation error. A tuner selection technique has been developed that specifically addresses the under-determined estimation problem, where there are more unknown parameters than available sensor measurements. A systematic approach is applied to produce a model tuning parameter vector of appropriate dimension to enable estimation by a Kalman filter, while minimizing the estimation error in the parameters of interest. Tuning parameter selection is performed using a multi-variable iterative search routine that seeks to minimize the theoretical mean-squared estimation error of the Kalman filter. This approach can significantly reduce the error in onboard aircraft engine parameter estimation applications such as model-based diagnostic, controls, and life usage calculations. The advantage of the innovation is the significant reduction in estimation errors that it can provide relative to the conventional approach of selecting a subset of health parameters to serve as the model tuning parameter vector. Because this technique needs only to be performed during the system design process, it places no additional computation burden on the onboard Kalman filter implementation. The technique has been developed for aircraft engine onboard estimation applications, as this application typically presents an under-determined estimation problem. However, this generic technique could be applied to other industries using gas turbine engine technology.
Vremec, David
2016-01-01
Dendritic cells (DCs) form a complex network of cells that initiate and orchestrate immune responses against a vast array of pathogenic challenges. Developmentally and functionally distinct DC subtypes differentially regulate T-cell function. Importantly it is the ability of DC to capture and process antigen, whether from pathogens, vaccines, or self-components, and present it to naive T cells that is the key to their ability to initiate an immune response. Our typical isolation procedure for DC from murine spleen was designed to efficiently extract all DC subtypes, without bias and without alteration to their in vivo phenotype, and involves a short collagenase digestion of the tissue, followed by selection for cells of light density and finally negative selection for DC. The isolation procedure can accommodate DC numbers that have been artificially increased via administration of fms-like tyrosine kinase 3 ligand (Flt3L), either directly through a series of subcutaneous injections or by seeding with an Flt3L secreting murine melanoma. Flt3L may also be added to bone marrow cultures to produce large numbers of in vitro equivalents of the spleen DC subsets. Total DC, or their subsets, may be further purified using immunofluorescent labeling and flow cytometric cell sorting. Cell sorting may be completely bypassed by separating DC subsets using a combination of fluorescent antibody labeling and anti-fluorochrome magnetic beads. Our procedure enables efficient separation of the distinct DC subsets, even in cases where mouse numbers or flow cytometric cell sorting time is limiting.
G-STRATEGY: Optimal Selection of Individuals for Sequencing in Genetic Association Studies
Wang, Miaoyan; Jakobsdottir, Johanna; Smith, Albert V.; McPeek, Mary Sara
2017-01-01
In a large-scale genetic association study, the number of phenotyped individuals available for sequencing may, in some cases, be greater than the study’s sequencing budget will allow. In that case, it can be important to prioritize individuals for sequencing in a way that optimizes power for association with the trait. Suppose a cohort of phenotyped individuals is available, with some subset of them possibly already sequenced, and one wants to choose an additional fixed-size subset of individuals to sequence in such a way that the power to detect association is maximized. When the phenotyped sample includes related individuals, power for association can be gained by including partial information, such as phenotype data of ungenotyped relatives, in the analysis, and this should be taken into account when assessing whom to sequence. We propose G-STRATEGY, which uses simulated annealing to choose a subset of individuals for sequencing that maximizes the expected power for association. In simulations, G-STRATEGY performs extremely well for a range of complex disease models and outperforms other strategies with, in many cases, relative power increases of 20–40% over the next best strategy, while maintaining correct type 1 error. G-STRATEGY is computationally feasible even for large datasets and complex pedigrees. We apply G-STRATEGY to data on HDL and LDL from the AGES-Reykjavik and REFINE-Reykjavik studies, in which G-STRATEGY is able to closely-approximate the power of sequencing the full sample by selecting for sequencing a only small subset of the individuals. PMID:27256766
Feature Selection and Pedestrian Detection Based on Sparse Representation.
Yao, Shihong; Wang, Tao; Shen, Weiming; Pan, Shaoming; Chong, Yanwen; Ding, Fei
2015-01-01
Pedestrian detection have been currently devoted to the extraction of effective pedestrian features, which has become one of the obstacles in pedestrian detection application according to the variety of pedestrian features and their large dimension. Based on the theoretical analysis of six frequently-used features, SIFT, SURF, Haar, HOG, LBP and LSS, and their comparison with experimental results, this paper screens out the sparse feature subsets via sparse representation to investigate whether the sparse subsets have the same description abilities and the most stable features. When any two of the six features are fused, the fusion feature is sparsely represented to obtain its important components. Sparse subsets of the fusion features can be rapidly generated by avoiding calculation of the corresponding index of dimension numbers of these feature descriptors; thus, the calculation speed of the feature dimension reduction is improved and the pedestrian detection time is reduced. Experimental results show that sparse feature subsets are capable of keeping the important components of these six feature descriptors. The sparse features of HOG and LSS possess the same description ability and consume less time compared with their full features. The ratios of the sparse feature subsets of HOG and LSS to their full sets are the highest among the six, and thus these two features can be used to best describe the characteristics of the pedestrian and the sparse feature subsets of the combination of HOG-LSS show better distinguishing ability and parsimony.
ERIC Educational Resources Information Center
Reynolds, Gemma; Reed, Phil
2013-01-01
Stimulus over-selectivity refers to the phenomenon whereby behavior is controlled by a subset of elements in the environment at the expense of other equally salient aspects of the environment. The experiments explored whether this cue interference effect was reduced following a surprising downward shift in reinforcer value. Experiment 1 revealed…
Long-Term Trends in Ecological Systems: A Basis for Understanding Responses to Global Change
USDA-ARS?s Scientific Manuscript database
The Eco Trends Editorial Committee sorted through vast amounts of historical and ongoing data from 50 ecological sites in the continental United States including Alaska, several islands, and Antarctica to present in a logical format the variables commonly collected. This report presents a subset of...
A Preliminary Analysis of LANDSAT-4 Thematic Mapper Radiometric Performance
NASA Technical Reports Server (NTRS)
Justice, C.; Fusco, L.; Mehl, W.
1984-01-01
Analysis was performed to characterize the radiometry of three Thematic Mapper (TM) digital products of a scene of Arkansas. The three digital products examined were the NASA raw (BT) product, the radiometrically corrected (AT) product and the radiometrically and geometrically corrected (PT) product. The frequency distribution of the digital data; the statistical correlation between the bands; and the variability between the detectors within a band were examined on a series of image subsets from the full scene. The results are presented from one 1024 x 1024 pixel subset of Realfoot Lake, Tennessee which displayed a representative range of ground conditions and cover types occurring within the full frame image. Bands 1, 2 and 5 of the sample area are presented. The subsets were extracted from the three digital data products to cover the same geographic area. This analysis provides the first step towards a full appraisal of the TM radiometry being performed as part of the ESA/CEC contribution to the NASA/LIDQA program.
Optimal Frequency-Domain System Realization with Weighting
NASA Technical Reports Server (NTRS)
Juang, Jer-Nan; Maghami, Peiman G.
1999-01-01
Several approaches are presented to identify an experimental system model directly from frequency response data. The formulation uses a matrix-fraction description as the model structure. Frequency weighting such as exponential weighting is introduced to solve a weighted least-squares problem to obtain the coefficient matrices for the matrix-fraction description. A multi-variable state-space model can then be formed using the coefficient matrices of the matrix-fraction description. Three different approaches are introduced to fine-tune the model using nonlinear programming methods to minimize the desired cost function. The first method uses an eigenvalue assignment technique to reassign a subset of system poles to improve the identified model. The second method deals with the model in the real Schur or modal form, reassigns a subset of system poles, and adjusts the columns (rows) of the input (output) influence matrix using a nonlinear optimizer. The third method also optimizes a subset of poles, but the input and output influence matrices are refined at every optimization step through least-squares procedures.
Elegido, Ana; Graell, Montserrat; Andrés, Patricia; Gheorghe, Alina; Marcos, Ascensión; Nova, Esther
2017-03-01
Anorexia nervosa (AN) is an atypical form of malnutrition with peculiar changes in the immune system. We hypothesized that different lymphocyte subsets are differentially affected by malnutrition in AN, and thus, our aim was to investigate the influence of body mass loss on the variability of lymphocyte subsets in AN patients. A group of 66 adolescent female patients, aged 12-17 years, referred for their first episode of either AN or feeding or eating disorders not elsewhere classified were studied upon admission (46 AN-restricting subtype, 11 AN-binge/purging subtype, and 9 feeding or eating disorders not elsewhere classified). Ninety healthy adolescents served as controls. White blood cells and lymphocyte subsets were analyzed by flow cytometry. Relationships with the body mass index (BMI) z score were assessed in linear models adjusted by diagnostic subtype and age. Leukocyte numbers were lower in AN patients than in controls, and relative lymphocytosis was observed in AN-restricting subtype. Lower CD8 + , NK, and memory CD8 + counts were found in eating disorder patients compared with controls. No differences were found for CD4 + counts or naive and memory CD4 + subsets between the groups. Negative associations between lymphocyte percentage and the BMI z score, as well as between the B cell counts, naive CD4 + percentage and counts, and the BMI z score, were found. In conclusion, increased naive CD4 + and B lymphocyte subsets associated with body mass loss drive the relative lymphocytosis observed in AN patients, which reflects an adaptive mechanism to preserve the adaptive immune response. Copyright © 2017 Elsevier Inc. All rights reserved.
Varvil-Weld, Lindsey; Mallett, Kimberly A.; Turrisi, Rob; Abar, Caitlin C.
2012-01-01
Objective: Previous research identified a high-risk subset of college students experiencing a disproportionate number of alcohol-related consequences at the end of their first year. With the goal of identifying pre-college predictors of membership in this high-risk subset, the present study used a prospective design to identify latent profiles of student-reported maternal and paternal parenting styles and alcohol-specific behaviors and to determine whether these profiles were associated with membership in the high-risk consequences subset. Method: A sample of randomly selected 370 incoming first-year students at a large public university reported on their mothers’ and fathers’ communication quality, monitoring, approval of alcohol use, and modeling of drinking behaviors and on consequences experienced across the first year of college. Results: Students in the high-risk subset comprised 15.5% of the sample but accounted for almost half (46.6%) of the total consequences reported by the entire sample. Latent profile analyses identified four parental profiles: positive pro-alcohol, positive anti-alcohol, negative mother, and negative father. Logistic regression analyses revealed that students in the negative-father profile were at greatest odds of being in the high-risk consequences subset at a follow-up assessment 1 year later, even after drinking at baseline was controlled for. Students in the positive pro-alcohol profile also were at increased odds of being in the high-risk subset, although this association was attenuated after baseline drinking was controlled for. Conclusions: These findings have important implications for the improvement of existing parent- and individual-based college student drinking interventions designed to reduce alcohol-related consequences. PMID:22456248
Varvil-Weld, Lindsey; Mallett, Kimberly A; Turrisi, Rob; Abar, Caitlin C
2012-05-01
Previous research identified a high-risk subset of college students experiencing a disproportionate number of alcohol-related consequences at the end of their first year. With the goal of identifying pre-college predictors of membership in this high-risk subset, the present study used a prospective design to identify latent profiles of student-reported maternal and paternal parenting styles and alcohol-specific behaviors and to determine whether these profiles were associated with membership in the high-risk consequences subset. A sample of randomly selected 370 incoming first-year students at a large public university reported on their mothers' and fathers' communication quality, monitoring, approval of alcohol use, and modeling of drinking behaviors and on consequences experienced across the first year of college. Students in the high-risk subset comprised 15.5% of the sample but accounted for almost half (46.6%) of the total consequences reported by the entire sample. Latent profile analyses identified four parental profiles: positive pro-alcohol, positive anti-alcohol, negative mother, and negative father. Logistic regression analyses revealed that students in the negative-father profile were at greatest odds of being in the high-risk consequences subset at a follow-up assessment 1 year later, even after drinking at baseline was controlled for. Students in the positive pro-alcohol profile also were at increased odds of being in the high-risk subset, although this association was attenuated after baseline drinking was controlled for. These findings have important implications for the improvement of existing parent- and individual-based college student drinking interventions designed to reduce alcohol-related consequences.
Qian, Yu; Wei, Chungwen; Lee, F. Eun-Hyung; Campbell, John; Halliley, Jessica; Lee, Jamie A.; Cai, Jennifer; Kong, Megan; Sadat, Eva; Thomson, Elizabeth; Dunn, Patrick; Seegmiller, Adam C.; Karandikar, Nitin J.; Tipton, Chris; Mosmann, Tim; Sanz, Iñaki; Scheuermann, Richard H.
2011-01-01
Background Advances in multi-parameter flow cytometry (FCM) now allow for the independent detection of larger numbers of fluorochromes on individual cells, generating data with increasingly higher dimensionality. The increased complexity of these data has made it difficult to identify cell populations from high-dimensional FCM data using traditional manual gating strategies based on single-color or two-color displays. Methods To address this challenge, we developed a novel program, FLOCK (FLOw Clustering without K), that uses a density-based clustering approach to algorithmically identify biologically relevant cell populations from multiple samples in an unbiased fashion, thereby eliminating operator-dependent variability. Results FLOCK was used to objectively identify seventeen distinct B cell subsets in a human peripheral blood sample and to identify and quantify novel plasmablast subsets responding transiently to tetanus and other vaccinations in peripheral blood. FLOCK has been implemented in the publically available Immunology Database and Analysis Portal – ImmPort (http://www.immport.org) for open use by the immunology research community. Conclusions FLOCK is able to identify cell subsets in experiments that use multi-parameter flow cytometry through an objective, automated computational approach. The use of algorithms like FLOCK for FCM data analysis obviates the need for subjective and labor intensive manual gating to identify and quantify cell subsets. Novel populations identified by these computational approaches can serve as hypotheses for further experimental study. PMID:20839340
Garabedian, C; Sfeir, R; Langlois, C; Bonnard, A; Khen-Dunlop, N; Gelas, T; Michaud, L; Auber, F; Piolat, C; Lemelle, J-L; Fouquet, V; Habonima, É; Becmeur, F; Polimerol, M-L; Breton, A; Petit, T; Podevin, G; Lavrand, F; Allal, H; Lopez, M; Elbaz, F; Merrot, T; Michel, J-L; Buisson, P; Sapin, E; Delagausie, P; Pelatan, C; Gaudin, J; Weil, D; de Vries, P; Jaby, O; Lardy, H; Aubert, D; Borderon, C; Fourcade, L; Geiss, S; Breaud, J; Pouzac, M; Echaieb, A; Laplace, C; Gottrand, F; Houfflin-Debarge, V
2015-11-01
Evaluate neonatal management and outcome of neonates with either a prenatal or a post-natal diagnosis of EA type III. Population-based study using data from the French National Register for EA from 2008 to 2010. We compared children with prenatal versus post-natal diagnosis in regards to prenatal, maternal and neonatal characteristics. We define a composite variable of morbidity (anastomotic esophageal leaks, recurrent fistula, stenosis) and mortality at 1 year. Four hundred and eight live births with EA type III were recorded with a prenatal diagnosis rate of 18.1%. Transfer after birth was lower in prenatal subset (32.4% versus 81.5%, P<0.001). Delay between birth and first intervention was not significantly different. Defect size (2cm vs 1.4cm, P<0.001), gastrostomy (21.6% versus 8.7%, P<0.001) and length in neonatal unit care were higher in prenatal subset (47.9 days versus 33.6 days, P<0.001). The composite variables were higher in prenatal diagnosis subset (38.7% vs 26.1%, P=0.044). Despite the excellent survival rate of EA, cases with antenatal detection have a higher morbidity related to the EA type (longer gap). Even if it does not modify neonatal management and 1-year outcome, prenatal diagnosis allows antenatal parental counseling and avoids post-natal transfer. Copyright © 2014 Elsevier Masson SAS. All rights reserved.
Beaulieu-Bonneau, Simon; Fortier-Brochu, Émilie; Ivers, Hans; Morin, Charles M
2017-03-01
The objectives of this study were to compare individuals with traumatic brain injury (TBI) and healthy controls on neuropsychological tests of attention and driving simulation performance, and explore their relationships with participants' characteristics, sleep, sleepiness, and fatigue. Participants were 22 adults with moderate or severe TBI (time since injury ≥ one year) and 22 matched controls. They completed three neuropsychological tests of attention, a driving simulator task, night-time polysomnographic recordings, and subjective ratings of sleepiness and fatigue. Results showed that participants with TBI exhibited poorer performance compared to controls on measures tapping speed of information processing and sustained attention, but not on selective attention measures. On the driving simulator task, a greater variability of the vehicle lateral position was observed in the TBI group. Poorer performance on specific subsets of neuropsychological variables was associated with poorer sleep continuity in the TBI group, and with a greater increase in subjective sleepiness in both groups. No significant relationship was found between cognitive performance and fatigue. These findings add to the existing evidence that speed of information processing is still impaired several years after moderate to severe TBI. Sustained attention could also be compromised. Attention seems to be associated with sleep continuity and daytime sleepiness; this interaction needs to be explored further.
Olson, Dawn M; Prescott, Kristina K; Zeilinger, Adam R; Hou, Suqin; Coffin, Alisa W; Smith, Coby M; Ruberson, John R; Andow, David A
2018-06-06
Landscape factors can significantly influence arthropod populations. The economically important brown stink bug, Euschistus servus (Say) (Hemiptera: Pentatomidae), is a native mobile, polyphagous and multivoltine pest of many crops in southeastern United States and understanding the relative influence of local and landscape factors on their reproduction may facilitate population management. Finite rate of population increase (λ) was estimated in four major crop hosts-maize, peanut, cotton, and soybean-over 3 yr in 16 landscapes of southern Georgia. A geographic information system (GIS) was used to characterize the surrounding landscape structure. LASSO regression was used to identify the subset of local and landscape characteristics and predator densities that account for variation in λ. The percentage area of maize, peanut and woodland and pasture in the landscape and the connectivity of cropland had no influence on E. servus λ. The best model for explaining variation in λ included only four predictor variables: whether or not the sampled field was a soybean field, mean natural enemy density in the field, percentage area of cotton in the landscape and the percentage area of soybean in the landscape. Soybean was the single most important variable for determining E. servus λ, with much greater reproduction in soybean fields than in other crop species. Penalized regression and post-selection inference provide conservative estimates of the landscape-scale determinants of E. servus reproduction and indicate that a relatively simple set of in-field and landscape variables influences reproduction in this species.
Hybrid modeling of spatial continuity for application to numerical inverse problems
Friedel, Michael J.; Iwashita, Fabio
2013-01-01
A novel two-step modeling approach is presented to obtain optimal starting values and geostatistical constraints for numerical inverse problems otherwise characterized by spatially-limited field data. First, a type of unsupervised neural network, called the self-organizing map (SOM), is trained to recognize nonlinear relations among environmental variables (covariates) occurring at various scales. The values of these variables are then estimated at random locations across the model domain by iterative minimization of SOM topographic error vectors. Cross-validation is used to ensure unbiasedness and compute prediction uncertainty for select subsets of the data. Second, analytical functions are fit to experimental variograms derived from original plus resampled SOM estimates producing model variograms. Sequential Gaussian simulation is used to evaluate spatial uncertainty associated with the analytical functions and probable range for constraining variables. The hybrid modeling of spatial continuity is demonstrated using spatially-limited hydrologic measurements at different scales in Brazil: (1) physical soil properties (sand, silt, clay, hydraulic conductivity) in the 42 km2 Vargem de Caldas basin; (2) well yield and electrical conductivity of groundwater in the 132 km2 fractured crystalline aquifer; and (3) specific capacity, hydraulic head, and major ions in a 100,000 km2 transboundary fractured-basalt aquifer. These results illustrate the benefits of exploiting nonlinear relations among sparse and disparate data sets for modeling spatial continuity, but the actual application of these spatial data to improve numerical inverse modeling requires testing.
Poulin, L.; Grygiel, P.; Magne, M.; Rodriguez-R, L. M.; Forero Serna, N.; Zhao, S.; El Rafii, M.; Dao, S.; Tekete, C.; Wonni, I.; Koita, O.; Pruvost, O.; Verdier, V.; Vernière, C.
2014-01-01
Multilocus variable-number tandem-repeat analysis (MLVA) is efficient for routine typing and for investigating the genetic structures of natural microbial populations. Two distinct pathovars of Xanthomonas oryzae can cause significant crop losses in tropical and temperate rice-growing countries. Bacterial leaf streak is caused by X. oryzae pv. oryzicola, and bacterial leaf blight is caused by X. oryzae pv. oryzae. For the latter, two genetic lineages have been described in the literature. We developed a universal MLVA typing tool both for the identification of the three X. oryzae genetic lineages and for epidemiological analyses. Sixteen candidate variable-number tandem-repeat (VNTR) loci were selected according to their presence and polymorphism in 10 draft or complete genome sequences of the three X. oryzae lineages and by VNTR sequencing of a subset of loci of interest in 20 strains per lineage. The MLVA-16 scheme was then applied to 338 strains of X. oryzae representing different pathovars and geographical locations. Linkage disequilibrium between MLVA loci was calculated by index association on different scales, and the 16 loci showed linear Mantel correlation with MLSA data on 56 X. oryzae strains, suggesting that they provide a good phylogenetic signal. Furthermore, analyses of sets of strains for different lineages indicated the possibility of using the scheme for deeper epidemiological investigation on small spatial scales. PMID:25398857
Quantum chemical and statistical study of megazol-derived compounds with trypanocidal activity
NASA Astrophysics Data System (ADS)
Rosselli, F. P.; Albuquerque, C. N.; da Silva, A. B. F.
In this work we performed a structure-activity relationship (SAR) study with the aim to correlate molecular properties of the megazol compound and 10 of its analogs with the biological activity against Trypanosoma cruzi (trypanocidal or antichagasic activity) presented by these molecules. The biological activity indication was obtained from in vitro tests and the molecular properties (variables or descriptors) were obtained from the optimized chemical structures by using the PM3 semiempirical method. It was calculated ˜80 molecular properties selected among steric, constitutional, electronic, and lipophilicity properties. In order to reduce dimensionality and investigate which subset of variables (descriptors) would be more effective in classifying the compounds studied, according to their degree of trypanocidal activity, we employed statistical methodologies (pattern recognition and classification techniques) such as principal component analysis (PCA), hierarchical cluster analysis (HCA), K-nearest neighbor (KNN), and discriminant function analysis (DFA). These methods showed that the descriptors molecular mass (MM), energy of the second lowest unoccupied molecular orbital (LUMO+1), charge on the first nitrogen at substituent 2 (qN'), dihedral angles (D1 and D2), bond length between atom C4 and its substituent (L4), Moriguchi octanol-partition coefficient (MLogP), and length-to-breadth ratio (L/Bw) were the variables responsible for the separation between active and inactive compounds against T. cruzi. Afterwards, the PCA, KNN, and DFA models built in this work were used to perform trypanocidal activity predictions for eight new megazol analog compounds.
Gutiérrez-López-Franca, Carlos; Hervás, Ramón; Johnson, Esperanza
2018-01-01
This paper aims to improve activity recognition systems based on skeletal tracking through the study of two different strategies (and its combination): (a) specialized body parts analysis and (b) stricter restrictions for the most easily detectable activities. The study was performed using the Extended Body-Angles Algorithm, which is able to analyze activities using only a single key sample. This system allows to select, for each considered activity, which are its relevant joints, which makes it possible to monitor the body of the user selecting only a subset of the same. But this feature of the system has both advantages and disadvantages. As a consequence, in the past we had some difficulties with the recognition of activities that only have a small subset of the joints of the body as relevant. The goal of this work, therefore, is to analyze the effect produced by the application of several strategies on the results of an activity recognition system based on skeletal tracking joint oriented devices. Strategies that we applied with the purpose of improve the recognition rates of the activities with a small subset of relevant joints. Through the results of this work, we aim to give the scientific community some first indications about which considered strategy is better. PMID:29789478
Zhang, Daqing; Xiao, Jianfeng; Zhou, Nannan; Luo, Xiaomin; Jiang, Hualiang; Chen, Kaixian
2015-01-01
Blood-brain barrier (BBB) is a highly complex physical barrier determining what substances are allowed to enter the brain. Support vector machine (SVM) is a kernel-based machine learning method that is widely used in QSAR study. For a successful SVM model, the kernel parameters for SVM and feature subset selection are the most important factors affecting prediction accuracy. In most studies, they are treated as two independent problems, but it has been proven that they could affect each other. We designed and implemented genetic algorithm (GA) to optimize kernel parameters and feature subset selection for SVM regression and applied it to the BBB penetration prediction. The results show that our GA/SVM model is more accurate than other currently available log BB models. Therefore, to optimize both SVM parameters and feature subset simultaneously with genetic algorithm is a better approach than other methods that treat the two problems separately. Analysis of our log BB model suggests that carboxylic acid group, polar surface area (PSA)/hydrogen-bonding ability, lipophilicity, and molecular charge play important role in BBB penetration. Among those properties relevant to BBB penetration, lipophilicity could enhance the BBB penetration while all the others are negatively correlated with BBB penetration. PMID:26504797
Moon, James J; Dash, Pradyot; Oguin, Thomas H; McClaren, Jennifer L; Chu, H Hamlet; Thomas, Paul G; Jenkins, Marc K
2011-08-30
It is currently thought that T cells with specificity for self-peptide/MHC (pMHC) ligands are deleted during thymic development, thereby preventing autoimmunity. In the case of CD4(+) T cells, what is unclear is the extent to which self-peptide/MHC class II (pMHCII)-specific T cells are deleted or become Foxp3(+) regulatory T cells. We addressed this issue by characterizing a natural polyclonal pMHCII-specific CD4(+) T-cell population in mice that either lacked or expressed the relevant antigen in a ubiquitous pattern. Mice expressing the antigen contained one-third the number of pMHCII-specific T cells as mice lacking the antigen, and the remaining cells exhibited low TCR avidity. In mice lacking the antigen, the pMHCII-specific T-cell population was dominated by phenotypically naive Foxp3(-) cells, but also contained a subset of Foxp3(+) regulatory cells. Both Foxp3(-) and Foxp3(+) pMHCII-specific T-cell numbers were reduced in mice expressing the antigen, but the Foxp3(+) subset was more resistant to changes in number and TCR repertoire. Therefore, thymic selection of self-pMHCII-specific CD4(+) T cells results in incomplete deletion within the normal polyclonal repertoire, especially among regulatory T cells.
A tool for selecting SNPs for association studies based on observed linkage disequilibrium patterns.
De La Vega, Francisco M; Isaac, Hadar I; Scafe, Charles R
2006-01-01
The design of genetic association studies using single-nucleotide polymorphisms (SNPs) requires the selection of subsets of the variants providing high statistical power at a reasonable cost. SNPs must be selected to maximize the probability that a causative mutation is in linkage disequilibrium (LD) with at least one marker genotyped in the study. The HapMap project performed a genome-wide survey of genetic variation with about a million SNPs typed in four populations, providing a rich resource to inform the design of association studies. A number of strategies have been proposed for the selection of SNPs based on observed LD, including construction of metric LD maps and the selection of haplotype tagging SNPs. Power calculations are important at the study design stage to ensure successful results. Integrating these methods and annotations can be challenging: the algorithms required to implement these methods are complex to deploy, and all the necessary data and annotations are deposited in disparate databases. Here, we present the SNPbrowser Software, a freely available tool to assist in the LD-based selection of markers for association studies. This stand-alone application provides fast query capabilities and swift visualization of SNPs, gene annotations, power, haplotype blocks, and LD map coordinates. Wizards implement several common SNP selection workflows including the selection of optimal subsets of SNPs (e.g. tagging SNPs). Selected SNPs are screened for their conversion potential to either TaqMan SNP Genotyping Assays or the SNPlex Genotyping System, two commercially available genotyping platforms, expediting the set-up of genetic studies with an increased probability of success.
Channel and feature selection in multifunction myoelectric control.
Khushaba, Rami N; Al-Jumaily, Adel
2007-01-01
Real time controlling devices based on myoelectric singles (MES) is one of the challenging research problems. This paper presents a new approach to reduce the computational cost of real time systems driven by Myoelectric signals (MES) (a.k.a Electromyography--EMG). The new approach evaluates the significance of feature/channel selection on MES pattern recognition. Particle Swarm Optimization (PSO), an evolutionary computational technique, is employed to search the feature/channel space for important subsets. These important subsets will be evaluated using a multilayer perceptron trained with back propagation neural network (BPNN). Practical results acquired from tests done on six subjects' datasets of MES signals measured in a noninvasive manner using surface electrodes are presented. It is proved that minimum error rates can be achieved by considering the correct combination of features/channels, thus providing a feasible system for practical implementation purpose for rehabilitation of patients.
NASA Technical Reports Server (NTRS)
Tumer, Kagan; Oza, Nikunj C.; Clancy, Daniel (Technical Monitor)
2001-01-01
Using an ensemble of classifiers instead of a single classifier has been shown to improve generalization performance in many pattern recognition problems. However, the extent of such improvement depends greatly on the amount of correlation among the errors of the base classifiers. Therefore, reducing those correlations while keeping the classifiers' performance levels high is an important area of research. In this article, we explore input decimation (ID), a method which selects feature subsets for their ability to discriminate among the classes and uses them to decouple the base classifiers. We provide a summary of the theoretical benefits of correlation reduction, along with results of our method on two underwater sonar data sets, three benchmarks from the Probenl/UCI repositories, and two synthetic data sets. The results indicate that input decimated ensembles (IDEs) outperform ensembles whose base classifiers use all the input features; randomly selected subsets of features; and features created using principal components analysis, on a wide range of domains.
Neural networks for vertical microcode compaction
NASA Astrophysics Data System (ADS)
Chu, Pong P.
1992-09-01
Neural networks provide an alternative way to solve complex optimization problems. Instead of performing a program of instructions sequentially as in a traditional computer, neural network model explores many competing hypotheses simultaneously using its massively parallel net. The paper shows how to use the neural network approach to perform vertical micro-code compaction for a micro-programmed control unit. The compaction procedure includes two basic steps. The first step determines the compatibility classes and the second step selects a minimal subset to cover the control signals. Since the selection process is an NP- complete problem, to find an optimal solution is impractical. In this study, we employ a customized neural network to obtain the minimal subset. We first formalize this problem, and then define an `energy function' and map it to a two-layer fully connected neural network. The modified network has two types of neurons and can always obtain a valid solution.
Surveillance system and method having parameter estimation and operating mode partitioning
NASA Technical Reports Server (NTRS)
Bickford, Randall L. (Inventor)
2003-01-01
A system and method for monitoring an apparatus or process asset including partitioning an unpartitioned training data set into a plurality of training data subsets each having an operating mode associated thereto; creating a process model comprised of a plurality of process submodels each trained as a function of at least one of the training data subsets; acquiring a current set of observed signal data values from the asset; determining an operating mode of the asset for the current set of observed signal data values; selecting a process submodel from the process model as a function of the determined operating mode of the asset; calculating a current set of estimated signal data values from the selected process submodel for the determined operating mode; and outputting the calculated current set of estimated signal data values for providing asset surveillance and/or control.
Scalable amplification of strand subsets from chip-synthesized oligonucleotide libraries
Schmidt, Thorsten L.; Beliveau, Brian J.; Uca, Yavuz O.; Theilmann, Mark; Da Cruz, Felipe; Wu, Chao-Ting; Shih, William M.
2015-01-01
Synthetic oligonucleotides are the main cost factor for studies in DNA nanotechnology, genetics and synthetic biology, which all require thousands of these at high quality. Inexpensive chip-synthesized oligonucleotide libraries can contain hundreds of thousands of distinct sequences, however only at sub-femtomole quantities per strand. Here we present a selective oligonucleotide amplification method, based on three rounds of rolling-circle amplification, that produces nanomole amounts of single-stranded oligonucleotides per millilitre reaction. In a multistep one-pot procedure, subsets of hundreds or thousands of single-stranded DNAs with different lengths can selectively be amplified and purified together. These oligonucleotides are used to fold several DNA nanostructures and as primary fluorescence in situ hybridization probes. The amplification cost is lower than other reported methods (typically around US$ 20 per nanomole total oligonucleotides produced) and is dominated by the use of commercial enzymes. PMID:26567534
Optimizing an Actuator Array for the Control of Multi-Frequency Noise in Aircraft Interiors
NASA Technical Reports Server (NTRS)
Palumbo, D. L.; Padula, S. L.
1997-01-01
Techniques developed for selecting an optimized actuator array for interior noise reduction at a single frequency are extended to the multi-frequency case. Transfer functions for 64 actuators were obtained at 5 frequencies from ground testing the rear section of a fully trimmed DC-9 fuselage. A single loudspeaker facing the left side of the aircraft was the primary source. A combinatorial search procedure (tabu search) was employed to find optimum actuator subsets of from 2 to 16 actuators. Noise reduction predictions derived from the transfer functions were used as a basis for evaluating actuator subsets during optimization. Results indicate that it is necessary to constrain actuator forces during optimization. Unconstrained optimizations selected actuators which require unrealistically large forces. Two methods of constraint are evaluated. It is shown that a fast, but approximate, method yields results equivalent to an accurate, but computationally expensive, method.
Redundant variables and Granger causality
NASA Astrophysics Data System (ADS)
Angelini, L.; de Tommaso, M.; Marinazzo, D.; Nitti, L.; Pellicoro, M.; Stramaglia, S.
2010-03-01
We discuss the use of multivariate Granger causality in presence of redundant variables: the application of the standard analysis, in this case, leads to under estimation of causalities. Using the un-normalized version of the causality index, we quantitatively develop the notions of redundancy and synergy in the frame of causality and propose two approaches to group redundant variables: (i) for a given target, the remaining variables are grouped so as to maximize the total causality and (ii) the whole set of variables is partitioned to maximize the sum of the causalities between subsets. We show the application to a real neurological experiment, aiming to a deeper understanding of the physiological basis of abnormal neuronal oscillations in the migraine brain. The outcome by our approach reveals the change in the informational pattern due to repetitive transcranial magnetic stimulations.
QUEST1 Variability Survey. II. Variability Determination Criteria and 200k Light Curve Catalog
NASA Astrophysics Data System (ADS)
Rengstorf, A. W.; Mufson, S. L.; Andrews, P.; Honeycutt, R. K.; Vivas, A. K.; Abad, C.; Adams, B.; Bailyn, C.; Baltay, C.; Bongiovanni, A.; Briceño, C.; Bruzual, G.; Coppi, P.; Della Prugna, F.; Emmet, W.; Ferrín, I.; Fuenmayor, F.; Gebhard, M.; Hernández, J.; Magris, G.; Musser, J.; Naranjo, O.; Oemler, A.; Rosenzweig, P.; Sabbey, C. N.; Sánchez, Ge.; Sánchez, Gu.; Schaefer, B.; Schenner, H.; Sinnott, J.; Snyder, J. A.; Sofia, S.; Stock, J.; van Altena, W.
2004-12-01
The QUEST (QUasar Equatorial Survey Team) Phase 1 camera has collected multibandpass photometry on a large strip of high Galactic latitude sky over a period of 26 months. This robust data set has been reduced and nightly catalogs compared to determine the photometric variability of the ensemble objects. Subsequent spectroscopic observations have confirmed a subset of the photometric variables as quasars, as previously reported. This paper reports on the details of the data reduction and analysis pipeline and presents multiple bandpass light curves for 198,213 QUEST1 objects, along with global variability information and matched Sloan photometry. Based on observations obtained at the Llano del Hato National Astronomical Observatory, operated by the Centro de Investigaciones de Astronomía for the Ministerio de Ciencia y Tecnologia of Venezuela.
A landslide susceptibility map of Africa
NASA Astrophysics Data System (ADS)
Broeckx, Jente; Vanmaercke, Matthias; Duchateau, Rica; Poesen, Jean
2017-04-01
Studies on landslide risks and fatalities indicate that landslides are a global threat to humans, infrastructure and the environment, certainly in Africa. Nonetheless our understanding of the spatial patterns of landslides and rockfalls on this continent is very limited. Also in global landslide susceptibility maps, Africa is mostly underrepresented in the inventories used to construct these maps. As a result, predicted landslide susceptibilities remain subject to very large uncertainties. This research aims to produce a first continent-wide landslide susceptibility map for Africa, calibrated with a well-distributed landslide dataset. As a first step, we compiled all available landslide inventories for Africa. This data was supplemented by additional landslide mapping with Google Earth in underrepresented regions. This way, we compiled 60 landslide inventories from the literature (ca. 11000 landslides) and an additional 6500 landslides through mapping in Google Earth (including 1500 rockfalls). Various environmental variables such as slope, lithology, soil characteristics, land use, precipitation and seismic activity, were investigated for their significance in explaining the observed spatial patterns of landslides. To account for potential mapping biases in our dataset, we used Monte Carlo simulations that selected different subsets of mapped landslides, tested the significance of the considered environmental variables and evaluated the performance of the fitted multiple logistic regression model against another subset of mapped landslides. Based on these analyses, we constructed two landslide susceptibility maps for Africa: one for all landslide types and one excluding rockfalls. In both maps, topography, lithology and seismic activity were the most significant variables. The latter factor may be surprising, given the overall limited degree of seismicity in Africa. However, its significance indicates that frequent seismic events may serve as in important preparatory factor for landslides. This finding concurs with several other recent studies. Rainfall explains a significant, but limited part of the observed landslide pattern and becomes insignificant when also rockfalls are considered. This may be explained by the fact that a significant fraction of the mapped rockfalls occurred in the Sahara desert. Overall, both maps perform well in predicting intra-continental patterns of mass movements in Africa and explain about 80% of the observed variance in landslide occurrence. As a result, these maps may be a valuable tool for planning and risk reduction strategies.
Kunicki, Matthew A; Amaya Hernandez, Laura C; Davis, Kara L; Bacchetta, Rosa; Roncarolo, Maria-Grazia
2018-01-01
Human CD3 + CD4 + Th cells, FOXP3 + T regulatory (Treg) cells, and T regulatory type 1 (Tr1) cells are essential for ensuring peripheral immune response and tolerance, but the diversity of Th, Treg, and Tr1 cell subsets has not been fully characterized. Independent functional characterization of human Th1, Th2, Th17, T follicular helper (Tfh), Treg, and Tr1 cells has helped to define unique surface molecules, transcription factors, and signaling profiles for each subset. However, the adequacy of these markers to recapitulate the whole CD3 + CD4 + T cell compartment remains questionable. In this study, we examined CD3 + CD4 + T cell populations by single-cell mass cytometry. We characterize the CD3 + CD4 + Th, Treg, and Tr1 cell populations simultaneously across 23 memory T cell-associated surface and intracellular molecules. High-dimensional analysis identified several new subsets, in addition to the already defined CD3 + CD4 + Th, Treg, and Tr1 cell populations, for a total of 11 Th cell, 4 Treg, and 1 Tr1 cell subsets. Some of these subsets share markers previously thought to be selective for Treg, Th1, Th2, Th17, and Tfh cells, including CD194 (CCR4) + FOXP3 + Treg and CD183 (CXCR3) + T-bet + Th17 cell subsets. Unsupervised clustering displayed a phenotypic organization of CD3 + CD4 + T cells that confirmed their diversity but showed interrelation between the different subsets, including similarity between Th1-Th2-Tfh cell populations and Th17 cells, as well as similarity of Th2 cells with Treg cells. In conclusion, the use of single-cell mass cytometry provides a systems-level characterization of CD3 + CD4 + T cells in healthy human blood, which represents an important baseline reference to investigate abnormalities of different subsets in immune-mediated pathologies. Copyright © 2017 by The American Association of Immunologists, Inc.
A multi-service data management platform for scientific oceanographic products
NASA Astrophysics Data System (ADS)
D'Anca, Alessandro; Conte, Laura; Nassisi, Paola; Palazzo, Cosimo; Lecci, Rita; Cretì, Sergio; Mancini, Marco; Nuzzo, Alessandra; Mirto, Maria; Mannarini, Gianandrea; Coppini, Giovanni; Fiore, Sandro; Aloisio, Giovanni
2017-02-01
An efficient, secure and interoperable data platform solution has been developed in the TESSA project to provide fast navigation and access to the data stored in the data archive, as well as a standard-based metadata management support. The platform mainly targets scientific users and the situational sea awareness high-level services such as the decision support systems (DSS). These datasets are accessible through the following three main components: the Data Access Service (DAS), the Metadata Service and the Complex Data Analysis Module (CDAM). The DAS allows access to data stored in the archive by providing interfaces for different protocols and services for downloading, variables selection, data subsetting or map generation. Metadata Service is the heart of the information system of the TESSA products and completes the overall infrastructure for data and metadata management. This component enables data search and discovery and addresses interoperability by exploiting widely adopted standards for geospatial data. Finally, the CDAM represents the back-end of the TESSA DSS by performing on-demand complex data analysis tasks.
Evaluation of workplace exposure to respirable crystalline silica in Italy
Scarselli, Alberto; Corfiati, Marisa; Marzio, Davide Di; Iavicoli, Sergio
2014-01-01
Background: Crystalline silica is a human carcinogen and its use is widespread among construction, mining, foundries, and other manufacturing industries. Purpose: To evaluate occupational exposure to crystalline silica in Italy. Methods: Data were collected from exposure registries and descriptive statistics were calculated for exposure-related variables. The number of potentially exposed workers was estimated in a subset of industrial sectors. Linear mixed model analysis was performed to determine factors affecting the exposure level. Results: We found 1387 cases of crystalline silica exposure between 1996 and 2012. Exposure was most common in construction work (AM = 0.057 mg/m3, N = 505), and among miners and quarry workers (AM = 0.048 mg/m3, N = 238). We estimated that 41 643 workers were at risk of exposure in the selected industrial sectors during the same period. Conclusions: This study identified high-risk sectors for occupational exposure to crystalline silica, which can help guide targeted dust control interventions and health promotion campaigns in the workplace. PMID:25078346
Evaluation of workplace exposure to respirable crystalline silica in Italy.
Scarselli, Alberto; Corfiati, Marisa; Marzio, Davide Di; Iavicoli, Sergio
2014-10-01
Crystalline silica is a human carcinogen and its use is widespread among construction, mining, foundries, and other manufacturing industries. To evaluate occupational exposure to crystalline silica in Italy. Data were collected from exposure registries and descriptive statistics were calculated for exposure-related variables. The number of potentially exposed workers was estimated in a subset of industrial sectors. Linear mixed model analysis was performed to determine factors affecting the exposure level. We found 1387 cases of crystalline silica exposure between 1996 and 2012. Exposure was most common in construction work (AM = 0·057 mg/m(3), N = 505), and among miners and quarry workers (AM = 0·048 mg/m(3), N = 238). We estimated that 41 643 workers were at risk of exposure in the selected industrial sectors during the same period. This study identified high-risk sectors for occupational exposure to crystalline silica, which can help guide targeted dust control interventions and health promotion campaigns in the workplace.
The Greeks in the West: genetic signatures of the Hellenic colonisation in southern Italy and Sicily
Tofanelli, Sergio; Brisighelli, Francesca; Anagnostou, Paolo; Busby, George B J; Ferri, Gianmarco; Thomas, Mark G; Taglioli, Luca; Rudan, Igor; Zemunik, Tatijana; Hayward, Caroline; Bolnick, Deborah; Romano, Valentino; Cali, Francesco; Luiselli, Donata; Shepherd, Gillian B; Tusa, Sebastiano; Facella, Antonino; Capelli, Cristian
2016-01-01
Greek colonisation of South Italy and Sicily (Magna Graecia) was a defining event in European cultural history, although the demographic processes and genetic impacts involved have not been systematically investigated. Here, we combine high-resolution surveys of the variability at the uni-parentally inherited Y chromosome and mitochondrial DNA in selected samples of putative source and recipient populations with forward-in-time simulations of alternative demographic models to detect signatures of that impact. Using a subset of haplotypes chosen to represent historical sources, we recover a clear signature of Greek ancestry in East Sicily compatible with the settlement from Euboea during the Archaic Period (eighth to fifth century BCE). We inferred moderate sex-bias in the numbers of individuals involved in the colonisation: a few thousand breeding men and a few hundred breeding women were the estimated number of migrants. Last, we demonstrate that studies aimed at quantifying Hellenic genetic flow by the proportion of specific lineages surviving in present-day populations may be misleading. PMID:26173964
Flexible categorization of relative stimulus strength by the optic tectum
Mysore, Shreesh P.; Knudsen, Eric I.
2011-01-01
Categorization is the process by which the brain segregates continuously variable stimuli into discrete groups. We report that patterns of neural population activity in the owl optic tectum (OT) categorize stimuli based on their relative strengths into “strongest” versus “other”. The category boundary shifts adaptively to track changes in the absolute strength of the strongest stimulus. This population-wide categorization is mediated by the responses of a small subset of neurons. Our data constitute the first direct demonstration of an explicit categorization of stimuli by a neural network based on relative stimulus strength or salience. The finding of categorization by the population code relaxes constraints on the properties of downstream decoders that might read out the location of the strongest stimulus. These results indicate that the ensemble neural code in the OT could mediate bottom-up stimulus selection for gaze and attention, a form of stimulus categorization in which the category boundary often shifts within hundreds of milliseconds. PMID:21613487
The Asthma Mobile Health Study, a large-scale clinical observational study using ResearchKit
Chan, Yu-Feng Yvonne; Wang, Pei; Rogers, Linda; Tignor, Nicole; Zweig, Micol; Hershman, Steven G; Genes, Nicholas; Scott, Erick R; Krock, Eric; Badgeley, Marcus; Edgar, Ron; Violante, Samantha; Wright, Rosalind; Powell, Charles A; Dudley, Joel T; Schadt, Eric E
2017-01-01
The feasibility of using mobile health applications to conduct observational clinical studies requires rigorous validation. Here, we report initial findings from the Asthma Mobile Health Study, a research study, including recruitment, consent, and enrollment, conducted entirely remotely by smartphone. We achieved secure bidirectional data flow between investigators and 7,593 participants from across the United States, including many with severe asthma. Our platform enabled prospective collection of longitudinal, multidimensional data (e.g., surveys, devices, geolocation, and air quality) in a subset of users over the 6-month study period. Consistent trending and correlation of interrelated variables support the quality of data obtained via this method. We detected increased reporting of asthma symptoms in regions affected by heat, pollen, and wildfires. Potential challenges with this technology include selection bias, low retention rates, reporting bias, and data security. These issues require attention to realize the full potential of mobile platforms in research and patient care. PMID:28288104
Millikan, Amy M; Weber, Natalya S; Niebuhr, David W; Torrey, E Fuller; Cowan, David N; Li, Yuanzhang; Kaminski, Brenda
2007-10-01
We are studying associations between selected biomarkers and schizophrenia or bipolar disorder among military personnel. To assess potential diagnostic misclassification and to estimate the date of illness onset, we reviewed medical records for a subset of cases. Two psychiatrists independently reviewed 182 service medical records retrieved from the Department of Veterans Affairs. Data were evaluated for diagnostic concordance between database diagnoses and reviewers. Interreviewer variability was measured by using proportion of agreement and the kappa statistic. Data were abstracted to estimate date of onset. High levels of agreement existed between database diagnoses and reviewers (proportion, 94.7%; kappa = 0.88) and between reviewers (proportion, 92.3%; kappa = 0.87). The median time between illness onset and initiation of medical discharge was 1.6 and 1.1 years for schizophrenia and bipolar disorder, respectively. High levels of agreement between investigators and database diagnoses indicate that diagnostic misclassification is unlikely. Discharge procedure initiation date provides a suitable surrogate for disease onset.