A Non-Intrusive Algorithm for Sensitivity Analysis of Chaotic Flow Simulations
NASA Technical Reports Server (NTRS)
Blonigan, Patrick J.; Wang, Qiqi; Nielsen, Eric J.; Diskin, Boris
2017-01-01
We demonstrate a novel algorithm for computing the sensitivity of statistics in chaotic flow simulations to parameter perturbations. The algorithm is non-intrusive but requires exposing an interface. Based on the principle of shadowing in dynamical systems, this algorithm is designed to reduce the effect of the sampling error in computing sensitivity of statistics in chaotic simulations. We compare the effectiveness of this method to that of the conventional finite difference method.
Developing a cosmic ray muon sampling capability for muon tomography and monitoring applications
NASA Astrophysics Data System (ADS)
Chatzidakis, S.; Chrysikopoulou, S.; Tsoukalas, L. H.
2015-12-01
In this study, a cosmic ray muon sampling capability using a phenomenological model that captures the main characteristics of the experimentally measured spectrum coupled with a set of statistical algorithms is developed. The "muon generator" produces muons with zenith angles in the range 0-90° and energies in the range 1-100 GeV and is suitable for Monte Carlo simulations with emphasis on muon tomographic and monitoring applications. The muon energy distribution is described by the Smith and Duller (1959) [35] phenomenological model. Statistical algorithms are then employed for generating random samples. The inverse transform provides a means to generate samples from the muon angular distribution, whereas the Acceptance-Rejection and Metropolis-Hastings algorithms are employed to provide the energy component. The predictions for muon energies 1-60 GeV and zenith angles 0-90° are validated with a series of actual spectrum measurements and with estimates from the software library CRY. The results confirm the validity of the phenomenological model and the applicability of the statistical algorithms to generate polyenergetic-polydirectional muons. The response of the algorithms and the impact of critical parameters on computation time and computed results were investigated. Final output from the proposed "muon generator" is a look-up table that contains the sampled muon angles and energies and can be easily integrated into Monte Carlo particle simulation codes such as Geant4 and MCNP.
Bellenguez, Céline; Strange, Amy; Freeman, Colin; Donnelly, Peter; Spencer, Chris C A
2012-01-01
High-throughput genotyping arrays provide an efficient way to survey single nucleotide polymorphisms (SNPs) across the genome in large numbers of individuals. Downstream analysis of the data, for example in genome-wide association studies (GWAS), often involves statistical models of genotype frequencies across individuals. The complexities of the sample collection process and the potential for errors in the experimental assay can lead to biases and artefacts in an individual's inferred genotypes. Rather than attempting to model these complications, it has become a standard practice to remove individuals whose genome-wide data differ from the sample at large. Here we describe a simple, but robust, statistical algorithm to identify samples with atypical summaries of genome-wide variation. Its use as a semi-automated quality control tool is demonstrated using several summary statistics, selected to identify different potential problems, and it is applied to two different genotyping platforms and sample collections. The algorithm is written in R and is freely available at www.well.ox.ac.uk/chris-spencer chris.spencer@well.ox.ac.uk Supplementary data are available at Bioinformatics online.
NASA Astrophysics Data System (ADS)
Williams, Arnold C.; Pachowicz, Peter W.
2004-09-01
Current mine detection research indicates that no single sensor or single look from a sensor will detect mines/minefields in a real-time manner at a performance level suitable for a forward maneuver unit. Hence, the integrated development of detectors and fusion algorithms are of primary importance. A problem in this development process has been the evaluation of these algorithms with relatively small data sets, leading to anecdotal and frequently over trained results. These anecdotal results are often unreliable and conflicting among various sensors and algorithms. Consequently, the physical phenomena that ought to be exploited and the performance benefits of this exploitation are often ambiguous. The Army RDECOM CERDEC Night Vision Laboratory and Electron Sensors Directorate has collected large amounts of multisensor data such that statistically significant evaluations of detection and fusion algorithms can be obtained. Even with these large data sets care must be taken in algorithm design and data processing to achieve statistically significant performance results for combined detectors and fusion algorithms. This paper discusses statistically significant detection and combined multilook fusion results for the Ellipse Detector (ED) and the Piecewise Level Fusion Algorithm (PLFA). These statistically significant performance results are characterized by ROC curves that have been obtained through processing this multilook data for the high resolution SAR data of the Veridian X-Band radar. We discuss the implications of these results on mine detection and the importance of statistical significance, sample size, ground truth, and algorithm design in performance evaluation.
Validation of Statistical Sampling Algorithms in Visual Sample Plan (VSP): Summary Report
DOE Office of Scientific and Technical Information (OSTI.GOV)
Nuffer, Lisa L; Sego, Landon H.; Wilson, John E.
2009-02-18
The U.S. Department of Homeland Security, Office of Technology Development (OTD) contracted with a set of U.S. Department of Energy national laboratories, including the Pacific Northwest National Laboratory (PNNL), to write a Remediation Guidance for Major Airports After a Chemical Attack. The report identifies key activities and issues that should be considered by a typical major airport following an incident involving release of a toxic chemical agent. Four experimental tasks were identified that would require further research in order to supplement the Remediation Guidance. One of the tasks, Task 4, OTD Chemical Remediation Statistical Sampling Design Validation, dealt with statisticalmore » sampling algorithm validation. This report documents the results of the sampling design validation conducted for Task 4. In 2005, the Government Accountability Office (GAO) performed a review of the past U.S. responses to Anthrax terrorist cases. Part of the motivation for this PNNL report was a major GAO finding that there was a lack of validated sampling strategies in the U.S. response to Anthrax cases. The report (GAO 2005) recommended that probability-based methods be used for sampling design in order to address confidence in the results, particularly when all sample results showed no remaining contamination. The GAO also expressed a desire that the methods be validated, which is the main purpose of this PNNL report. The objective of this study was to validate probability-based statistical sampling designs and the algorithms pertinent to within-building sampling that allow the user to prescribe or evaluate confidence levels of conclusions based on data collected as guided by the statistical sampling designs. Specifically, the designs found in the Visual Sample Plan (VSP) software were evaluated. VSP was used to calculate the number of samples and the sample location for a variety of sampling plans applied to an actual release site. Most of the sampling designs validated are probability based, meaning samples are located randomly (or on a randomly placed grid) so no bias enters into the placement of samples, and the number of samples is calculated such that IF the amount and spatial extent of contamination exceeds levels of concern, at least one of the samples would be taken from a contaminated area, at least X% of the time. Hence, "validation" of the statistical sampling algorithms is defined herein to mean ensuring that the "X%" (confidence) is actually met.« less
Roh, Min K; Gillespie, Dan T; Petzold, Linda R
2010-11-07
The weighted stochastic simulation algorithm (wSSA) was developed by Kuwahara and Mura [J. Chem. Phys. 129, 165101 (2008)] to efficiently estimate the probabilities of rare events in discrete stochastic systems. The wSSA uses importance sampling to enhance the statistical accuracy in the estimation of the probability of the rare event. The original algorithm biases the reaction selection step with a fixed importance sampling parameter. In this paper, we introduce a novel method where the biasing parameter is state-dependent. The new method features improved accuracy, efficiency, and robustness.
USDA-ARS?s Scientific Manuscript database
The primary advantage of Dynamically Dimensioned Search algorithm (DDS) is that it outperforms many other optimization techniques in both convergence speed and the ability in searching for parameter sets that satisfy statistical guidelines while requiring only one algorithm parameter (perturbation f...
BCM: toolkit for Bayesian analysis of Computational Models using samplers.
Thijssen, Bram; Dijkstra, Tjeerd M H; Heskes, Tom; Wessels, Lodewyk F A
2016-10-21
Computational models in biology are characterized by a large degree of uncertainty. This uncertainty can be analyzed with Bayesian statistics, however, the sampling algorithms that are frequently used for calculating Bayesian statistical estimates are computationally demanding, and each algorithm has unique advantages and disadvantages. It is typically unclear, before starting an analysis, which algorithm will perform well on a given computational model. We present BCM, a toolkit for the Bayesian analysis of Computational Models using samplers. It provides efficient, multithreaded implementations of eleven algorithms for sampling from posterior probability distributions and for calculating marginal likelihoods. BCM includes tools to simplify the process of model specification and scripts for visualizing the results. The flexible architecture allows it to be used on diverse types of biological computational models. In an example inference task using a model of the cell cycle based on ordinary differential equations, BCM is significantly more efficient than existing software packages, allowing more challenging inference problems to be solved. BCM represents an efficient one-stop-shop for computational modelers wishing to use sampler-based Bayesian statistics.
Statistics for characterizing data on the periphery
DOE Office of Scientific and Technical Information (OSTI.GOV)
Theiler, James P; Hush, Donald R
2010-01-01
We introduce a class of statistics for characterizing the periphery of a distribution, and show that these statistics are particularly valuable for problems in target detection. Because so many detection algorithms are rooted in Gaussian statistics, we concentrate on ellipsoidal models of high-dimensional data distributions (that is to say: covariance matrices), but we recommend several alternatives to the sample covariance matrix that more efficiently model the periphery of a distribution, and can more effectively detect anomalous data samples.
Kaspi, Omer; Yosipof, Abraham; Senderowitz, Hanoch
2017-06-06
An important aspect of chemoinformatics and material-informatics is the usage of machine learning algorithms to build Quantitative Structure Activity Relationship (QSAR) models. The RANdom SAmple Consensus (RANSAC) algorithm is a predictive modeling tool widely used in the image processing field for cleaning datasets from noise. RANSAC could be used as a "one stop shop" algorithm for developing and validating QSAR models, performing outlier removal, descriptors selection, model development and predictions for test set samples using applicability domain. For "future" predictions (i.e., for samples not included in the original test set) RANSAC provides a statistical estimate for the probability of obtaining reliable predictions, i.e., predictions within a pre-defined number of standard deviations from the true values. In this work we describe the first application of RNASAC in material informatics, focusing on the analysis of solar cells. We demonstrate that for three datasets representing different metal oxide (MO) based solar cell libraries RANSAC-derived models select descriptors previously shown to correlate with key photovoltaic properties and lead to good predictive statistics for these properties. These models were subsequently used to predict the properties of virtual solar cells libraries highlighting interesting dependencies of PV properties on MO compositions.
Productive Information Foraging
NASA Technical Reports Server (NTRS)
Furlong, P. Michael; Dille, Michael
2016-01-01
This paper presents a new algorithm for autonomous on-line exploration in unknown environments. The objective of the algorithm is to free robot scientists from extensive preliminary site investigation while still being able to collect meaningful data. We simulate a common form of exploration task for an autonomous robot involving sampling the environment at various locations and compare performance with a simpler existing algorithm that is also denied global information. The result of the experiment shows that the new algorithm has a statistically significant improvement in performance with a significant effect size for a range of costs for taking sampling actions.
Statistical Methods in Ai: Rare Event Learning Using Associative Rules and Higher-Order Statistics
NASA Astrophysics Data System (ADS)
Iyer, V.; Shetty, S.; Iyengar, S. S.
2015-07-01
Rare event learning has not been actively researched since lately due to the unavailability of algorithms which deal with big samples. The research addresses spatio-temporal streams from multi-resolution sensors to find actionable items from a perspective of real-time algorithms. This computing framework is independent of the number of input samples, application domain, labelled or label-less streams. A sampling overlap algorithm such as Brooks-Iyengar is used for dealing with noisy sensor streams. We extend the existing noise pre-processing algorithms using Data-Cleaning trees. Pre-processing using ensemble of trees using bagging and multi-target regression showed robustness to random noise and missing data. As spatio-temporal streams are highly statistically correlated, we prove that a temporal window based sampling from sensor data streams converges after n samples using Hoeffding bounds. Which can be used for fast prediction of new samples in real-time. The Data-cleaning tree model uses a nonparametric node splitting technique, which can be learned in an iterative way which scales linearly in memory consumption for any size input stream. The improved task based ensemble extraction is compared with non-linear computation models using various SVM kernels for speed and accuracy. We show using empirical datasets the explicit rule learning computation is linear in time and is only dependent on the number of leafs present in the tree ensemble. The use of unpruned trees (t) in our proposed ensemble always yields minimum number (m) of leafs keeping pre-processing computation to n × t log m compared to N2 for Gram Matrix. We also show that the task based feature induction yields higher Qualify of Data (QoD) in the feature space compared to kernel methods using Gram Matrix.
Statistical Inference on Memory Structure of Processes and Its Applications to Information Theory
2016-05-12
valued times series from a sample. (A practical algorithm to compute the estimator is a work in progress.) Third, finitely-valued spatial processes...ES) U.S. Army Research Office P.O. Box 12211 Research Triangle Park, NC 27709-2211 mathematical statistics; time series ; Markov chains; random...proved. Second, a statistical method is developed to estimate the memory depth of discrete- time and continuously-valued times series from a sample. (A
Chen, Nan; Majda, Andrew J
2017-12-05
Solving the Fokker-Planck equation for high-dimensional complex dynamical systems is an important issue. Recently, the authors developed efficient statistically accurate algorithms for solving the Fokker-Planck equations associated with high-dimensional nonlinear turbulent dynamical systems with conditional Gaussian structures, which contain many strong non-Gaussian features such as intermittency and fat-tailed probability density functions (PDFs). The algorithms involve a hybrid strategy with a small number of samples [Formula: see text], where a conditional Gaussian mixture in a high-dimensional subspace via an extremely efficient parametric method is combined with a judicious Gaussian kernel density estimation in the remaining low-dimensional subspace. In this article, two effective strategies are developed and incorporated into these algorithms. The first strategy involves a judicious block decomposition of the conditional covariance matrix such that the evolutions of different blocks have no interactions, which allows an extremely efficient parallel computation due to the small size of each individual block. The second strategy exploits statistical symmetry for a further reduction of [Formula: see text] The resulting algorithms can efficiently solve the Fokker-Planck equation with strongly non-Gaussian PDFs in much higher dimensions even with orders in the millions and thus beat the curse of dimension. The algorithms are applied to a [Formula: see text]-dimensional stochastic coupled FitzHugh-Nagumo model for excitable media. An accurate recovery of both the transient and equilibrium non-Gaussian PDFs requires only [Formula: see text] samples! In addition, the block decomposition facilitates the algorithms to efficiently capture the distinct non-Gaussian features at different locations in a [Formula: see text]-dimensional two-layer inhomogeneous Lorenz 96 model, using only [Formula: see text] samples. Copyright © 2017 the Author(s). Published by PNAS.
Tang, Jie; Nett, Brian E; Chen, Guang-Hong
2009-10-07
Of all available reconstruction methods, statistical iterative reconstruction algorithms appear particularly promising since they enable accurate physical noise modeling. The newly developed compressive sampling/compressed sensing (CS) algorithm has shown the potential to accurately reconstruct images from highly undersampled data. The CS algorithm can be implemented in the statistical reconstruction framework as well. In this study, we compared the performance of two standard statistical reconstruction algorithms (penalized weighted least squares and q-GGMRF) to the CS algorithm. In assessing the image quality using these iterative reconstructions, it is critical to utilize realistic background anatomy as the reconstruction results are object dependent. A cadaver head was scanned on a Varian Trilogy system at different dose levels. Several figures of merit including the relative root mean square error and a quality factor which accounts for the noise performance and the spatial resolution were introduced to objectively evaluate reconstruction performance. A comparison is presented between the three algorithms for a constant undersampling factor comparing different algorithms at several dose levels. To facilitate this comparison, the original CS method was formulated in the framework of the statistical image reconstruction algorithms. Important conclusions of the measurements from our studies are that (1) for realistic neuro-anatomy, over 100 projections are required to avoid streak artifacts in the reconstructed images even with CS reconstruction, (2) regardless of the algorithm employed, it is beneficial to distribute the total dose to more views as long as each view remains quantum noise limited and (3) the total variation-based CS method is not appropriate for very low dose levels because while it can mitigate streaking artifacts, the images exhibit patchy behavior, which is potentially harmful for medical diagnosis.
Designing image segmentation studies: Statistical power, sample size and reference standard quality.
Gibson, Eli; Hu, Yipeng; Huisman, Henkjan J; Barratt, Dean C
2017-12-01
Segmentation algorithms are typically evaluated by comparison to an accepted reference standard. The cost of generating accurate reference standards for medical image segmentation can be substantial. Since the study cost and the likelihood of detecting a clinically meaningful difference in accuracy both depend on the size and on the quality of the study reference standard, balancing these trade-offs supports the efficient use of research resources. In this work, we derive a statistical power calculation that enables researchers to estimate the appropriate sample size to detect clinically meaningful differences in segmentation accuracy (i.e. the proportion of voxels matching the reference standard) between two algorithms. Furthermore, we derive a formula to relate reference standard errors to their effect on the sample sizes of studies using lower-quality (but potentially more affordable and practically available) reference standards. The accuracy of the derived sample size formula was estimated through Monte Carlo simulation, demonstrating, with 95% confidence, a predicted statistical power within 4% of simulated values across a range of model parameters. This corresponds to sample size errors of less than 4 subjects and errors in the detectable accuracy difference less than 0.6%. The applicability of the formula to real-world data was assessed using bootstrap resampling simulations for pairs of algorithms from the PROMISE12 prostate MR segmentation challenge data set. The model predicted the simulated power for the majority of algorithm pairs within 4% for simulated experiments using a high-quality reference standard and within 6% for simulated experiments using a low-quality reference standard. A case study, also based on the PROMISE12 data, illustrates using the formulae to evaluate whether to use a lower-quality reference standard in a prostate segmentation study. Copyright © 2017 The Authors. Published by Elsevier B.V. All rights reserved.
Classical Statistics and Statistical Learning in Imaging Neuroscience
Bzdok, Danilo
2017-01-01
Brain-imaging research has predominantly generated insight by means of classical statistics, including regression-type analyses and null-hypothesis testing using t-test and ANOVA. Throughout recent years, statistical learning methods enjoy increasing popularity especially for applications in rich and complex data, including cross-validated out-of-sample prediction using pattern classification and sparsity-inducing regression. This concept paper discusses the implications of inferential justifications and algorithmic methodologies in common data analysis scenarios in neuroimaging. It is retraced how classical statistics and statistical learning originated from different historical contexts, build on different theoretical foundations, make different assumptions, and evaluate different outcome metrics to permit differently nuanced conclusions. The present considerations should help reduce current confusion between model-driven classical hypothesis testing and data-driven learning algorithms for investigating the brain with imaging techniques. PMID:29056896
Wang, JianLi; Sareen, Jitender; Patten, Scott; Bolton, James; Schmitz, Norbert; Birney, Arden
2014-05-01
Prediction algorithms are useful for making clinical decisions and for population health planning. However, such prediction algorithms for first onset of major depression do not exist. The objective of this study was to develop and validate a prediction algorithm for first onset of major depression in the general population. Longitudinal study design with approximate 3-year follow-up. The study was based on data from a nationally representative sample of the US general population. A total of 28 059 individuals who participated in Waves 1 and 2 of the US National Epidemiologic Survey on Alcohol and Related Conditions and who had not had major depression at Wave 1 were included. The prediction algorithm was developed using logistic regression modelling in 21 813 participants from three census regions. The algorithm was validated in participants from the 4th census region (n=6246). Major depression occurred since Wave 1 of the National Epidemiologic Survey on Alcohol and Related Conditions, assessed by the Alcohol Use Disorder and Associated Disabilities Interview Schedule-diagnostic and statistical manual for mental disorders IV. A prediction algorithm containing 17 unique risk factors was developed. The algorithm had good discriminative power (C statistics=0.7538, 95% CI 0.7378 to 0.7699) and excellent calibration (F-adjusted test=1.00, p=0.448) with the weighted data. In the validation sample, the algorithm had a C statistic of 0.7259 and excellent calibration (Hosmer-Lemeshow χ(2)=3.41, p=0.906). The developed prediction algorithm has good discrimination and calibration capacity. It can be used by clinicians, mental health policy-makers and service planners and the general public to predict future risk of having major depression. The application of the algorithm may lead to increased personalisation of treatment, better clinical decisions and more optimal mental health service planning.
Software for Data Analysis with Graphical Models
NASA Technical Reports Server (NTRS)
Buntine, Wray L.; Roy, H. Scott
1994-01-01
Probabilistic graphical models are being used widely in artificial intelligence and statistics, for instance, in diagnosis and expert systems, as a framework for representing and reasoning with probabilities and independencies. They come with corresponding algorithms for performing statistical inference. This offers a unifying framework for prototyping and/or generating data analysis algorithms from graphical specifications. This paper illustrates the framework with an example and then presents some basic techniques for the task: problem decomposition and the calculation of exact Bayes factors. Other tools already developed, such as automatic differentiation, Gibbs sampling, and use of the EM algorithm, make this a broad basis for the generation of data analysis software.
Ensembles of radial basis function networks for spectroscopic detection of cervical precancer
NASA Technical Reports Server (NTRS)
Tumer, K.; Ramanujam, N.; Ghosh, J.; Richards-Kortum, R.
1998-01-01
The mortality related to cervical cancer can be substantially reduced through early detection and treatment. However, current detection techniques, such as Pap smear and colposcopy, fail to achieve a concurrently high sensitivity and specificity. In vivo fluorescence spectroscopy is a technique which quickly, noninvasively and quantitatively probes the biochemical and morphological changes that occur in precancerous tissue. A multivariate statistical algorithm was used to extract clinically useful information from tissue spectra acquired from 361 cervical sites from 95 patients at 337-, 380-, and 460-nm excitation wavelengths. The multivariate statistical analysis was also employed to reduce the number of fluorescence excitation-emission wavelength pairs required to discriminate healthy tissue samples from precancerous tissue samples. The use of connectionist methods such as multilayered perceptrons, radial basis function (RBF) networks, and ensembles of such networks was investigated. RBF ensemble algorithms based on fluorescence spectra potentially provide automated and near real-time implementation of precancer detection in the hands of nonexperts. The results are more reliable, direct, and accurate than those achieved by either human experts or multivariate statistical algorithms.
Nangia, Shikha; Jasper, Ahren W; Miller, Thomas F; Truhlar, Donald G
2004-02-22
The most widely used algorithm for Monte Carlo sampling of electronic transitions in trajectory surface hopping (TSH) calculations is the so-called anteater algorithm, which is inefficient for sampling low-probability nonadiabatic events. We present a new sampling scheme (called the army ants algorithm) for carrying out TSH calculations that is applicable to systems with any strength of coupling. The army ants algorithm is a form of rare event sampling whose efficiency is controlled by an input parameter. By choosing a suitable value of the input parameter the army ants algorithm can be reduced to the anteater algorithm (which is efficient for strongly coupled cases), and by optimizing the parameter the army ants algorithm may be efficiently applied to systems with low-probability events. To demonstrate the efficiency of the army ants algorithm, we performed atom-diatom scattering calculations on a model system involving weakly coupled electronic states. Fully converged quantum mechanical calculations were performed, and the probabilities for nonadiabatic reaction and nonreactive deexcitation (quenching) were found to be on the order of 10(-8). For such low-probability events the anteater sampling scheme requires a large number of trajectories ( approximately 10(10)) to obtain good statistics and converged semiclassical results. In contrast by using the new army ants algorithm converged results were obtained by running 10(5) trajectories. Furthermore, the results were found to be in excellent agreement with the quantum mechanical results. Sampling errors were estimated using the bootstrap method, which is validated for use with the army ants algorithm. (c) 2004 American Institute of Physics.
NASA Technical Reports Server (NTRS)
Bell, Thomas L.; Kundu, Prasun K.; Einaudi, Franco (Technical Monitor)
2000-01-01
Estimates from TRMM satellite data of monthly total rainfall over an area are subject to substantial sampling errors due to the limited number of visits to the area by the satellite during the month. Quantitative comparisons of TRMM averages with data collected by other satellites and by ground-based systems require some estimate of the size of this sampling error. A method of estimating this sampling error based on the actual statistics of the TRMM observations and on some modeling work has been developed. "Sampling error" in TRMM monthly averages is defined here relative to the monthly total a hypothetical satellite permanently stationed above the area would have reported. "Sampling error" therefore includes contributions from the random and systematic errors introduced by the satellite remote sensing system. As part of our long-term goal of providing error estimates for each grid point accessible to the TRMM instruments, sampling error estimates for TRMM based on rain retrievals from TRMM microwave (TMI) data are compared for different times of the year and different oceanic areas (to minimize changes in the statistics due to algorithmic differences over land and ocean). Changes in sampling error estimates due to changes in rain statistics due 1) to evolution of the official algorithms used to process the data, and 2) differences from other remote sensing systems such as the Defense Meteorological Satellite Program (DMSP) Special Sensor Microwave/Imager (SSM/I), are analyzed.
Note: A pure-sampling quantum Monte Carlo algorithm with independent Metropolis.
Vrbik, Jan; Ospadov, Egor; Rothstein, Stuart M
2016-07-14
Recently, Ospadov and Rothstein published a pure-sampling quantum Monte Carlo algorithm (PSQMC) that features an auxiliary Path Z that connects the midpoints of the current and proposed Paths X and Y, respectively. When sufficiently long, Path Z provides statistical independence of Paths X and Y. Under those conditions, the Metropolis decision used in PSQMC is done without any approximation, i.e., not requiring microscopic reversibility and without having to introduce any G(x → x'; τ) factors into its decision function. This is a unique feature that contrasts with all competing reptation algorithms in the literature. An example illustrates that dependence of Paths X and Y has adverse consequences for pure sampling.
Note: A pure-sampling quantum Monte Carlo algorithm with independent Metropolis
NASA Astrophysics Data System (ADS)
Vrbik, Jan; Ospadov, Egor; Rothstein, Stuart M.
2016-07-01
Recently, Ospadov and Rothstein published a pure-sampling quantum Monte Carlo algorithm (PSQMC) that features an auxiliary Path Z that connects the midpoints of the current and proposed Paths X and Y, respectively. When sufficiently long, Path Z provides statistical independence of Paths X and Y. Under those conditions, the Metropolis decision used in PSQMC is done without any approximation, i.e., not requiring microscopic reversibility and without having to introduce any G(x → x'; τ) factors into its decision function. This is a unique feature that contrasts with all competing reptation algorithms in the literature. An example illustrates that dependence of Paths X and Y has adverse consequences for pure sampling.
Tenan, Matthew S; Tweedell, Andrew J; Haynes, Courtney A
2017-01-01
The timing of muscle activity is a commonly applied analytic method to understand how the nervous system controls movement. This study systematically evaluates six classes of standard and statistical algorithms to determine muscle onset in both experimental surface electromyography (EMG) and simulated EMG with a known onset time. Eighteen participants had EMG collected from the biceps brachii and vastus lateralis while performing a biceps curl or knee extension, respectively. Three established methods and three statistical methods for EMG onset were evaluated. Linear envelope, Teager-Kaiser energy operator + linear envelope and sample entropy were the established methods evaluated while general time series mean/variance, sequential and batch processing of parametric and nonparametric tools, and Bayesian changepoint analysis were the statistical techniques used. Visual EMG onset (experimental data) and objective EMG onset (simulated data) were compared with algorithmic EMG onset via root mean square error and linear regression models for stepwise elimination of inferior algorithms. The top algorithms for both data types were analyzed for their mean agreement with the gold standard onset and evaluation of 95% confidence intervals. The top algorithms were all Bayesian changepoint analysis iterations where the parameter of the prior (p0) was zero. The best performing Bayesian algorithms were p0 = 0 and a posterior probability for onset determination at 60-90%. While existing algorithms performed reasonably, the Bayesian changepoint analysis methodology provides greater reliability and accuracy when determining the singular onset of EMG activity in a time series. Further research is needed to determine if this class of algorithms perform equally well when the time series has multiple bursts of muscle activity.
NASA Technical Reports Server (NTRS)
Racette, Paul; Lang, Roger; Zhang, Zhao-Nan; Zacharias, David; Krebs, Carolyn A. (Technical Monitor)
2002-01-01
Radiometers must be periodically calibrated because the receiver response fluctuates. Many techniques exist to correct for the time varying response of a radiometer receiver. An analytical technique has been developed that uses generalized least squares regression (LSR) to predict the performance of a wide variety of calibration algorithms. The total measurement uncertainty including the uncertainty of the calibration can be computed using LSR. The uncertainties of the calibration samples used in the regression are based upon treating the receiver fluctuations as non-stationary processes. Signals originating from the different sources of emission are treated as simultaneously existing random processes. Thus, the radiometer output is a series of samples obtained from these random processes. The samples are treated as random variables but because the underlying processes are non-stationary the statistics of the samples are treated as non-stationary. The statistics of the calibration samples depend upon the time for which the samples are to be applied. The statistics of the random variables are equated to the mean statistics of the non-stationary processes over the interval defined by the time of calibration sample and when it is applied. This analysis opens the opportunity for experimental investigation into the underlying properties of receiver non stationarity through the use of multiple calibration references. In this presentation we will discuss the application of LSR to the analysis of various calibration algorithms, requirements for experimental verification of the theory, and preliminary results from analyzing experiment measurements.
NASA Technical Reports Server (NTRS)
Melbourne, William G.
1986-01-01
In double differencing a regression system obtained from concurrent Global Positioning System (GPS) observation sequences, one either undersamples the system to avoid introducing colored measurement statistics, or one fully samples the system incurring the resulting non-diagonal covariance matrix for the differenced measurement errors. A suboptimal estimation result will be obtained in the undersampling case and will also be obtained in the fully sampled case unless the color noise statistics are taken into account. The latter approach requires a least squares weighting matrix derived from inversion of a non-diagonal covariance matrix for the differenced measurement errors instead of inversion of the customary diagonal one associated with white noise processes. Presented is the so-called fully redundant double differencing algorithm for generating a weighted double differenced regression system that yields equivalent estimation results, but features for certain cases a diagonal weighting matrix even though the differenced measurement error statistics are highly colored.
Addeh, Abdoljalil; Khormali, Aminollah; Golilarz, Noorbakhsh Amiri
2018-05-04
The control chart patterns are the most commonly used statistical process control (SPC) tools to monitor process changes. When a control chart produces an out-of-control signal, this means that the process has been changed. In this study, a new method based on optimized radial basis function neural network (RBFNN) is proposed for control chart patterns (CCPs) recognition. The proposed method consists of four main modules: feature extraction, feature selection, classification and learning algorithm. In the feature extraction module, shape and statistical features are used. Recently, various shape and statistical features have been presented for the CCPs recognition. In the feature selection module, the association rules (AR) method has been employed to select the best set of the shape and statistical features. In the classifier section, RBFNN is used and finally, in RBFNN, learning algorithm has a high impact on the network performance. Therefore, a new learning algorithm based on the bees algorithm has been used in the learning module. Most studies have considered only six patterns: Normal, Cyclic, Increasing Trend, Decreasing Trend, Upward Shift and Downward Shift. Since three patterns namely Normal, Stratification, and Systematic are very similar to each other and distinguishing them is very difficult, in most studies Stratification and Systematic have not been considered. Regarding to the continuous monitoring and control over the production process and the exact type detection of the problem encountered during the production process, eight patterns have been investigated in this study. The proposed method is tested on a dataset containing 1600 samples (200 samples from each pattern) and the results showed that the proposed method has a very good performance. Copyright © 2018 ISA. Published by Elsevier Ltd. All rights reserved.
Classical boson sampling algorithms with superior performance to near-term experiments
NASA Astrophysics Data System (ADS)
Neville, Alex; Sparrow, Chris; Clifford, Raphaël; Johnston, Eric; Birchall, Patrick M.; Montanaro, Ashley; Laing, Anthony
2017-12-01
It is predicted that quantum computers will dramatically outperform their conventional counterparts. However, large-scale universal quantum computers are yet to be built. Boson sampling is a rudimentary quantum algorithm tailored to the platform of linear optics, which has sparked interest as a rapid way to demonstrate such quantum supremacy. Photon statistics are governed by intractable matrix functions, which suggests that sampling from the distribution obtained by injecting photons into a linear optical network could be solved more quickly by a photonic experiment than by a classical computer. The apparently low resource requirements for large boson sampling experiments have raised expectations of a near-term demonstration of quantum supremacy by boson sampling. Here we present classical boson sampling algorithms and theoretical analyses of prospects for scaling boson sampling experiments, showing that near-term quantum supremacy via boson sampling is unlikely. Our classical algorithm, based on Metropolised independence sampling, allowed the boson sampling problem to be solved for 30 photons with standard computing hardware. Compared to current experiments, a demonstration of quantum supremacy over a successful implementation of these classical methods on a supercomputer would require the number of photons and experimental components to increase by orders of magnitude, while tackling exponentially scaling photon loss.
Method and algorithm of automatic estimation of road surface type for variable damping control
NASA Astrophysics Data System (ADS)
Dąbrowski, K.; Ślaski, G.
2016-09-01
In this paper authors presented an idea of road surface estimation (recognition) on a base of suspension dynamic response signals statistical analysis. For preliminary analysis cumulated distribution function (CDF) was used, and some conclusion that various roads have responses values in a different ranges of limits for the same percentage of samples or for the same limits different percentages of samples are located within the range between limit values. That was the base for developed and presented algorithm which was tested using suspension response signals recorded during road test riding over various surfaces. Proposed algorithm can be essential part of adaptive damping control algorithm for a vehicle suspension or adaptive control strategy for suspension damping control.
Tweedell, Andrew J.; Haynes, Courtney A.
2017-01-01
The timing of muscle activity is a commonly applied analytic method to understand how the nervous system controls movement. This study systematically evaluates six classes of standard and statistical algorithms to determine muscle onset in both experimental surface electromyography (EMG) and simulated EMG with a known onset time. Eighteen participants had EMG collected from the biceps brachii and vastus lateralis while performing a biceps curl or knee extension, respectively. Three established methods and three statistical methods for EMG onset were evaluated. Linear envelope, Teager-Kaiser energy operator + linear envelope and sample entropy were the established methods evaluated while general time series mean/variance, sequential and batch processing of parametric and nonparametric tools, and Bayesian changepoint analysis were the statistical techniques used. Visual EMG onset (experimental data) and objective EMG onset (simulated data) were compared with algorithmic EMG onset via root mean square error and linear regression models for stepwise elimination of inferior algorithms. The top algorithms for both data types were analyzed for their mean agreement with the gold standard onset and evaluation of 95% confidence intervals. The top algorithms were all Bayesian changepoint analysis iterations where the parameter of the prior (p0) was zero. The best performing Bayesian algorithms were p0 = 0 and a posterior probability for onset determination at 60–90%. While existing algorithms performed reasonably, the Bayesian changepoint analysis methodology provides greater reliability and accuracy when determining the singular onset of EMG activity in a time series. Further research is needed to determine if this class of algorithms perform equally well when the time series has multiple bursts of muscle activity. PMID:28489897
Chapman, Benjamin P; Weiss, Alexander; Duberstein, Paul R
2016-12-01
Statistical learning theory (SLT) is the statistical formulation of machine learning theory, a body of analytic methods common in "big data" problems. Regression-based SLT algorithms seek to maximize predictive accuracy for some outcome, given a large pool of potential predictors, without overfitting the sample. Research goals in psychology may sometimes call for high dimensional regression. One example is criterion-keyed scale construction, where a scale with maximal predictive validity must be built from a large item pool. Using this as a working example, we first introduce a core principle of SLT methods: minimization of expected prediction error (EPE). Minimizing EPE is fundamentally different than maximizing the within-sample likelihood, and hinges on building a predictive model of sufficient complexity to predict the outcome well, without undue complexity leading to overfitting. We describe how such models are built and refined via cross-validation. We then illustrate how 3 common SLT algorithms-supervised principal components, regularization, and boosting-can be used to construct a criterion-keyed scale predicting all-cause mortality, using a large personality item pool within a population cohort. Each algorithm illustrates a different approach to minimizing EPE. Finally, we consider broader applications of SLT predictive algorithms, both as supportive analytic tools for conventional methods, and as primary analytic tools in discovery phase research. We conclude that despite their differences from the classic null-hypothesis testing approach-or perhaps because of them-SLT methods may hold value as a statistically rigorous approach to exploratory regression. (PsycINFO Database Record (c) 2016 APA, all rights reserved).
An adaptive multi-level simulation algorithm for stochastic biological systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lester, C., E-mail: lesterc@maths.ox.ac.uk; Giles, M. B.; Baker, R. E.
2015-01-14
Discrete-state, continuous-time Markov models are widely used in the modeling of biochemical reaction networks. Their complexity often precludes analytic solution, and we rely on stochastic simulation algorithms (SSA) to estimate system statistics. The Gillespie algorithm is exact, but computationally costly as it simulates every single reaction. As such, approximate stochastic simulation algorithms such as the tau-leap algorithm are often used. Potentially computationally more efficient, the system statistics generated suffer from significant bias unless tau is relatively small, in which case the computational time can be comparable to that of the Gillespie algorithm. The multi-level method [Anderson and Higham, “Multi-level Montemore » Carlo for continuous time Markov chains, with applications in biochemical kinetics,” SIAM Multiscale Model. Simul. 10(1), 146–179 (2012)] tackles this problem. A base estimator is computed using many (cheap) sample paths at low accuracy. The bias inherent in this estimator is then reduced using a number of corrections. Each correction term is estimated using a collection of paired sample paths where one path of each pair is generated at a higher accuracy compared to the other (and so more expensive). By sharing random variables between these paired paths, the variance of each correction estimator can be reduced. This renders the multi-level method very efficient as only a relatively small number of paired paths are required to calculate each correction term. In the original multi-level method, each sample path is simulated using the tau-leap algorithm with a fixed value of τ. This approach can result in poor performance when the reaction activity of a system changes substantially over the timescale of interest. By introducing a novel adaptive time-stepping approach where τ is chosen according to the stochastic behaviour of each sample path, we extend the applicability of the multi-level method to such cases. We demonstrate the efficiency of our method using a number of examples.« less
Analyzing Kernel Matrices for the Identification of Differentially Expressed Genes
Xia, Xiao-Lei; Xing, Huanlai; Liu, Xueqin
2013-01-01
One of the most important applications of microarray data is the class prediction of biological samples. For this purpose, statistical tests have often been applied to identify the differentially expressed genes (DEGs), followed by the employment of the state-of-the-art learning machines including the Support Vector Machines (SVM) in particular. The SVM is a typical sample-based classifier whose performance comes down to how discriminant samples are. However, DEGs identified by statistical tests are not guaranteed to result in a training dataset composed of discriminant samples. To tackle this problem, a novel gene ranking method namely the Kernel Matrix Gene Selection (KMGS) is proposed. The rationale of the method, which roots in the fundamental ideas of the SVM algorithm, is described. The notion of ''the separability of a sample'' which is estimated by performing -like statistics on each column of the kernel matrix, is first introduced. The separability of a classification problem is then measured, from which the significance of a specific gene is deduced. Also described is a method of Kernel Matrix Sequential Forward Selection (KMSFS) which shares the KMGS method's essential ideas but proceeds in a greedy manner. On three public microarray datasets, our proposed algorithms achieved noticeably competitive performance in terms of the B.632+ error rate. PMID:24349110
Small sample estimation of the reliability function for technical products
NASA Astrophysics Data System (ADS)
Lyamets, L. L.; Yakimenko, I. V.; Kanishchev, O. A.; Bliznyuk, O. A.
2017-12-01
It is demonstrated that, in the absence of big statistic samples obtained as a result of testing complex technical products for failure, statistic estimation of the reliability function of initial elements can be made by the moments method. A formal description of the moments method is given and its advantages in the analysis of small censored samples are discussed. A modified algorithm is proposed for the implementation of the moments method with the use of only the moments at which the failures of initial elements occur.
Rational approximations to rational models: alternative algorithms for category learning.
Sanborn, Adam N; Griffiths, Thomas L; Navarro, Daniel J
2010-10-01
Rational models of cognition typically consider the abstract computational problems posed by the environment, assuming that people are capable of optimally solving those problems. This differs from more traditional formal models of cognition, which focus on the psychological processes responsible for behavior. A basic challenge for rational models is thus explaining how optimal solutions can be approximated by psychological processes. We outline a general strategy for answering this question, namely to explore the psychological plausibility of approximation algorithms developed in computer science and statistics. In particular, we argue that Monte Carlo methods provide a source of rational process models that connect optimal solutions to psychological processes. We support this argument through a detailed example, applying this approach to Anderson's (1990, 1991) rational model of categorization (RMC), which involves a particularly challenging computational problem. Drawing on a connection between the RMC and ideas from nonparametric Bayesian statistics, we propose 2 alternative algorithms for approximate inference in this model. The algorithms we consider include Gibbs sampling, a procedure appropriate when all stimuli are presented simultaneously, and particle filters, which sequentially approximate the posterior distribution with a small number of samples that are updated as new data become available. Applying these algorithms to several existing datasets shows that a particle filter with a single particle provides a good description of human inferences.
Variance estimates and confidence intervals for the Kappa measure of classification accuracy
M. A. Kalkhan; R. M. Reich; R. L. Czaplewski
1997-01-01
The Kappa statistic is frequently used to characterize the results of an accuracy assessment used to evaluate land use and land cover classifications obtained by remotely sensed data. This statistic allows comparisons of alternative sampling designs, classification algorithms, photo-interpreters, and so forth. In order to make these comparisons, it is...
Selected-node stochastic simulation algorithm
NASA Astrophysics Data System (ADS)
Duso, Lorenzo; Zechner, Christoph
2018-04-01
Stochastic simulations of biochemical networks are of vital importance for understanding complex dynamics in cells and tissues. However, existing methods to perform such simulations are associated with computational difficulties and addressing those remains a daunting challenge to the present. Here we introduce the selected-node stochastic simulation algorithm (snSSA), which allows us to exclusively simulate an arbitrary, selected subset of molecular species of a possibly large and complex reaction network. The algorithm is based on an analytical elimination of chemical species, thereby avoiding explicit simulation of the associated chemical events. These species are instead described continuously in terms of statistical moments derived from a stochastic filtering equation, resulting in a substantial speedup when compared to Gillespie's stochastic simulation algorithm (SSA). Moreover, we show that statistics obtained via snSSA profit from a variance reduction, which can significantly lower the number of Monte Carlo samples needed to achieve a certain performance. We demonstrate the algorithm using several biological case studies for which the simulation time could be reduced by orders of magnitude.
Creating ensembles of oblique decision trees with evolutionary algorithms and sampling
Cantu-Paz, Erick [Oakland, CA; Kamath, Chandrika [Tracy, CA
2006-06-13
A decision tree system that is part of a parallel object-oriented pattern recognition system, which in turn is part of an object oriented data mining system. A decision tree process includes the step of reading the data. If necessary, the data is sorted. A potential split of the data is evaluated according to some criterion. An initial split of the data is determined. The final split of the data is determined using evolutionary algorithms and statistical sampling techniques. The data is split. Multiple decision trees are combined in ensembles.
Schoenberg, Mike R; Lange, Rael T; Saklofske, Donald H; Suarez, Mariann; Brickell, Tracey A
2008-12-01
Determination of neuropsychological impairment involves contrasting obtained performances with a comparison standard, which is often an estimate of premorbid IQ. M. R. Schoenberg, R. T. Lange, T. A. Brickell, and D. H. Saklofske (2007) proposed the Child Premorbid Intelligence Estimate (CPIE) to predict premorbid Full Scale IQ (FSIQ) using the Wechsler Intelligence Scale for Children-4th Edition (WISC-IV; Wechsler, 2003). The CPIE includes 12 algorithms to predict FSIQ, 1 using demographic variables and 11 algorithms combining WISC-IV subtest raw scores with demographic variables. The CPIE was applied to a sample of children with acquired traumatic brain injury (TBI sample; n = 40) and a healthy demographically matched sample (n = 40). Paired-samples t tests found estimated premorbid FSIQ differed from obtained FSIQ when applied to the TBI sample (ps
Esteban, Santiago; Rodríguez Tablado, Manuel; Peper, Francisco; Mahumud, Yamila S; Ricci, Ricardo I; Kopitowski, Karin; Terrasa, Sergio
2017-01-01
Precision medicine requires extremely large samples. Electronic health records (EHR) are thought to be a cost-effective source of data for that purpose. Phenotyping algorithms help reduce classification errors, making EHR a more reliable source of information for research. Four algorithm development strategies for classifying patients according to their diabetes status (diabetics; non-diabetics; inconclusive) were tested (one codes-only algorithm; one boolean algorithm, four statistical learning algorithms and six stacked generalization meta-learners). The best performing algorithms within each strategy were tested on the validation set. The stacked generalization algorithm yielded the highest Kappa coefficient value in the validation set (0.95 95% CI 0.91, 0.98). The implementation of these algorithms allows for the exploitation of data from thousands of patients accurately, greatly reducing the costs of constructing retrospective cohorts for research.
2017-01-01
Co-expression networks have long been used as a tool for investigating the molecular circuitry governing biological systems. However, most algorithms for constructing co-expression networks were developed in the microarray era, before high-throughput sequencing—with its unique statistical properties—became the norm for expression measurement. Here we develop Bayesian Relevance Networks, an algorithm that uses Bayesian reasoning about expression levels to account for the differing levels of uncertainty in expression measurements between highly- and lowly-expressed entities, and between samples with different sequencing depths. It combines data from groups of samples (e.g., replicates) to estimate group expression levels and confidence ranges. It then computes uncertainty-moderated estimates of cross-group correlations between entities, and uses permutation testing to assess their statistical significance. Using large scale miRNA data from The Cancer Genome Atlas, we show that our Bayesian update of the classical Relevance Networks algorithm provides improved reproducibility in co-expression estimates and lower false discovery rates in the resulting co-expression networks. Software is available at www.perkinslab.ca. PMID:28817636
Ramachandran, Parameswaran; Sánchez-Taltavull, Daniel; Perkins, Theodore J
2017-01-01
Co-expression networks have long been used as a tool for investigating the molecular circuitry governing biological systems. However, most algorithms for constructing co-expression networks were developed in the microarray era, before high-throughput sequencing-with its unique statistical properties-became the norm for expression measurement. Here we develop Bayesian Relevance Networks, an algorithm that uses Bayesian reasoning about expression levels to account for the differing levels of uncertainty in expression measurements between highly- and lowly-expressed entities, and between samples with different sequencing depths. It combines data from groups of samples (e.g., replicates) to estimate group expression levels and confidence ranges. It then computes uncertainty-moderated estimates of cross-group correlations between entities, and uses permutation testing to assess their statistical significance. Using large scale miRNA data from The Cancer Genome Atlas, we show that our Bayesian update of the classical Relevance Networks algorithm provides improved reproducibility in co-expression estimates and lower false discovery rates in the resulting co-expression networks. Software is available at www.perkinslab.ca.
Comparison of statistical sampling methods with ScannerBit, the GAMBIT scanning module
NASA Astrophysics Data System (ADS)
Martinez, Gregory D.; McKay, James; Farmer, Ben; Scott, Pat; Roebber, Elinore; Putze, Antje; Conrad, Jan
2017-11-01
We introduce ScannerBit, the statistics and sampling module of the public, open-source global fitting framework GAMBIT. ScannerBit provides a standardised interface to different sampling algorithms, enabling the use and comparison of multiple computational methods for inferring profile likelihoods, Bayesian posteriors, and other statistical quantities. The current version offers random, grid, raster, nested sampling, differential evolution, Markov Chain Monte Carlo (MCMC) and ensemble Monte Carlo samplers. We also announce the release of a new standalone differential evolution sampler, Diver, and describe its design, usage and interface to ScannerBit. We subject Diver and three other samplers (the nested sampler MultiNest, the MCMC GreAT, and the native ScannerBit implementation of the ensemble Monte Carlo algorithm T-Walk) to a battery of statistical tests. For this we use a realistic physical likelihood function, based on the scalar singlet model of dark matter. We examine the performance of each sampler as a function of its adjustable settings, and the dimensionality of the sampling problem. We evaluate performance on four metrics: optimality of the best fit found, completeness in exploring the best-fit region, number of likelihood evaluations, and total runtime. For Bayesian posterior estimation at high resolution, T-Walk provides the most accurate and timely mapping of the full parameter space. For profile likelihood analysis in less than about ten dimensions, we find that Diver and MultiNest score similarly in terms of best fit and speed, outperforming GreAT and T-Walk; in ten or more dimensions, Diver substantially outperforms the other three samplers on all metrics.
Chapman, Benjamin P.; Weiss, Alexander; Duberstein, Paul
2016-01-01
Statistical learning theory (SLT) is the statistical formulation of machine learning theory, a body of analytic methods common in “big data” problems. Regression-based SLT algorithms seek to maximize predictive accuracy for some outcome, given a large pool of potential predictors, without overfitting the sample. Research goals in psychology may sometimes call for high dimensional regression. One example is criterion-keyed scale construction, where a scale with maximal predictive validity must be built from a large item pool. Using this as a working example, we first introduce a core principle of SLT methods: minimization of expected prediction error (EPE). Minimizing EPE is fundamentally different than maximizing the within-sample likelihood, and hinges on building a predictive model of sufficient complexity to predict the outcome well, without undue complexity leading to overfitting. We describe how such models are built and refined via cross-validation. We then illustrate how three common SLT algorithms–Supervised Principal Components, Regularization, and Boosting—can be used to construct a criterion-keyed scale predicting all-cause mortality, using a large personality item pool within a population cohort. Each algorithm illustrates a different approach to minimizing EPE. Finally, we consider broader applications of SLT predictive algorithms, both as supportive analytic tools for conventional methods, and as primary analytic tools in discovery phase research. We conclude that despite their differences from the classic null-hypothesis testing approach—or perhaps because of them–SLT methods may hold value as a statistically rigorous approach to exploratory regression. PMID:27454257
DALMATIAN: An Algorithm for Automatic Cell Detection and Counting in 3D.
Shuvaev, Sergey A; Lazutkin, Alexander A; Kedrov, Alexander V; Anokhin, Konstantin V; Enikolopov, Grigori N; Koulakov, Alexei A
2017-01-01
Current 3D imaging methods, including optical projection tomography, light-sheet microscopy, block-face imaging, and serial two photon tomography enable visualization of large samples of biological tissue. Large volumes of data obtained at high resolution require development of automatic image processing techniques, such as algorithms for automatic cell detection or, more generally, point-like object detection. Current approaches to automated cell detection suffer from difficulties originating from detection of particular cell types, cell populations of different brightness, non-uniformly stained, and overlapping cells. In this study, we present a set of algorithms for robust automatic cell detection in 3D. Our algorithms are suitable for, but not limited to, whole brain regions and individual brain sections. We used watershed procedure to split regional maxima representing overlapping cells. We developed a bootstrap Gaussian fit procedure to evaluate the statistical significance of detected cells. We compared cell detection quality of our algorithm and other software using 42 samples, representing 6 staining and imaging techniques. The results provided by our algorithm matched manual expert quantification with signal-to-noise dependent confidence, including samples with cells of different brightness, non-uniformly stained, and overlapping cells for whole brain regions and individual tissue sections. Our algorithm provided the best cell detection quality among tested free and commercial software.
Tenan, Matthew S; Tweedell, Andrew J; Haynes, Courtney A
2017-12-01
The onset of muscle activity, as measured by electromyography (EMG), is a commonly applied metric in biomechanics. Intramuscular EMG is often used to examine deep musculature and there are currently no studies examining the effectiveness of algorithms for intramuscular EMG onset. The present study examines standard surface EMG onset algorithms (linear envelope, Teager-Kaiser Energy Operator, and sample entropy) and novel algorithms (time series mean-variance analysis, sequential/batch processing with parametric and nonparametric methods, and Bayesian changepoint analysis). Thirteen male and 5 female subjects had intramuscular EMG collected during isolated biceps brachii and vastus lateralis contractions, resulting in 103 trials. EMG onset was visually determined twice by 3 blinded reviewers. Since the reliability of visual onset was high (ICC (1,1) : 0.92), the mean of the 6 visual assessments was contrasted with the algorithmic approaches. Poorly performing algorithms were stepwise eliminated via (1) root mean square error analysis, (2) algorithm failure to identify onset/premature onset, (3) linear regression analysis, and (4) Bland-Altman plots. The top performing algorithms were all based on Bayesian changepoint analysis of rectified EMG and were statistically indistinguishable from visual analysis. Bayesian changepoint analysis has the potential to produce more reliable, accurate, and objective intramuscular EMG onset results than standard methodologies.
Aircraft target detection algorithm based on high resolution spaceborne SAR imagery
NASA Astrophysics Data System (ADS)
Zhang, Hui; Hao, Mengxi; Zhang, Cong; Su, Xiaojing
2018-03-01
In this paper, an image classification algorithm for airport area is proposed, which based on the statistical features of synthetic aperture radar (SAR) images and the spatial information of pixels. The algorithm combines Gamma mixture model and MRF. The algorithm using Gamma mixture model to obtain the initial classification result. Pixel space correlation based on the classification results are optimized by the MRF technique. Additionally, morphology methods are employed to extract airport (ROI) region where the suspected aircraft target samples are clarified to reduce the false alarm and increase the detection performance. Finally, this paper presents the plane target detection, which have been verified by simulation test.
SeaWiFS technical report series. Volume 4: An analysis of GAC sampling algorithms. A case study
NASA Technical Reports Server (NTRS)
Yeh, Eueng-Nan (Editor); Hooker, Stanford B. (Editor); Hooker, Stanford B. (Editor); Mccain, Charles R. (Editor); Fu, Gary (Editor)
1992-01-01
The Sea-viewing Wide Field-of-view Sensor (SeaWiFS) instrument will sample at approximately a 1 km resolution at nadir which will be broadcast for reception by realtime ground stations. However, the global data set will be comprised of coarser four kilometer data which will be recorded and broadcast to the SeaWiFS Project for processing. Several algorithms for degrading the one kilometer data to four kilometer data are examined using imagery from the Coastal Zone Color Scanner (CZCS) in an effort to determine which algorithm would best preserve the statistical characteristics of the derived products generated from the one kilometer data. Of the algorithms tested, subsampling based on a fixed pixel within a 4 x 4 pixel array is judged to yield the most consistent results when compared to the one kilometer data products.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Eisenbach, Markus; Li, Ying Wai
We report a new multicanonical Monte Carlo (MC) algorithm to obtain the density of states (DOS) for physical systems with continuous state variables in statistical mechanics. Our algorithm is able to obtain an analytical form for the DOS expressed in a chosen basis set, instead of a numerical array of finite resolution as in previous variants of this class of MC methods such as the multicanonical (MUCA) sampling and Wang-Landau (WL) sampling. This is enabled by storing the visited states directly in a data set and avoiding the explicit collection of a histogram. This practice also has the advantage ofmore » avoiding undesirable artificial errors caused by the discretization and binning of continuous state variables. Our results show that this scheme is capable of obtaining converged results with a much reduced number of Monte Carlo steps, leading to a significant speedup over existing algorithms.« less
A quantum–quantum Metropolis algorithm
Yung, Man-Hong; Aspuru-Guzik, Alán
2012-01-01
The classical Metropolis sampling method is a cornerstone of many statistical modeling applications that range from physics, chemistry, and biology to economics. This method is particularly suitable for sampling the thermal distributions of classical systems. The challenge of extending this method to the simulation of arbitrary quantum systems is that, in general, eigenstates of quantum Hamiltonians cannot be obtained efficiently with a classical computer. However, this challenge can be overcome by quantum computers. Here, we present a quantum algorithm which fully generalizes the classical Metropolis algorithm to the quantum domain. The meaning of quantum generalization is twofold: The proposed algorithm is not only applicable to both classical and quantum systems, but also offers a quantum speedup relative to the classical counterpart. Furthermore, unlike the classical method of quantum Monte Carlo, this quantum algorithm does not suffer from the negative-sign problem associated with fermionic systems. Applications of this algorithm include the study of low-temperature properties of quantum systems, such as the Hubbard model, and preparing the thermal states of sizable molecules to simulate, for example, chemical reactions at an arbitrary temperature. PMID:22215584
Quantum speedup of Monte Carlo methods.
Montanaro, Ashley
2015-09-08
Monte Carlo methods use random sampling to estimate numerical quantities which are hard to compute deterministically. One important example is the use in statistical physics of rapidly mixing Markov chains to approximately compute partition functions. In this work, we describe a quantum algorithm which can accelerate Monte Carlo methods in a very general setting. The algorithm estimates the expected output value of an arbitrary randomized or quantum subroutine with bounded variance, achieving a near-quadratic speedup over the best possible classical algorithm. Combining the algorithm with the use of quantum walks gives a quantum speedup of the fastest known classical algorithms with rigorous performance bounds for computing partition functions, which use multiple-stage Markov chain Monte Carlo techniques. The quantum algorithm can also be used to estimate the total variation distance between probability distributions efficiently.
Quantum speedup of Monte Carlo methods
Montanaro, Ashley
2015-01-01
Monte Carlo methods use random sampling to estimate numerical quantities which are hard to compute deterministically. One important example is the use in statistical physics of rapidly mixing Markov chains to approximately compute partition functions. In this work, we describe a quantum algorithm which can accelerate Monte Carlo methods in a very general setting. The algorithm estimates the expected output value of an arbitrary randomized or quantum subroutine with bounded variance, achieving a near-quadratic speedup over the best possible classical algorithm. Combining the algorithm with the use of quantum walks gives a quantum speedup of the fastest known classical algorithms with rigorous performance bounds for computing partition functions, which use multiple-stage Markov chain Monte Carlo techniques. The quantum algorithm can also be used to estimate the total variation distance between probability distributions efficiently. PMID:26528079
Statistical analysis and machine learning algorithms for optical biopsy
NASA Astrophysics Data System (ADS)
Wu, Binlin; Liu, Cheng-hui; Boydston-White, Susie; Beckman, Hugh; Sriramoju, Vidyasagar; Sordillo, Laura; Zhang, Chunyuan; Zhang, Lin; Shi, Lingyan; Smith, Jason; Bailin, Jacob; Alfano, Robert R.
2018-02-01
Analyzing spectral or imaging data collected with various optical biopsy methods is often times difficult due to the complexity of the biological basis. Robust methods that can utilize the spectral or imaging data and detect the characteristic spectral or spatial signatures for different types of tissue is challenging but highly desired. In this study, we used various machine learning algorithms to analyze a spectral dataset acquired from human skin normal and cancerous tissue samples using resonance Raman spectroscopy with 532nm excitation. The algorithms including principal component analysis, nonnegative matrix factorization, and autoencoder artificial neural network are used to reduce dimension of the dataset and detect features. A support vector machine with a linear kernel is used to classify the normal tissue and cancerous tissue samples. The efficacies of the methods are compared.
de Almeida, Valber Elias; de Araújo Gomes, Adriano; de Sousa Fernandes, David Douglas; Goicoechea, Héctor Casimiro; Galvão, Roberto Kawakami Harrop; Araújo, Mario Cesar Ugulino
2018-05-01
This paper proposes a new variable selection method for nonlinear multivariate calibration, combining the Successive Projections Algorithm for interval selection (iSPA) with the Kernel Partial Least Squares (Kernel-PLS) modelling technique. The proposed iSPA-Kernel-PLS algorithm is employed in a case study involving a Vis-NIR spectrometric dataset with complex nonlinear features. The analytical problem consists of determining Brix and sucrose content in samples from a sugar production system, on the basis of transflectance spectra. As compared to full-spectrum Kernel-PLS, the iSPA-Kernel-PLS models involve a smaller number of variables and display statistically significant superiority in terms of accuracy and/or bias in the predictions. Published by Elsevier B.V.
Detection of cracks in shafts with the Approximated Entropy algorithm
NASA Astrophysics Data System (ADS)
Sampaio, Diego Luchesi; Nicoletti, Rodrigo
2016-05-01
The Approximate Entropy is a statistical calculus used primarily in the fields of Medicine, Biology, and Telecommunication for classifying and identifying complex signal data. In this work, an Approximate Entropy algorithm is used to detect cracks in a rotating shaft. The signals of the cracked shaft are obtained from numerical simulations of a de Laval rotor with breathing cracks modelled by the Fracture Mechanics. In this case, one analysed the vertical displacements of the rotor during run-up transients. The results show the feasibility of detecting cracks from 5% depth, irrespective of the unbalance of the rotating system and crack orientation in the shaft. The results also show that the algorithm can differentiate the occurrence of crack only, misalignment only, and crack + misalignment in the system. However, the algorithm is sensitive to intrinsic parameters p (number of data points in a sample vector) and f (fraction of the standard deviation that defines the minimum distance between two sample vectors), and good results are only obtained by appropriately choosing their values according to the sampling rate of the signal.
Statistical Inference for Data Adaptive Target Parameters.
Hubbard, Alan E; Kherad-Pajouh, Sara; van der Laan, Mark J
2016-05-01
Consider one observes n i.i.d. copies of a random variable with a probability distribution that is known to be an element of a particular statistical model. In order to define our statistical target we partition the sample in V equal size sub-samples, and use this partitioning to define V splits in an estimation sample (one of the V subsamples) and corresponding complementary parameter-generating sample. For each of the V parameter-generating samples, we apply an algorithm that maps the sample to a statistical target parameter. We define our sample-split data adaptive statistical target parameter as the average of these V-sample specific target parameters. We present an estimator (and corresponding central limit theorem) of this type of data adaptive target parameter. This general methodology for generating data adaptive target parameters is demonstrated with a number of practical examples that highlight new opportunities for statistical learning from data. This new framework provides a rigorous statistical methodology for both exploratory and confirmatory analysis within the same data. Given that more research is becoming "data-driven", the theory developed within this paper provides a new impetus for a greater involvement of statistical inference into problems that are being increasingly addressed by clever, yet ad hoc pattern finding methods. To suggest such potential, and to verify the predictions of the theory, extensive simulation studies, along with a data analysis based on adaptively determined intervention rules are shown and give insight into how to structure such an approach. The results show that the data adaptive target parameter approach provides a general framework and resulting methodology for data-driven science.
Statistical Inference in Hidden Markov Models Using k-Segment Constraints
Titsias, Michalis K.; Holmes, Christopher C.; Yau, Christopher
2016-01-01
Hidden Markov models (HMMs) are one of the most widely used statistical methods for analyzing sequence data. However, the reporting of output from HMMs has largely been restricted to the presentation of the most-probable (MAP) hidden state sequence, found via the Viterbi algorithm, or the sequence of most probable marginals using the forward–backward algorithm. In this article, we expand the amount of information we could obtain from the posterior distribution of an HMM by introducing linear-time dynamic programming recursions that, conditional on a user-specified constraint in the number of segments, allow us to (i) find MAP sequences, (ii) compute posterior probabilities, and (iii) simulate sample paths. We collectively call these recursions k-segment algorithms and illustrate their utility using simulated and real examples. We also highlight the prospective and retrospective use of k-segment constraints for fitting HMMs or exploring existing model fits. Supplementary materials for this article are available online. PMID:27226674
Statistics based sampling for controller and estimator design
NASA Astrophysics Data System (ADS)
Tenne, Dirk
The purpose of this research is the development of statistical design tools for robust feed-forward/feedback controllers and nonlinear estimators. This dissertation is threefold and addresses the aforementioned topics nonlinear estimation, target tracking and robust control. To develop statistically robust controllers and nonlinear estimation algorithms, research has been performed to extend existing techniques, which propagate the statistics of the state, to achieve higher order accuracy. The so-called unscented transformation has been extended to capture higher order moments. Furthermore, higher order moment update algorithms based on a truncated power series have been developed. The proposed techniques are tested on various benchmark examples. Furthermore, the unscented transformation has been utilized to develop a three dimensional geometrically constrained target tracker. The proposed planar circular prediction algorithm has been developed in a local coordinate framework, which is amenable to extension of the tracking algorithm to three dimensional space. This tracker combines the predictions of a circular prediction algorithm and a constant velocity filter by utilizing the Covariance Intersection. This combined prediction can be updated with the subsequent measurement using a linear estimator. The proposed technique is illustrated on a 3D benchmark trajectory, which includes coordinated turns and straight line maneuvers. The third part of this dissertation addresses the design of controller which include knowledge of parametric uncertainties and their distributions. The parameter distributions are approximated by a finite set of points which are calculated by the unscented transformation. This set of points is used to design robust controllers which minimize a statistical performance of the plant over the domain of uncertainty consisting of a combination of the mean and variance. The proposed technique is illustrated on three benchmark problems. The first relates to the design of prefilters for a linear and nonlinear spring-mass-dashpot system and the second applies a feedback controller to a hovering helicopter. Lastly, the statistical robust controller design is devoted to a concurrent feed-forward/feedback controller structure for a high-speed low tension tape drive.
NASA Technical Reports Server (NTRS)
Hoffman, Matthew J.; Eluszkiewicz, Janusz; Weisenstein, Deborah; Uymin, Gennady; Moncet, Jean-Luc
2012-01-01
Motivated by the needs of Mars data assimilation. particularly quantification of measurement errors and generation of averaging kernels. we have evaluated atmospheric temperature retrievals from Mars Global Surveyor (MGS) Thermal Emission Spectrometer (TES) radiances. Multiple sets of retrievals have been considered in this study; (1) retrievals available from the Planetary Data System (PDS), (2) retrievals based on variants of the retrieval algorithm used to generate the PDS retrievals, and (3) retrievals produced using the Mars 1-Dimensional Retrieval (M1R) algorithm based on the Optimal Spectral Sampling (OSS ) forward model. The retrieved temperature profiles are compared to the MGS Radio Science (RS) temperature profiles. For the samples tested, the M1R temperature profiles can be made to agree within 2 K with the RS temperature profiles, but only after tuning the prior and error statistics. Use of a global prior that does not take into account the seasonal dependence leads errors of up 6 K. In polar samples. errors relative to the RS temperature profiles are even larger. In these samples, the PDS temperature profiles also exhibit a poor fit with RS temperatures. This fit is worse than reported in previous studies, indicating that the lack of fit is due to a bias correction to TES radiances implemented after 2004. To explain the differences between the PDS and Ml R temperatures, the algorithms are compared directly, with the OSS forward model inserted into the PDS algorithm. Factors such as the filtering parameter, the use of linear versus nonlinear constrained inversion, and the choice of the forward model, are found to contribute heavily to the differences in the temperature profiles retrieved in the polar regions, resulting in uncertainties of up to 6 K. Even outside the poles, changes in the a priori statistics result in different profile shapes which all fit the radiances within the specified error. The importance of the a priori statistics prevents reliable global retrievals based a single a priori and strongly implies that a robust science analysis must instead rely on retrievals employing localized a priori information, for example from an ensemble based data assimilation system such as the Local Ensemble Transform Kalman Filter (LETKF).
External Threat Risk Assessment Algorithm (ExTRAA)
DOE Office of Scientific and Technical Information (OSTI.GOV)
Powell, Troy C.
Two risk assessment algorithms and philosophies have been augmented and combined to form a new algorit hm, the External Threat Risk Assessment Algorithm (ExTRAA), that allows for effective and statistically sound analysis of external threat sources in relation to individual attack methods . In addition to the attack method use probability and the attack method employment consequence, t he concept of defining threat sources is added to the risk assessment process. Sample data is tabulated and depicted in radar plots and bar graphs for algorithm demonstration purposes. The largest success of ExTRAA is its ability to visualize the kind ofmore » r isk posed in a given situation using the radar plot method.« less
Chen, Bin; Peng, Xiuming; Xie, Tiansheng; Jin, Changzhong; Liu, Fumin; Wu, Nanping
2017-07-01
Currently, there are three algorithms for screening of syphilis: traditional algorithm, reverse algorithm and European Centre for Disease Prevention and Control (ECDC) algorithm. To date, there is not a generally recognized diagnostic algorithm. When syphilis meets HIV, the situation is even more complex. To evaluate their screening performance and impact on the seroprevalence of syphilis in HIV-infected individuals, we conducted a cross-sectional study included 865 serum samples from HIV-infected patients in a tertiary hospital. Every sample (one per patient) was tested with toluidine red unheated serum test (TRUST), T. pallidum particle agglutination assay (TPPA), and Treponema pallidum enzyme immunoassay (TP-EIA) according to the manufacturer's instructions. The results of syphilis serological testing were interpreted following different algorithms respectively. We directly compared the traditional syphilis screening algorithm with the reverse syphilis screening algorithm in this unique population. The reverse algorithm achieved remarkable higher seroprevalence of syphilis than the traditional algorithm (24.9% vs. 14.2%, p < 0.0001). Compared to the reverse algorithm, the traditional algorithm also had a missed serodiagnosis rate of 42.8%. The total percentages of agreement and corresponding kappa values of tradition and ECDC algorithm compared with those of reverse algorithm were as follows: 89.4%,0.668; 99.8%, 0.994. There was a very good strength of agreement between the reverse and the ECDC algorithm. Our results supported the reverse (or ECDC) algorithm in screening of syphilis in HIV-infected populations. In addition, our study demonstrated that screening of HIV-populations using different algorithms may result in a statistically different seroprevalence of syphilis.
Automatic cortical thickness analysis on rodent brain
NASA Astrophysics Data System (ADS)
Lee, Joohwi; Ehlers, Cindy; Crews, Fulton; Niethammer, Marc; Budin, Francois; Paniagua, Beatriz; Sulik, Kathy; Johns, Josephine; Styner, Martin; Oguz, Ipek
2011-03-01
Localized difference in the cortex is one of the most useful morphometric traits in human and animal brain studies. There are many tools and methods already developed to automatically measure and analyze cortical thickness for the human brain. However, these tools cannot be directly applied to rodent brains due to the different scales; even adult rodent brains are 50 to 100 times smaller than humans. This paper describes an algorithm for automatically measuring the cortical thickness of mouse and rat brains. The algorithm consists of three steps: segmentation, thickness measurement, and statistical analysis among experimental groups. The segmentation step provides the neocortex separation from other brain structures and thus is a preprocessing step for the thickness measurement. In the thickness measurement step, the thickness is computed by solving a Laplacian PDE and a transport equation. The Laplacian PDE first creates streamlines as an analogy of cortical columns; the transport equation computes the length of the streamlines. The result is stored as a thickness map over the neocortex surface. For the statistical analysis, it is important to sample thickness at corresponding points. This is achieved by the particle correspondence algorithm which minimizes entropy between dynamically moving sample points called particles. Since the computational cost of the correspondence algorithm may limit the number of corresponding points, we use thin-plate spline based interpolation to increase the number of corresponding sample points. As a driving application, we measured the thickness difference to assess the effects of adolescent intermittent ethanol exposure that persist into adulthood and performed t-test between the control and exposed rat groups. We found significantly differing regions in both hemispheres.
Hassan, Ahnaf Rashik; Bhuiyan, Mohammed Imamul Hassan
2017-03-01
Automatic sleep staging is essential for alleviating the burden of the physicians of analyzing a large volume of data by visual inspection. It is also a precondition for making an automated sleep monitoring system feasible. Further, computerized sleep scoring will expedite large-scale data analysis in sleep research. Nevertheless, most of the existing works on sleep staging are either multichannel or multiple physiological signal based which are uncomfortable for the user and hinder the feasibility of an in-home sleep monitoring device. So, a successful and reliable computer-assisted sleep staging scheme is yet to emerge. In this work, we propose a single channel EEG based algorithm for computerized sleep scoring. In the proposed algorithm, we decompose EEG signal segments using Ensemble Empirical Mode Decomposition (EEMD) and extract various statistical moment based features. The effectiveness of EEMD and statistical features are investigated. Statistical analysis is performed for feature selection. A newly proposed classification technique, namely - Random under sampling boosting (RUSBoost) is introduced for sleep stage classification. This is the first implementation of EEMD in conjunction with RUSBoost to the best of the authors' knowledge. The proposed feature extraction scheme's performance is investigated for various choices of classification models. The algorithmic performance of our scheme is evaluated against contemporary works in the literature. The performance of the proposed method is comparable or better than that of the state-of-the-art ones. The proposed algorithm gives 88.07%, 83.49%, 92.66%, 94.23%, and 98.15% for 6-state to 2-state classification of sleep stages on Sleep-EDF database. Our experimental outcomes reveal that RUSBoost outperforms other classification models for the feature extraction framework presented in this work. Besides, the algorithm proposed in this work demonstrates high detection accuracy for the sleep states S1 and REM. Statistical moment based features in the EEMD domain distinguish the sleep states successfully and efficaciously. The automated sleep scoring scheme propounded herein can eradicate the onus of the clinicians, contribute to the device implementation of a sleep monitoring system, and benefit sleep research. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Alici, Ibrahim Onur; Yılmaz Demirci, Nilgün; Yılmaz, Aydın; Karakaya, Jale; Özaydın, Esra
2016-09-01
There are several papers on the sonographic features of mediastinal lymph nodes affected by several diseases, but none gives the importance and clinical utility of the features. In order to find out which lymph node should be sampled in a particular nodal station during endobronchial ultrasound, we investigated the diagnostic performances of certain sonographic features and proposed an algorithmic approach. We retrospectively analyzed 1051 lymph nodes and randomly assigned them into a preliminary experimental and a secondary study group. The diagnostic performances of the sonographic features (gray scale, echogeneity, shape, size, margin, presence of necrosis, presence of calcification and absence of central hilar structure) were calculated, and an algorithm for lymph node sampling was obtained with decision tree analysis in the experimental group. Later, a modified algorithm was applied to the patients in the study group to give the accuracy. The demographic characteristics of the patients were not statistically significant between the primary and the secondary groups. All of the features were discriminative between malignant and benign diseases. The modified algorithm sensitivity, specificity, and positive and negative predictive values and diagnostic accuracy for detecting metastatic lymph nodes were 100%, 51.2%, 50.6%, 100% and 67.5%, respectively. In this retrospective analysis, the standardized sonographic classification system and the proposed algorithm performed well in choosing the node that should be sampled in a particular station during endobronchial ultrasound. © 2015 John Wiley & Sons Ltd.
Hierarchical Dirichlet process model for gene expression clustering
2013-01-01
Clustering is an important data processing tool for interpreting microarray data and genomic network inference. In this article, we propose a clustering algorithm based on the hierarchical Dirichlet processes (HDP). The HDP clustering introduces a hierarchical structure in the statistical model which captures the hierarchical features prevalent in biological data such as the gene express data. We develop a Gibbs sampling algorithm based on the Chinese restaurant metaphor for the HDP clustering. We apply the proposed HDP algorithm to both regulatory network segmentation and gene expression clustering. The HDP algorithm is shown to outperform several popular clustering algorithms by revealing the underlying hierarchical structure of the data. For the yeast cell cycle data, we compare the HDP result to the standard result and show that the HDP algorithm provides more information and reduces the unnecessary clustering fragments. PMID:23587447
Billing code algorithms to identify cases of peripheral artery disease from administrative data
Fan, Jin; Arruda-Olson, Adelaide M; Leibson, Cynthia L; Smith, Carin; Liu, Guanghui; Bailey, Kent R; Kullo, Iftikhar J
2013-01-01
Objective To construct and validate billing code algorithms for identifying patients with peripheral arterial disease (PAD). Methods We extracted all encounters and line item details including PAD-related billing codes at Mayo Clinic Rochester, Minnesota, between July 1, 1997 and June 30, 2008; 22 712 patients evaluated in the vascular laboratory were divided into training and validation sets. Multiple logistic regression analysis was used to create an integer code score from the training dataset, and this was tested in the validation set. We applied a model-based code algorithm to patients evaluated in the vascular laboratory and compared this with a simpler algorithm (presence of at least one of the ICD-9 PAD codes 440.20–440.29). We also applied both algorithms to a community-based sample (n=4420), followed by a manual review. Results The logistic regression model performed well in both training and validation datasets (c statistic=0.91). In patients evaluated in the vascular laboratory, the model-based code algorithm provided better negative predictive value. The simpler algorithm was reasonably accurate for identification of PAD status, with lesser sensitivity and greater specificity. In the community-based sample, the sensitivity (38.7% vs 68.0%) of the simpler algorithm was much lower, whereas the specificity (92.0% vs 87.6%) was higher than the model-based algorithm. Conclusions A model-based billing code algorithm had reasonable accuracy in identifying PAD cases from the community, and in patients referred to the non-invasive vascular laboratory. The simpler algorithm had reasonable accuracy for identification of PAD in patients referred to the vascular laboratory but was significantly less sensitive in a community-based sample. PMID:24166724
An Intrinsic Algorithm for Parallel Poisson Disk Sampling on Arbitrary Surfaces.
Ying, Xiang; Xin, Shi-Qing; Sun, Qian; He, Ying
2013-03-08
Poisson disk sampling plays an important role in a variety of visual computing, due to its useful statistical property in distribution and the absence of aliasing artifacts. While many effective techniques have been proposed to generate Poisson disk distribution in Euclidean space, relatively few work has been reported to the surface counterpart. This paper presents an intrinsic algorithm for parallel Poisson disk sampling on arbitrary surfaces. We propose a new technique for parallelizing the dart throwing. Rather than the conventional approaches that explicitly partition the spatial domain to generate the samples in parallel, our approach assigns each sample candidate a random and unique priority that is unbiased with regard to the distribution. Hence, multiple threads can process the candidates simultaneously and resolve conflicts by checking the given priority values. It is worth noting that our algorithm is accurate as the generated Poisson disks are uniformly and randomly distributed without bias. Our method is intrinsic in that all the computations are based on the intrinsic metric and are independent of the embedding space. This intrinsic feature allows us to generate Poisson disk distributions on arbitrary surfaces. Furthermore, by manipulating the spatially varying density function, we can obtain adaptive sampling easily.
NASA Astrophysics Data System (ADS)
Artrith, Nongnuch; Urban, Alexander; Ceder, Gerbrand
2018-06-01
The atomistic modeling of amorphous materials requires structure sizes and sampling statistics that are challenging to achieve with first-principles methods. Here, we propose a methodology to speed up the sampling of amorphous and disordered materials using a combination of a genetic algorithm and a specialized machine-learning potential based on artificial neural networks (ANNs). We show for the example of the amorphous LiSi alloy that around 1000 first-principles calculations are sufficient for the ANN-potential assisted sampling of low-energy atomic configurations in the entire amorphous LixSi phase space. The obtained phase diagram is validated by comparison with the results from an extensive sampling of LixSi configurations using molecular dynamics simulations and a general ANN potential trained to ˜45 000 first-principles calculations. This demonstrates the utility of the approach for the first-principles modeling of amorphous materials.
Practical continuous-variable quantum key distribution without finite sampling bandwidth effects.
Li, Huasheng; Wang, Chao; Huang, Peng; Huang, Duan; Wang, Tao; Zeng, Guihua
2016-09-05
In a practical continuous-variable quantum key distribution system, finite sampling bandwidth of the employed analog-to-digital converter at the receiver's side may lead to inaccurate results of pulse peak sampling. Then, errors in the parameters estimation resulted. Subsequently, the system performance decreases and security loopholes are exposed to eavesdroppers. In this paper, we propose a novel data acquisition scheme which consists of two parts, i.e., a dynamic delay adjusting module and a statistical power feedback-control algorithm. The proposed scheme may improve dramatically the data acquisition precision of pulse peak sampling and remove the finite sampling bandwidth effects. Moreover, the optimal peak sampling position of a pulse signal can be dynamically calibrated through monitoring the change of the statistical power of the sampled data in the proposed scheme. This helps to resist against some practical attacks, such as the well-known local oscillator calibration attack.
Design of Neural Networks for Fast Convergence and Accuracy
NASA Technical Reports Server (NTRS)
Maghami, Peiman G.; Sparks, Dean W., Jr.
1998-01-01
A novel procedure for the design and training of artificial neural networks, used for rapid and efficient controls and dynamics design and analysis for flexible space systems, has been developed. Artificial neural networks are employed to provide a means of evaluating the impact of design changes rapidly. Specifically, two-layer feedforward neural networks are designed to approximate the functional relationship between the component spacecraft design changes and measures of its performance. A training algorithm, based on statistical sampling theory, is presented, which guarantees that the trained networks provide a designer-specified degree of accuracy in mapping the functional relationship. Within each iteration of this statistical-based algorithm, a sequential design algorithm is used for the design and training of the feedforward network to provide rapid convergence to the network goals. Here, at each sequence a new network is trained to minimize the error of previous network. The design algorithm attempts to avoid the local minima phenomenon that hampers the traditional network training. A numerical example is performed on a spacecraft application in order to demonstrate the feasibility of the proposed approach.
THE SCREENING AND RANKING ALGORITHM FOR CHANGE-POINTS DETECTION IN MULTIPLE SAMPLES
Song, Chi; Min, Xiaoyi; Zhang, Heping
2016-01-01
The chromosome copy number variation (CNV) is the deviation of genomic regions from their normal copy number states, which may associate with many human diseases. Current genetic studies usually collect hundreds to thousands of samples to study the association between CNV and diseases. CNVs can be called by detecting the change-points in mean for sequences of array-based intensity measurements. Although multiple samples are of interest, the majority of the available CNV calling methods are single sample based. Only a few multiple sample methods have been proposed using scan statistics that are computationally intensive and designed toward either common or rare change-points detection. In this paper, we propose a novel multiple sample method by adaptively combining the scan statistic of the screening and ranking algorithm (SaRa), which is computationally efficient and is able to detect both common and rare change-points. We prove that asymptotically this method can find the true change-points with almost certainty and show in theory that multiple sample methods are superior to single sample methods when shared change-points are of interest. Additionally, we report extensive simulation studies to examine the performance of our proposed method. Finally, using our proposed method as well as two competing approaches, we attempt to detect CNVs in the data from the Primary Open-Angle Glaucoma Genes and Environment study, and conclude that our method is faster and requires less information while our ability to detect the CNVs is comparable or better. PMID:28090239
The computation of pi to 29,360,000 decimal digits using Borweins' quartically convergent algorithm
NASA Technical Reports Server (NTRS)
Bailey, David H.
1988-01-01
The quartically convergent numerical algorithm developed by Borwein and Borwein (1987) for 1/pi is implemented via a prime-modulus-transform multiprecision technique on the NASA Ames Cray-2 supercomputer to compute the first 2.936 x 10 to the 7th digits of the decimal expansion of pi. The history of pi computations is briefly recalled; the most recent algorithms are characterized; the implementation procedures are described; and samples of the output listing are presented. Statistical analyses show that the present decimal expansion is completely random, with only acceptable numbers of long repeating strings and single-digit runs.
Tsukerman, B M; Finkel'shteĭn, I E
1987-07-01
A statistical analysis of prolonged ECG records has been carried out in patients with various heart rhythm and conductivity disorders. The distribution of absolute R-R duration values and relationships between adjacent intervals have been examined. A two-step algorithm has been constructed that excludes anomalous and "suspicious" intervals from a sample of consecutively recorded R-R intervals, until only the intervals between contractions of veritably sinus origin remain in the sample. The algorithm has been developed into a programme for microcomputer Electronica NC-80. It operates reliably even in cases of complex combined rhythm and conductivity disorders.
Random Numbers and Monte Carlo Methods
NASA Astrophysics Data System (ADS)
Scherer, Philipp O. J.
Many-body problems often involve the calculation of integrals of very high dimension which cannot be treated by standard methods. For the calculation of thermodynamic averages Monte Carlo methods are very useful which sample the integration volume at randomly chosen points. After summarizing some basic statistics, we discuss algorithms for the generation of pseudo-random numbers with given probability distribution which are essential for all Monte Carlo methods. We show how the efficiency of Monte Carlo integration can be improved by sampling preferentially the important configurations. Finally the famous Metropolis algorithm is applied to classical many-particle systems. Computer experiments visualize the central limit theorem and apply the Metropolis method to the traveling salesman problem.
Advancing X-ray scattering metrology using inverse genetic algorithms.
Hannon, Adam F; Sunday, Daniel F; Windover, Donald; Kline, R Joseph
2016-01-01
We compare the speed and effectiveness of two genetic optimization algorithms to the results of statistical sampling via a Markov chain Monte Carlo algorithm to find which is the most robust method for determining real space structure in periodic gratings measured using critical dimension small angle X-ray scattering. Both a covariance matrix adaptation evolutionary strategy and differential evolution algorithm are implemented and compared using various objective functions. The algorithms and objective functions are used to minimize differences between diffraction simulations and measured diffraction data. These simulations are parameterized with an electron density model known to roughly correspond to the real space structure of our nanogratings. The study shows that for X-ray scattering data, the covariance matrix adaptation coupled with a mean-absolute error log objective function is the most efficient combination of algorithm and goodness of fit criterion for finding structures with little foreknowledge about the underlying fine scale structure features of the nanograting.
Jankovic, Marko; Ogawa, Hidemitsu
2004-10-01
Principal Component Analysis (PCA) and Principal Subspace Analysis (PSA) are classic techniques in statistical data analysis, feature extraction and data compression. Given a set of multivariate measurements, PCA and PSA provide a smaller set of "basis vectors" with less redundancy, and a subspace spanned by them, respectively. Artificial neurons and neural networks have been shown to perform PSA and PCA when gradient ascent (descent) learning rules are used, which is related to the constrained maximization (minimization) of statistical objective functions. Due to their low complexity, such algorithms and their implementation in neural networks are potentially useful in cases of tracking slow changes of correlations in the input data or in updating eigenvectors with new samples. In this paper we propose PCA learning algorithm that is fully homogeneous with respect to neurons. The algorithm is obtained by modification of one of the most famous PSA learning algorithms--Subspace Learning Algorithm (SLA). Modification of the algorithm is based on Time-Oriented Hierarchical Method (TOHM). The method uses two distinct time scales. On a faster time scale PSA algorithm is responsible for the "behavior" of all output neurons. On a slower scale, output neurons will compete for fulfillment of their "own interests". On this scale, basis vectors in the principal subspace are rotated toward the principal eigenvectors. At the end of the paper it will be briefly analyzed how (or why) time-oriented hierarchical method can be used for transformation of any of the existing neural network PSA method, into PCA method.
Visual Sample Plan Version 7.0 User's Guide
DOE Office of Scientific and Technical Information (OSTI.GOV)
Matzke, Brett D.; Newburn, Lisa LN; Hathaway, John E.
2014-03-01
User's guide for VSP 7.0 This user's guide describes Visual Sample Plan (VSP) Version 7.0 and provides instructions for using the software. VSP selects the appropriate number and location of environmental samples to ensure that the results of statistical tests performed to provide input to risk decisions have the required confidence and performance. VSP Version 7.0 provides sample-size equations or algorithms needed by specific statistical tests appropriate for specific environmental sampling objectives. It also provides data quality assessment and statistical analysis functions to support evaluation of the data and determine whether the data support decisions regarding sites suspected of contamination.more » The easy-to-use program is highly visual and graphic. VSP runs on personal computers with Microsoft Windows operating systems (XP, Vista, Windows 7, and Windows 8). Designed primarily for project managers and users without expertise in statistics, VSP is applicable to two- and three-dimensional populations to be sampled (e.g., rooms and buildings, surface soil, a defined layer of subsurface soil, water bodies, and other similar applications) for studies of environmental quality. VSP is also applicable for designing sampling plans for assessing chem/rad/bio threat and hazard identification within rooms and buildings, and for designing geophysical surveys for unexploded ordnance (UXO) identification.« less
Automated sampling assessment for molecular simulations using the effective sample size
Zhang, Xin; Bhatt, Divesh; Zuckerman, Daniel M.
2010-01-01
To quantify the progress in the development of algorithms and forcefields used in molecular simulations, a general method for the assessment of the sampling quality is needed. Statistical mechanics principles suggest the populations of physical states characterize equilibrium sampling in a fundamental way. We therefore develop an approach for analyzing the variances in state populations, which quantifies the degree of sampling in terms of the effective sample size (ESS). The ESS estimates the number of statistically independent configurations contained in a simulated ensemble. The method is applicable to both traditional dynamics simulations as well as more modern (e.g., multi–canonical) approaches. Our procedure is tested in a variety of systems from toy models to atomistic protein simulations. We also introduce a simple automated procedure to obtain approximate physical states from dynamic trajectories: this allows sample–size estimation in systems for which physical states are not known in advance. PMID:21221418
Yu, P.; Sun, J.; Wolz, R.; Stephenson, D.; Brewer, J.; Fox, N.C.; Cole, P.E.; Jack, C.R.; Hill, D.L.G.; Schwarz, A.J.
2014-01-01
Objective To evaluate the effect of computational algorithm, measurement variability and cut-point on hippocampal volume (HCV)-based patient selection for clinical trials in mild cognitive impairment (MCI). Methods We used normal control and amnestic MCI subjects from ADNI-1 as normative reference and screening cohorts. We evaluated the enrichment performance of four widely-used hippocampal segmentation algorithms (FreeSurfer, HMAPS, LEAP and NeuroQuant) in terms of two-year changes in MMSE, ADAS-Cog and CDR-SB. We modeled the effect of algorithm, test-retest variability and cut-point on sample size, screen fail rates and trial cost and duration. Results HCV-based patient selection yielded not only reduced sample sizes (by ~40–60%) but also lower trial costs (by ~30–40%) across a wide range of cut-points. Overall, the dependence on the cut-point value was similar for the three clinical instruments considered. Conclusion These results provide a guide to the choice of HCV cut-point for aMCI clinical trials, allowing an informed trade-off between statistical and practical considerations. PMID:24211008
Drivers’ Visual Behavior-Guided RRT Motion Planner for Autonomous On-Road Driving
Du, Mingbo; Mei, Tao; Liang, Huawei; Chen, Jiajia; Huang, Rulin; Zhao, Pan
2016-01-01
This paper describes a real-time motion planner based on the drivers’ visual behavior-guided rapidly exploring random tree (RRT) approach, which is applicable to on-road driving of autonomous vehicles. The primary novelty is in the use of the guidance of drivers’ visual search behavior in the framework of RRT motion planner. RRT is an incremental sampling-based method that is widely used to solve the robotic motion planning problems. However, RRT is often unreliable in a number of practical applications such as autonomous vehicles used for on-road driving because of the unnatural trajectory, useless sampling, and slow exploration. To address these problems, we present an interesting RRT algorithm that introduces an effective guided sampling strategy based on the drivers’ visual search behavior on road and a continuous-curvature smooth method based on B-spline. The proposed algorithm is implemented on a real autonomous vehicle and verified against several different traffic scenarios. A large number of the experimental results demonstrate that our algorithm is feasible and efficient for on-road autonomous driving. Furthermore, the comparative test and statistical analyses illustrate that its excellent performance is superior to other previous algorithms. PMID:26784203
Drivers' Visual Behavior-Guided RRT Motion Planner for Autonomous On-Road Driving.
Du, Mingbo; Mei, Tao; Liang, Huawei; Chen, Jiajia; Huang, Rulin; Zhao, Pan
2016-01-15
This paper describes a real-time motion planner based on the drivers' visual behavior-guided rapidly exploring random tree (RRT) approach, which is applicable to on-road driving of autonomous vehicles. The primary novelty is in the use of the guidance of drivers' visual search behavior in the framework of RRT motion planner. RRT is an incremental sampling-based method that is widely used to solve the robotic motion planning problems. However, RRT is often unreliable in a number of practical applications such as autonomous vehicles used for on-road driving because of the unnatural trajectory, useless sampling, and slow exploration. To address these problems, we present an interesting RRT algorithm that introduces an effective guided sampling strategy based on the drivers' visual search behavior on road and a continuous-curvature smooth method based on B-spline. The proposed algorithm is implemented on a real autonomous vehicle and verified against several different traffic scenarios. A large number of the experimental results demonstrate that our algorithm is feasible and efficient for on-road autonomous driving. Furthermore, the comparative test and statistical analyses illustrate that its excellent performance is superior to other previous algorithms.
The DOHA algorithm: a new recipe for cotrending large-scale transiting exoplanet survey light curves
NASA Astrophysics Data System (ADS)
Mislis, D.; Pyrzas, S.; Alsubai, K. A.; Tsvetanov, Z. I.; Vilchez, N. P. E.
2017-03-01
We present
Kamiura, Moto; Sano, Kohei
2017-10-01
The principle of optimism in the face of uncertainty is known as a heuristic in sequential decision-making problems. Overtaking method based on this principle is an effective algorithm to solve multi-armed bandit problems. It was defined by a set of some heuristic patterns of the formulation in the previous study. The objective of the present paper is to redefine the value functions of Overtaking method and to unify the formulation of them. The unified Overtaking method is associated with upper bounds of confidence intervals of expected rewards on statistics. The unification of the formulation enhances the universality of Overtaking method. Consequently we newly obtain Overtaking method for the exponentially distributed rewards, numerically analyze it, and show that it outperforms UCB algorithm on average. The present study suggests that the principle of optimism in the face of uncertainty should be regarded as the statistics-based consequence of the law of large numbers for the sample mean of rewards and estimation of upper bounds of expected rewards, rather than as a heuristic, in the context of multi-armed bandit problems. Copyright © 2017 Elsevier B.V. All rights reserved.
Topics in Statistical Calibration
2014-03-27
on a parametric bootstrap where, instead of sampling directly from the residuals , samples are drawn from a normal distribution. This procedure will...addition to centering them (Davison and Hinkley, 1997). When there are outliers in the residuals , the bootstrap distribution of x̂0 can become skewed or...based and inversion methods using the linear mixed-effects model. Then, a simple parametric bootstrap algorithm is proposed that can be used to either
Analysis of delay reducing and fuel saving sequencing and spacing algorithms for arrival traffic
NASA Technical Reports Server (NTRS)
Neuman, Frank; Erzberger, Heinz
1991-01-01
The air traffic control subsystem that performs sequencing and spacing is discussed. The function of the sequencing and spacing algorithms is to automatically plan the most efficient landing order and to assign optimally spaced landing times to all arrivals. Several algorithms are described and their statistical performance is examined. Sequencing brings order to an arrival sequence for aircraft. First-come-first-served sequencing (FCFS) establishes a fair order, based on estimated times of arrival, and determines proper separations. Because of the randomness of the arriving traffic, gaps will remain in the sequence of aircraft. Delays are reduced by time-advancing the leading aircraft of each group while still preserving the FCFS order. Tightly spaced groups of aircraft remain with a mix of heavy and large aircraft. Spacing requirements differ for different types of aircraft trailing each other. Traffic is reordered slightly to take advantage of this spacing criterion, thus shortening the groups and reducing average delays. For heavy traffic, delays for different traffic samples vary widely, even when the same set of statistical parameters is used to produce each sample. This report supersedes NASA TM-102795 on the same subject. It includes a new method of time-advance as well as an efficient method of sequencing and spacing for two dependent runways.
Nonequilibrium umbrella sampling in spaces of many order parameters
NASA Astrophysics Data System (ADS)
Dickson, Alex; Warmflash, Aryeh; Dinner, Aaron R.
2009-02-01
We recently introduced an umbrella sampling method for obtaining nonequilibrium steady-state probability distributions projected onto an arbitrary number of coordinates that characterize a system (order parameters) [A. Warmflash, P. Bhimalapuram, and A. R. Dinner, J. Chem. Phys. 127, 154112 (2007)]. Here, we show how our algorithm can be combined with the image update procedure from the finite-temperature string method for reversible processes [E. Vanden-Eijnden and M. Venturoli, "Revisiting the finite temperature string method for calculation of reaction tubes and free energies," J. Chem. Phys. (in press)] to enable restricted sampling of a nonequilibrium steady state in the vicinity of a path in a many-dimensional space of order parameters. For the study of transitions between stable states, the adapted algorithm results in improved scaling with the number of order parameters and the ability to progressively refine the regions of enforced sampling. We demonstrate the algorithm by applying it to a two-dimensional model of driven Brownian motion and a coarse-grained (Ising) model for nucleation under shear. It is found that the choice of order parameters can significantly affect the convergence of the simulation; local magnetization variables other than those used previously for sampling transition paths in Ising systems are needed to ensure that the reactive flux is primarily contained within a tube in the space of order parameters. The relation of this method to other algorithms that sample the statistics of path ensembles is discussed.
Multiple signal classification algorithm for super-resolution fluorescence microscopy
Agarwal, Krishna; Macháň, Radek
2016-01-01
Single-molecule localization techniques are restricted by long acquisition and computational times, or the need of special fluorophores or biologically toxic photochemical environments. Here we propose a statistical super-resolution technique of wide-field fluorescence microscopy we call the multiple signal classification algorithm which has several advantages. It provides resolution down to at least 50 nm, requires fewer frames and lower excitation power and works even at high fluorophore concentrations. Further, it works with any fluorophore that exhibits blinking on the timescale of the recording. The multiple signal classification algorithm shows comparable or better performance in comparison with single-molecule localization techniques and four contemporary statistical super-resolution methods for experiments of in vitro actin filaments and other independently acquired experimental data sets. We also demonstrate super-resolution at timescales of 245 ms (using 49 frames acquired at 200 frames per second) in samples of live-cell microtubules and live-cell actin filaments imaged without imaging buffers. PMID:27934858
Distribution of the two-sample t-test statistic following blinded sample size re-estimation.
Lu, Kaifeng
2016-05-01
We consider the blinded sample size re-estimation based on the simple one-sample variance estimator at an interim analysis. We characterize the exact distribution of the standard two-sample t-test statistic at the final analysis. We describe a simulation algorithm for the evaluation of the probability of rejecting the null hypothesis at given treatment effect. We compare the blinded sample size re-estimation method with two unblinded methods with respect to the empirical type I error, the empirical power, and the empirical distribution of the standard deviation estimator and final sample size. We characterize the type I error inflation across the range of standardized non-inferiority margin for non-inferiority trials, and derive the adjusted significance level to ensure type I error control for given sample size of the internal pilot study. We show that the adjusted significance level increases as the sample size of the internal pilot study increases. Copyright © 2016 John Wiley & Sons, Ltd. Copyright © 2016 John Wiley & Sons, Ltd.
NASA Astrophysics Data System (ADS)
Steiner, Matthias; Houze, Robert A., Jr.; Yuter, Sandra E.
1995-09-01
Three algorithms extract information on precipitation type, structure, and amount from operational radar and rain gauge data. Tests on one month of data from one site show that the algorithms perform accurately and provide products that characterize the essential features of the precipitation climatology. Input to the algorithms are the operationally executed volume scans of a radar and the data from a surrounding rain gauge network. The algorithms separate the radar echoes into convective and stratiform regions, statistically summarize the vertical structure of the radar echoes, and determine precipitation rates and amounts on high spatial resolution.The convective and stratiform regions are separated on the basis of the intensity and sharpness of the peaks of echo intensity. The peaks indicate the centers of the convective region. Precipitation not identified as convective is stratiform. This method avoids the problem of underestimating the stratiform precipitation. The separation criteria are applied in exactly the same way throughout the observational domain and the product generated by the algorithm can be compared directly to model output. An independent test of the algorithm on data for which high-resolution dual-Doppler observations are available shows that the convective stratiform separation algorithm is consistent with the physical definitions of convective and stratiform precipitation.The vertical structure algorithm presents the frequency distribution of radar reflectivity as a function of height and thus summarizes in a single plot the vertical structure of all the radar echoes observed during a month (or any other time period). Separate plots reveal the essential differences in structure between the convective and stratiform echoes.Tests yield similar results (within less than 10%) for monthly rain statistics regardless of the technique used for estimating the precipitation, as long as the radar reflectivity values are adjusted to agree with monthly rain gauge data. It makes little difference whether the adjustment is by monthly mean rates or percentiles. Further tests show that 1-h sampling is sufficient to obtain an accurate estimate of monthly rain statistics.
Convergence and Efficiency of Adaptive Importance Sampling Techniques with Partial Biasing
NASA Astrophysics Data System (ADS)
Fort, G.; Jourdain, B.; Lelièvre, T.; Stoltz, G.
2018-04-01
We propose a new Monte Carlo method to efficiently sample a multimodal distribution (known up to a normalization constant). We consider a generalization of the discrete-time Self Healing Umbrella Sampling method, which can also be seen as a generalization of well-tempered metadynamics. The dynamics is based on an adaptive importance technique. The importance function relies on the weights (namely the relative probabilities) of disjoint sets which form a partition of the space. These weights are unknown but are learnt on the fly yielding an adaptive algorithm. In the context of computational statistical physics, the logarithm of these weights is, up to an additive constant, the free-energy, and the discrete valued function defining the partition is called the collective variable. The algorithm falls into the general class of Wang-Landau type methods, and is a generalization of the original Self Healing Umbrella Sampling method in two ways: (i) the updating strategy leads to a larger penalization strength of already visited sets in order to escape more quickly from metastable states, and (ii) the target distribution is biased using only a fraction of the free-energy, in order to increase the effective sample size and reduce the variance of importance sampling estimators. We prove the convergence of the algorithm and analyze numerically its efficiency on a toy example.
Probabilistic Open Set Recognition
NASA Astrophysics Data System (ADS)
Jain, Lalit Prithviraj
Real-world tasks in computer vision, pattern recognition and machine learning often touch upon the open set recognition problem: multi-class recognition with incomplete knowledge of the world and many unknown inputs. An obvious way to approach such problems is to develop a recognition system that thresholds probabilities to reject unknown classes. Traditional rejection techniques are not about the unknown; they are about the uncertain boundary and rejection around that boundary. Thus traditional techniques only represent the "known unknowns". However, a proper open set recognition algorithm is needed to reduce the risk from the "unknown unknowns". This dissertation examines this concept and finds existing probabilistic multi-class recognition approaches are ineffective for true open set recognition. We hypothesize the cause is due to weak adhoc assumptions combined with closed-world assumptions made by existing calibration techniques. Intuitively, if we could accurately model just the positive data for any known class without overfitting, we could reject the large set of unknown classes even under this assumption of incomplete class knowledge. For this, we formulate the problem as one of modeling positive training data by invoking statistical extreme value theory (EVT) near the decision boundary of positive data with respect to negative data. We provide a new algorithm called the PI-SVM for estimating the unnormalized posterior probability of class inclusion. This dissertation also introduces a new open set recognition model called Compact Abating Probability (CAP), where the probability of class membership decreases in value (abates) as points move from known data toward open space. We show that CAP models improve open set recognition for multiple algorithms. Leveraging the CAP formulation, we go on to describe the novel Weibull-calibrated SVM (W-SVM) algorithm, which combines the useful properties of statistical EVT for score calibration with one-class and binary support vector machines. Building from the success of statistical EVT based recognition methods such as PI-SVM and W-SVM on the open set problem, we present a new general supervised learning algorithm for multi-class classification and multi-class open set recognition called the Extreme Value Local Basis (EVLB). The design of this algorithm is motivated by the observation that extrema from known negative class distributions are the closest negative points to any positive sample during training, and thus should be used to define the parameters of a probabilistic decision model. In the EVLB, the kernel distribution for each positive training sample is estimated via an EVT distribution fit over the distances to the separating hyperplane between positive training sample and closest negative samples, with a subset of the overall positive training data retained to form a probabilistic decision boundary. Using this subset as a frame of reference, the probability of a sample at test time decreases as it moves away from the positive class. Possessing this property, the EVLB is well-suited to open set recognition problems where samples from unknown or novel classes are encountered at test. Our experimental evaluation shows that the EVLB provides a substantial improvement in scalability compared to standard radial basis function kernel machines, as well as P I-SVM and W-SVM, with improved accuracy in many cases. We evaluate our algorithm on open set variations of the standard visual learning benchmarks, as well as with an open subset of classes from Caltech 256 and ImageNet. Our experiments show that PI-SVM, WSVM and EVLB provide significant advances over the previous state-of-the-art solutions for the same tasks.
Johnson, Oliver K.; Kurniawan, Christian
2018-02-03
Properties closures delineate the theoretical objective space for materials design problems, allowing designers to make informed trade-offs between competing constraints and target properties. In this paper, we present a new algorithm called hierarchical simplex sampling (HSS) that approximates properties closures more efficiently and faithfully than traditional optimization based approaches. By construction, HSS generates samples of microstructure statistics that span the corresponding microstructure hull. As a result, we also find that HSS can be coupled with synthetic polycrystal generation software to generate diverse sets of microstructures for subsequent mesoscale simulations. Finally, by more broadly sampling the space of possible microstructures, itmore » is anticipated that such diverse microstructure sets will expand our understanding of the influence of microstructure on macroscale effective properties and inform the construction of higher-fidelity mesoscale structure-property models.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Johnson, Oliver K.; Kurniawan, Christian
Properties closures delineate the theoretical objective space for materials design problems, allowing designers to make informed trade-offs between competing constraints and target properties. In this paper, we present a new algorithm called hierarchical simplex sampling (HSS) that approximates properties closures more efficiently and faithfully than traditional optimization based approaches. By construction, HSS generates samples of microstructure statistics that span the corresponding microstructure hull. As a result, we also find that HSS can be coupled with synthetic polycrystal generation software to generate diverse sets of microstructures for subsequent mesoscale simulations. Finally, by more broadly sampling the space of possible microstructures, itmore » is anticipated that such diverse microstructure sets will expand our understanding of the influence of microstructure on macroscale effective properties and inform the construction of higher-fidelity mesoscale structure-property models.« less
Statistical characterization of handwriting characteristics using automated tools
NASA Astrophysics Data System (ADS)
Ball, Gregory R.; Srihari, Sargur N.
2011-01-01
We provide a statistical basis for reporting the results of handwriting examination by questioned document (QD) examiners. As a facet of Questioned Document (QD) examination, the analysis and reporting of handwriting examination suffers from the lack of statistical data concerning the frequency of occurrence of combinations of particular handwriting characteristics. QD examiners tend to assign probative values to specific handwriting characteristics and their combinations based entirely on the examiner's experience and power of recall. The research uses data bases of handwriting samples that are representative of the US population. Feature lists of characteristics provided by QD examiners, are used to determine as to what frequencies need to be evaluated. Algorithms are used to automatically extract those characteristics, e.g., a software tool for extracting most of the characteristics from the most common letter pair th, is functional. For each letter combination the marginal and conditional frequencies of their characteristics are evaluated. Based on statistical dependencies of the characteristics the probability of any given letter formation is computed. The resulting algorithms are incorporated into a system for writer verification known as CEDAR-FOX.
From inverse problems to learning: a Statistical Mechanics approach
NASA Astrophysics Data System (ADS)
Baldassi, Carlo; Gerace, Federica; Saglietti, Luca; Zecchina, Riccardo
2018-01-01
We present a brief introduction to the statistical mechanics approaches for the study of inverse problems in data science. We then provide concrete new results on inferring couplings from sampled configurations in systems characterized by an extensive number of stable attractors in the low temperature regime. We also show how these result are connected to the problem of learning with realistic weak signals in computational neuroscience. Our techniques and algorithms rely on advanced mean-field methods developed in the context of disordered systems.
NASA Astrophysics Data System (ADS)
Wang, Ershen; Jia, Chaoying; Tong, Gang; Qu, Pingping; Lan, Xiaoyu; Pang, Tao
2018-03-01
The receiver autonomous integrity monitoring (RAIM) is one of the most important parts in an avionic navigation system. Two problems need to be addressed to improve this system, namely, the degeneracy phenomenon and lack of samples for the standard particle filter (PF). However, the number of samples cannot adequately express the real distribution of the probability density function (i.e., sample impoverishment). This study presents a GPS receiver autonomous integrity monitoring (RAIM) method based on a chaos particle swarm optimization particle filter (CPSO-PF) algorithm with a log likelihood ratio. The chaos sequence generates a set of chaotic variables, which are mapped to the interval of optimization variables to improve particle quality. This chaos perturbation overcomes the potential for the search to become trapped in a local optimum in the particle swarm optimization (PSO) algorithm. Test statistics are configured based on a likelihood ratio, and satellite fault detection is then conducted by checking the consistency between the state estimate of the main PF and those of the auxiliary PFs. Based on GPS data, the experimental results demonstrate that the proposed algorithm can effectively detect and isolate satellite faults under conditions of non-Gaussian measurement noise. Moreover, the performance of the proposed novel method is better than that of RAIM based on the PF or PSO-PF algorithm.
Zhang, Cheng; Zhang, Tao; Li, Ming; Peng, Chengtao; Liu, Zhaobang; Zheng, Jian
2016-06-18
In order to reduce the radiation dose of CT (computed tomography), compressed sensing theory has been a hot topic since it provides the possibility of a high quality recovery from the sparse sampling data. Recently, the algorithm based on DL (dictionary learning) was developed to deal with the sparse CT reconstruction problem. However, the existing DL algorithm focuses on the minimization problem with the L2-norm regularization term, which leads to reconstruction quality deteriorating while the sampling rate declines further. Therefore, it is essential to improve the DL method to meet the demand of more dose reduction. In this paper, we replaced the L2-norm regularization term with the L1-norm one. It is expected that the proposed L1-DL method could alleviate the over-smoothing effect of the L2-minimization and reserve more image details. The proposed algorithm solves the L1-minimization problem by a weighting strategy, solving the new weighted L2-minimization problem based on IRLS (iteratively reweighted least squares). Through the numerical simulation, the proposed algorithm is compared with the existing DL method (adaptive dictionary based statistical iterative reconstruction, ADSIR) and other two typical compressed sensing algorithms. It is revealed that the proposed algorithm is more accurate than the other algorithms especially when further reducing the sampling rate or increasing the noise. The proposed L1-DL algorithm can utilize more prior information of image sparsity than ADSIR. By transforming the L2-norm regularization term of ADSIR with the L1-norm one and solving the L1-minimization problem by IRLS strategy, L1-DL could reconstruct the image more exactly.
NASA Astrophysics Data System (ADS)
Orović, Irena; Stanković, Srdjan; Amin, Moeness
2013-05-01
A modified robust two-dimensional compressive sensing algorithm for reconstruction of sparse time-frequency representation (TFR) is proposed. The ambiguity function domain is assumed to be the domain of observations. The two-dimensional Fourier bases are used to linearly relate the observations to the sparse TFR, in lieu of the Wigner distribution. We assume that a set of available samples in the ambiguity domain is heavily corrupted by an impulsive type of noise. Consequently, the problem of sparse TFR reconstruction cannot be tackled using standard compressive sensing optimization algorithms. We introduce a two-dimensional L-statistics based modification into the transform domain representation. It provides suitable initial conditions that will produce efficient convergence of the reconstruction algorithm. This approach applies sorting and weighting operations to discard an expected amount of samples corrupted by noise. The remaining samples serve as observations used in sparse reconstruction of the time-frequency signal representation. The efficiency of the proposed approach is demonstrated on numerical examples that comprise both cases of monocomponent and multicomponent signals.
NASA Technical Reports Server (NTRS)
Fisher, Brad; Wolff, David B.
2010-01-01
Passive and active microwave rain sensors onboard earth-orbiting satellites estimate monthly rainfall from the instantaneous rain statistics collected during satellite overpasses. It is well known that climate-scale rain estimates from meteorological satellites incur sampling errors resulting from the process of discrete temporal sampling and statistical averaging. Sampling and retrieval errors ultimately become entangled in the estimation of the mean monthly rain rate. The sampling component of the error budget effectively introduces statistical noise into climate-scale rain estimates that obscure the error component associated with the instantaneous rain retrieval. Estimating the accuracy of the retrievals on monthly scales therefore necessitates a decomposition of the total error budget into sampling and retrieval error quantities. This paper presents results from a statistical evaluation of the sampling and retrieval errors for five different space-borne rain sensors on board nine orbiting satellites. Using an error decomposition methodology developed by one of the authors, sampling and retrieval errors were estimated at 0.25 resolution within 150 km of ground-based weather radars located at Kwajalein, Marshall Islands and Melbourne, Florida. Error and bias statistics were calculated according to the land, ocean and coast classifications of the surface terrain mask developed for the Goddard Profiling (GPROF) rain algorithm. Variations in the comparative error statistics are attributed to various factors related to differences in the swath geometry of each rain sensor, the orbital and instrument characteristics of the satellite and the regional climatology. The most significant result from this study found that each of the satellites incurred negative longterm oceanic retrieval biases of 10 to 30%.
Statistical computation of tolerance limits
NASA Technical Reports Server (NTRS)
Wheeler, J. T.
1993-01-01
Based on a new theory, two computer codes were developed specifically to calculate the exact statistical tolerance limits for normal distributions within unknown means and variances for the one-sided and two-sided cases for the tolerance factor, k. The quantity k is defined equivalently in terms of the noncentral t-distribution by the probability equation. Two of the four mathematical methods employ the theory developed for the numerical simulation. Several algorithms for numerically integrating and iteratively root-solving the working equations are written to augment the program simulation. The program codes generate some tables of k's associated with the varying values of the proportion and sample size for each given probability to show accuracy obtained for small sample sizes.
Robust matching for voice recognition
NASA Astrophysics Data System (ADS)
Higgins, Alan; Bahler, L.; Porter, J.; Blais, P.
1994-10-01
This paper describes an automated method of comparing a voice sample of an unknown individual with samples from known speakers in order to establish or verify the individual's identity. The method is based on a statistical pattern matching approach that employs a simple training procedure, requires no human intervention (transcription, work or phonetic marketing, etc.), and makes no assumptions regarding the expected form of the statistical distributions of the observations. The content of the speech material (vocabulary, grammar, etc.) is not assumed to be constrained in any way. An algorithm is described which incorporates frame pruning and channel equalization processes designed to achieve robust performance with reasonable computational resources. An experimental implementation demonstrating the feasibility of the concept is described.
Statistical analysis for validating ACO-KNN algorithm as feature selection in sentiment analysis
NASA Astrophysics Data System (ADS)
Ahmad, Siti Rohaidah; Yusop, Nurhafizah Moziyana Mohd; Bakar, Azuraliza Abu; Yaakub, Mohd Ridzwan
2017-10-01
This research paper aims to propose a hybrid of ant colony optimization (ACO) and k-nearest neighbor (KNN) algorithms as feature selections for selecting and choosing relevant features from customer review datasets. Information gain (IG), genetic algorithm (GA), and rough set attribute reduction (RSAR) were used as baseline algorithms in a performance comparison with the proposed algorithm. This paper will also discuss the significance test, which was used to evaluate the performance differences between the ACO-KNN, IG-GA, and IG-RSAR algorithms. This study evaluated the performance of the ACO-KNN algorithm using precision, recall, and F-score, which were validated using the parametric statistical significance tests. The evaluation process has statistically proven that this ACO-KNN algorithm has been significantly improved compared to the baseline algorithms. The evaluation process has statistically proven that this ACO-KNN algorithm has been significantly improved compared to the baseline algorithms. In addition, the experimental results have proven that the ACO-KNN can be used as a feature selection technique in sentiment analysis to obtain quality, optimal feature subset that can represent the actual data in customer review data.
Emura, Takeshi; Konno, Yoshihiko; Michimae, Hirofumi
2015-07-01
Doubly truncated data consist of samples whose observed values fall between the right- and left- truncation limits. With such samples, the distribution function of interest is estimated using the nonparametric maximum likelihood estimator (NPMLE) that is obtained through a self-consistency algorithm. Owing to the complicated asymptotic distribution of the NPMLE, the bootstrap method has been suggested for statistical inference. This paper proposes a closed-form estimator for the asymptotic covariance function of the NPMLE, which is computationally attractive alternative to bootstrapping. Furthermore, we develop various statistical inference procedures, such as confidence interval, goodness-of-fit tests, and confidence bands to demonstrate the usefulness of the proposed covariance estimator. Simulations are performed to compare the proposed method with both the bootstrap and jackknife methods. The methods are illustrated using the childhood cancer dataset.
Simulating realistic predator signatures in quantitative fatty acid signature analysis
Bromaghin, Jeffrey F.
2015-01-01
Diet estimation is an important field within quantitative ecology, providing critical insights into many aspects of ecology and community dynamics. Quantitative fatty acid signature analysis (QFASA) is a prominent method of diet estimation, particularly for marine mammal and bird species. Investigators using QFASA commonly use computer simulation to evaluate statistical characteristics of diet estimators for the populations they study. Similar computer simulations have been used to explore and compare the performance of different variations of the original QFASA diet estimator. In both cases, computer simulations involve bootstrap sampling prey signature data to construct pseudo-predator signatures with known properties. However, bootstrap sample sizes have been selected arbitrarily and pseudo-predator signatures therefore may not have realistic properties. I develop an algorithm to objectively establish bootstrap sample sizes that generates pseudo-predator signatures with realistic properties, thereby enhancing the utility of computer simulation for assessing QFASA estimator performance. The algorithm also appears to be computationally efficient, resulting in bootstrap sample sizes that are smaller than those commonly used. I illustrate the algorithm with an example using data from Chukchi Sea polar bears (Ursus maritimus) and their marine mammal prey. The concepts underlying the approach may have value in other areas of quantitative ecology in which bootstrap samples are post-processed prior to their use.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Stassi, D.; Ma, H.; Schmidt, T. G., E-mail: taly.gilat-schmidt@marquette.edu
Purpose: Reconstructing a low-motion cardiac phase is expected to improve coronary artery visualization in coronary computed tomography angiography (CCTA) exams. This study developed an automated algorithm for selecting the optimal cardiac phase for CCTA reconstruction. The algorithm uses prospectively gated, single-beat, multiphase data made possible by wide cone-beam imaging. The proposed algorithm differs from previous approaches because the optimal phase is identified based on vessel image quality (IQ) directly, compared to previous approaches that included motion estimation and interphase processing. Because there is no processing of interphase information, the algorithm can be applied to any sampling of image phases, makingmore » it suited for prospectively gated studies where only a subset of phases are available. Methods: An automated algorithm was developed to select the optimal phase based on quantitative IQ metrics. For each reconstructed slice at each reconstructed phase, an image quality metric was calculated based on measures of circularity and edge strength of through-plane vessels. The image quality metric was aggregated across slices, while a metric of vessel-location consistency was used to ignore slices that did not contain through-plane vessels. The algorithm performance was evaluated using two observer studies. Fourteen single-beat cardiac CT exams (Revolution CT, GE Healthcare, Chalfont St. Giles, UK) reconstructed at 2% intervals were evaluated for best systolic (1), diastolic (6), or systolic and diastolic phases (7) by three readers and the algorithm. Pairwise inter-reader and reader-algorithm agreement was evaluated using the mean absolute difference (MAD) and concordance correlation coefficient (CCC) between the reader and algorithm-selected phases. A reader-consensus best phase was determined and compared to the algorithm selected phase. In cases where the algorithm and consensus best phases differed by more than 2%, IQ was scored by three readers using a five point Likert scale. Results: There was no statistically significant difference between inter-reader and reader-algorithm agreement for either MAD or CCC metrics (p > 0.1). The algorithm phase was within 2% of the consensus phase in 15/21 of cases. The average absolute difference between consensus and algorithm best phases was 2.29% ± 2.47%, with a maximum difference of 8%. Average image quality scores for the algorithm chosen best phase were 4.01 ± 0.65 overall, 3.33 ± 1.27 for right coronary artery (RCA), 4.50 ± 0.35 for left anterior descending (LAD) artery, and 4.50 ± 0.35 for left circumflex artery (LCX). Average image quality scores for the consensus best phase were 4.11 ± 0.54 overall, 3.44 ± 1.03 for RCA, 4.39 ± 0.39 for LAD, and 4.50 ± 0.18 for LCX. There was no statistically significant difference (p > 0.1) between the image quality scores of the algorithm phase and the consensus phase. Conclusions: The proposed algorithm was statistically equivalent to a reader in selecting an optimal cardiac phase for CCTA exams. When reader and algorithm phases differed by >2%, image quality as rated by blinded readers was statistically equivalent. By detecting the optimal phase for CCTA reconstruction, the proposed algorithm is expected to improve coronary artery visualization in CCTA exams.« less
NASA Astrophysics Data System (ADS)
Zhu, Maohu; Jie, Nanfeng; Jiang, Tianzi
2014-03-01
A reliable and precise classification of schizophrenia is significant for its diagnosis and treatment of schizophrenia. Functional magnetic resonance imaging (fMRI) is a novel tool increasingly used in schizophrenia research. Recent advances in statistical learning theory have led to applying pattern classification algorithms to access the diagnostic value of functional brain networks, discovered from resting state fMRI data. The aim of this study was to propose an adaptive learning algorithm to distinguish schizophrenia patients from normal controls using resting-state functional language network. Furthermore, here the classification of schizophrenia was regarded as a sample selection problem where a sparse subset of samples was chosen from the labeled training set. Using these selected samples, which we call informative vectors, a classifier for the clinic diagnosis of schizophrenia was established. We experimentally demonstrated that the proposed algorithm incorporating resting-state functional language network achieved 83.6% leaveone- out accuracy on resting-state fMRI data of 27 schizophrenia patients and 28 normal controls. In contrast with KNearest- Neighbor (KNN), Support Vector Machine (SVM) and l1-norm, our method yielded better classification performance. Moreover, our results suggested that a dysfunction of resting-state functional language network plays an important role in the clinic diagnosis of schizophrenia.
Machine Learning for Big Data: A Study to Understand Limits at Scale
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sukumar, Sreenivas R.; Del-Castillo-Negrete, Carlos Emilio
This report aims to empirically understand the limits of machine learning when applied to Big Data. We observe that recent innovations in being able to collect, access, organize, integrate, and query massive amounts of data from a wide variety of data sources have brought statistical data mining and machine learning under more scrutiny, evaluation and application for gleaning insights from the data than ever before. Much is expected from algorithms without understanding their limitations at scale while dealing with massive datasets. In that context, we pose and address the following questions How does a machine learning algorithm perform on measuresmore » such as accuracy and execution time with increasing sample size and feature dimensionality? Does training with more samples guarantee better accuracy? How many features to compute for a given problem? Do more features guarantee better accuracy? Do efforts to derive and calculate more features and train on larger samples worth the effort? As problems become more complex and traditional binary classification algorithms are replaced with multi-task, multi-class categorization algorithms do parallel learners perform better? What happens to the accuracy of the learning algorithm when trained to categorize multiple classes within the same feature space? Towards finding answers to these questions, we describe the design of an empirical study and present the results. We conclude with the following observations (i) accuracy of the learning algorithm increases with increasing sample size but saturates at a point, beyond which more samples do not contribute to better accuracy/learning, (ii) the richness of the feature space dictates performance - both accuracy and training time, (iii) increased dimensionality often reflected in better performance (higher accuracy in spite of longer training times) but the improvements are not commensurate the efforts for feature computation and training and (iv) accuracy of the learning algorithms drop significantly with multi-class learners training on the same feature matrix and (v) learning algorithms perform well when categories in labeled data are independent (i.e., no relationship or hierarchy exists among categories).« less
Optimizing Integrated Terminal Airspace Operations Under Uncertainty
NASA Technical Reports Server (NTRS)
Bosson, Christabelle; Xue, Min; Zelinski, Shannon
2014-01-01
In the terminal airspace, integrated departures and arrivals have the potential to increase operations efficiency. Recent research has developed geneticalgorithm- based schedulers for integrated arrival and departure operations under uncertainty. This paper presents an alternate method using a machine jobshop scheduling formulation to model the integrated airspace operations. A multistage stochastic programming approach is chosen to formulate the problem and candidate solutions are obtained by solving sample average approximation problems with finite sample size. Because approximate solutions are computed, the proposed algorithm incorporates the computation of statistical bounds to estimate the optimality of the candidate solutions. A proof-ofconcept study is conducted on a baseline implementation of a simple problem considering a fleet mix of 14 aircraft evolving in a model of the Los Angeles terminal airspace. A more thorough statistical analysis is also performed to evaluate the impact of the number of scenarios considered in the sampled problem. To handle extensive sampling computations, a multithreading technique is introduced.
Analysis of the Einstein sample of early-type galaxies
NASA Technical Reports Server (NTRS)
Eskridge, Paul B.; Fabbiano, Giuseppina
1993-01-01
The EINSTEIN galaxy catalog contains x-ray data for 148 early-type (E and SO) galaxies. A detailed analysis of the global properties of this sample are studied. By comparing the x-ray properties with other tracers of the ISM, as well as with observables related to the stellar dynamics and populations of the sample, we expect to determine more clearly the physical relationships that determine the evolution of early-type galaxies. Previous studies with smaller samples have explored the relationships between x-ray luminosity (L(sub x)) and luminosities in other bands. Using our larger sample and the statistical techniques of survival analysis, a number of these earlier analyses were repeated. For our full sample, a strong statistical correlation is found between L(sub X) and L(sub B) (the probability that the null hypothesis is upheld is P less than 10(exp -4) from a variety of rank correlation tests. Regressions with several algorithms yield consistent results.
Effectiveness of feature and classifier algorithms in character recognition systems
NASA Astrophysics Data System (ADS)
Wilson, Charles L.
1993-04-01
At the first Census Optical Character Recognition Systems Conference, NIST generated accuracy data for more than character recognition systems. Most systems were tested on the recognition of isolated digits and upper and lower case alphabetic characters. The recognition experiments were performed on sample sizes of 58,000 digits, and 12,000 upper and lower case alphabetic characters. The algorithms used by the 26 conference participants included rule-based methods, image-based methods, statistical methods, and neural networks. The neural network methods included Multi-Layer Perceptron's, Learned Vector Quantitization, Neocognitrons, and cascaded neural networks. In this paper 11 different systems are compared using correlations between the answers of different systems, comparing the decrease in error rate as a function of confidence of recognition, and comparing the writer dependence of recognition. This comparison shows that methods that used different algorithms for feature extraction and recognition performed with very high levels of correlation. This is true for neural network systems, hybrid systems, and statistically based systems, and leads to the conclusion that neural networks have not yet demonstrated a clear superiority to more conventional statistical methods. Comparison of these results with the models of Vapnick (for estimation problems), MacKay (for Bayesian statistical models), Moody (for effective parameterization), and Boltzmann models (for information content) demonstrate that as the limits of training data variance are approached, all classifier systems have similar statistical properties. The limiting condition can only be approached for sufficiently rich feature sets because the accuracy limit is controlled by the available information content of the training set, which must pass through the feature extraction process prior to classification.
Retrieval of volcanic ash height from satellite-based infrared measurements
NASA Astrophysics Data System (ADS)
Zhu, Lin; Li, Jun; Zhao, Yingying; Gong, He; Li, Wenjie
2017-05-01
A new algorithm for retrieving volcanic ash cloud height from satellite-based measurements is presented. This algorithm, which was developed in preparation for China's next-generation meteorological satellite (FY-4), is based on volcanic ash microphysical property simulation and statistical optimal estimation theory. The MSG satellite's main payload, a 12-channel Spinning Enhanced Visible and Infrared Imager, was used as proxy data to test this new algorithm. A series of eruptions of Iceland's Eyjafjallajökull volcano during April to May 2010 and the Puyehue-Cordón Caulle volcanic complex eruption in the Chilean Andes on 16 June 2011 were selected as two typical cases for evaluating the algorithm under various meteorological backgrounds. Independent volcanic ash simulation training samples and satellite-based Cloud-Aerosol Lidar with Orthogonal Polarization data were used as validation data. It is demonstrated that the statistically based volcanic ash height algorithm is able to rapidly retrieve volcanic ash heights, globally. The retrieved ash heights show comparable accuracy with both independent training data and the lidar measurements, which is consistent with previous studies. However, under complicated background, with multilayers in vertical scale, underlying stratus clouds tend to have detrimental effects on the final retrieval accuracy. This is an unresolved problem, like many other previously published methods using passive satellite sensors. Compared with previous studies, the FY-4 ash height algorithm is independent of simultaneous atmospheric profiles, providing a flexible way to estimate volcanic ash height using passive satellite infrared measurements.
Lukashin, A V; Fuchs, R
2001-05-01
Cluster analysis of genome-wide expression data from DNA microarray hybridization studies has proved to be a useful tool for identifying biologically relevant groupings of genes and samples. In the present paper, we focus on several important issues related to clustering algorithms that have not yet been fully studied. We describe a simple and robust algorithm for the clustering of temporal gene expression profiles that is based on the simulated annealing procedure. In general, this algorithm guarantees to eventually find the globally optimal distribution of genes over clusters. We introduce an iterative scheme that serves to evaluate quantitatively the optimal number of clusters for each specific data set. The scheme is based on standard approaches used in regular statistical tests. The basic idea is to organize the search of the optimal number of clusters simultaneously with the optimization of the distribution of genes over clusters. The efficiency of the proposed algorithm has been evaluated by means of a reverse engineering experiment, that is, a situation in which the correct distribution of genes over clusters is known a priori. The employment of this statistically rigorous test has shown that our algorithm places greater than 90% genes into correct clusters. Finally, the algorithm has been tested on real gene expression data (expression changes during yeast cell cycle) for which the fundamental patterns of gene expression and the assignment of genes to clusters are well understood from numerous previous studies.
NASA Astrophysics Data System (ADS)
Li, Li-Na; Ma, Chang-Ming; Chang, Ming; Zhang, Ren-Cheng
2017-12-01
A novel method based on SIMPLe-to-use Interactive Self-modeling Mixture Analysis (SIMPLISMA) and Kernel Partial Least Square (KPLS), named as SIMPLISMA-KPLS, is proposed in this paper for selection of outlier samples and informative samples simultaneously. It is a quick algorithm used to model standardization (or named as model transfer) in near infrared (NIR) spectroscopy. The NIR experiment data of the corn for analysis of the protein content is introduced to evaluate the proposed method. Piecewise direct standardization (PDS) is employed in model transfer. And the comparison of SIMPLISMA-PDS-KPLS and KS-PDS-KPLS is given in this research by discussion of the prediction accuracy of protein content and calculation speed of each algorithm. The conclusions include that SIMPLISMA-KPLS can be utilized as an alternative sample selection method for model transfer. Although it has similar accuracy to Kennard-Stone (KS), it is different from KS as it employs concentration information in selection program. This means that it ensures analyte information is involved in analysis, and the spectra (X) of the selected samples is interrelated with concentration (y). And it can be used for outlier sample elimination simultaneously by validation of calibration. According to the statistical data results of running time, it is clear that the sample selection process is more rapid when using KPLS. The quick algorithm of SIMPLISMA-KPLS is beneficial to improve the speed of online measurement using NIR spectroscopy.
Data Analysis with Graphical Models: Software Tools
NASA Technical Reports Server (NTRS)
Buntine, Wray L.
1994-01-01
Probabilistic graphical models (directed and undirected Markov fields, and combined in chain graphs) are used widely in expert systems, image processing and other areas as a framework for representing and reasoning with probabilities. They come with corresponding algorithms for performing probabilistic inference. This paper discusses an extension to these models by Spiegelhalter and Gilks, plates, used to graphically model the notion of a sample. This offers a graphical specification language for representing data analysis problems. When combined with general methods for statistical inference, this also offers a unifying framework for prototyping and/or generating data analysis algorithms from graphical specifications. This paper outlines the framework and then presents some basic tools for the task: a graphical version of the Pitman-Koopman Theorem for the exponential family, problem decomposition, and the calculation of exact Bayes factors. Other tools already developed, such as automatic differentiation, Gibbs sampling, and use of the EM algorithm, make this a broad basis for the generation of data analysis software.
Mapping from disease-specific measures to health-state utility values in individuals with migraine.
Gillard, Patrick J; Devine, Beth; Varon, Sepideh F; Liu, Lei; Sullivan, Sean D
2012-05-01
The objective of this study was to develop empirical algorithms that estimate health-state utility values from disease-specific quality-of-life scores in individuals with migraine. Data from a cross-sectional, multicountry study were used. Individuals with episodic and chronic migraine were randomly assigned to training or validation samples. Spearman's correlation coefficients between paired EuroQol five-dimensional (EQ-5D) questionnaire utility values and both Headache Impact Test (HIT-6) scores and Migraine-Specific Quality-of-Life Questionnaire version 2.1 (MSQ) domain scores (role restrictive, role preventive, and emotional function) were examined. Regression models were constructed to estimate EQ-5D questionnaire utility values from the HIT-6 score or the MSQ domain scores. Preferred algorithms were confirmed in the validation samples. In episodic migraine, the preferred HIT-6 and MSQ algorithms explained 22% and 25% of the variance (R(2)) in the training samples, respectively, and had similar prediction errors (root mean square errors of 0.30). In chronic migraine, the preferred HIT-6 and MSQ algorithms explained 36% and 45% of the variance in the training samples, respectively, and had similar prediction errors (root mean square errors 0.31 and 0.29). In episodic and chronic migraine, no statistically significant differences were observed between the mean observed and the mean estimated EQ-5D questionnaire utility values for the preferred HIT-6 and MSQ algorithms in the validation samples. The relationship between the EQ-5D questionnaire and the HIT-6 or the MSQ is adequate to use regression equations to estimate EQ-5D questionnaire utility values. The preferred HIT-6 and MSQ algorithms will be useful in estimating health-state utilities in migraine trials in which no preference-based measure is present. Copyright © 2012 International Society for Pharmacoeconomics and Outcomes Research (ISPOR). Published by Elsevier Inc. All rights reserved.
Analysis of sequencing and scheduling methods for arrival traffic
NASA Technical Reports Server (NTRS)
Neuman, Frank; Erzberger, Heinz
1990-01-01
The air traffic control subsystem that performs scheduling is discussed. The function of the scheduling algorithms is to plan automatically the most efficient landing order and to assign optimally spaced landing times to all arrivals. Several important scheduling algorithms are described and the statistical performance of the scheduling algorithms is examined. Scheduling brings order to an arrival sequence for aircraft. First-come-first-served scheduling (FCFS) establishes a fair order, based on estimated times of arrival, and determines proper separations. Because of the randomness of the traffic, gaps will remain in the scheduled sequence of aircraft. These gaps are filled, or partially filled, by time-advancing the leading aircraft after a gap while still preserving the FCFS order. Tightly scheduled groups of aircraft remain with a mix of heavy and large aircraft. Separation requirements differ for different types of aircraft trailing each other. Advantage is taken of this fact through mild reordering of the traffic, thus shortening the groups and reducing average delays. Actual delays for different samples with the same statistical parameters vary widely, especially for heavy traffic.
NASA Astrophysics Data System (ADS)
Beyhaghi, Pooriya
2016-11-01
This work considers the problem of the efficient minimization of the infinite time average of a stationary ergodic process in the space of a handful of independent parameters which affect it. Problems of this class, derived from physical or numerical experiments which are sometimes expensive to perform, are ubiquitous in turbulence research. In such problems, any given function evaluation, determined with finite sampling, is associated with a quantifiable amount of uncertainty, which may be reduced via additional sampling. This work proposes the first algorithm of this type. Our algorithm remarkably reduces the overall cost of the optimization process for problems of this class. Further, under certain well-defined conditions, rigorous proof of convergence is established to the global minimum of the problem considered.
Fusion And Inference From Multiple And Massive Disparate Distributed Dynamic Data Sets
2017-07-01
principled methodology for two-sample graph testing; designed a provably almost-surely perfect vertex clustering algorithm for block model graphs; proved...3.7 Semi-Supervised Clustering Methodology ...................................................................... 9 3.8 Robust Hypothesis Testing...dimensional Euclidean space – allows the full arsenal of statistical and machine learning methodology for multivariate Euclidean data to be deployed for
Multi-level methods and approximating distribution functions
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wilson, D., E-mail: daniel.wilson@dtc.ox.ac.uk; Baker, R. E.
2016-07-15
Biochemical reaction networks are often modelled using discrete-state, continuous-time Markov chains. System statistics of these Markov chains usually cannot be calculated analytically and therefore estimates must be generated via simulation techniques. There is a well documented class of simulation techniques known as exact stochastic simulation algorithms, an example of which is Gillespie’s direct method. These algorithms often come with high computational costs, therefore approximate stochastic simulation algorithms such as the tau-leap method are used. However, in order to minimise the bias in the estimates generated using them, a relatively small value of tau is needed, rendering the computational costs comparablemore » to Gillespie’s direct method. The multi-level Monte Carlo method (Anderson and Higham, Multiscale Model. Simul. 10:146–179, 2012) provides a reduction in computational costs whilst minimising or even eliminating the bias in the estimates of system statistics. This is achieved by first crudely approximating required statistics with many sample paths of low accuracy. Then correction terms are added until a required level of accuracy is reached. Recent literature has primarily focussed on implementing the multi-level method efficiently to estimate a single system statistic. However, it is clearly also of interest to be able to approximate entire probability distributions of species counts. We present two novel methods that combine known techniques for distribution reconstruction with the multi-level method. We demonstrate the potential of our methods using a number of examples.« less
3D widefield light microscope image reconstruction without dyes
NASA Astrophysics Data System (ADS)
Larkin, S.; Larson, J.; Holmes, C.; Vaicik, M.; Turturro, M.; Jurkevich, A.; Sinha, S.; Ezashi, T.; Papavasiliou, G.; Brey, E.; Holmes, T.
2015-03-01
3D image reconstruction using light microscope modalities without exogenous contrast agents is proposed and investigated as an approach to produce 3D images of biological samples for live imaging applications. Multimodality and multispectral imaging, used in concert with this 3D optical sectioning approach is also proposed as a way to further produce contrast that could be specific to components in the sample. The methods avoid usage of contrast agents. Contrast agents, such as fluorescent or absorbing dyes, can be toxic to cells or alter cell behavior. Current modes of producing 3D image sets from a light microscope, such as 3D deconvolution algorithms and confocal microscopy generally require contrast agents. Zernike phase contrast (ZPC), transmitted light brightfield (TLB), darkfield microscopy and others can produce contrast without dyes. Some of these modalities have not previously benefitted from 3D image reconstruction algorithms, however. The 3D image reconstruction algorithm is based on an underlying physical model of scattering potential, expressed as the sample's 3D absorption and phase quantities. The algorithm is based upon optimizing an objective function - the I-divergence - while solving for the 3D absorption and phase quantities. Unlike typical deconvolution algorithms, each microscope modality, such as ZPC or TLB, produces two output image sets instead of one. Contrast in the displayed image and 3D renderings is further enabled by treating the multispectral/multimodal data as a feature set in a mathematical formulation that uses the principal component method of statistics.
Efficiency of exchange schemes in replica exchange
NASA Astrophysics Data System (ADS)
Lingenheil, Martin; Denschlag, Robert; Mathias, Gerald; Tavan, Paul
2009-08-01
In replica exchange simulations a fast diffusion of the replicas through the temperature space maximizes the efficiency of the statistical sampling. Here, we compare the diffusion speed as measured by the round trip rates for four exchange algorithms. We find different efficiency profiles with optimal average acceptance probabilities ranging from 8% to 41%. The best performance is determined by benchmark simulations for the most widely used algorithm, which alternately tries to exchange all even and all odd replica pairs. By analytical mathematics we show that the excellent performance of this exchange scheme is due to the high diffusivity of the underlying random walk.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Caillet, V; Colvill, E; Royal North Shore Hospital, St Leonards, Sydney
2016-06-15
Purpose: Multi-leaf collimator (MLC) tracking is being clinically pioneered to continuously compensate for thoracic and abdominal motion during radiotherapy. The purpose of this work is to characterize the performance of two MLC tracking algorithms for cancer radiotherapy, based on a direct optimization and a piecewise leaf fitting approach respectively. Methods: To test the algorithms, both physical and in silico experiments were performed. Previously published high and low modulation VMAT plans for lung and prostate cancer cases were used along with eight patient-measured organ-specific trajectories. For both MLC tracking algorithm, the plans were run with their corresponding patient trajectories. The physicalmore » experiments were performed on a Trilogy Varian linac and a programmable phantom (HexaMotion platform). For each MLC tracking algorithm, plan and patient trajectory, the tracking accuracy was quantified as the difference in aperture area between ideal and fitted MLC. To compare algorithms, the average cumulative tracking error area for each experiment was calculated. The two-sample Kolmogorov-Smirnov (KS) test was used to evaluate the cumulative tracking errors between algorithms. Results: Comparison of tracking errors for the physical and in silico experiments showed minor differences between the two algorithms. The KS D-statistics for the physical experiments were below 0.05 denoting no significant differences between the two distributions pattern and the average error area (direct optimization/piecewise leaf-fitting) were comparable (66.64 cm2/65.65 cm2). For the in silico experiments, the KS D-statistics were below 0.05 and the average errors area were also equivalent (49.38 cm2/48.98 cm2). Conclusion: The comparison between the two leaf fittings algorithms demonstrated no significant differences in tracking errors, neither in a clinically realistic environment nor in silico. The similarities in the two independent algorithms give confidence in the use of either algorithm for clinical implementation.« less
NASA Astrophysics Data System (ADS)
Piretzidis, Dimitrios; Sra, Gurveer; Karantaidis, George; Sideris, Michael G.
2017-04-01
A new method for identifying correlated errors in Gravity Recovery and Climate Experiment (GRACE) monthly harmonic coefficients has been developed and tested. Correlated errors are present in the differences between monthly GRACE solutions, and can be suppressed using a de-correlation filter. In principle, the de-correlation filter should be implemented only on coefficient series with correlated errors to avoid losing useful geophysical information. In previous studies, two main methods of implementing the de-correlation filter have been utilized. In the first one, the de-correlation filter is implemented starting from a specific minimum order until the maximum order of the monthly solution examined. In the second one, the de-correlation filter is implemented only on specific coefficient series, the selection of which is based on statistical testing. The method proposed in the present study exploits the capabilities of supervised machine learning algorithms such as neural networks and support vector machines (SVMs). The pattern of correlated errors can be described by several numerical and geometric features of the harmonic coefficient series. The features of extreme cases of both correlated and uncorrelated coefficients are extracted and used for the training of the machine learning algorithms. The trained machine learning algorithms are later used to identify correlated errors and provide the probability of a coefficient series to be correlated. Regarding SVMs algorithms, an extensive study is performed with various kernel functions in order to find the optimal training model for prediction. The selection of the optimal training model is based on the classification accuracy of the trained SVM algorithm on the same samples used for training. Results show excellent performance of all algorithms with a classification accuracy of 97% - 100% on a pre-selected set of training samples, both in the validation stage of the training procedure and in the subsequent use of the trained algorithms to classify independent coefficients. This accuracy is also confirmed by the external validation of the trained algorithms using the hydrology model GLDAS NOAH. The proposed method meet the requirement of identifying and de-correlating only coefficients with correlated errors. Also, there is no need of applying statistical testing or other techniques that require prior de-correlation of the harmonic coefficients.
Evaluation of ultrasonic array imaging algorithms for inspection of a coarse grained material
NASA Astrophysics Data System (ADS)
Van Pamel, A.; Lowe, M. J. S.; Brett, C. R.
2014-02-01
Improving the ultrasound inspection capability for coarse grain metals remains of longstanding interest to industry and the NDE research community and is expected to become increasingly important for next generation power plants. A test sample of coarse grained Inconel 625 which is representative of future power plant components has been manufactured to test the detectability of different inspection techniques. Conventional ultrasonic A, B, and C-scans showed the sample to be extraordinarily difficult to inspect due to its scattering behaviour. However, in recent years, array probes and Full Matrix Capture (FMC) imaging algorithms, which extract the maximum amount of information possible, have unlocked exciting possibilities for improvements. This article proposes a robust methodology to evaluate the detection performance of imaging algorithms, applying this to three FMC imaging algorithms; Total Focusing Method (TFM), Phase Coherent Imaging (PCI), and Decomposition of the Time Reversal Operator with Multiple Scattering (DORT MSF). The methodology considers the statistics of detection, presenting the detection performance as Probability of Detection (POD) and probability of False Alarm (PFA). The data is captured in pulse-echo mode using 64 element array probes at centre frequencies of 1MHz and 5MHz. All three algorithms are shown to perform very similarly when comparing their flaw detection capabilities on this particular case.
Scheid, Anika; Nebel, Markus E
2012-07-09
Over the past years, statistical and Bayesian approaches have become increasingly appreciated to address the long-standing problem of computational RNA structure prediction. Recently, a novel probabilistic method for the prediction of RNA secondary structures from a single sequence has been studied which is based on generating statistically representative and reproducible samples of the entire ensemble of feasible structures for a particular input sequence. This method samples the possible foldings from a distribution implied by a sophisticated (traditional or length-dependent) stochastic context-free grammar (SCFG) that mirrors the standard thermodynamic model applied in modern physics-based prediction algorithms. Specifically, that grammar represents an exact probabilistic counterpart to the energy model underlying the Sfold software, which employs a sampling extension of the partition function (PF) approach to produce statistically representative subsets of the Boltzmann-weighted ensemble. Although both sampling approaches have the same worst-case time and space complexities, it has been indicated that they differ in performance (both with respect to prediction accuracy and quality of generated samples), where neither of these two competing approaches generally outperforms the other. In this work, we will consider the SCFG based approach in order to perform an analysis on how the quality of generated sample sets and the corresponding prediction accuracy changes when different degrees of disturbances are incorporated into the needed sampling probabilities. This is motivated by the fact that if the results prove to be resistant to large errors on the distinct sampling probabilities (compared to the exact ones), then it will be an indication that these probabilities do not need to be computed exactly, but it may be sufficient and more efficient to approximate them. Thus, it might then be possible to decrease the worst-case time requirements of such an SCFG based sampling method without significant accuracy losses. If, on the other hand, the quality of sampled structures can be observed to strongly react to slight disturbances, there is little hope for improving the complexity by heuristic procedures. We hence provide a reliable test for the hypothesis that a heuristic method could be implemented to improve the time scaling of RNA secondary structure prediction in the worst-case - without sacrificing much of the accuracy of the results. Our experiments indicate that absolute errors generally lead to the generation of useless sample sets, whereas relative errors seem to have only small negative impact on both the predictive accuracy and the overall quality of resulting structure samples. Based on these observations, we present some useful ideas for developing a time-reduced sampling method guaranteeing an acceptable predictive accuracy. We also discuss some inherent drawbacks that arise in the context of approximation. The key results of this paper are crucial for the design of an efficient and competitive heuristic prediction method based on the increasingly accepted and attractive statistical sampling approach. This has indeed been indicated by the construction of prototype algorithms.
2012-01-01
Background Over the past years, statistical and Bayesian approaches have become increasingly appreciated to address the long-standing problem of computational RNA structure prediction. Recently, a novel probabilistic method for the prediction of RNA secondary structures from a single sequence has been studied which is based on generating statistically representative and reproducible samples of the entire ensemble of feasible structures for a particular input sequence. This method samples the possible foldings from a distribution implied by a sophisticated (traditional or length-dependent) stochastic context-free grammar (SCFG) that mirrors the standard thermodynamic model applied in modern physics-based prediction algorithms. Specifically, that grammar represents an exact probabilistic counterpart to the energy model underlying the Sfold software, which employs a sampling extension of the partition function (PF) approach to produce statistically representative subsets of the Boltzmann-weighted ensemble. Although both sampling approaches have the same worst-case time and space complexities, it has been indicated that they differ in performance (both with respect to prediction accuracy and quality of generated samples), where neither of these two competing approaches generally outperforms the other. Results In this work, we will consider the SCFG based approach in order to perform an analysis on how the quality of generated sample sets and the corresponding prediction accuracy changes when different degrees of disturbances are incorporated into the needed sampling probabilities. This is motivated by the fact that if the results prove to be resistant to large errors on the distinct sampling probabilities (compared to the exact ones), then it will be an indication that these probabilities do not need to be computed exactly, but it may be sufficient and more efficient to approximate them. Thus, it might then be possible to decrease the worst-case time requirements of such an SCFG based sampling method without significant accuracy losses. If, on the other hand, the quality of sampled structures can be observed to strongly react to slight disturbances, there is little hope for improving the complexity by heuristic procedures. We hence provide a reliable test for the hypothesis that a heuristic method could be implemented to improve the time scaling of RNA secondary structure prediction in the worst-case – without sacrificing much of the accuracy of the results. Conclusions Our experiments indicate that absolute errors generally lead to the generation of useless sample sets, whereas relative errors seem to have only small negative impact on both the predictive accuracy and the overall quality of resulting structure samples. Based on these observations, we present some useful ideas for developing a time-reduced sampling method guaranteeing an acceptable predictive accuracy. We also discuss some inherent drawbacks that arise in the context of approximation. The key results of this paper are crucial for the design of an efficient and competitive heuristic prediction method based on the increasingly accepted and attractive statistical sampling approach. This has indeed been indicated by the construction of prototype algorithms. PMID:22776037
A cluster pattern algorithm for the analysis of multiparametric cell assays.
Kaufman, Menachem; Bloch, David; Zurgil, Naomi; Shafran, Yana; Deutsch, Mordechai
2005-09-01
The issue of multiparametric analysis of complex single cell assays of both static and flow cytometry (SC and FC, respectively) has become common in recent years. In such assays, the analysis of changes, applying common statistical parameters and tests, often fails to detect significant differences between the investigated samples. The cluster pattern similarity (CPS) measure between two sets of gated clusters is based on computing the difference between their density distribution functions' set points. The CPS was applied for the discrimination between two observations in a four-dimensional parameter space. The similarity coefficient (r) ranges between 0 (perfect similarity) to 1 (dissimilar). Three CPS validation tests were carried out: on the same stock samples of fluorescent beads, yielding very low r's (0, 0.066); and on two cell models: mitogenic stimulation of peripheral blood mononuclear cells (PBMC), and apoptosis induction in Jurkat T cell line by H2O2. In both latter cases, r indicated similarity (r < 0.23) within the same group, and dissimilarity (r > 0.48) otherwise. This classification and algorithm approach offers a measure of similarity between samples. It relies on the multidimensional pattern of the sample parameters. The algorithm compensates for environmental drifts in this apparatus and assay; it also may be applied to more than four dimensions.
The Cross-Correlation and Reshuffling Tests in Discerning Induced Seismicity
NASA Astrophysics Data System (ADS)
Schultz, Ryan; Telesca, Luciano
2018-05-01
In recent years, cases of newly emergent induced clusters have increased seismic hazard and risk in locations with social, environmental, and economic consequence. Thus, the need for a quantitative and robust means to discern induced seismicity has become a critical concern. This paper reviews a Matlab-based algorithm designed to quantify the statistical confidence between two time-series datasets. Similar to prior approaches, our method utilizes the cross-correlation to delineate the strength and lag of correlated signals. In addition, use of surrogate reshuffling tests allows for the dynamic testing against statistical confidence intervals of anticipated spurious correlations. We demonstrate the robust nature of our algorithm in a suite of synthetic tests to determine the limits of accurate signal detection in the presence of noise and sub-sampling. Overall, this routine has considerable merit in terms of delineating the strength of correlated signals, one of which includes the discernment of induced seismicity from natural.
Stassi, D; Dutta, S; Ma, H; Soderman, A; Pazzani, D; Gros, E; Okerlund, D; Schmidt, T G
2016-01-01
Reconstructing a low-motion cardiac phase is expected to improve coronary artery visualization in coronary computed tomography angiography (CCTA) exams. This study developed an automated algorithm for selecting the optimal cardiac phase for CCTA reconstruction. The algorithm uses prospectively gated, single-beat, multiphase data made possible by wide cone-beam imaging. The proposed algorithm differs from previous approaches because the optimal phase is identified based on vessel image quality (IQ) directly, compared to previous approaches that included motion estimation and interphase processing. Because there is no processing of interphase information, the algorithm can be applied to any sampling of image phases, making it suited for prospectively gated studies where only a subset of phases are available. An automated algorithm was developed to select the optimal phase based on quantitative IQ metrics. For each reconstructed slice at each reconstructed phase, an image quality metric was calculated based on measures of circularity and edge strength of through-plane vessels. The image quality metric was aggregated across slices, while a metric of vessel-location consistency was used to ignore slices that did not contain through-plane vessels. The algorithm performance was evaluated using two observer studies. Fourteen single-beat cardiac CT exams (Revolution CT, GE Healthcare, Chalfont St. Giles, UK) reconstructed at 2% intervals were evaluated for best systolic (1), diastolic (6), or systolic and diastolic phases (7) by three readers and the algorithm. Pairwise inter-reader and reader-algorithm agreement was evaluated using the mean absolute difference (MAD) and concordance correlation coefficient (CCC) between the reader and algorithm-selected phases. A reader-consensus best phase was determined and compared to the algorithm selected phase. In cases where the algorithm and consensus best phases differed by more than 2%, IQ was scored by three readers using a five point Likert scale. There was no statistically significant difference between inter-reader and reader-algorithm agreement for either MAD or CCC metrics (p > 0.1). The algorithm phase was within 2% of the consensus phase in 15/21 of cases. The average absolute difference between consensus and algorithm best phases was 2.29% ± 2.47%, with a maximum difference of 8%. Average image quality scores for the algorithm chosen best phase were 4.01 ± 0.65 overall, 3.33 ± 1.27 for right coronary artery (RCA), 4.50 ± 0.35 for left anterior descending (LAD) artery, and 4.50 ± 0.35 for left circumflex artery (LCX). Average image quality scores for the consensus best phase were 4.11 ± 0.54 overall, 3.44 ± 1.03 for RCA, 4.39 ± 0.39 for LAD, and 4.50 ± 0.18 for LCX. There was no statistically significant difference (p > 0.1) between the image quality scores of the algorithm phase and the consensus phase. The proposed algorithm was statistically equivalent to a reader in selecting an optimal cardiac phase for CCTA exams. When reader and algorithm phases differed by >2%, image quality as rated by blinded readers was statistically equivalent. By detecting the optimal phase for CCTA reconstruction, the proposed algorithm is expected to improve coronary artery visualization in CCTA exams.
Driving mechanism of unsteady separation shock motion in hypersonic interactive flow
NASA Technical Reports Server (NTRS)
Dolling, D. S.; Narlo, J. C., II
1987-01-01
Wall pressure fluctuations were measured under the steady separation shock waves in Mach 5 turbulent interactions induced by unswept circular cylinders on a flat plate. The wall temperature was adiabatic. A conditional sampling algorithm was developed to examine the statistics of the shock wave motion. The same algorithm was used to examine data taken in earlier studies in the Princeton University Mach 3 blowdown tunnel. In these earlier studies, hemicylindrically blunted fins of different leading-edge diameters were tested in boundary layers which developed on the tunnel floor and on a flat plate. A description of the algorithm, the reasons why it was developed and the sensitivity of the results to the threshold settings, are discussed. The results from the algorithm, together with cross correlations and power spectral density estimates suggests that the shock motion is driven by the low-frequency unsteadiness of the downstream separated, vortical flow.
An Uncertainty Quantification Framework for Remote Sensing Retrievals
NASA Astrophysics Data System (ADS)
Braverman, A. J.; Hobbs, J.
2017-12-01
Remote sensing data sets produced by NASA and other space agencies are the result of complex algorithms that infer geophysical state from observed radiances using retrieval algorithms. The processing must keep up with the downlinked data flow, and this necessitates computational compromises that affect the accuracies of retrieved estimates. The algorithms are also limited by imperfect knowledge of physics and of ancillary inputs that are required. All of this contributes to uncertainties that are generally not rigorously quantified by stepping outside the assumptions that underlie the retrieval methodology. In this talk we discuss a practical framework for uncertainty quantification that can be applied to a variety of remote sensing retrieval algorithms. Ours is a statistical approach that uses Monte Carlo simulation to approximate the sampling distribution of the retrieved estimates. We will discuss the strengths and weaknesses of this approach, and provide a case-study example from the Orbiting Carbon Observatory 2 mission.
Small convolution kernels for high-fidelity image restoration
NASA Technical Reports Server (NTRS)
Reichenbach, Stephen E.; Park, Stephen K.
1991-01-01
An algorithm is developed for computing the mean-square-optimal values for small, image-restoration kernels. The algorithm is based on a comprehensive, end-to-end imaging system model that accounts for the important components of the imaging process: the statistics of the scene, the point-spread function of the image-gathering device, sampling effects, noise, and display reconstruction. Subject to constraints on the spatial support of the kernel, the algorithm generates the kernel values that restore the image with maximum fidelity, that is, the kernel minimizes the expected mean-square restoration error. The algorithm is consistent with the derivation of the spatially unconstrained Wiener filter, but leads to a small, spatially constrained kernel that, unlike the unconstrained filter, can be efficiently implemented by convolution. Simulation experiments demonstrate that for a wide range of imaging systems these small kernels can restore images with fidelity comparable to images restored with the unconstrained Wiener filter.
Visell, Yon
2015-04-01
This paper proposes a fast, physically accurate method for synthesizing multimodal, acoustic and haptic, signatures of distributed fracture in quasi-brittle heterogeneous materials, such as wood, granular media, or other fiber composites. Fracture processes in these materials are challenging to simulate with existing methods, due to the prevalence of large numbers of disordered, quasi-random spatial degrees of freedom, representing the complex physical state of a sample over the geometric volume of interest. Here, I develop an algorithm for simulating such processes, building on a class of statistical lattice models of fracture that have been widely investigated in the physics literature. This algorithm is enabled through a recently published mathematical construction based on the inverse transform method of random number sampling. It yields a purely time domain stochastic jump process representing stress fluctuations in the medium. The latter can be readily extended by a mean field approximation that captures the averaged constitutive (stress-strain) behavior of the material. Numerical simulations and interactive examples demonstrate the ability of these algorithms to generate physically plausible acoustic and haptic signatures of fracture in complex, natural materials interactively at audio sampling rates.
Active learning for clinical text classification: is it better than random sampling?
Figueroa, Rosa L; Zeng-Treitler, Qing; Ngo, Long H; Goryachev, Sergey; Wiechmann, Eduardo P
2012-01-01
This study explores active learning algorithms as a way to reduce the requirements for large training sets in medical text classification tasks. Three existing active learning algorithms (distance-based (DIST), diversity-based (DIV), and a combination of both (CMB)) were used to classify text from five datasets. The performance of these algorithms was compared to that of passive learning on the five datasets. We then conducted a novel investigation of the interaction between dataset characteristics and the performance results. Classification accuracy and area under receiver operating characteristics (ROC) curves for each algorithm at different sample sizes were generated. The performance of active learning algorithms was compared with that of passive learning using a weighted mean of paired differences. To determine why the performance varies on different datasets, we measured the diversity and uncertainty of each dataset using relative entropy and correlated the results with the performance differences. The DIST and CMB algorithms performed better than passive learning. With a statistical significance level set at 0.05, DIST outperformed passive learning in all five datasets, while CMB was found to be better than passive learning in four datasets. We found strong correlations between the dataset diversity and the DIV performance, as well as the dataset uncertainty and the performance of the DIST algorithm. For medical text classification, appropriate active learning algorithms can yield performance comparable to that of passive learning with considerably smaller training sets. In particular, our results suggest that DIV performs better on data with higher diversity and DIST on data with lower uncertainty.
Active learning for clinical text classification: is it better than random sampling?
Figueroa, Rosa L; Ngo, Long H; Goryachev, Sergey; Wiechmann, Eduardo P
2012-01-01
Objective This study explores active learning algorithms as a way to reduce the requirements for large training sets in medical text classification tasks. Design Three existing active learning algorithms (distance-based (DIST), diversity-based (DIV), and a combination of both (CMB)) were used to classify text from five datasets. The performance of these algorithms was compared to that of passive learning on the five datasets. We then conducted a novel investigation of the interaction between dataset characteristics and the performance results. Measurements Classification accuracy and area under receiver operating characteristics (ROC) curves for each algorithm at different sample sizes were generated. The performance of active learning algorithms was compared with that of passive learning using a weighted mean of paired differences. To determine why the performance varies on different datasets, we measured the diversity and uncertainty of each dataset using relative entropy and correlated the results with the performance differences. Results The DIST and CMB algorithms performed better than passive learning. With a statistical significance level set at 0.05, DIST outperformed passive learning in all five datasets, while CMB was found to be better than passive learning in four datasets. We found strong correlations between the dataset diversity and the DIV performance, as well as the dataset uncertainty and the performance of the DIST algorithm. Conclusion For medical text classification, appropriate active learning algorithms can yield performance comparable to that of passive learning with considerably smaller training sets. In particular, our results suggest that DIV performs better on data with higher diversity and DIST on data with lower uncertainty. PMID:22707743
NASA Astrophysics Data System (ADS)
Khan, Asif; Ryoo, Chang-Kyung; Kim, Heung Soo
2017-04-01
This paper presents a comparative study of different classification algorithms for the classification of various types of inter-ply delaminations in smart composite laminates. Improved layerwise theory is used to model delamination at different interfaces along the thickness and longitudinal directions of the smart composite laminate. The input-output data obtained through surface bonded piezoelectric sensor and actuator is analyzed by the system identification algorithm to get the system parameters. The identified parameters for the healthy and delaminated structure are supplied as input data to the classification algorithms. The classification algorithms considered in this study are ZeroR, Classification via regression, Naïve Bayes, Multilayer Perceptron, Sequential Minimal Optimization, Multiclass-Classifier, and Decision tree (J48). The open source software of Waikato Environment for Knowledge Analysis (WEKA) is used to evaluate the classification performance of the classifiers mentioned above via 75-25 holdout and leave-one-sample-out cross-validation regarding classification accuracy, precision, recall, kappa statistic and ROC Area.
2D Affine and Projective Shape Analysis.
Bryner, Darshan; Klassen, Eric; Huiling Le; Srivastava, Anuj
2014-05-01
Current techniques for shape analysis tend to seek invariance to similarity transformations (rotation, translation, and scale), but certain imaging situations require invariance to larger groups, such as affine or projective groups. Here we present a general Riemannian framework for shape analysis of planar objects where metrics and related quantities are invariant to affine and projective groups. Highlighting two possibilities for representing object boundaries-ordered points (or landmarks) and parameterized curves-we study different combinations of these representations (points and curves) and transformations (affine and projective). Specifically, we provide solutions to three out of four situations and develop algorithms for computing geodesics and intrinsic sample statistics, leading up to Gaussian-type statistical models, and classifying test shapes using such models learned from training data. In the case of parameterized curves, we also achieve the desired goal of invariance to re-parameterizations. The geodesics are constructed by particularizing the path-straightening algorithm to geometries of current manifolds and are used, in turn, to compute shape statistics and Gaussian-type shape models. We demonstrate these ideas using a number of examples from shape and activity recognition.
Czaplewski, Raymond L.
2015-01-01
Wall-to-wall remotely sensed data are increasingly available to monitor landscape dynamics over large geographic areas. However, statistical monitoring programs that use post-stratification cannot fully utilize those sensor data. The Kalman filter (KF) is an alternative statistical estimator. I develop a new KF algorithm that is numerically robust with large numbers of study variables and auxiliary sensor variables. A National Forest Inventory (NFI) illustrates application within an official statistics program. Practical recommendations regarding remote sensing and statistical issues are offered. This algorithm has the potential to increase the value of synoptic sensor data for statistical monitoring of large geographic areas. PMID:26393588
Transport Coefficients from Large Deviation Functions
NASA Astrophysics Data System (ADS)
Gao, Chloe; Limmer, David
2017-10-01
We describe a method for computing transport coefficients from the direct evaluation of large deviation function. This method is general, relying on only equilibrium fluctuations, and is statistically efficient, employing trajectory based importance sampling. Equilibrium fluctuations of molecular currents are characterized by their large deviation functions, which is a scaled cumulant generating function analogous to the free energy. A diffusion Monte Carlo algorithm is used to evaluate the large deviation functions, from which arbitrary transport coefficients are derivable. We find significant statistical improvement over traditional Green-Kubo based calculations. The systematic and statistical errors of this method are analyzed in the context of specific transport coefficient calculations, including the shear viscosity, interfacial friction coefficient, and thermal conductivity.
Frequent statistics of link-layer bit stream data based on AC-IM algorithm
NASA Astrophysics Data System (ADS)
Cao, Chenghong; Lei, Yingke; Xu, Yiming
2017-08-01
At present, there are many relevant researches on data processing using classical pattern matching and its improved algorithm, but few researches on statistical data of link-layer bit stream. This paper adopts a frequent statistical method of link-layer bit stream data based on AC-IM algorithm for classical multi-pattern matching algorithms such as AC algorithm has high computational complexity, low efficiency and it cannot be applied to binary bit stream data. The method's maximum jump distance of the mode tree is length of the shortest mode string plus 3 in case of no missing? In this paper, theoretical analysis is made on the principle of algorithm construction firstly, and then the experimental results show that the algorithm can adapt to the binary bit stream data environment and extract the frequent sequence more accurately, the effect is obvious. Meanwhile, comparing with the classical AC algorithm and other improved algorithms, AC-IM algorithm has a greater maximum jump distance and less time-consuming.
Bayesian reconstruction of projection reconstruction NMR (PR-NMR).
Yoon, Ji Won
2014-11-01
Projection reconstruction nuclear magnetic resonance (PR-NMR) is a technique for generating multidimensional NMR spectra. A small number of projections from lower-dimensional NMR spectra are used to reconstruct the multidimensional NMR spectra. In our previous work, it was shown that multidimensional NMR spectra are efficiently reconstructed using peak-by-peak based reversible jump Markov chain Monte Carlo (RJMCMC) algorithm. We propose an extended and generalized RJMCMC algorithm replacing a simple linear model with a linear mixed model to reconstruct close NMR spectra into true spectra. This statistical method generates samples in a Bayesian scheme. Our proposed algorithm is tested on a set of six projections derived from the three-dimensional 700 MHz HNCO spectrum of a protein HasA. Copyright © 2014 Elsevier Ltd. All rights reserved.
Enabling phenotypic big data with PheNorm.
Yu, Sheng; Ma, Yumeng; Gronsbell, Jessica; Cai, Tianrun; Ananthakrishnan, Ashwin N; Gainer, Vivian S; Churchill, Susanne E; Szolovits, Peter; Murphy, Shawn N; Kohane, Isaac S; Liao, Katherine P; Cai, Tianxi
2018-01-01
Electronic health record (EHR)-based phenotyping infers whether a patient has a disease based on the information in his or her EHR. A human-annotated training set with gold-standard disease status labels is usually required to build an algorithm for phenotyping based on a set of predictive features. The time intensiveness of annotation and feature curation severely limits the ability to achieve high-throughput phenotyping. While previous studies have successfully automated feature curation, annotation remains a major bottleneck. In this paper, we present PheNorm, a phenotyping algorithm that does not require expert-labeled samples for training. The most predictive features, such as the number of International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) codes or mentions of the target phenotype, are normalized to resemble a normal mixture distribution with high area under the receiver operating curve (AUC) for prediction. The transformed features are then denoised and combined into a score for accurate disease classification. We validated the accuracy of PheNorm with 4 phenotypes: coronary artery disease, rheumatoid arthritis, Crohn's disease, and ulcerative colitis. The AUCs of the PheNorm score reached 0.90, 0.94, 0.95, and 0.94 for the 4 phenotypes, respectively, which were comparable to the accuracy of supervised algorithms trained with sample sizes of 100-300, with no statistically significant difference. The accuracy of the PheNorm algorithms is on par with algorithms trained with annotated samples. PheNorm fully automates the generation of accurate phenotyping algorithms and demonstrates the capacity for EHR-driven annotations to scale to the next level - phenotypic big data. © The Author 2017. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com
Space Object Maneuver Detection Algorithms Using TLE Data
NASA Astrophysics Data System (ADS)
Pittelkau, M.
2016-09-01
An important aspect of Space Situational Awareness (SSA) is detection of deliberate and accidental orbit changes of space objects. Although space surveillance systems detect orbit maneuvers within their tracking algorithms, maneuver data are not readily disseminated for general use. However, two-line element (TLE) data is available and can be used to detect maneuvers of space objects. This work is an attempt to improve upon existing TLE-based maneuver detection algorithms. Three adaptive maneuver detection algorithms are developed and evaluated: The first is a fading-memory Kalman filter, which is equivalent to the sliding-window least-squares polynomial fit, but computationally more efficient and adaptive to the noise in the TLE data. The second algorithm is based on a sample cumulative distribution function (CDF) computed from a histogram of the magnitude-squared |V|2 of change-in-velocity vectors (V), which is computed from the TLE data. A maneuver detection threshold is computed from the median estimated from the CDF, or from the CDF and a specified probability of false alarm. The third algorithm is a median filter. The median filter is the simplest of a class of nonlinear filters called order statistics filters, which is within the theory of robust statistics. The output of the median filter is practically insensitive to outliers, or large maneuvers. The median of the |V|2 data is proportional to the variance of the V, so the variance is estimated from the output of the median filter. A maneuver is detected when the input data exceeds a constant times the estimated variance.
The Detection and Statistics of Giant Arcs behind CLASH Clusters
NASA Astrophysics Data System (ADS)
Xu, Bingxiao; Postman, Marc; Meneghetti, Massimo; Seitz, Stella; Zitrin, Adi; Merten, Julian; Maoz, Dani; Frye, Brenda; Umetsu, Keiichi; Zheng, Wei; Bradley, Larry; Vega, Jesus; Koekemoer, Anton
2016-02-01
We developed an algorithm to find and characterize gravitationally lensed galaxies (arcs) to perform a comparison of the observed and simulated arc abundance. Observations are from the Cluster Lensing And Supernova survey with Hubble (CLASH). Simulated CLASH images are created using the MOKA package and also clusters selected from the high-resolution, hydrodynamical simulations, MUSIC, over the same mass and redshift range as the CLASH sample. The algorithm's arc elongation accuracy, completeness, and false positive rate are determined and used to compute an estimate of the true arc abundance. We derive a lensing efficiency of 4 ± 1 arcs (with length ≥6″ and length-to-width ratio ≥7) per cluster for the X-ray-selected CLASH sample, 4 ± 1 arcs per cluster for the MOKA-simulated sample, and 3 ± 1 arcs per cluster for the MUSIC-simulated sample. The observed and simulated arc statistics are in full agreement. We measure the photometric redshifts of all detected arcs and find a median redshift zs = 1.9 with 33% of the detected arcs having zs > 3. We find that the arc abundance does not depend strongly on the source redshift distribution but is sensitive to the mass distribution of the dark matter halos (e.g., the c-M relation). Our results show that consistency between the observed and simulated distributions of lensed arc sizes and axial ratios can be achieved by using cluster-lensing simulations that are carefully matched to the selection criteria used in the observations.
Theory and generation of conditional, scalable sub-Gaussian random fields
NASA Astrophysics Data System (ADS)
Panzeri, M.; Riva, M.; Guadagnini, A.; Neuman, S. P.
2016-03-01
Many earth and environmental (as well as a host of other) variables, Y, and their spatial (or temporal) increments, ΔY, exhibit non-Gaussian statistical scaling. Previously we were able to capture key aspects of such non-Gaussian scaling by treating Y and/or ΔY as sub-Gaussian random fields (or processes). This however left unaddressed the empirical finding that whereas sample frequency distributions of Y tend to display relatively mild non-Gaussian peaks and tails, those of ΔY often reveal peaks that grow sharper and tails that become heavier with decreasing separation distance or lag. Recently we proposed a generalized sub-Gaussian model (GSG) which resolves this apparent inconsistency between the statistical scaling behaviors of observed variables and their increments. We presented an algorithm to generate unconditional random realizations of statistically isotropic or anisotropic GSG functions and illustrated it in two dimensions. Most importantly, we demonstrated the feasibility of estimating all parameters of a GSG model underlying a single realization of Y by analyzing jointly spatial moments of Y data and corresponding increments, ΔY. Here, we extend our GSG model to account for noisy measurements of Y at a discrete set of points in space (or time), present an algorithm to generate conditional realizations of corresponding isotropic or anisotropic random fields, introduce two approximate versions of this algorithm to reduce CPU time, and explore them on one and two-dimensional synthetic test cases.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bochicchio, Davide; Panizon, Emanuele; Ferrando, Riccardo
2015-10-14
We compare the performance of two well-established computational algorithms for the calculation of free-energy landscapes of biomolecular systems, umbrella sampling and metadynamics. We look at benchmark systems composed of polyethylene and polypropylene oligomers interacting with lipid (phosphatidylcholine) membranes, aiming at the calculation of the oligomer water-membrane free energy of transfer. We model our test systems at two different levels of description, united-atom and coarse-grained. We provide optimized parameters for the two methods at both resolutions. We devote special attention to the analysis of statistical errors in the two different methods and propose a general procedure for the error estimation inmore » metadynamics simulations. Metadynamics and umbrella sampling yield the same estimates for the water-membrane free energy profile, but metadynamics can be more efficient, providing lower statistical uncertainties within the same simulation time.« less
Calculation of absolute protein-ligand binding free energy using distributed replica sampling.
Rodinger, Tomas; Howell, P Lynne; Pomès, Régis
2008-10-21
Distributed replica sampling [T. Rodinger et al., J. Chem. Theory Comput. 2, 725 (2006)] is a simple and general scheme for Boltzmann sampling of conformational space by computer simulation in which multiple replicas of the system undergo a random walk in reaction coordinate or temperature space. Individual replicas are linked through a generalized Hamiltonian containing an extra potential energy term or bias which depends on the distribution of all replicas, thus enforcing the desired sampling distribution along the coordinate or parameter of interest regardless of free energy barriers. In contrast to replica exchange methods, efficient implementation of the algorithm does not require synchronicity of the individual simulations. The algorithm is inherently suited for large-scale simulations using shared or heterogeneous computing platforms such as a distributed network. In this work, we build on our original algorithm by introducing Boltzmann-weighted jumping, which allows moves of a larger magnitude and thus enhances sampling efficiency along the reaction coordinate. The approach is demonstrated using a realistic and biologically relevant application; we calculate the standard binding free energy of benzene to the L99A mutant of T4 lysozyme. Distributed replica sampling is used in conjunction with thermodynamic integration to compute the potential of mean force for extracting the ligand from protein and solvent along a nonphysical spatial coordinate. Dynamic treatment of the reaction coordinate leads to faster statistical convergence of the potential of mean force than a conventional static coordinate, which suffers from slow transitions on a rugged potential energy surface.
Calculation of absolute protein-ligand binding free energy using distributed replica sampling
NASA Astrophysics Data System (ADS)
Rodinger, Tomas; Howell, P. Lynne; Pomès, Régis
2008-10-01
Distributed replica sampling [T. Rodinger et al., J. Chem. Theory Comput. 2, 725 (2006)] is a simple and general scheme for Boltzmann sampling of conformational space by computer simulation in which multiple replicas of the system undergo a random walk in reaction coordinate or temperature space. Individual replicas are linked through a generalized Hamiltonian containing an extra potential energy term or bias which depends on the distribution of all replicas, thus enforcing the desired sampling distribution along the coordinate or parameter of interest regardless of free energy barriers. In contrast to replica exchange methods, efficient implementation of the algorithm does not require synchronicity of the individual simulations. The algorithm is inherently suited for large-scale simulations using shared or heterogeneous computing platforms such as a distributed network. In this work, we build on our original algorithm by introducing Boltzmann-weighted jumping, which allows moves of a larger magnitude and thus enhances sampling efficiency along the reaction coordinate. The approach is demonstrated using a realistic and biologically relevant application; we calculate the standard binding free energy of benzene to the L99A mutant of T4 lysozyme. Distributed replica sampling is used in conjunction with thermodynamic integration to compute the potential of mean force for extracting the ligand from protein and solvent along a nonphysical spatial coordinate. Dynamic treatment of the reaction coordinate leads to faster statistical convergence of the potential of mean force than a conventional static coordinate, which suffers from slow transitions on a rugged potential energy surface.
Autoregressive statistical pattern recognition algorithms for damage detection in civil structures
NASA Astrophysics Data System (ADS)
Yao, Ruigen; Pakzad, Shamim N.
2012-08-01
Statistical pattern recognition has recently emerged as a promising set of complementary methods to system identification for automatic structural damage assessment. Its essence is to use well-known concepts in statistics for boundary definition of different pattern classes, such as those for damaged and undamaged structures. In this paper, several statistical pattern recognition algorithms using autoregressive models, including statistical control charts and hypothesis testing, are reviewed as potentially competitive damage detection techniques. To enhance the performance of statistical methods, new feature extraction techniques using model spectra and residual autocorrelation, together with resampling-based threshold construction methods, are proposed. Subsequently, simulated acceleration data from a multi degree-of-freedom system is generated to test and compare the efficiency of the existing and proposed algorithms. Data from laboratory experiments conducted on a truss and a large-scale bridge slab model are then used to further validate the damage detection methods and demonstrate the superior performance of proposed algorithms.
Algorithm for computing descriptive statistics for very large data sets and the exa-scale era
NASA Astrophysics Data System (ADS)
Beekman, Izaak
2017-11-01
An algorithm for Single-point, Parallel, Online, Converging Statistics (SPOCS) is presented. It is suited for in situ analysis that traditionally would be relegated to post-processing, and can be used to monitor the statistical convergence and estimate the error/residual in the quantity-useful for uncertainty quantification too. Today, data may be generated at an overwhelming rate by numerical simulations and proliferating sensing apparatuses in experiments and engineering applications. Monitoring descriptive statistics in real time lets costly computations and experiments be gracefully aborted if an error has occurred, and monitoring the level of statistical convergence allows them to be run for the shortest amount of time required to obtain good results. This algorithm extends work by Pébay (Sandia Report SAND2008-6212). Pébay's algorithms are recast into a converging delta formulation, with provably favorable properties. The mean, variance, covariances and arbitrary higher order statistical moments are computed in one pass. The algorithm is tested using Sillero, Jiménez, & Moser's (2013, 2014) publicly available UPM high Reynolds number turbulent boundary layer data set, demonstrating numerical robustness, efficiency and other favorable properties.
Statistical detection of patterns in unidimensional distributions by continuous wavelet transforms
NASA Astrophysics Data System (ADS)
Baluev, R. V.
2018-04-01
Objective detection of specific patterns in statistical distributions, like groupings or gaps or abrupt transitions between different subsets, is a task with a rich range of applications in astronomy: Milky Way stellar population analysis, investigations of the exoplanets diversity, Solar System minor bodies statistics, extragalactic studies, etc. We adapt the powerful technique of the wavelet transforms to this generalized task, making a strong emphasis on the assessment of the patterns detection significance. Among other things, our method also involves optimal minimum-noise wavelets and minimum-noise reconstruction of the distribution density function. Based on this development, we construct a self-closed algorithmic pipeline aimed to process statistical samples. It is currently applicable to single-dimensional distributions only, but it is flexible enough to undergo further generalizations and development.
NASA Astrophysics Data System (ADS)
Zhu, Hao
Sparsity plays an instrumental role in a plethora of scientific fields, including statistical inference for variable selection, parsimonious signal representations, and solving under-determined systems of linear equations - what has led to the ground-breaking result of compressive sampling (CS). This Thesis leverages exciting ideas of sparse signal reconstruction to develop sparsity-cognizant algorithms, and analyze their performance. The vision is to devise tools exploiting the 'right' form of sparsity for the 'right' application domain of multiuser communication systems, array signal processing systems, and the emerging challenges in the smart power grid. Two important power system monitoring tasks are addressed first by capitalizing on the hidden sparsity. To robustify power system state estimation, a sparse outlier model is leveraged to capture the possible corruption in every datum, while the problem nonconvexity due to nonlinear measurements is handled using the semidefinite relaxation technique. Different from existing iterative methods, the proposed algorithm approximates well the global optimum regardless of the initialization. In addition, for enhanced situational awareness, a novel sparse overcomplete representation is introduced to capture (possibly multiple) line outages, and develop real-time algorithms for solving the combinatorially complex identification problem. The proposed algorithms exhibit near-optimal performance while incurring only linear complexity in the number of lines, which makes it possible to quickly bring contingencies to attention. This Thesis also accounts for two basic issues in CS, namely fully-perturbed models and the finite alphabet property. The sparse total least-squares (S-TLS) approach is proposed to furnish CS algorithms for fully-perturbed linear models, leading to statistically optimal and computationally efficient solvers. The S-TLS framework is well motivated for grid-based sensing applications and exhibits higher accuracy than existing sparse algorithms. On the other hand, exploiting the finite alphabet of unknown signals emerges naturally in communication systems, along with sparsity coming from the low activity of each user. Compared to approaches only accounting for either one of the two, joint exploitation of both leads to statistically optimal detectors with improved error performance.
NASA Astrophysics Data System (ADS)
Kim, D.; Youn, J.; Kim, C.
2017-08-01
As a malfunctioning PV (Photovoltaic) cell has a higher temperature than adjacent normal cells, we can detect it easily with a thermal infrared sensor. However, it will be a time-consuming way to inspect large-scale PV power plants by a hand-held thermal infrared sensor. This paper presents an algorithm for automatically detecting defective PV panels using images captured with a thermal imaging camera from an UAV (unmanned aerial vehicle). The proposed algorithm uses statistical analysis of thermal intensity (surface temperature) characteristics of each PV module to verify the mean intensity and standard deviation of each panel as parameters for fault diagnosis. One of the characteristics of thermal infrared imaging is that the larger the distance between sensor and target, the lower the measured temperature of the object. Consequently, a global detection rule using the mean intensity of all panels in the fault detection algorithm is not applicable. Therefore, a local detection rule based on the mean intensity and standard deviation range was developed to detect defective PV modules from individual array automatically. The performance of the proposed algorithm was tested on three sample images; this verified a detection accuracy of defective panels of 97 % or higher. In addition, as the proposed algorithm can adjust the range of threshold values for judging malfunction at the array level, the local detection rule is considered better suited for highly sensitive fault detection compared to a global detection rule.
Anomaly-specified virtual dimensionality
NASA Astrophysics Data System (ADS)
Chen, Shih-Yu; Paylor, Drew; Chang, Chein-I.
2013-09-01
Virtual dimensionality (VD) has received considerable interest where VD is used to estimate the number of spectral distinct signatures, denoted by p. Unfortunately, no specific definition is provided by VD for what a spectrally distinct signature is. As a result, various types of spectral distinct signatures determine different values of VD. There is no one value-fit-all for VD. In order to address this issue this paper presents a new concept, referred to as anomaly-specified VD (AS-VD) which determines the number of anomalies of interest present in the data. Specifically, two types of anomaly detection algorithms are of particular interest, sample covariance matrix K-based anomaly detector developed by Reed and Yu, referred to as K-RXD and sample correlation matrix R-based RXD, referred to as R-RXD. Since K-RXD is only determined by 2nd order statistics compared to R-RXD which is specified by statistics of the first two orders including sample mean as the first order statistics, the values determined by K-RXD and R-RXD will be different. Experiments are conducted in comparison with widely used eigen-based approaches.
Validation tools for image segmentation
NASA Astrophysics Data System (ADS)
Padfield, Dirk; Ross, James
2009-02-01
A large variety of image analysis tasks require the segmentation of various regions in an image. For example, segmentation is required to generate accurate models of brain pathology that are important components of modern diagnosis and therapy. While the manual delineation of such structures gives accurate information, the automatic segmentation of regions such as the brain and tumors from such images greatly enhances the speed and repeatability of quantifying such structures. The ubiquitous need for such algorithms has lead to a wide range of image segmentation algorithms with various assumptions, parameters, and robustness. The evaluation of such algorithms is an important step in determining their effectiveness. Therefore, rather than developing new segmentation algorithms, we here describe validation methods for segmentation algorithms. Using similarity metrics comparing the automatic to manual segmentations, we demonstrate methods for optimizing the parameter settings for individual cases and across a collection of datasets using the Design of Experiment framework. We then employ statistical analysis methods to compare the effectiveness of various algorithms. We investigate several region-growing algorithms from the Insight Toolkit and compare their accuracy to that of a separate statistical segmentation algorithm. The segmentation algorithms are used with their optimized parameters to automatically segment the brain and tumor regions in MRI images of 10 patients. The validation tools indicate that none of the ITK algorithms studied are able to outperform with statistical significance the statistical segmentation algorithm although they perform reasonably well considering their simplicity.
Template protection and its implementation in 3D face recognition systems
NASA Astrophysics Data System (ADS)
Zhou, Xuebing
2007-04-01
As biometric recognition systems are widely applied in various application areas, security and privacy risks have recently attracted the attention of the biometric community. Template protection techniques prevent stored reference data from revealing private biometric information and enhance the security of biometrics systems against attacks such as identity theft and cross matching. This paper concentrates on a template protection algorithm that merges methods from cryptography, error correction coding and biometrics. The key component of the algorithm is to convert biometric templates into binary vectors. It is shown that the binary vectors should be robust, uniformly distributed, statistically independent and collision-free so that authentication performance can be optimized and information leakage can be avoided. Depending on statistical character of the biometric template, different approaches for transforming biometric templates into compact binary vectors are presented. The proposed methods are integrated into a 3D face recognition system and tested on the 3D facial images of the FRGC database. It is shown that the resulting binary vectors provide an authentication performance that is similar to the original 3D face templates. A high security level is achieved with reasonable false acceptance and false rejection rates of the system, based on an efficient statistical analysis. The algorithm estimates the statistical character of biometric templates from a number of biometric samples in the enrollment database. For the FRGC 3D face database, the small distinction of robustness and discriminative power between the classification results under the assumption of uniquely distributed templates and the ones under the assumption of Gaussian distributed templates is shown in our tests.
Suner, Aslı; Karakülah, Gökhan; Dicle, Oğuz
2014-01-01
Statistical hypothesis testing is an essential component of biological and medical studies for making inferences and estimations from the collected data in the study; however, the misuse of statistical tests is widely common. In order to prevent possible errors in convenient statistical test selection, it is currently possible to consult available test selection algorithms developed for various purposes. However, the lack of an algorithm presenting the most common statistical tests used in biomedical research in a single flowchart causes several problems such as shifting users among the algorithms, poor decision support in test selection and lack of satisfaction of potential users. Herein, we demonstrated a unified flowchart; covers mostly used statistical tests in biomedical domain, to provide decision aid to non-statistician users while choosing the appropriate statistical test for testing their hypothesis. We also discuss some of the findings while we are integrating the flowcharts into each other to develop a single but more comprehensive decision algorithm.
Anomaly detection in hyperspectral imagery: statistics vs. graph-based algorithms
NASA Astrophysics Data System (ADS)
Berkson, Emily E.; Messinger, David W.
2016-05-01
Anomaly detection (AD) algorithms are frequently applied to hyperspectral imagery, but different algorithms produce different outlier results depending on the image scene content and the assumed background model. This work provides the first comparison of anomaly score distributions between common statistics-based anomaly detection algorithms (RX and subspace-RX) and the graph-based Topological Anomaly Detector (TAD). Anomaly scores in statistical AD algorithms should theoretically approximate a chi-squared distribution; however, this is rarely the case with real hyperspectral imagery. The expected distribution of scores found with graph-based methods remains unclear. We also look for general trends in algorithm performance with varied scene content. Three separate scenes were extracted from the hyperspectral MegaScene image taken over downtown Rochester, NY with the VIS-NIR-SWIR ProSpecTIR instrument. In order of most to least cluttered, we study an urban, suburban, and rural scene. The three AD algorithms were applied to each scene, and the distributions of the most anomalous 5% of pixels were compared. We find that subspace-RX performs better than RX, because the data becomes more normal when the highest variance principal components are removed. We also see that compared to statistical detectors, anomalies detected by TAD are easier to separate from the background. Due to their different underlying assumptions, the statistical and graph-based algorithms highlighted different anomalies within the urban scene. These results will lead to a deeper understanding of these algorithms and their applicability across different types of imagery.
Mutation Clusters from Cancer Exome.
Kakushadze, Zura; Yu, Willie
2017-08-15
We apply our statistically deterministic machine learning/clustering algorithm *K-means (recently developed in https://ssrn.com/abstract=2908286) to 10,656 published exome samples for 32 cancer types. A majority of cancer types exhibit a mutation clustering structure. Our results are in-sample stable. They are also out-of-sample stable when applied to 1389 published genome samples across 14 cancer types. In contrast, we find in- and out-of-sample instabilities in cancer signatures extracted from exome samples via nonnegative matrix factorization (NMF), a computationally-costly and non-deterministic method. Extracting stable mutation structures from exome data could have important implications for speed and cost, which are critical for early-stage cancer diagnostics, such as novel blood-test methods currently in development.
Mutation Clusters from Cancer Exome
Kakushadze, Zura; Yu, Willie
2017-01-01
We apply our statistically deterministic machine learning/clustering algorithm *K-means (recently developed in https://ssrn.com/abstract=2908286) to 10,656 published exome samples for 32 cancer types. A majority of cancer types exhibit a mutation clustering structure. Our results are in-sample stable. They are also out-of-sample stable when applied to 1389 published genome samples across 14 cancer types. In contrast, we find in- and out-of-sample instabilities in cancer signatures extracted from exome samples via nonnegative matrix factorization (NMF), a computationally-costly and non-deterministic method. Extracting stable mutation structures from exome data could have important implications for speed and cost, which are critical for early-stage cancer diagnostics, such as novel blood-test methods currently in development. PMID:28809811
TEAM: efficient two-locus epistasis tests in human genome-wide association study.
Zhang, Xiang; Huang, Shunping; Zou, Fei; Wang, Wei
2010-06-15
As a promising tool for identifying genetic markers underlying phenotypic differences, genome-wide association study (GWAS) has been extensively investigated in recent years. In GWAS, detecting epistasis (or gene-gene interaction) is preferable over single locus study since many diseases are known to be complex traits. A brute force search is infeasible for epistasis detection in the genome-wide scale because of the intensive computational burden. Existing epistasis detection algorithms are designed for dataset consisting of homozygous markers and small sample size. In human study, however, the genotype may be heterozygous, and number of individuals can be up to thousands. Thus, existing methods are not readily applicable to human datasets. In this article, we propose an efficient algorithm, TEAM, which significantly speeds up epistasis detection for human GWAS. Our algorithm is exhaustive, i.e. it does not ignore any epistatic interaction. Utilizing the minimum spanning tree structure, the algorithm incrementally updates the contingency tables for epistatic tests without scanning all individuals. Our algorithm has broader applicability and is more efficient than existing methods for large sample study. It supports any statistical test that is based on contingency tables, and enables both family-wise error rate and false discovery rate controlling. Extensive experiments show that our algorithm only needs to examine a small portion of the individuals to update the contingency tables, and it achieves at least an order of magnitude speed up over the brute force approach.
Mapping hard magnetic recording disks by TOF-SIMS
NASA Astrophysics Data System (ADS)
Spool, A.; Forrest, J.
2008-12-01
Mapping of hard magnetic recording disks by TOF-SIMS was performed both to produce significant analytical results for the understanding of the disk surface and the head disk interface in hard disk drives, and as an example of a macroscopic non-rectangular mapping problem for the technique. In this study, maps were obtained by taking discrete samples of the disk surface at set intervals in R and Θ. Because both in manufacturing, and in the disk drive, processes that may affect the disk surface are typically circumferential in nature, changes in the surface are likely to be blurred in the Θ direction. An algorithm was developed to determine the optimum relative sampling ratio in R and Θ. The results confirm what the experience of the analysts suggested, that changes occur more rapidly on disks in the radial direction, and that more sampling in the radial direction is desired. The subsequent use of statistical methods principle component analysis (PCA), maximum auto-correlation factors (MAF), and the algorithm inverse distance weighting (IDW) are explored.
MAFsnp: A Multi-Sample Accurate and Flexible SNP Caller Using Next-Generation Sequencing Data
Hu, Jiyuan; Li, Tengfei; Xiu, Zidi; Zhang, Hong
2015-01-01
Most existing statistical methods developed for calling single nucleotide polymorphisms (SNPs) using next-generation sequencing (NGS) data are based on Bayesian frameworks, and there does not exist any SNP caller that produces p-values for calling SNPs in a frequentist framework. To fill in this gap, we develop a new method MAFsnp, a Multiple-sample based Accurate and Flexible algorithm for calling SNPs with NGS data. MAFsnp is based on an estimated likelihood ratio test (eLRT) statistic. In practical situation, the involved parameter is very close to the boundary of the parametric space, so the standard large sample property is not suitable to evaluate the finite-sample distribution of the eLRT statistic. Observing that the distribution of the test statistic is a mixture of zero and a continuous part, we propose to model the test statistic with a novel two-parameter mixture distribution. Once the parameters in the mixture distribution are estimated, p-values can be easily calculated for detecting SNPs, and the multiple-testing corrected p-values can be used to control false discovery rate (FDR) at any pre-specified level. With simulated data, MAFsnp is shown to have much better control of FDR than the existing SNP callers. Through the application to two real datasets, MAFsnp is also shown to outperform the existing SNP callers in terms of calling accuracy. An R package “MAFsnp” implementing the new SNP caller is freely available at http://homepage.fudan.edu.cn/zhangh/softwares/. PMID:26309201
NASA Astrophysics Data System (ADS)
Levy, R. C.; Munchak, L. A.; Mattoo, S.; Patadia, F.; Remer, L. A.; Holz, R. E.
2015-10-01
To answer fundamental questions about aerosols in our changing climate, we must quantify both the current state of aerosols and how they are changing. Although NASA's Moderate Resolution Imaging Spectroradiometer (MODIS) sensors have provided quantitative information about global aerosol optical depth (AOD) for more than a decade, this period is still too short to create an aerosol climate data record (CDR). The Visible Infrared Imaging Radiometer Suite (VIIRS) was launched on the Suomi-NPP satellite in late 2011, with additional copies planned for future satellites. Can the MODIS aerosol data record be continued with VIIRS to create a consistent CDR? When compared to ground-based AERONET data, the VIIRS Environmental Data Record (V_EDR) has similar validation statistics as the MODIS Collection 6 (M_C6) product. However, the V_EDR and M_C6 are offset in regards to global AOD magnitudes, and tend to provide different maps of 0.55 μm AOD and 0.55/0.86 μm-based Ångström Exponent (AE). One reason is that the retrieval algorithms are different. Using the Intermediate File Format (IFF) for both MODIS and VIIRS data, we have tested whether we can apply a single MODIS-like (ML) dark-target algorithm on both sensors that leads to product convergence. Except for catering the radiative transfer and aerosol lookup tables to each sensor's specific wavelength bands, the ML algorithm is the same for both. We run the ML algorithm on both sensors between March 2012 and May 2014, and compare monthly mean AOD time series with each other and with M_C6 and V_EDR products. Focusing on the March-April-May (MAM) 2013 period, we compared additional statistics that include global and gridded 1° × 1° AOD and AE, histograms, sampling frequencies, and collocations with ground-based AERONET. Over land, use of the ML algorithm clearly reduces the differences between the MODIS and VIIRS-based AOD. However, although global offsets are near zero, some regional biases remain, especially in cloud fields and over brighter surface targets. Over ocean, use of the ML algorithm actually increases the offset between VIIRS and MODIS-based AOD (to ~ 0.025), while reducing the differences between AE. We characterize algorithm retrievability through statistics of retrieval fraction. In spite of differences between retrieved AOD magnitudes, the ML algorithm will lead to similar decisions about "whether to retrieve" on each sensor. Finally, we discuss how issues of calibration, as well as instrument spatial resolution may be contributing to the statistics and the ability to create a consistent MODIS → VIIRS aerosol CDR.
NASA Astrophysics Data System (ADS)
Levy, R. C.; Munchak, L. A.; Mattoo, S.; Patadia, F.; Remer, L. A.; Holz, R. E.
2015-07-01
To answer fundamental questions about aerosols in our changing climate, we must quantify both the current state of aerosols and how they are changing. Although NASA's Moderate resolution Imaging Spectroradiometer (MODIS) sensors have provided quantitative information about global aerosol optical depth (AOD) for more than a decade, this period is still too short to create an aerosol climate data record (CDR). The Visible Infrared Imaging Radiometer Suite (VIIRS) was launched on the Suomi-NPP satellite in late 2011, with additional copies planned for future satellites. Can the MODIS aerosol data record be continued with VIIRS to create a consistent CDR? When compared to ground-based AERONET data, the VIIRS Environmental Data Record (V_EDR) has similar validation statistics as the MODIS Collection 6 (M_C6) product. However, the V_EDR and M_C6 are offset in regards to global AOD magnitudes, and tend to provide different maps of 0.55 μm AOD and 0.55/0.86 μm-based Ångstrom Exponent (AE). One reason is that the retrieval algorithms are different. Using the Intermediate File Format (IFF) for both MODIS and VIIRS data, we have tested whether we can apply a single MODIS-like (ML) dark-target algorithm on both sensors that leads to product convergence. Except for catering the radiative transfer and aerosol lookup tables to each sensor's specific wavelength bands, the ML algorithm is the same for both. We run the ML algorithm on both sensors between March 2012 and May 2014, and compare monthly mean AOD time series with each other and with M_C6 and V_EDR products. Focusing on the March-April-May (MAM) 2013 period, we compared additional statistics that include global and gridded 1° × 1° AOD and AE, histograms, sampling frequencies, and collocations with ground-based AERONET. Over land, use of the ML algorithm clearly reduces the differences between the MODIS and VIIRS-based AOD. However, although global offsets are near zero, some regional biases remain, especially in cloud fields and over brighter surface targets. Over ocean, use of the ML algorithm actually increases the offset between VIIRS and MODIS-based AOD (to ∼ 0.025), while reducing the differences between AE. We characterize algorithm retrievibility through statistics of retrieval fraction. In spite of differences between retrieved AOD magnitudes, the ML algorithm will lead to similar decisions about "whether to retrieve" on each sensor. Finally, we discuss how issues of calibration, as well as instrument spatial resolution may be contributing to the statistics and the ability to create a consistent MODIS → VIIRS aerosol CDR.
Design of Neural Networks for Fast Convergence and Accuracy: Dynamics and Control
NASA Technical Reports Server (NTRS)
Maghami, Peiman G.; Sparks, Dean W., Jr.
1997-01-01
A procedure for the design and training of artificial neural networks, used for rapid and efficient controls and dynamics design and analysis for flexible space systems, has been developed. Artificial neural networks are employed, such that once properly trained, they provide a means of evaluating the impact of design changes rapidly. Specifically, two-layer feedforward neural networks are designed to approximate the functional relationship between the component/spacecraft design changes and measures of its performance or nonlinear dynamics of the system/components. A training algorithm, based on statistical sampling theory, is presented, which guarantees that the trained networks provide a designer-specified degree of accuracy in mapping the functional relationship. Within each iteration of this statistical-based algorithm, a sequential design algorithm is used for the design and training of the feedforward network to provide rapid convergence to the network goals. Here, at each sequence a new network is trained to minimize the error of previous network. The proposed method should work for applications wherein an arbitrary large source of training data can be generated. Two numerical examples are performed on a spacecraft application in order to demonstrate the feasibility of the proposed approach.
Design of neural networks for fast convergence and accuracy: dynamics and control.
Maghami, P G; Sparks, D R
2000-01-01
A procedure for the design and training of artificial neural networks, used for rapid and efficient controls and dynamics design and analysis for flexible space systems, has been developed. Artificial neural networks are employed, such that once properly trained, they provide a means of evaluating the impact of design changes rapidly. Specifically, two-layer feedforward neural networks are designed to approximate the functional relationship between the component/spacecraft design changes and measures of its performance or nonlinear dynamics of the system/components. A training algorithm, based on statistical sampling theory, is presented, which guarantees that the trained networks provide a designer-specified degree of accuracy in mapping the functional relationship. Within each iteration of this statistical-based algorithm, a sequential design algorithm is used for the design and training of the feedforward network to provide rapid convergence to the network goals. Here, at each sequence a new network is trained to minimize the error of previous network. The proposed method should work for applications wherein an arbitrary large source of training data can be generated. Two numerical examples are performed on a spacecraft application in order to demonstrate the feasibility of the proposed approach.
Cairns, Andrew W; Bond, Raymond R; Finlay, Dewar D; Guldenring, Daniel; Badilini, Fabio; Libretti, Guido; Peace, Aaron J; Leslie, Stephen J
The 12-lead Electrocardiogram (ECG) has been used to detect cardiac abnormalities in the same format for more than 70years. However, due to the complex nature of 12-lead ECG interpretation, there is a significant cognitive workload required from the interpreter. This complexity in ECG interpretation often leads to errors in diagnosis and subsequent treatment. We have previously reported on the development of an ECG interpretation support system designed to augment the human interpretation process. This computerised decision support system has been named 'Interactive Progressive based Interpretation' (IPI). In this study, a decision support algorithm was built into the IPI system to suggest potential diagnoses based on the interpreter's annotations of the 12-lead ECG. We hypothesise semi-automatic interpretation using a digital assistant can be an optimal man-machine model for ECG interpretation. To improve interpretation accuracy and reduce missed co-abnormalities. The Differential Diagnoses Algorithm (DDA) was developed using web technologies where diagnostic ECG criteria are defined in an open storage format, Javascript Object Notation (JSON), which is queried using a rule-based reasoning algorithm to suggest diagnoses. To test our hypothesis, a counterbalanced trial was designed where subjects interpreted ECGs using the conventional approach and using the IPI+DDA approach. A total of 375 interpretations were collected. The IPI+DDA approach was shown to improve diagnostic accuracy by 8.7% (although not statistically significant, p-value=0.1852), the IPI+DDA suggested the correct interpretation more often than the human interpreter in 7/10 cases (varying statistical significance). Human interpretation accuracy increased to 70% when seven suggestions were generated. Although results were not found to be statistically significant, we found; 1) our decision support tool increased the number of correct interpretations, 2) the DDA algorithm suggested the correct interpretation more often than humans, and 3) as many as 7 computerised diagnostic suggestions augmented human decision making in ECG interpretation. Statistical significance may be achieved by expanding sample size. Copyright © 2017 Elsevier Inc. All rights reserved.
Pei, Yanbo; Tian, Guo-Liang; Tang, Man-Lai
2014-11-10
Stratified data analysis is an important research topic in many biomedical studies and clinical trials. In this article, we develop five test statistics for testing the homogeneity of proportion ratios for stratified correlated bilateral binary data based on an equal correlation model assumption. Bootstrap procedures based on these test statistics are also considered. To evaluate the performance of these statistics and procedures, we conduct Monte Carlo simulations to study their empirical sizes and powers under various scenarios. Our results suggest that the procedure based on score statistic performs well generally and is highly recommended. When the sample size is large, procedures based on the commonly used weighted least square estimate and logarithmic transformation with Mantel-Haenszel estimate are recommended as they do not involve any computation of maximum likelihood estimates requiring iterative algorithms. We also derive approximate sample size formulas based on the recommended test procedures. Finally, we apply the proposed methods to analyze a multi-center randomized clinical trial for scleroderma patients. Copyright © 2014 John Wiley & Sons, Ltd.
Algorithm for Identifying Erroneous Rain-Gauge Readings
NASA Technical Reports Server (NTRS)
Rickman, Doug
2005-01-01
An algorithm analyzes rain-gauge data to identify statistical outliers that could be deemed to be erroneous readings. Heretofore, analyses of this type have been performed in burdensome manual procedures that have involved subjective judgements. Sometimes, the analyses have included computational assistance for detecting values falling outside of arbitrary limits. The analyses have been performed without statistically valid knowledge of the spatial and temporal variations of precipitation within rain events. In contrast, the present algorithm makes it possible to automate such an analysis, makes the analysis objective, takes account of the spatial distribution of rain gauges in conjunction with the statistical nature of spatial variations in rainfall readings, and minimizes the use of arbitrary criteria. The algorithm implements an iterative process that involves nonparametric statistics.
Multi-Parent Clustering Algorithms from Stochastic Grammar Data Models
NASA Technical Reports Server (NTRS)
Mjoisness, Eric; Castano, Rebecca; Gray, Alexander
1999-01-01
We introduce a statistical data model and an associated optimization-based clustering algorithm which allows data vectors to belong to zero, one or several "parent" clusters. For each data vector the algorithm makes a discrete decision among these alternatives. Thus, a recursive version of this algorithm would place data clusters in a Directed Acyclic Graph rather than a tree. We test the algorithm with synthetic data generated according to the statistical data model. We also illustrate the algorithm using real data from large-scale gene expression assays.
Baumes, Laurent A
2006-01-01
One of the main problems in high-throughput research for materials is still the design of experiments. At early stages of discovery programs, purely exploratory methodologies coupled with fast screening tools should be employed. This should lead to opportunities to find unexpected catalytic results and identify the "groups" of catalyst outputs, providing well-defined boundaries for future optimizations. However, very few new papers deal with strategies that guide exploratory studies. Mostly, traditional designs, homogeneous covering, or simple random samplings are exploited. Typical catalytic output distributions exhibit unbalanced datasets for which an efficient learning is hardly carried out, and interesting but rare classes are usually unrecognized. Here is suggested a new iterative algorithm for the characterization of the search space structure, working independently of learning processes. It enhances recognition rates by transferring catalysts to be screened from "performance-stable" space zones to "unsteady" ones which necessitate more experiments to be well-modeled. The evaluation of new algorithm attempts through benchmarks is compulsory due to the lack of past proofs about their efficiency. The method is detailed and thoroughly tested with mathematical functions exhibiting different levels of complexity. The strategy is not only empirically evaluated, the effect or efficiency of sampling on future Machine Learning performances is also quantified. The minimum sample size required by the algorithm for being statistically discriminated from simple random sampling is investigated.
High performance transcription factor-DNA docking with GPU computing
2012-01-01
Background Protein-DNA docking is a very challenging problem in structural bioinformatics and has important implications in a number of applications, such as structure-based prediction of transcription factor binding sites and rational drug design. Protein-DNA docking is very computational demanding due to the high cost of energy calculation and the statistical nature of conformational sampling algorithms. More importantly, experiments show that the docking quality depends on the coverage of the conformational sampling space. It is therefore desirable to accelerate the computation of the docking algorithm, not only to reduce computing time, but also to improve docking quality. Methods In an attempt to accelerate the sampling process and to improve the docking performance, we developed a graphics processing unit (GPU)-based protein-DNA docking algorithm. The algorithm employs a potential-based energy function to describe the binding affinity of a protein-DNA pair, and integrates Monte-Carlo simulation and a simulated annealing method to search through the conformational space. Algorithmic techniques were developed to improve the computation efficiency and scalability on GPU-based high performance computing systems. Results The effectiveness of our approach is tested on a non-redundant set of 75 TF-DNA complexes and a newly developed TF-DNA docking benchmark. We demonstrated that the GPU-based docking algorithm can significantly accelerate the simulation process and thereby improving the chance of finding near-native TF-DNA complex structures. This study also suggests that further improvement in protein-DNA docking research would require efforts from two integral aspects: improvement in computation efficiency and energy function design. Conclusions We present a high performance computing approach for improving the prediction accuracy of protein-DNA docking. The GPU-based docking algorithm accelerates the search of the conformational space and thus increases the chance of finding more near-native structures. To the best of our knowledge, this is the first ad hoc effort of applying GPU or GPU clusters to the protein-DNA docking problem. PMID:22759575
Multivariate statistical model for 3D image segmentation with application to medical images.
John, Nigel M; Kabuka, Mansur R; Ibrahim, Mohamed O
2003-12-01
In this article we describe a statistical model that was developed to segment brain magnetic resonance images. The statistical segmentation algorithm was applied after a pre-processing stage involving the use of a 3D anisotropic filter along with histogram equalization techniques. The segmentation algorithm makes use of prior knowledge and a probability-based multivariate model designed to semi-automate the process of segmentation. The algorithm was applied to images obtained from the Center for Morphometric Analysis at Massachusetts General Hospital as part of the Internet Brain Segmentation Repository (IBSR). The developed algorithm showed improved accuracy over the k-means, adaptive Maximum Apriori Probability (MAP), biased MAP, and other algorithms. Experimental results showing the segmentation and the results of comparisons with other algorithms are provided. Results are based on an overlap criterion against expertly segmented images from the IBSR. The algorithm produced average results of approximately 80% overlap with the expertly segmented images (compared with 85% for manual segmentation and 55% for other algorithms).
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hastings, Matthew B
We show how to combine the light-cone and matrix product algorithms to simulate quantum systems far from equilibrium for long times. For the case of the XXZ spin chain at {Delta} = 0.5, we simulate to a time of {approx} 22.5. While part of the long simulation time is due to the use of the light-cone method, we also describe a modification of the infinite time-evolving bond decimation algorithm with improved numerical stability, and we describe how to incorporate symmetry into this algorithm. While statistical sampling error means that we are not yet able to make a definite statement, themore » behavior of the simulation at long times indicates the appearance of either 'revivals' in the order parameter as predicted by Hastings and Levitov (e-print arXiv:0806.4283) or of a distinct shoulder in the decay of the order parameter.« less
NASA Technical Reports Server (NTRS)
Strahler, A. H.; Woodcock, C. E.; Logan, T. L.
1983-01-01
A timber inventory of the Eldorado National Forest, located in east-central California, provides an example of the use of a Geographic Information System (GIS) to stratify large areas of land for sampling and the collection of statistical data. The raster-based GIS format of the VICAR/IBIS software system allows simple and rapid tabulation of areas, and facilitates the selection of random locations for ground sampling. Algorithms that simplify the complex spatial pattern of raster-based information, and convert raster format data to strings of coordinate vectors, provide a link to conventional vector-based geographic information systems.
info-gibbs: a motif discovery algorithm that directly optimizes information content during sampling.
Defrance, Matthieu; van Helden, Jacques
2009-10-15
Discovering cis-regulatory elements in genome sequence remains a challenging issue. Several methods rely on the optimization of some target scoring function. The information content (IC) or relative entropy of the motif has proven to be a good estimator of transcription factor DNA binding affinity. However, these information-based metrics are usually used as a posteriori statistics rather than during the motif search process itself. We introduce here info-gibbs, a Gibbs sampling algorithm that efficiently optimizes the IC or the log-likelihood ratio (LLR) of the motif while keeping computation time low. The method compares well with existing methods like MEME, BioProspector, Gibbs or GAME on both synthetic and biological datasets. Our study shows that motif discovery techniques can be enhanced by directly focusing the search on the motif IC or the motif LLR. http://rsat.ulb.ac.be/rsat/info-gibbs
Yu, Peng; Sun, Jia; Wolz, Robin; Stephenson, Diane; Brewer, James; Fox, Nick C; Cole, Patricia E; Jack, Clifford R; Hill, Derek L G; Schwarz, Adam J
2014-04-01
The objective of this study was to evaluate the effect of computational algorithm, measurement variability, and cut point on hippocampal volume (HCV)-based patient selection for clinical trials in mild cognitive impairment (MCI). We used normal control and amnestic MCI subjects from the Alzheimer's Disease Neuroimaging Initiative 1 (ADNI-1) as normative reference and screening cohorts. We evaluated the enrichment performance of 4 widely used hippocampal segmentation algorithms (FreeSurfer, Hippocampus Multi-Atlas Propagation and Segmentation (HMAPS), Learning Embeddings Atlas Propagation (LEAP), and NeuroQuant) in terms of 2-year changes in Mini-Mental State Examination (MMSE), Alzheimer's Disease Assessment Scale-Cognitive Subscale (ADAS-Cog), and Clinical Dementia Rating Sum of Boxes (CDR-SB). We modeled the implications for sample size, screen fail rates, and trial cost and duration. HCV based patient selection yielded reduced sample sizes (by ∼40%-60%) and lower trial costs (by ∼30%-40%) across a wide range of cut points. These results provide a guide to the choice of HCV cut point for amnestic MCI clinical trials, allowing an informed tradeoff between statistical and practical considerations. Copyright © 2014 Elsevier Inc. All rights reserved.
Multiagency Urban Search Experiment Detector and Algorithm Test Bed
NASA Astrophysics Data System (ADS)
Nicholson, Andrew D.; Garishvili, Irakli; Peplow, Douglas E.; Archer, Daniel E.; Ray, William R.; Swinney, Mathew W.; Willis, Michael J.; Davidson, Gregory G.; Cleveland, Steven L.; Patton, Bruce W.; Hornback, Donald E.; Peltz, James J.; McLean, M. S. Lance; Plionis, Alexander A.; Quiter, Brian J.; Bandstra, Mark S.
2017-07-01
In order to provide benchmark data sets for radiation detector and algorithm development, a particle transport test bed has been created using experimental data as model input and validation. A detailed radiation measurement campaign at the Combined Arms Collective Training Facility in Fort Indiantown Gap, PA (FTIG), USA, provides sample background radiation levels for a variety of materials present at the site (including cinder block, gravel, asphalt, and soil) using long dwell high-purity germanium (HPGe) measurements. In addition, detailed light detection and ranging data and ground-truth measurements inform model geometry. This paper describes the collected data and the application of these data to create background and injected source synthetic data for an arbitrary gamma-ray detection system using particle transport model detector response calculations and statistical sampling. In the methodology presented here, HPGe measurements inform model source terms while detector response calculations are validated via long dwell measurements using 2"×4"×16" NaI(Tl) detectors at a variety of measurement points. A collection of responses, along with sampling methods and interpolation, can be used to create data sets to gauge radiation detector and algorithm (including detection, identification, and localization) performance under a variety of scenarios. Data collected at the FTIG site are available for query, filtering, visualization, and download at muse.lbl.gov.
Ferrari, Ulisse
2016-08-01
Maximum entropy models provide the least constrained probability distributions that reproduce statistical properties of experimental datasets. In this work we characterize the learning dynamics that maximizes the log-likelihood in the case of large but finite datasets. We first show how the steepest descent dynamics is not optimal as it is slowed down by the inhomogeneous curvature of the model parameters' space. We then provide a way for rectifying this space which relies only on dataset properties and does not require large computational efforts. We conclude by solving the long-time limit of the parameters' dynamics including the randomness generated by the systematic use of Gibbs sampling. In this stochastic framework, rather than converging to a fixed point, the dynamics reaches a stationary distribution, which for the rectified dynamics reproduces the posterior distribution of the parameters. We sum up all these insights in a "rectified" data-driven algorithm that is fast and by sampling from the parameters' posterior avoids both under- and overfitting along all the directions of the parameters' space. Through the learning of pairwise Ising models from the recording of a large population of retina neurons, we show how our algorithm outperforms the steepest descent method.
Strategies for informed sample size reduction in adaptive controlled clinical trials
NASA Astrophysics Data System (ADS)
Arandjelović, Ognjen
2017-12-01
Clinical trial adaptation refers to any adjustment of the trial protocol after the onset of the trial. The main goal is to make the process of introducing new medical interventions to patients more efficient. The principal challenge, which is an outstanding research problem, is to be found in the question of how adaptation should be performed so as to minimize the chance of distorting the outcome of the trial. In this paper, we propose a novel method for achieving this. Unlike most of the previously published work, our approach focuses on trial adaptation by sample size adjustment, i.e. by reducing the number of trial participants in a statistically informed manner. Our key idea is to select the sample subset for removal in a manner which minimizes the associated loss of information. We formalize this notion and describe three algorithms which approach the problem in different ways, respectively, using (i) repeated random draws, (ii) a genetic algorithm, and (iii) what we term pair-wise sample compatibilities. Experiments on simulated data demonstrate the effectiveness of all three approaches, with a consistently superior performance exhibited by the pair-wise sample compatibilities-based method.
Statistical Methods and Tools for Uxo Characterization (SERDP Final Technical Report)
DOE Office of Scientific and Technical Information (OSTI.GOV)
Pulsipher, Brent A.; Gilbert, Richard O.; Wilson, John E.
2004-11-15
The Strategic Environmental Research and Development Program (SERDP) issued a statement of need for FY01 titled Statistical Sampling for Unexploded Ordnance (UXO) Site Characterization that solicited proposals to develop statistically valid sampling protocols for cost-effective, practical, and reliable investigation of sites contaminated with UXO; protocols that could be validated through subsequent field demonstrations. The SERDP goal was the development of a sampling strategy for which a fraction of the site is initially surveyed by geophysical detectors to confidently identify clean areas and subsections (target areas, TAs) that had elevated densities of anomalous geophysical detector readings that could indicate the presencemore » of UXO. More detailed surveys could then be conducted to search the identified TAs for UXO. SERDP funded three projects: those proposed by the Pacific Northwest National Laboratory (PNNL) (SERDP Project No. UXO 1199), Sandia National Laboratory (SNL), and Oak Ridge National Laboratory (ORNL). The projects were closely coordinated to minimize duplication of effort and facilitate use of shared algorithms where feasible. This final report for PNNL Project 1199 describes the methods developed by PNNL to address SERDP's statement-of-need for the development of statistically-based geophysical survey methods for sites where 100% surveys are unattainable or cost prohibitive.« less
Network Data: Statistical Theory and New Models
2016-02-17
SECURITY CLASSIFICATION OF: During this period of review, Bin Yu worked on many thrusts of high-dimensional statistical theory and methodologies. Her...research covered a wide range of topics in statistics including analysis and methods for spectral clustering for sparse and structured networks...2,7,8,21], sparse modeling (e.g. Lasso) [4,10,11,17,18,19], statistical guarantees for the EM algorithm [3], statistical analysis of algorithm leveraging
Chodera, John D; Shirts, Michael R
2011-11-21
The widespread popularity of replica exchange and expanded ensemble algorithms for simulating complex molecular systems in chemistry and biophysics has generated much interest in discovering new ways to enhance the phase space mixing of these protocols in order to improve sampling of uncorrelated configurations. Here, we demonstrate how both of these classes of algorithms can be considered as special cases of Gibbs sampling within a Markov chain Monte Carlo framework. Gibbs sampling is a well-studied scheme in the field of statistical inference in which different random variables are alternately updated from conditional distributions. While the update of the conformational degrees of freedom by Metropolis Monte Carlo or molecular dynamics unavoidably generates correlated samples, we show how judicious updating of the thermodynamic state indices--corresponding to thermodynamic parameters such as temperature or alchemical coupling variables--can substantially increase mixing while still sampling from the desired distributions. We show how state update methods in common use can lead to suboptimal mixing, and present some simple, inexpensive alternatives that can increase mixing of the overall Markov chain, reducing simulation times necessary to obtain estimates of the desired precision. These improved schemes are demonstrated for several common applications, including an alchemical expanded ensemble simulation, parallel tempering, and multidimensional replica exchange umbrella sampling.
Inventory and mapping of flood inundation using interactive digital image analysis techniques
Rohde, Wayne G.; Nelson, Charles A.; Taranik, J.V.
1979-01-01
LANDSAT digital data and color infra-red photographs were used in a multiphase sampling scheme to estimate the area of agricultural land affected by a flood. The LANDSAT data were classified with a maximum likelihood algorithm. Stratification of the LANDSAT data, prior to classification, greatly reduced misclassification errors. The classification results were used to prepare a map overlay showing the areal extent of flooding. These data also provided statistics required to estimate sample size in a two phase sampling scheme, and provided quick, accurate estimates of areas flooded for the first phase. The measurements made in the second phase, based on ground data and photo-interpretation, were used with two phase sampling statistics to estimate the area of agricultural land affected by flooding These results show that LANDSAT digital data can be used to prepare map overlays showing the extent of flooding on agricultural land and, with two phase sampling procedures, can provide acreage estimates with sampling errors of about 5 percent. This procedure provides a technique for rapidly assessing the areal extent of flood conditions on agricultural land and would provide a basis for designing a sampling framework to estimate the impact of flooding on crop production.
An incremental approach to genetic-algorithms-based classification.
Guan, Sheng-Uei; Zhu, Fangming
2005-04-01
Incremental learning has been widely addressed in the machine learning literature to cope with learning tasks where the learning environment is ever changing or training samples become available over time. However, most research work explores incremental learning with statistical algorithms or neural networks, rather than evolutionary algorithms. The work in this paper employs genetic algorithms (GAs) as basic learning algorithms for incremental learning within one or more classifier agents in a multiagent environment. Four new approaches with different initialization schemes are proposed. They keep the old solutions and use an "integration" operation to integrate them with new elements to accommodate new attributes, while biased mutation and crossover operations are adopted to further evolve a reinforced solution. The simulation results on benchmark classification data sets show that the proposed approaches can deal with the arrival of new input attributes and integrate them with the original input space. It is also shown that the proposed approaches can be successfully used for incremental learning and improve classification rates as compared to the retraining GA. Possible applications for continuous incremental training and feature selection are also discussed.
Shannon, Casey P; Chen, Virginia; Takhar, Mandeep; Hollander, Zsuzsanna; Balshaw, Robert; McManus, Bruce M; Tebbutt, Scott J; Sin, Don D; Ng, Raymond T
2016-11-14
Gene network inference (GNI) algorithms can be used to identify sets of coordinately expressed genes, termed network modules from whole transcriptome gene expression data. The identification of such modules has become a popular approach to systems biology, with important applications in translational research. Although diverse computational and statistical approaches have been devised to identify such modules, their performance behavior is still not fully understood, particularly in complex human tissues. Given human heterogeneity, one important question is how the outputs of these computational methods are sensitive to the input sample set, or stability. A related question is how this sensitivity depends on the size of the sample set. We describe here the SABRE (Similarity Across Bootstrap RE-sampling) procedure for assessing the stability of gene network modules using a re-sampling strategy, introduce a novel criterion for identifying stable modules, and demonstrate the utility of this approach in a clinically-relevant cohort, using two different gene network module discovery algorithms. The stability of modules increased as sample size increased and stable modules were more likely to be replicated in larger sets of samples. Random modules derived from permutated gene expression data were consistently unstable, as assessed by SABRE, and provide a useful baseline value for our proposed stability criterion. Gene module sets identified by different algorithms varied with respect to their stability, as assessed by SABRE. Finally, stable modules were more readily annotated in various curated gene set databases. The SABRE procedure and proposed stability criterion may provide guidance when designing systems biology studies in complex human disease and tissues.
A Gaussian Mixture Model for Nulling Pulsars
NASA Astrophysics Data System (ADS)
Kaplan, D. L.; Swiggum, J. K.; Fichtenbauer, T. D. J.; Vallisneri, M.
2018-03-01
The phenomenon of pulsar nulling—where pulsars occasionally turn off for one or more pulses—provides insight into pulsar-emission mechanisms and the processes by which pulsars turn off when they cross the “death line.” However, while ever more pulsars are found that exhibit nulling behavior, the statistical techniques used to measure nulling are biased, with limited utility and precision. In this paper, we introduce an improved algorithm, based on Gaussian mixture models, for measuring pulsar nulling behavior. We demonstrate this algorithm on a number of pulsars observed as part of a larger sample of nulling pulsars, and show that it performs considerably better than existing techniques, yielding better precision and no bias. We further validate our algorithm on simulated data. Our algorithm is widely applicable to a large number of pulsars even if they do not show obvious nulls. Moreover, it can be used to derive nulling probabilities of nulling for individual pulses, which can be used for in-depth studies.
Bayesian Analysis for Exponential Random Graph Models Using the Adaptive Exchange Sampler.
Jin, Ick Hoon; Yuan, Ying; Liang, Faming
2013-10-01
Exponential random graph models have been widely used in social network analysis. However, these models are extremely difficult to handle from a statistical viewpoint, because of the intractable normalizing constant and model degeneracy. In this paper, we consider a fully Bayesian analysis for exponential random graph models using the adaptive exchange sampler, which solves the intractable normalizing constant and model degeneracy issues encountered in Markov chain Monte Carlo (MCMC) simulations. The adaptive exchange sampler can be viewed as a MCMC extension of the exchange algorithm, and it generates auxiliary networks via an importance sampling procedure from an auxiliary Markov chain running in parallel. The convergence of this algorithm is established under mild conditions. The adaptive exchange sampler is illustrated using a few social networks, including the Florentine business network, molecule synthetic network, and dolphins network. The results indicate that the adaptive exchange algorithm can produce more accurate estimates than approximate exchange algorithms, while maintaining the same computational efficiency.
Mao, Yong; Zhou, Xiao-Bo; Pi, Dao-Ying; Sun, You-Xian; Wong, Stephen T C
2005-10-01
In microarray-based cancer classification, gene selection is an important issue owing to the large number of variables and small number of samples as well as its non-linearity. It is difficult to get satisfying results by using conventional linear statistical methods. Recursive feature elimination based on support vector machine (SVM RFE) is an effective algorithm for gene selection and cancer classification, which are integrated into a consistent framework. In this paper, we propose a new method to select parameters of the aforementioned algorithm implemented with Gaussian kernel SVMs as better alternatives to the common practice of selecting the apparently best parameters by using a genetic algorithm to search for a couple of optimal parameter. Fast implementation issues for this method are also discussed for pragmatic reasons. The proposed method was tested on two representative hereditary breast cancer and acute leukaemia datasets. The experimental results indicate that the proposed method performs well in selecting genes and achieves high classification accuracies with these genes.
Performance metrics for the assessment of satellite data products: an ocean color case study
Seegers, Bridget N.; Stumpf, Richard P.; Schaeffer, Blake A.; Loftin, Keith A.; Werdell, P. Jeremy
2018-01-01
Performance assessment of ocean color satellite data has generally relied on statistical metrics chosen for their common usage and the rationale for selecting certain metrics is infrequently explained. Commonly reported statistics based on mean squared errors, such as the coefficient of determination (r2), root mean square error, and regression slopes, are most appropriate for Gaussian distributions without outliers and, therefore, are often not ideal for ocean color algorithm performance assessment, which is often limited by sample availability. In contrast, metrics based on simple deviations, such as bias and mean absolute error, as well as pair-wise comparisons, often provide more robust and straightforward quantities for evaluating ocean color algorithms with non-Gaussian distributions and outliers. This study uses a SeaWiFS chlorophyll-a validation data set to demonstrate a framework for satellite data product assessment and recommends a multi-metric and user-dependent approach that can be applied within science, modeling, and resource management communities. PMID:29609296
THE DETECTION AND STATISTICS OF GIANT ARCS BEHIND CLASH CLUSTERS
DOE Office of Scientific and Technical Information (OSTI.GOV)
Xu, Bingxiao; Zheng, Wei; Postman, Marc
We developed an algorithm to find and characterize gravitationally lensed galaxies (arcs) to perform a comparison of the observed and simulated arc abundance. Observations are from the Cluster Lensing And Supernova survey with Hubble (CLASH). Simulated CLASH images are created using the MOKA package and also clusters selected from the high-resolution, hydrodynamical simulations, MUSIC, over the same mass and redshift range as the CLASH sample. The algorithm's arc elongation accuracy, completeness, and false positive rate are determined and used to compute an estimate of the true arc abundance. We derive a lensing efficiency of 4 ± 1 arcs (with length ≥6″ andmore » length-to-width ratio ≥7) per cluster for the X-ray-selected CLASH sample, 4 ± 1 arcs per cluster for the MOKA-simulated sample, and 3 ± 1 arcs per cluster for the MUSIC-simulated sample. The observed and simulated arc statistics are in full agreement. We measure the photometric redshifts of all detected arcs and find a median redshift z{sub s} = 1.9 with 33% of the detected arcs having z{sub s} > 3. We find that the arc abundance does not depend strongly on the source redshift distribution but is sensitive to the mass distribution of the dark matter halos (e.g., the c–M relation). Our results show that consistency between the observed and simulated distributions of lensed arc sizes and axial ratios can be achieved by using cluster-lensing simulations that are carefully matched to the selection criteria used in the observations.« less
Computational statistics using the Bayesian Inference Engine
NASA Astrophysics Data System (ADS)
Weinberg, Martin D.
2013-09-01
This paper introduces the Bayesian Inference Engine (BIE), a general parallel, optimized software package for parameter inference and model selection. This package is motivated by the analysis needs of modern astronomical surveys and the need to organize and reuse expensive derived data. The BIE is the first platform for computational statistics designed explicitly to enable Bayesian update and model comparison for astronomical problems. Bayesian update is based on the representation of high-dimensional posterior distributions using metric-ball-tree based kernel density estimation. Among its algorithmic offerings, the BIE emphasizes hybrid tempered Markov chain Monte Carlo schemes that robustly sample multimodal posterior distributions in high-dimensional parameter spaces. Moreover, the BIE implements a full persistence or serialization system that stores the full byte-level image of the running inference and previously characterized posterior distributions for later use. Two new algorithms to compute the marginal likelihood from the posterior distribution, developed for and implemented in the BIE, enable model comparison for complex models and data sets. Finally, the BIE was designed to be a collaborative platform for applying Bayesian methodology to astronomy. It includes an extensible object-oriented and easily extended framework that implements every aspect of the Bayesian inference. By providing a variety of statistical algorithms for all phases of the inference problem, a scientist may explore a variety of approaches with a single model and data implementation. Additional technical details and download details are available from http://www.astro.umass.edu/bie. The BIE is distributed under the GNU General Public License.
2013-01-01
Background The high burden and rising incidence of cardiovascular disease (CVD) in resource constrained countries necessitates implementation of robust and pragmatic primary and secondary prevention strategies. Many current CVD management guidelines recommend absolute cardiovascular (CV) risk assessment as a clinically sound guide to preventive and treatment strategies. Development of non-laboratory based cardiovascular risk assessment algorithms enable absolute risk assessment in resource constrained countries. The objective of this review is to evaluate the performance of existing non-laboratory based CV risk assessment algorithms using the benchmarks for clinically useful CV risk assessment algorithms outlined by Cooney and colleagues. Methods A literature search to identify non-laboratory based risk prediction algorithms was performed in MEDLINE, CINAHL, Ovid Premier Nursing Journals Plus, and PubMed databases. The identified algorithms were evaluated using the benchmarks for clinically useful cardiovascular risk assessment algorithms outlined by Cooney and colleagues. Results Five non-laboratory based CV risk assessment algorithms were identified. The Gaziano and Framingham algorithms met the criteria for appropriateness of statistical methods used to derive the algorithms and endpoints. The Swedish Consultation, Framingham and Gaziano algorithms demonstrated good discrimination in derivation datasets. Only the Gaziano algorithm was externally validated where it had optimal discrimination. The Gaziano and WHO algorithms had chart formats which made them simple and user friendly for clinical application. Conclusion Both the Gaziano and Framingham non-laboratory based algorithms met most of the criteria outlined by Cooney and colleagues. External validation of the algorithms in diverse samples is needed to ascertain their performance and applicability to different populations and to enhance clinicians’ confidence in them. PMID:24373202
NASA Astrophysics Data System (ADS)
Kerr, Laura T.; Adams, Aine; O'Dea, Shirley; Domijan, Katarina; Cullen, Ivor; Hennelly, Bryan M.
2014-05-01
Raman microspectroscopy can be applied to the urinary bladder for highly accurate classification and diagnosis of bladder cancer. This technique can be applied in vitro to bladder epithelial cells obtained from urine cytology or in vivo as an optical biopsy" to provide results in real-time with higher sensitivity and specificity than current clinical methods. However, there exists a high degree of variability across experimental parameters which need to be standardised before this technique can be utilized in an everyday clinical environment. In this study, we investigate different laser wavelengths (473 nm and 532 nm), sample substrates (glass, fused silica and calcium fluoride) and multivariate statistical methods in order to gain insight into how these various experimental parameters impact on the sensitivity and specificity of Raman cytology.
A solution quality assessment method for swarm intelligence optimization algorithms.
Zhang, Zhaojun; Wang, Gai-Ge; Zou, Kuansheng; Zhang, Jianhua
2014-01-01
Nowadays, swarm intelligence optimization has become an important optimization tool and wildly used in many fields of application. In contrast to many successful applications, the theoretical foundation is rather weak. Therefore, there are still many problems to be solved. One problem is how to quantify the performance of algorithm in finite time, that is, how to evaluate the solution quality got by algorithm for practical problems. It greatly limits the application in practical problems. A solution quality assessment method for intelligent optimization is proposed in this paper. It is an experimental analysis method based on the analysis of search space and characteristic of algorithm itself. Instead of "value performance," the "ordinal performance" is used as evaluation criteria in this method. The feasible solutions were clustered according to distance to divide solution samples into several parts. Then, solution space and "good enough" set can be decomposed based on the clustering results. Last, using relative knowledge of statistics, the evaluation result can be got. To validate the proposed method, some intelligent algorithms such as ant colony optimization (ACO), particle swarm optimization (PSO), and artificial fish swarm algorithm (AFS) were taken to solve traveling salesman problem. Computational results indicate the feasibility of proposed method.
Criteria for Choosing the Best Neural Network: Part 1
1991-07-24
Touretzky, pp. 177-185. San Mateo: Morgan Kaufmann. Harp, S.A., Samad , T., and Guha, A . (1990). Designing application-specific neural networks using genetic...determining a parsimonious neural network for use in prediction/generalization based on a given fixed learning sample. Both the classification and...statistical settings, algorithms for selecting the number of hidden layer nodes in a three layer, feedforward neural network are presented. The selection
Fall 2014 SEI Research Review Probabilistic Analysis of Time Sensitive Systems
2014-10-28
Osmosis SMC Tool Osmosis is a tool for Statistical Model Checking (SMC) with Semantic Importance Sampling. • Input model is written in subset of C...ASSERT() statements in model indicate conditions that must hold. • Input probability distributions defined by the user. • Osmosis returns the...on: – Target relative error, or – Set number of simulations Osmosis Main Algorithm 1 http://dreal.cs.cmu.edu/ (?⃑?): Indicator
The Wang-Landau Sampling Algorithm
NASA Astrophysics Data System (ADS)
Landau, David P.
2003-03-01
Over the past several decades Monte Carlo simulations[1] have evolved into a powerful tool for the study of wide-ranging problems in statistical/condensed matter physics. Standard methods sample the probability distribution for the states of the system, usually in the canonical ensemble, and enormous improvements have been made in performance through the implementation of novel algorithms. Nonetheless, difficulties arise near phase transitions, either due to critical slowing down near 2nd order transitions or to metastability near 1st order transitions, thus limiting the applicability of the method. We shall describe a new and different Monte Carlo approach [2] that uses a random walk in energy space to determine the density of states directly. Once the density of states is estimated, all thermodynamic properties can be calculated at all temperatures. This approach can be extended to multi-dimensional parameter spaces and has already found use in classical models of interacting particles including systems with complex energy landscapes, e.g., spin glasses, protein folding models, etc., as well as for quantum models. 1. A Guide to Monte Carlo Simulations in Statistical Physics, D. P. Landau and K. Binder (Cambridge U. Press, Cambridge, 2000). 2. Fugao Wang and D. P. Landau, Phys. Rev. Lett. 86, 2050 (2001); Phys. Rev. E64, 056101-1 (2001).
Computation of fluid flow and pore-space properties estimation on micro-CT images of rock samples
NASA Astrophysics Data System (ADS)
Starnoni, M.; Pokrajac, D.; Neilson, J. E.
2017-09-01
Accurate determination of the petrophysical properties of rocks, namely REV, mean pore and grain size and absolute permeability, is essential for a broad range of engineering applications. Here, the petrophysical properties of rocks are calculated using an integrated approach comprising image processing, statistical correlation and numerical simulations. The Stokes equations of creeping flow for incompressible fluids are solved using the Finite-Volume SIMPLE algorithm. Simulations are then carried out on three-dimensional digital images obtained from micro-CT scanning of two rock formations: one sandstone and one carbonate. Permeability is predicted from the computed flow field using Darcy's law. It is shown that REV, REA and mean pore and grain size are effectively estimated using the two-point spatial correlation function. Homogeneity and anisotropy are also evaluated using the same statistical tools. A comparison of different absolute permeability estimates is also presented, revealing a good agreement between the numerical value and the experimentally determined one for the carbonate sample, but a large discrepancy for the sandstone. Finally, a new convergence criterion for the SIMPLE algorithm, and more generally for the family of pressure-correction methods, is presented. This criterion is based on satisfaction of bulk momentum balance, which makes it particularly useful for pore-scale modelling of reservoir rocks.
e-DMDAV: A new privacy preserving algorithm for wearable enterprise information systems
NASA Astrophysics Data System (ADS)
Zhang, Zhenjiang; Wang, Xiaoni; Uden, Lorna; Zhang, Peng; Zhao, Yingsi
2018-04-01
Wearable devices have been widely used in many fields to improve the quality of people's lives. More and more data on individuals and businesses are collected by statistical organizations though those devices. Almost all of this data holds confidential information. Statistical Disclosure Control (SDC) seeks to protect statistical data in such a way that it can be released without giving away confidential information that can be linked to specific individuals or entities. The MDAV (Maximum Distance to Average Vector) algorithm is an efficient micro-aggregation algorithm belonging to SDC. However, the MDAV algorithm cannot survive homogeneity and background knowledge attacks because it was designed for static numerical data. This paper proposes a systematic dynamic-updating anonymity algorithm based on MDAV called the e-DMDAV algorithm. This algorithm introduces a new parameter and a table to ensure that the k records in one cluster with the range of the distinct values in each cluster is no less than e for numerical and non-numerical datasets. This new algorithm has been evaluated and compared with the MDAV algorithm. The simulation results show that the new algorithm outperforms MDAV in terms of minimizing distortion and disclosure risk with a similar computational cost.
An Automated Energy Detection Algorithm Based on Consecutive Mean Excision
2018-01-01
present in the RF spectrum. 15. SUBJECT TERMS RF spectrum, detection threshold algorithm, consecutive mean excision, rank order filter , statistical...Median 4 3.1.9 Rank Order Filter (ROF) 4 3.1.10 Crest Factor (CF) 5 3.2 Statistical Summary 6 4. Algorithm 7 5. Conclusion 8 6. References 9...energy detection algorithm based on morphological filter processing with a semi- disk structure. Adelphi (MD): Army Research Laboratory (US); 2018 Jan
Profiling Arthritis Pain with a Decision Tree.
Hung, Man; Bounsanga, Jerry; Liu, Fangzhou; Voss, Maren W
2018-06-01
Arthritis is the leading cause of work disability and contributes to lost productivity. Previous studies showed that various factors predict pain, but they were limited in sample size and scope from a data analytics perspective. The current study applied machine learning algorithms to identify predictors of pain associated with arthritis in a large national sample. Using data from the 2011 to 2012 Medical Expenditure Panel Survey, data mining was performed to develop algorithms to identify factors and patterns that contribute to risk of pain. The model incorporated over 200 variables within the algorithm development, including demographic data, medical claims, laboratory tests, patient-reported outcomes, and sociobehavioral characteristics. The developed algorithms to predict pain utilize variables readily available in patient medical records. Using the machine learning classification algorithm J48 with 50-fold cross-validations, we found that the model can significantly distinguish those with and without pain (c-statistics = 0.9108). The F measure was 0.856, accuracy rate was 85.68%, sensitivity was 0.862, specificity was 0.852, and precision was 0.849. Physical and mental function scores, the ability to climb stairs, and overall assessment of feeling were the most discriminative predictors from the 12 identified variables, predicting pain with 86% accuracy for individuals with arthritis. In this era of rapid expansion of big data application, the nature of healthcare research is moving from hypothesis-driven to data-driven solutions. The algorithms generated in this study offer new insights on individualized pain prediction, allowing the development of cost-effective care management programs for those experiencing arthritis pain. © 2017 World Institute of Pain.
Yan, Jianjun; Shen, Xiaojing; Wang, Yiqin; Li, Fufeng; Xia, Chunming; Guo, Rui; Chen, Chunfeng; Shen, Qingwei
2010-01-01
This study aims at utilising Wavelet Packet Transform (WPT) and Support Vector Machine (SVM) algorithm to make objective analysis and quantitative research for the auscultation in Traditional Chinese Medicine (TCM) diagnosis. First, Wavelet Packet Decomposition (WPD) at level 6 was employed to split more elaborate frequency bands of the auscultation signals. Then statistic analysis was made based on the extracted Wavelet Packet Energy (WPE) features from WPD coefficients. Furthermore, the pattern recognition was used to distinguish mixed subjects' statistical feature values of sample groups through SVM. Finally, the experimental results showed that the classification accuracies were at a high level.
Evaluation of an Algorithm to Predict Menstrual-Cycle Phase at the Time of Injury.
Tourville, Timothy W; Shultz, Sandra J; Vacek, Pamela M; Knudsen, Emily J; Bernstein, Ira M; Tourville, Kelly J; Hardy, Daniel M; Johnson, Robert J; Slauterbeck, James R; Beynnon, Bruce D
2016-01-01
Women are 2 to 8 times more likely to sustain an anterior cruciate ligament (ACL) injury than men, and previous studies indicated an increased risk for injury during the preovulatory phase of the menstrual cycle (MC). However, investigations of risk rely on retrospective classification of MC phase, and no tools for this have been validated. To evaluate the accuracy of an algorithm for retrospectively classifying MC phase at the time of a mock injury based on MC history and salivary progesterone (P4) concentration. Descriptive laboratory study. Research laboratory. Thirty-one healthy female collegiate athletes (age range, 18-24 years) provided serum or saliva (or both) samples at 8 visits over 1 complete MC. Self-reported MC information was obtained on a randomized date (1-45 days) after mock injury, which is the typical timeframe in which researchers have access to ACL-injured study participants. The MC phase was classified using the algorithm as applied in a stand-alone computational fashion and also by 4 clinical experts using the algorithm and additional subjective hormonal history information to help inform their decision. To assess algorithm accuracy, phase classifications were compared with the actual MC phase at the time of mock injury (ascertained using urinary luteinizing hormone tests and serial serum P4 samples). Clinical expert and computed classifications were compared using κ statistics. Fourteen participants (45%) experienced anovulatory cycles. The algorithm correctly classified MC phase for 23 participants (74%): 22 (76%) of 29 who were preovulatory/anovulatory and 1 (50%) of 2 who were postovulatory. Agreement between expert and algorithm classifications ranged from 80.6% (κ = 0.50) to 93% (κ = 0.83). Classifications based on same-day saliva sample and optimal P4 threshold were the same as those based on MC history alone (87.1% correct). Algorithm accuracy varied during the MC but at no time were both sensitivity and specificity levels acceptable. These findings raise concerns about the accuracy of previous retrospective MC-phase classification systems, particularly in a population with a high occurrence of anovulatory cycles.
Iselin, Greg; Le Brocque, Robyne; Kenardy, Justin; Anderson, Vicki; McKinlay, Lynne
2010-10-01
Controversy surrounds the classification of posttraumatic stress disorder (PTSD), particularly in children and adolescents with traumatic brain injury (TBI). In these populations, it is difficult to differentiate TBI-related organic memory loss from dissociative amnesia. Several alternative PTSD classification algorithms have been proposed for use with children. This paper investigates DSM-IV-TR and alternative PTSD classification algorithms, including and excluding the dissociative amnesia item, in terms of their ability to predict psychosocial function following pediatric TBI. A sample of 184 children aged 6-14 years were recruited following emergency department presentation and/or hospital admission for TBI. PTSD was assessed via semi-structured clinical interview (CAPS-CA) with the child at 3 months post-injury. Psychosocial function was assessed using the parent report CHQ-PF50. Two alternative classification algorithms, the PTSD-AA and 2 of 3 algorithms, reached statistical significance. While the inclusion of the dissociative amnesia item increased prevalence rates across algorithms, it generally resulted in weaker associations with psychosocial function. The PTSD-AA algorithm appears to have the strongest association with psychosocial function following TBI in children and adolescents. Removing the dissociative amnesia item from the diagnostic algorithm generally results in improved validity. Copyright 2010 Elsevier Ltd. All rights reserved.
Statistical reconstruction for cosmic ray muon tomography.
Schultz, Larry J; Blanpied, Gary S; Borozdin, Konstantin N; Fraser, Andrew M; Hengartner, Nicolas W; Klimenko, Alexei V; Morris, Christopher L; Orum, Chris; Sossong, Michael J
2007-08-01
Highly penetrating cosmic ray muons constantly shower the earth at a rate of about 1 muon per cm2 per minute. We have developed a technique which exploits the multiple Coulomb scattering of these particles to perform nondestructive inspection without the use of artificial radiation. In prior work [1]-[3], we have described heuristic methods for processing muon data to create reconstructed images. In this paper, we present a maximum likelihood/expectation maximization tomographic reconstruction algorithm designed for the technique. This algorithm borrows much from techniques used in medical imaging, particularly emission tomography, but the statistics of muon scattering dictates differences. We describe the statistical model for multiple scattering, derive the reconstruction algorithm, and present simulated examples. We also propose methods to improve the robustness of the algorithm to experimental errors and events departing from the statistical model.
Local image statistics: maximum-entropy constructions and perceptual salience
Victor, Jonathan D.; Conte, Mary M.
2012-01-01
The space of visual signals is high-dimensional and natural visual images have a highly complex statistical structure. While many studies suggest that only a limited number of image statistics are used for perceptual judgments, a full understanding of visual function requires analysis not only of the impact of individual image statistics, but also, how they interact. In natural images, these statistical elements (luminance distributions, correlations of low and high order, edges, occlusions, etc.) are intermixed, and their effects are difficult to disentangle. Thus, there is a need for construction of stimuli in which one or more statistical elements are introduced in a controlled fashion, so that their individual and joint contributions can be analyzed. With this as motivation, we present algorithms to construct synthetic images in which local image statistics—including luminance distributions, pair-wise correlations, and higher-order correlations—are explicitly specified and all other statistics are determined implicitly by maximum-entropy. We then apply this approach to measure the sensitivity of the human visual system to local image statistics and to sample their interactions. PMID:22751397
Automated medication reconciliation and complexity of care transitions.
Silva, Pamela A Bozzo; Bernstam, Elmer V; Markowitz, Eliz; Johnson, Todd R; Zhang, Jiajie; Herskovic, Jorge R
2011-01-01
Medication reconciliation is a National Patient Safety Goal (NPSG) from The Joint Commission (TJC) that entails reviewing all medications a patient takes after a health care transition. Medication reconciliation is a resource-intensive, error-prone task, and the resources to accomplish it may not be routinely available. Computer-based methods have the potential to overcome these barriers. We designed and explored a rule-based medication reconciliation algorithm to accomplish this task across different healthcare transitions. We tested our algorithm on a random sample of 94 transitions from the Clinical Data Warehouse at the University of Texas Health Science Center at Houston. We found that the algorithm reconciled, on average, 23.4% of the potentially reconcilable medications. Our study did not have sufficient statistical power to establish whether the kind of transition affects reconcilability. We conclude that automated reconciliation is possible and will help accomplish the NPSG.
Log-Linear Models for Gene Association
Hu, Jianhua; Joshi, Adarsh; Johnson, Valen E.
2009-01-01
We describe a class of log-linear models for the detection of interactions in high-dimensional genomic data. This class of models leads to a Bayesian model selection algorithm that can be applied to data that have been reduced to contingency tables using ranks of observations within subjects, and discretization of these ranks within gene/network components. Many normalization issues associated with the analysis of genomic data are thereby avoided. A prior density based on Ewens’ sampling distribution is used to restrict the number of interacting components assigned high posterior probability, and the calculation of posterior model probabilities is expedited by approximations based on the likelihood ratio statistic. Simulation studies are used to evaluate the efficiency of the resulting algorithm for known interaction structures. Finally, the algorithm is validated in a microarray study for which it was possible to obtain biological confirmation of detected interactions. PMID:19655032
Okariz, Ana; Guraya, Teresa; Iturrondobeitia, Maider; Ibarretxe, Julen
2017-02-01
The SIRT (Simultaneous Iterative Reconstruction Technique) algorithm is commonly used in Electron Tomography to calculate the original volume of the sample from noisy images, but the results provided by this iterative procedure are strongly dependent on the specific implementation of the algorithm, as well as on the number of iterations employed for the reconstruction. In this work, a methodology for selecting the iteration number of the SIRT reconstruction that provides the most accurate segmentation is proposed. The methodology is based on the statistical analysis of the intensity profiles at the edge of the objects in the reconstructed volume. A phantom which resembles a a carbon black aggregate has been created to validate the methodology and the SIRT implementations of two free software packages (TOMOJ and TOMO3D) have been used. Copyright © 2016 Elsevier B.V. All rights reserved.
Exact and Approximate Probabilistic Symbolic Execution
NASA Technical Reports Server (NTRS)
Luckow, Kasper; Pasareanu, Corina S.; Dwyer, Matthew B.; Filieri, Antonio; Visser, Willem
2014-01-01
Probabilistic software analysis seeks to quantify the likelihood of reaching a target event under uncertain environments. Recent approaches compute probabilities of execution paths using symbolic execution, but do not support nondeterminism. Nondeterminism arises naturally when no suitable probabilistic model can capture a program behavior, e.g., for multithreading or distributed systems. In this work, we propose a technique, based on symbolic execution, to synthesize schedulers that resolve nondeterminism to maximize the probability of reaching a target event. To scale to large systems, we also introduce approximate algorithms to search for good schedulers, speeding up established random sampling and reinforcement learning results through the quantification of path probabilities based on symbolic execution. We implemented the techniques in Symbolic PathFinder and evaluated them on nondeterministic Java programs. We show that our algorithms significantly improve upon a state-of- the-art statistical model checking algorithm, originally developed for Markov Decision Processes.
Driven-dissipative quantum Monte Carlo method for open quantum systems
NASA Astrophysics Data System (ADS)
Nagy, Alexandra; Savona, Vincenzo
2018-05-01
We develop a real-time full configuration-interaction quantum Monte Carlo approach to model driven-dissipative open quantum systems with Markovian system-bath coupling. The method enables stochastic sampling of the Liouville-von Neumann time evolution of the density matrix thanks to a massively parallel algorithm, thus providing estimates of observables on the nonequilibrium steady state. We present the underlying theory and introduce an initiator technique and importance sampling to reduce the statistical error. Finally, we demonstrate the efficiency of our approach by applying it to the driven-dissipative two-dimensional X Y Z spin-1/2 model on a lattice.
The Stroke Riskometer™ App: Validation of a data collection tool and stroke risk predictor
Parmar, Priya; Krishnamurthi, Rita; Ikram, M Arfan; Hofman, Albert; Mirza, Saira S; Varakin, Yury; Kravchenko, Michael; Piradov, Michael; Thrift, Amanda G; Norrving, Bo; Wang, Wenzhi; Mandal, Dipes Kumar; Barker-Collo, Suzanne; Sahathevan, Ramesh; Davis, Stephen; Saposnik, Gustavo; Kivipelto, Miia; Sindi, Shireen; Bornstein, Natan M; Giroud, Maurice; Béjot, Yannick; Brainin, Michael; Poulton, Richie; Narayan, K M Venkat; Correia, Manuel; Freire, António; Kokubo, Yoshihiro; Wiebers, David; Mensah, George; BinDhim, Nasser F; Barber, P Alan; Pandian, Jeyaraj Durai; Hankey, Graeme J; Mehndiratta, Man Mohan; Azhagammal, Shobhana; Ibrahim, Norlinah Mohd; Abbott, Max; Rush, Elaine; Hume, Patria; Hussein, Tasleem; Bhattacharjee, Rohit; Purohit, Mitali; Feigin, Valery L
2015-01-01
Background The greatest potential to reduce the burden of stroke is by primary prevention of first-ever stroke, which constitutes three quarters of all stroke. In addition to population-wide prevention strategies (the ‘mass’ approach), the ‘high risk’ approach aims to identify individuals at risk of stroke and to modify their risk factors, and risk, accordingly. Current methods of assessing and modifying stroke risk are difficult to access and implement by the general population, amongst whom most future strokes will arise. To help reduce the burden of stroke on individuals and the population a new app, the Stroke Riskometer™, has been developed. We aim to explore the validity of the app for predicting the risk of stroke compared with current best methods. Methods 752 stroke outcomes from a sample of 9501 individuals across three countries (New Zealand, Russia and the Netherlands) were utilized to investigate the performance of a novel stroke risk prediction tool algorithm (Stroke Riskometer™) compared with two established stroke risk score prediction algorithms (Framingham Stroke Risk Score [FSRS] and QStroke). We calculated the receiver operating characteristics (ROC) curves and area under the ROC curve (AUROC) with 95% confidence intervals, Harrels C-statistic and D-statistics for measure of discrimination, R2 statistics to indicate level of variability accounted for by each prediction algorithm, the Hosmer-Lemeshow statistic for calibration, and the sensitivity and specificity of each algorithm. Results The Stroke Riskometer™ performed well against the FSRS five-year AUROC for both males (FSRS = 75·0% (95% CI 72·3%–77·6%), Stroke Riskometer™ = 74·0(95% CI 71·3%–76·7%) and females [FSRS = 70·3% (95% CI 67·9%–72·8%, Stroke Riskometer™ = 71·5% (95% CI 69·0%–73·9%)], and better than QStroke [males – 59·7% (95% CI 57·3%–62·0%) and comparable to females = 71·1% (95% CI 69·0%–73·1%)]. Discriminative ability of all algorithms was low (C-statistic ranging from 0·51–0·56, D-statistic ranging from 0·01–0·12). Hosmer-Lemeshow illustrated that all of the predicted risk scores were not well calibrated with the observed event data (P < 0·006). Conclusions The Stroke Riskometer™ is comparable in performance for stroke prediction with FSRS and QStroke. All three algorithms performed equally poorly in predicting stroke events. The Stroke Riskometer™ will be continually developed and validated to address the need to improve the current stroke risk scoring systems to more accurately predict stroke, particularly by identifying robust ethnic/race ethnicity group and country specific risk factors. PMID:25491651
Simulation and analysis of scalable non-Gaussian statistically anisotropic random functions
NASA Astrophysics Data System (ADS)
Riva, Monica; Panzeri, Marco; Guadagnini, Alberto; Neuman, Shlomo P.
2015-12-01
Many earth and environmental (as well as other) variables, Y, and their spatial or temporal increments, ΔY, exhibit non-Gaussian statistical scaling. Previously we were able to capture some key aspects of such scaling by treating Y or ΔY as standard sub-Gaussian random functions. We were however unable to reconcile two seemingly contradictory observations, namely that whereas sample frequency distributions of Y (or its logarithm) exhibit relatively mild non-Gaussian peaks and tails, those of ΔY display peaks that grow sharper and tails that become heavier with decreasing separation distance or lag. Recently we overcame this difficulty by developing a new generalized sub-Gaussian model which captures both behaviors in a unified and consistent manner, exploring it on synthetically generated random functions in one dimension (Riva et al., 2015). Here we extend our generalized sub-Gaussian model to multiple dimensions, present an algorithm to generate corresponding random realizations of statistically isotropic or anisotropic sub-Gaussian functions and illustrate it in two dimensions. We demonstrate the accuracy of our algorithm by comparing ensemble statistics of Y and ΔY (such as, mean, variance, variogram and probability density function) with those of Monte Carlo generated realizations. We end by exploring the feasibility of estimating all relevant parameters of our model by analyzing jointly spatial moments of Y and ΔY obtained from a single realization of Y.
Statistical Algorithms for Designing Geophysical Surveys to Detect UXO Target Areas
DOE Office of Scientific and Technical Information (OSTI.GOV)
O'Brien, Robert F.; Carlson, Deborah K.; Gilbert, Richard O.
2005-07-29
The U.S. Department of Defense is in the process of assessing and remediating closed, transferred, and transferring military training ranges across the United States. Many of these sites have areas that are known to contain unexploded ordnance (UXO). Other sites or portions of sites are not expected to contain UXO, but some verification of this expectation using geophysical surveys is needed. Many sites are so large that it is often impractical and/or cost prohibitive to perform surveys over 100% of the site. In that case, it is particularly important to be explicit about the performance required of the survey. Thismore » article presents the statistical algorithms developed to support the design of geophysical surveys along transects (swaths) to find target areas (TAs) of anomalous geophysical readings that may indicate the presence of UXO. The algorithms described here determine 1) the spacing between transects that should be used for the surveys to achieve a specified probability of traversing the TA, 2) the probability of both traversing and detecting a TA of anomalous geophysical readings when the spatial density of anomalies within the TA is either uniform (unchanging over space) or has a bivariate normal distribution, and 3) the probability that a TA exists when it was not found by surveying along transects. These algorithms have been implemented in the Visual Sample Plan (VSP) software to develop cost-effective transect survey designs that meet performance objectives.« less
Statistical Algorithms for Designing Geophysical Surveys to Detect UXO Target Areas
DOE Office of Scientific and Technical Information (OSTI.GOV)
O'Brien, Robert F.; Carlson, Deborah K.; Gilbert, Richard O.
2005-07-28
The U.S. Department of Defense is in the process of assessing and remediating closed, transferred, and transferring military training ranges across the United States. Many of these sites have areas that are known to contain unexploded ordnance (UXO). Other sites or portions of sites are not expected to contain UXO, but some verification of this expectation using geophysical surveys is needed. Many sites are so large that it is often impractical and/or cost prohibitive to perform surveys over 100% of the site. In such cases, it is particularly important to be explicit about the performance required of the surveys. Thismore » article presents the statistical algorithms developed to support the design of geophysical surveys along transects (swaths) to find target areas (TAs) of anomalous geophysical readings that may indicate the presence of UXO. The algorithms described here determine (1) the spacing between transects that should be used for the surveys to achieve a specified probability of traversing the TA, (2) the probability of both traversing and detecting a TA of anomalous geophysical readings when the spatial density of anomalies within the TA is either uniform (unchanging over space) or has a bivariate normal distribution, and (3) the probability that a TA exists when it was not found by surveying along transects. These algorithms have been implemented in the Visual Sample Plan (VSP) software to develop cost-effective transect survey designs that meet performance objectives.« less
Soultan, Alaaeldin; Safi, Kamran
2017-01-01
Digitized species occurrence data provide an unprecedented source of information for ecologists and conservationists. Species distribution model (SDM) has become a popular method to utilise these data for understanding the spatial and temporal distribution of species, and for modelling biodiversity patterns. Our objective is to study the impact of noise in species occurrence data (namely sample size and positional accuracy) on the performance and reliability of SDM, considering the multiplicative impact of SDM algorithms, species specialisation, and grid resolution. We created a set of four 'virtual' species characterized by different specialisation levels. For each of these species, we built the suitable habitat models using five algorithms at two grid resolutions, with varying sample sizes and different levels of positional accuracy. We assessed the performance and reliability of the SDM according to classic model evaluation metrics (Area Under the Curve and True Skill Statistic) and model agreement metrics (Overall Concordance Correlation Coefficient and geographic niche overlap) respectively. Our study revealed that species specialisation had by far the most dominant impact on the SDM. In contrast to previous studies, we found that for widespread species, low sample size and low positional accuracy were acceptable, and useful distribution ranges could be predicted with as few as 10 species occurrences. Range predictions for narrow-ranged species, however, were sensitive to sample size and positional accuracy, such that useful distribution ranges required at least 20 species occurrences. Against expectations, the MAXENT algorithm poorly predicted the distribution of specialist species at low sample size.
Robustness of methods for blinded sample size re-estimation with overdispersed count data.
Schneider, Simon; Schmidli, Heinz; Friede, Tim
2013-09-20
Counts of events are increasingly common as primary endpoints in randomized clinical trials. With between-patient heterogeneity leading to variances in excess of the mean (referred to as overdispersion), statistical models reflecting this heterogeneity by mixtures of Poisson distributions are frequently employed. Sample size calculation in the planning of such trials requires knowledge on the nuisance parameters, that is, the control (or overall) event rate and the overdispersion parameter. Usually, there is only little prior knowledge regarding these parameters in the design phase resulting in considerable uncertainty regarding the sample size. In this situation internal pilot studies have been found very useful and very recently several blinded procedures for sample size re-estimation have been proposed for overdispersed count data, one of which is based on an EM-algorithm. In this paper we investigate the EM-algorithm based procedure with respect to aspects of their implementation by studying the algorithm's dependence on the choice of convergence criterion and find that the procedure is sensitive to the choice of the stopping criterion in scenarios relevant to clinical practice. We also compare the EM-based procedure to other competing procedures regarding their operating characteristics such as sample size distribution and power. Furthermore, the robustness of these procedures to deviations from the model assumptions is explored. We find that some of the procedures are robust to at least moderate deviations. The results are illustrated using data from the US National Heart, Lung and Blood Institute sponsored Asymptomatic Cardiac Ischemia Pilot study. Copyright © 2013 John Wiley & Sons, Ltd.
Stan : A Probabilistic Programming Language
DOE Office of Scientific and Technical Information (OSTI.GOV)
Carpenter, Bob; Gelman, Andrew; Hoffman, Matthew D.
Stan is a probabilistic programming language for specifying statistical models. A Stan program imperatively defines a log probability function over parameters conditioned on specified data and constants. As of version 2.14.0, Stan provides full Bayesian inference for continuous-variable models through Markov chain Monte Carlo methods such as the No-U-Turn sampler, an adaptive form of Hamiltonian Monte Carlo sampling. Penalized maximum likelihood estimates are calculated using optimization methods such as the limited memory Broyden-Fletcher-Goldfarb-Shanno algorithm. Stan is also a platform for computing log densities and their gradients and Hessians, which can be used in alternative algorithms such as variational Bayes, expectationmore » propagation, and marginal inference using approximate integration. To this end, Stan is set up so that the densities, gradients, and Hessians, along with intermediate quantities of the algorithm such as acceptance probabilities, are easily accessible. Stan can also be called from the command line using the cmdstan package, through R using the rstan package, and through Python using the pystan package. All three interfaces support sampling and optimization-based inference with diagnostics and posterior analysis. rstan and pystan also provide access to log probabilities, gradients, Hessians, parameter transforms, and specialized plotting.« less
High density DNA microarrays: algorithms and biomedical applications.
Liu, Wei-Min
2004-08-01
DNA microarrays are devices capable of detecting the identity and abundance of numerous DNA or RNA segments in samples. They are used for analyzing gene expressions, identifying genetic markers and detecting mutations on a genomic scale. The fundamental chemical mechanism of DNA microarrays is the hybridization between probes and targets due to the hydrogen bonds of nucleotide base pairing. Since the cross hybridization is inevitable, and probes or targets may form undesirable secondary or tertiary structures, the microarray data contain noise and depend on experimental conditions. It is crucial to apply proper statistical algorithms to obtain useful signals from noisy data. After we obtained the signals of a large amount of probes, we need to derive the biomedical information such as the existence of a transcript in a cell, the difference of expression levels of a gene in multiple samples, and the type of a genetic marker. Furthermore, after the expression levels of thousands of genes or the genotypes of thousands of single nucleotide polymorphisms are determined, it is usually important to find a small number of genes or markers that are related to a disease, individual reactions to drugs, or other phenotypes. All these applications need careful data analyses and reliable algorithms.
Stan : A Probabilistic Programming Language
Carpenter, Bob; Gelman, Andrew; Hoffman, Matthew D.; ...
2017-01-01
Stan is a probabilistic programming language for specifying statistical models. A Stan program imperatively defines a log probability function over parameters conditioned on specified data and constants. As of version 2.14.0, Stan provides full Bayesian inference for continuous-variable models through Markov chain Monte Carlo methods such as the No-U-Turn sampler, an adaptive form of Hamiltonian Monte Carlo sampling. Penalized maximum likelihood estimates are calculated using optimization methods such as the limited memory Broyden-Fletcher-Goldfarb-Shanno algorithm. Stan is also a platform for computing log densities and their gradients and Hessians, which can be used in alternative algorithms such as variational Bayes, expectationmore » propagation, and marginal inference using approximate integration. To this end, Stan is set up so that the densities, gradients, and Hessians, along with intermediate quantities of the algorithm such as acceptance probabilities, are easily accessible. Stan can also be called from the command line using the cmdstan package, through R using the rstan package, and through Python using the pystan package. All three interfaces support sampling and optimization-based inference with diagnostics and posterior analysis. rstan and pystan also provide access to log probabilities, gradients, Hessians, parameter transforms, and specialized plotting.« less
China Report, Science and Technology, No. 188
1983-02-18
parenthetical notes within the body of an item originate with the source. Times within items are as given by source. The contents of this publication in...the model-oriented algorithm and, especially,.this system does not need the time -consuming work of mathematical statistics during summarization of the...the fact that the users can see the samples at the exibit, several times the number of expected orders have been taken to demon- strate that a
Jeyasingh, Suganthi; Veluchamy, Malathi
2017-05-01
Early diagnosis of breast cancer is essential to save lives of patients. Usually, medical datasets include a large variety of data that can lead to confusion during diagnosis. The Knowledge Discovery on Database (KDD) process helps to improve efficiency. It requires elimination of inappropriate and repeated data from the dataset before final diagnosis. This can be done using any of the feature selection algorithms available in data mining. Feature selection is considered as a vital step to increase the classification accuracy. This paper proposes a Modified Bat Algorithm (MBA) for feature selection to eliminate irrelevant features from an original dataset. The Bat algorithm was modified using simple random sampling to select the random instances from the dataset. Ranking was with the global best features to recognize the predominant features available in the dataset. The selected features are used to train a Random Forest (RF) classification algorithm. The MBA feature selection algorithm enhanced the classification accuracy of RF in identifying the occurrence of breast cancer. The Wisconsin Diagnosis Breast Cancer Dataset (WDBC) was used for estimating the performance analysis of the proposed MBA feature selection algorithm. The proposed algorithm achieved better performance in terms of Kappa statistic, Mathew’s Correlation Coefficient, Precision, F-measure, Recall, Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Relative Absolute Error (RAE) and Root Relative Squared Error (RRSE). Creative Commons Attribution License
Hippisley-Cox, Julia; Coupland, Carol; Brindle, Peter
2014-01-01
Objectives To validate the performance of a set of risk prediction algorithms developed using the QResearch database, in an independent sample from general practices contributing to the Clinical Research Data Link (CPRD). Setting Prospective open cohort study using practices contributing to the CPRD database and practices contributing to the QResearch database. Participants The CPRD validation cohort consisted of 3.3 million patients, aged 25–99 years registered at 357 general practices between 1 Jan 1998 and 31 July 2012. The validation statistics for QResearch were obtained from the original published papers which used a one-third sample of practices separate to those used to derive the score. A cohort from QResearch was used to compare incidence rates and baseline characteristics and consisted of 6.8 million patients from 753 practices registered between 1 Jan 1998 and until 31 July 2013. Outcome measures Incident events relating to seven different risk prediction scores: QRISK2 (cardiovascular disease); QStroke (ischaemic stroke); QDiabetes (type 2 diabetes); QFracture (osteoporotic fracture and hip fracture); QKidney (moderate and severe kidney failure); QThrombosis (venous thromboembolism); QBleed (intracranial bleed and upper gastrointestinal haemorrhage). Measures of discrimination and calibration were calculated. Results Overall, the baseline characteristics of the CPRD and QResearch cohorts were similar though QResearch had higher recording levels for ethnicity and family history. The validation statistics for each of the risk prediction scores were very similar in the CPRD cohort compared with the published results from QResearch validation cohorts. For example, in women, the QDiabetes algorithm explained 50% of the variation within CPRD compared with 51% on QResearch and the receiver operator curve value was 0.85 on both databases. The scores were well calibrated in CPRD. Conclusions Each of the algorithms performed practically as well in the external independent CPRD validation cohorts as they had in the original published QResearch validation cohorts. PMID:25168040
Parsons, Brendon A; Marney, Luke C; Siegler, W Christopher; Hoggard, Jamin C; Wright, Bob W; Synovec, Robert E
2015-04-07
Comprehensive two-dimensional (2D) gas chromatography coupled with time-of-flight mass spectrometry (GC × GC-TOFMS) is a versatile instrumental platform capable of collecting highly informative, yet highly complex, chemical data for a variety of samples. Fisher-ratio (F-ratio) analysis applied to the supervised comparison of sample classes algorithmically reduces complex GC × GC-TOFMS data sets to find class distinguishing chemical features. F-ratio analysis, using a tile-based algorithm, significantly reduces the adverse effects of chromatographic misalignment and spurious covariance of the detected signal, enhancing the discovery of true positives while simultaneously reducing the likelihood of detecting false positives. Herein, we report a study using tile-based F-ratio analysis whereby four non-native analytes were spiked into diesel fuel at several concentrations ranging from 0 to 100 ppm. Spike level comparisons were performed in two regimes: comparing the spiked samples to the nonspiked fuel matrix and to each other at relative concentration factors of two. Redundant hits were algorithmically removed by refocusing the tiled results onto the original high resolution pixel level data. To objectively limit the tile-based F-ratio results to only features which are statistically likely to be true positives, we developed a combinatorial technique using null class comparisons, called null distribution analysis, by which we determined a statistically defensible F-ratio cutoff for the analysis of the hit list. After applying null distribution analysis, spiked analytes were reliably discovered at ∼1 to ∼10 ppm (∼5 to ∼50 pg using a 200:1 split), depending upon the degree of mass spectral selectivity and 2D chromatographic resolution, with minimal occurrence of false positives. To place the relevance of this work among other methods in this field, results are compared to those for pixel and peak table-based approaches.
Crans, Gerald G; Shuster, Jonathan J
2008-08-15
The debate as to which statistical methodology is most appropriate for the analysis of the two-sample comparative binomial trial has persisted for decades. Practitioners who favor the conditional methods of Fisher, Fisher's exact test (FET), claim that only experimental outcomes containing the same amount of information should be considered when performing analyses. Hence, the total number of successes should be fixed at its observed level in hypothetical repetitions of the experiment. Using conditional methods in clinical settings can pose interpretation difficulties, since results are derived using conditional sample spaces rather than the set of all possible outcomes. Perhaps more importantly from a clinical trial design perspective, this test can be too conservative, resulting in greater resource requirements and more subjects exposed to an experimental treatment. The actual significance level attained by FET (the size of the test) has not been reported in the statistical literature. Berger (J. R. Statist. Soc. D (The Statistician) 2001; 50:79-85) proposed assessing the conservativeness of conditional methods using p-value confidence intervals. In this paper we develop a numerical algorithm that calculates the size of FET for sample sizes, n, up to 125 per group at the two-sided significance level, alpha = 0.05. Additionally, this numerical method is used to define new significance levels alpha(*) = alpha+epsilon, where epsilon is a small positive number, for each n, such that the size of the test is as close as possible to the pre-specified alpha (0.05 for the current work) without exceeding it. Lastly, a sample size and power calculation example are presented, which demonstrates the statistical advantages of implementing the adjustment to FET (using alpha(*) instead of alpha) in the two-sample comparative binomial trial. 2008 John Wiley & Sons, Ltd
NASA Astrophysics Data System (ADS)
Hsu, Hsiao-Ping; Nadler, Walder; Grassberger, Peter
2005-07-01
The scaling behavior of randomly branched polymers in a good solvent is studied in two to nine dimensions, modeled by lattice animals on simple hypercubic lattices. For the simulations, we use a biased sequential sampling algorithm with re-sampling, similar to the pruned-enriched Rosenbluth method (PERM) used extensively for linear polymers. We obtain high statistics of animals with up to several thousand sites in all dimension 2⩽d⩽9. The partition sum (number of different animals) and gyration radii are estimated. In all dimensions we verify the Parisi-Sourlas prediction, and we verify all exactly known critical exponents in dimensions 2, 3, 4, and ⩾8. In addition, we present the hitherto most precise estimates for growth constants in d⩾3. For clusters with one site attached to an attractive surface, we verify the superuniversality of the cross-over exponent at the adsorption transition predicted by Janssen and Lyssy.
Pezzotti, Giuseppe; Zhu, Wenliang; Boffelli, Marco; Adachi, Tetsuya; Ichioka, Hiroaki; Yamamoto, Toshiro; Marunaka, Yoshinori; Kanamura, Narisato
2015-05-01
The Raman spectroscopic method has quantitatively been applied to the analysis of local crystallographic orientation in both single-crystal hydroxyapatite and human teeth. Raman selection rules for all the vibrational modes of the hexagonal structure were expanded into explicit functions of Euler angles in space and six Raman tensor elements (RTE). A theoretical treatment has also been put forward according to the orientation distribution function (ODF) formalism, which allows one to resolve the statistical orientation patterns of the nm-sized hydroxyapatite crystallite comprised in the Raman microprobe. Close-form solutions could be obtained for the Euler angles and their statistical distributions resolved with respect to the direction of the average texture axis. Polarized Raman spectra from single-crystalline hydroxyapatite and textured polycrystalline (teeth enamel) samples were compared, and a validation of the proposed Raman method could be obtained through confirming the agreement between RTE values obtained from different samples.
NASA Astrophysics Data System (ADS)
González Gómez, Dulce I.; Moreno Barbosa, E.; Martínez Hernández, Mario Iván; Ramos Méndez, José; Hidalgo Tobón, Silvia; Dies Suarez, Pilar; Barragán Pérez, Eduardo; De Celis Alonso, Benito
2014-11-01
The main goal of this project was to create a computer algorithm based on wavelet analysis of region of homogeneity images obtained during resting state studies. Ideally it would automatically diagnose ADHD. Because the cerebellum is an area known to be affected by ADHD, this study specifically analysed this region. Male right handed volunteers (infants with ages between 7 and 11 years old) were studied and compared with age matched controls. Statistical differences between the values of the absolute integrated wavelet spectrum were found and showed significant differences (p<0.0015) between groups. This difference might help in the future to distinguish healthy from ADHD patients and therefore diagnose ADHD. Even if results were statistically significant, the small size of the sample limits the applicability of this methods as it is presented here, and further work with larger samples and using freely available datasets must be done.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Matzke, Brett D.; Wilson, John E.; Hathaway, J.
2008-02-12
Statistically defensible methods are presented for developing geophysical detector sampling plans and analyzing data for munitions response sites where unexploded ordnance (UXO) may exist. Detection methods for identifying areas of elevated anomaly density from background density are shown. Additionally, methods are described which aid in the choice of transect pattern and spacing to assure with degree of confidence that a target area (TA) of specific size, shape, and anomaly density will be identified using the detection methods. Methods for evaluating the sensitivity of designs to variation in certain parameters are also discussed. Methods presented have been incorporated into the Visualmore » Sample Plan (VSP) software (free at http://dqo.pnl.gov/vsp) and demonstrated at multiple sites in the United States. Application examples from actual transect designs and surveys from the previous two years are demonstrated.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
González Gómez Dulce, I., E-mail: isabeldgg@hotmail.com, E-mail: emoreno@fcfm.buap.mx, E-mail: mim@fcfm.buap.mx, E-mail: joserm84@gmail.com; Moreno Barbosa, E., E-mail: isabeldgg@hotmail.com, E-mail: emoreno@fcfm.buap.mx, E-mail: mim@fcfm.buap.mx, E-mail: joserm84@gmail.com; Hernández, Mario Iván Martínez, E-mail: isabeldgg@hotmail.com, E-mail: emoreno@fcfm.buap.mx, E-mail: mim@fcfm.buap.mx, E-mail: joserm84@gmail.com
The main goal of this project was to create a computer algorithm based on wavelet analysis of region of homogeneity images obtained during resting state studies. Ideally it would automatically diagnose ADHD. Because the cerebellum is an area known to be affected by ADHD, this study specifically analysed this region. Male right handed volunteers (infants with ages between 7 and 11 years old) were studied and compared with age matched controls. Statistical differences between the values of the absolute integrated wavelet spectrum were found and showed significant differences (p<0.0015) between groups. This difference might help in the future to distinguishmore » healthy from ADHD patients and therefore diagnose ADHD. Even if results were statistically significant, the small size of the sample limits the applicability of this methods as it is presented here, and further work with larger samples and using freely available datasets must be done.« less
Quantitative Imaging Biomarkers: A Review of Statistical Methods for Computer Algorithm Comparisons
2014-01-01
Quantitative biomarkers from medical images are becoming important tools for clinical diagnosis, staging, monitoring, treatment planning, and development of new therapies. While there is a rich history of the development of quantitative imaging biomarker (QIB) techniques, little attention has been paid to the validation and comparison of the computer algorithms that implement the QIB measurements. In this paper we provide a framework for QIB algorithm comparisons. We first review and compare various study designs, including designs with the true value (e.g. phantoms, digital reference images, and zero-change studies), designs with a reference standard (e.g. studies testing equivalence with a reference standard), and designs without a reference standard (e.g. agreement studies and studies of algorithm precision). The statistical methods for comparing QIB algorithms are then presented for various study types using both aggregate and disaggregate approaches. We propose a series of steps for establishing the performance of a QIB algorithm, identify limitations in the current statistical literature, and suggest future directions for research. PMID:24919829
Machine Learning Methods for Attack Detection in the Smart Grid.
Ozay, Mete; Esnaola, Inaki; Yarman Vural, Fatos Tunay; Kulkarni, Sanjeev R; Poor, H Vincent
2016-08-01
Attack detection problems in the smart grid are posed as statistical learning problems for different attack scenarios in which the measurements are observed in batch or online settings. In this approach, machine learning algorithms are used to classify measurements as being either secure or attacked. An attack detection framework is provided to exploit any available prior knowledge about the system and surmount constraints arising from the sparse structure of the problem in the proposed approach. Well-known batch and online learning algorithms (supervised and semisupervised) are employed with decision- and feature-level fusion to model the attack detection problem. The relationships between statistical and geometric properties of attack vectors employed in the attack scenarios and learning algorithms are analyzed to detect unobservable attacks using statistical learning methods. The proposed algorithms are examined on various IEEE test systems. Experimental analyses show that machine learning algorithms can detect attacks with performances higher than attack detection algorithms that employ state vector estimation methods in the proposed attack detection framework.
A Hybrid Semi-supervised Classification Scheme for Mining Multisource Geospatial Data
DOE Office of Scientific and Technical Information (OSTI.GOV)
Vatsavai, Raju; Bhaduri, Budhendra L
2011-01-01
Supervised learning methods such as Maximum Likelihood (ML) are often used in land cover (thematic) classification of remote sensing imagery. ML classifier relies exclusively on spectral characteristics of thematic classes whose statistical distributions (class conditional probability densities) are often overlapping. The spectral response distributions of thematic classes are dependent on many factors including elevation, soil types, and ecological zones. A second problem with statistical classifiers is the requirement of large number of accurate training samples (10 to 30 |dimensions|), which are often costly and time consuming to acquire over large geographic regions. With the increasing availability of geospatial databases, itmore » is possible to exploit the knowledge derived from these ancillary datasets to improve classification accuracies even when the class distributions are highly overlapping. Likewise newer semi-supervised techniques can be adopted to improve the parameter estimates of statistical model by utilizing a large number of easily available unlabeled training samples. Unfortunately there is no convenient multivariate statistical model that can be employed for mulitsource geospatial databases. In this paper we present a hybrid semi-supervised learning algorithm that effectively exploits freely available unlabeled training samples from multispectral remote sensing images and also incorporates ancillary geospatial databases. We have conducted several experiments on real datasets, and our new hybrid approach shows over 25 to 35% improvement in overall classification accuracy over conventional classification schemes.« less
Neale, Chris; Madill, Chris; Rauscher, Sarah; Pomès, Régis
2013-08-13
All molecular dynamics simulations are susceptible to sampling errors, which degrade the accuracy and precision of observed values. The statistical convergence of simulations containing atomistic lipid bilayers is limited by the slow relaxation of the lipid phase, which can exceed hundreds of nanoseconds. These long conformational autocorrelation times are exacerbated in the presence of charged solutes, which can induce significant distortions of the bilayer structure. Such long relaxation times represent hidden barriers that induce systematic sampling errors in simulations of solute insertion. To identify optimal methods for enhancing sampling efficiency, we quantitatively evaluate convergence rates using generalized ensemble sampling algorithms in calculations of the potential of mean force for the insertion of the ionic side chain analog of arginine in a lipid bilayer. Umbrella sampling (US) is used to restrain solute insertion depth along the bilayer normal, the order parameter commonly used in simulations of molecular solutes in lipid bilayers. When US simulations are modified to conduct random walks along the bilayer normal using a Hamiltonian exchange algorithm, systematic sampling errors are eliminated more rapidly and the rate of statistical convergence of the standard free energy of binding of the solute to the lipid bilayer is increased 3-fold. We compute the ratio of the replica flux transmitted across a defined region of the order parameter to the replica flux that entered that region in Hamiltonian exchange simulations. We show that this quantity, the transmission factor, identifies sampling barriers in degrees of freedom orthogonal to the order parameter. The transmission factor is used to estimate the depth-dependent conformational autocorrelation times of the simulation system, some of which exceed the simulation time, and thereby identify solute insertion depths that are prone to systematic sampling errors and estimate the lower bound of the amount of sampling that is required to resolve these sampling errors. Finally, we extend our simulations and verify that the conformational autocorrelation times estimated by the transmission factor accurately predict correlation times that exceed the simulation time scale-something that, to our knowledge, has never before been achieved.
Symmetric log-domain diffeomorphic Registration: a demons-based approach.
Vercauteren, Tom; Pennec, Xavier; Perchant, Aymeric; Ayache, Nicholas
2008-01-01
Modern morphometric studies use non-linear image registration to compare anatomies and perform group analysis. Recently, log-Euclidean approaches have contributed to promote the use of such computational anatomy tools by permitting simple computations of statistics on a rather large class of invertible spatial transformations. In this work, we propose a non-linear registration algorithm perfectly fit for log-Euclidean statistics on diffeomorphisms. Our algorithm works completely in the log-domain, i.e. it uses a stationary velocity field. This implies that we guarantee the invertibility of the deformation and have access to the true inverse transformation. This also means that our output can be directly used for log-Euclidean statistics without relying on the heavy computation of the log of the spatial transformation. As it is often desirable, our algorithm is symmetric with respect to the order of the input images. Furthermore, we use an alternate optimization approach related to Thirion's demons algorithm to provide a fast non-linear registration algorithm. First results show that our algorithm outperforms both the demons algorithm and the recently proposed diffeomorphic demons algorithm in terms of accuracy of the transformation while remaining computationally efficient.
Statistical Signal Models and Algorithms for Image Analysis
1984-10-25
In this report, two-dimensional stochastic linear models are used in developing algorithms for image analysis such as classification, segmentation, and object detection in images characterized by textured backgrounds. These models generate two-dimensional random processes as outputs to which statistical inference procedures can naturally be applied. A common thread throughout our algorithms is the interpretation of the inference procedures in terms of linear prediction
An Automated Energy Detection Algorithm Based on Morphological and Statistical Processing Techniques
2018-01-09
ARL-TR-8272 ● JAN 2018 US Army Research Laboratory An Automated Energy Detection Algorithm Based on Morphological and...is no longer needed. Do not return it to the originator. ARL-TR-8272 ● JAN 2018 US Army Research Laboratory An Automated Energy ...4. TITLE AND SUBTITLE An Automated Energy Detection Algorithm Based on Morphological and Statistical Processing Techniques 5a. CONTRACT NUMBER
Dai, Mingwei; Ming, Jingsi; Cai, Mingxuan; Liu, Jin; Yang, Can; Wan, Xiang; Xu, Zongben
2017-09-15
Results from genome-wide association studies (GWAS) suggest that a complex phenotype is often affected by many variants with small effects, known as 'polygenicity'. Tens of thousands of samples are often required to ensure statistical power of identifying these variants with small effects. However, it is often the case that a research group can only get approval for the access to individual-level genotype data with a limited sample size (e.g. a few hundreds or thousands). Meanwhile, summary statistics generated using single-variant-based analysis are becoming publicly available. The sample sizes associated with the summary statistics datasets are usually quite large. How to make the most efficient use of existing abundant data resources largely remains an open question. In this study, we propose a statistical approach, IGESS, to increasing statistical power of identifying risk variants and improving accuracy of risk prediction by i ntegrating individual level ge notype data and s ummary s tatistics. An efficient algorithm based on variational inference is developed to handle the genome-wide analysis. Through comprehensive simulation studies, we demonstrated the advantages of IGESS over the methods which take either individual-level data or summary statistics data as input. We applied IGESS to perform integrative analysis of Crohns Disease from WTCCC and summary statistics from other studies. IGESS was able to significantly increase the statistical power of identifying risk variants and improve the risk prediction accuracy from 63.2% ( ±0.4% ) to 69.4% ( ±0.1% ) using about 240 000 variants. The IGESS software is available at https://github.com/daviddaigithub/IGESS . zbxu@xjtu.edu.cn or xwan@comp.hkbu.edu.hk or eeyang@hkbu.edu.hk. Supplementary data are available at Bioinformatics online. © The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
Multiclass classification of microarray data samples with a reduced number of genes
2011-01-01
Background Multiclass classification of microarray data samples with a reduced number of genes is a rich and challenging problem in Bioinformatics research. The problem gets harder as the number of classes is increased. In addition, the performance of most classifiers is tightly linked to the effectiveness of mandatory gene selection methods. Critical to gene selection is the availability of estimates about the maximum number of genes that can be handled by any classification algorithm. Lack of such estimates may lead to either computationally demanding explorations of a search space with thousands of dimensions or classification models based on gene sets of unrestricted size. In the former case, unbiased but possibly overfitted classification models may arise. In the latter case, biased classification models unable to support statistically significant findings may be obtained. Results A novel bound on the maximum number of genes that can be handled by binary classifiers in binary mediated multiclass classification algorithms of microarray data samples is presented. The bound suggests that high-dimensional binary output domains might favor the existence of accurate and sparse binary mediated multiclass classifiers for microarray data samples. Conclusions A comprehensive experimental work shows that the bound is indeed useful to induce accurate and sparse multiclass classifiers for microarray data samples. PMID:21342522
Nonparametric Methods in Astronomy: Think, Regress, Observe—Pick Any Three
NASA Astrophysics Data System (ADS)
Steinhardt, Charles L.; Jermyn, Adam S.
2018-02-01
Telescopes are much more expensive than astronomers, so it is essential to minimize required sample sizes by using the most data-efficient statistical methods possible. However, the most commonly used model-independent techniques for finding the relationship between two variables in astronomy are flawed. In the worst case they can lead without warning to subtly yet catastrophically wrong results, and even in the best case they require more data than necessary. Unfortunately, there is no single best technique for nonparametric regression. Instead, we provide a guide for how astronomers can choose the best method for their specific problem and provide a python library with both wrappers for the most useful existing algorithms and implementations of two new algorithms developed here.
Monte Carlo Analysis of Reservoir Models Using Seismic Data and Geostatistical Models
NASA Astrophysics Data System (ADS)
Zunino, A.; Mosegaard, K.; Lange, K.; Melnikova, Y.; Hansen, T. M.
2013-12-01
We present a study on the analysis of petroleum reservoir models consistent with seismic data and geostatistical constraints performed on a synthetic reservoir model. Our aim is to invert directly for structure and rock bulk properties of the target reservoir zone. To infer the rock facies, porosity and oil saturation seismology alone is not sufficient but a rock physics model must be taken into account, which links the unknown properties to the elastic parameters. We then combine a rock physics model with a simple convolutional approach for seismic waves to invert the "measured" seismograms. To solve this inverse problem, we employ a Markov chain Monte Carlo (MCMC) method, because it offers the possibility to handle non-linearity, complex and multi-step forward models and provides realistic estimates of uncertainties. However, for large data sets the MCMC method may be impractical because of a very high computational demand. To face this challenge one strategy is to feed the algorithm with realistic models, hence relying on proper prior information. To address this problem, we utilize an algorithm drawn from geostatistics to generate geologically plausible models which represent samples of the prior distribution. The geostatistical algorithm learns the multiple-point statistics from prototype models (in the form of training images), then generates thousands of different models which are accepted or rejected by a Metropolis sampler. To further reduce the computation time we parallelize the software and run it on multi-core machines. The solution of the inverse problem is then represented by a collection of reservoir models in terms of facies, porosity and oil saturation, which constitute samples of the posterior distribution. We are finally able to produce probability maps of the properties we are interested in by performing statistical analysis on the collection of solutions.
NASA Astrophysics Data System (ADS)
Walker, Joel W.
2014-08-01
The M T2, or "s-transverse mass", statistic was developed to associate a parent mass scale to a missing transverse energy signature, given that escaping particles are generally expected in pairs, while collider experiments are sensitive to just a single transverse momentum vector sum. This document focuses on the generalized extension of that statistic to asymmetric one- and two-step decay chains, with arbitrary child particle masses and upstream missing transverse momentum. It provides a unified theoretical formulation, complete solution classification, taxonomy of critical points, and technical algorithmic prescription for treatment of the event scale. An implementation of the described algorithm is available for download, and is also a deployable component of the author's selection cut software package AEAC uS (Algorithmic Event Arbiter and C ut Selector). appendices address combinatoric event assembly, algorithm validation, and a complete pseudocode.
Volume reconstruction optimization for tomo-PIV algorithms applied to experimental data
NASA Astrophysics Data System (ADS)
Martins, Fabio J. W. A.; Foucaut, Jean-Marc; Thomas, Lionel; Azevedo, Luis F. A.; Stanislas, Michel
2015-08-01
Tomographic PIV is a three-component volumetric velocity measurement technique based on the tomographic reconstruction of a particle distribution imaged by multiple camera views. In essence, the performance and accuracy of this technique is highly dependent on the parametric adjustment and the reconstruction algorithm used. Although synthetic data have been widely employed to optimize experiments, the resulting reconstructed volumes might not have optimal quality. The purpose of the present study is to offer quality indicators that can be applied to data samples in order to improve the quality of velocity results obtained by the tomo-PIV technique. The methodology proposed can potentially lead to significantly reduction in the time required to optimize a tomo-PIV reconstruction, also leading to better quality velocity results. Tomo-PIV data provided by a six-camera turbulent boundary-layer experiment were used to optimize the reconstruction algorithms according to this methodology. Velocity statistics measurements obtained by optimized BIMART, SMART and MART algorithms were compared with hot-wire anemometer data and velocity measurement uncertainties were computed. Results indicated that BIMART and SMART algorithms produced reconstructed volumes with equivalent quality as the standard MART with the benefit of reduced computational time.
Unsupervised learning of structure in spectroscopic cubes
NASA Astrophysics Data System (ADS)
Araya, M.; Mendoza, M.; Solar, M.; Mardones, D.; Bayo, A.
2018-07-01
We consider the problem of analyzing the structure of spectroscopic cubes using unsupervised machine learning techniques. We propose representing the target's signal as a homogeneous set of volumes through an iterative algorithm that separates the structured emission from the background while not overestimating the flux. Besides verifying some basic theoretical properties, the algorithm is designed to be tuned by domain experts, because its parameters have meaningful values in the astronomical context. Nevertheless, we propose a heuristic to automatically estimate the signal-to-noise ratio parameter of the algorithm directly from data. The resulting light-weighted set of samples (≤ 1% compared to the original data) offer several advantages. For instance, it is statistically correct and computationally inexpensive to apply well-established techniques of the pattern recognition and machine learning domains; such as clustering and dimensionality reduction algorithms. We use ALMA science verification data to validate our method, and present examples of the operations that can be performed by using the proposed representation. Even though this approach is focused on providing faster and better analysis tools for the end-user astronomer, it also opens the possibility of content-aware data discovery by applying our algorithm to big data.
NASA Astrophysics Data System (ADS)
Shen, Yan; Ge, Jin-ming; Zhang, Guo-qing; Yu, Wen-bin; Liu, Rui-tong; Fan, Wei; Yang, Ying-xuan
2018-01-01
This paper explores the problem of signal processing in optical current transformers (OCTs). Based on the noise characteristics of OCTs, such as overlapping signals, noise frequency bands, low signal-to-noise ratios, and difficulties in acquiring statistical features of noise power, an improved standard Kalman filtering algorithm was proposed for direct current (DC) signal processing. The state-space model of the OCT DC measurement system is first established, and then mixed noise can be processed by adding mixed noise into measurement and state parameters. According to the minimum mean squared error criterion, state predictions and update equations of the improved Kalman algorithm could be deduced based on the established model. An improved central difference Kalman filter was proposed for alternating current (AC) signal processing, which improved the sampling strategy and noise processing of colored noise. Real-time estimation and correction of noise were achieved by designing AC and DC noise recursive filters. Experimental results show that the improved signal processing algorithms had a good filtering effect on the AC and DC signals with mixed noise of OCT. Furthermore, the proposed algorithm was able to achieve real-time correction of noise during the OCT filtering process.
Raymond L. Czaplewski
2015-01-01
Wall-to-wall remotely sensed data are increasingly available to monitor landscape dynamics over large geographic areas. However, statistical monitoring programs that use post-stratification cannot fully utilize those sensor data. The Kalman filter (KF) is an alternative statistical estimator. I develop a new KF algorithm that is numerically robust with large numbers of...
A scoring system for ascertainment of incident stroke; the Risk Index Score (RISc).
Kass-Hout, T A; Moyé, L A; Smith, M A; Morgenstern, L B
2006-01-01
The main objective of this study was to develop and validate a computer-based statistical algorithm that could be translated into a simple scoring system in order to ascertain incident stroke cases using hospital admission medical records data. The Risk Index Score (RISc) algorithm was developed using data collected prospectively by the Brain Attack Surveillance in Corpus Christi (BASIC) project, 2000. The validity of RISc was evaluated by estimating the concordance of scoring system stroke ascertainment to stroke ascertainment by physician and/or abstractor review of hospital admission records. RISc was developed on 1718 randomly selected patients (training set) and then statistically validated on an independent sample of 858 patients (validation set). A multivariable logistic model was used to develop RISc and subsequently evaluated by goodness-of-fit and receiver operating characteristic (ROC) analyses. The higher the value of RISc, the higher the patient's risk of potential stroke. The study showed RISc was well calibrated and discriminated those who had potential stroke from those that did not on initial screening. In this study we developed and validated a rapid, easy, efficient, and accurate method to ascertain incident stroke cases from routine hospital admission records for epidemiologic investigations. Validation of this scoring system was achieved statistically; however, clinical validation in a community hospital setting is warranted.
Optimal structure and parameter learning of Ising models
Lokhov, Andrey; Vuffray, Marc Denis; Misra, Sidhant; ...
2018-03-16
Reconstruction of the structure and parameters of an Ising model from binary samples is a problem of practical importance in a variety of disciplines, ranging from statistical physics and computational biology to image processing and machine learning. The focus of the research community shifted toward developing universal reconstruction algorithms that are both computationally efficient and require the minimal amount of expensive data. Here, we introduce a new method, interaction screening, which accurately estimates model parameters using local optimization problems. The algorithm provably achieves perfect graph structure recovery with an information-theoretically optimal number of samples, notably in the low-temperature regime, whichmore » is known to be the hardest for learning. Here, the efficacy of interaction screening is assessed through extensive numerical tests on synthetic Ising models of various topologies with different types of interactions, as well as on real data produced by a D-Wave quantum computer. Finally, this study shows that the interaction screening method is an exact, tractable, and optimal technique that universally solves the inverse Ising problem.« less
Optimal structure and parameter learning of Ising models
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lokhov, Andrey; Vuffray, Marc Denis; Misra, Sidhant
Reconstruction of the structure and parameters of an Ising model from binary samples is a problem of practical importance in a variety of disciplines, ranging from statistical physics and computational biology to image processing and machine learning. The focus of the research community shifted toward developing universal reconstruction algorithms that are both computationally efficient and require the minimal amount of expensive data. Here, we introduce a new method, interaction screening, which accurately estimates model parameters using local optimization problems. The algorithm provably achieves perfect graph structure recovery with an information-theoretically optimal number of samples, notably in the low-temperature regime, whichmore » is known to be the hardest for learning. Here, the efficacy of interaction screening is assessed through extensive numerical tests on synthetic Ising models of various topologies with different types of interactions, as well as on real data produced by a D-Wave quantum computer. Finally, this study shows that the interaction screening method is an exact, tractable, and optimal technique that universally solves the inverse Ising problem.« less
Removal of impulse noise clusters from color images with local order statistics
NASA Astrophysics Data System (ADS)
Ruchay, Alexey; Kober, Vitaly
2017-09-01
This paper proposes a novel algorithm for restoring images corrupted with clusters of impulse noise. The noise clusters often occur when the probability of impulse noise is very high. The proposed noise removal algorithm consists of detection of bulky impulse noise in three color channels with local order statistics followed by removal of the detected clusters by means of vector median filtering. With the help of computer simulation we show that the proposed algorithm is able to effectively remove clustered impulse noise. The performance of the proposed algorithm is compared in terms of image restoration metrics with that of common successful algorithms.
NASA Astrophysics Data System (ADS)
Fisher, B. L.; Wolff, D. B.; Silberstein, D. S.; Marks, D. M.; Pippitt, J. L.
2007-12-01
The Tropical Rainfall Measuring Mission's (TRMM) Ground Validation (GV) Program was originally established with the principal long-term goal of determining the random errors and systematic biases stemming from the application of the TRMM rainfall algorithms. The GV Program has been structured around two validation strategies: 1) determining the quantitative accuracy of the integrated monthly rainfall products at GV regional sites over large areas of about 500 km2 using integrated ground measurements and 2) evaluating the instantaneous satellite and GV rain rate statistics at spatio-temporal scales compatible with the satellite sensor resolution (Simpson et al. 1988, Thiele 1988). The GV Program has continued to evolve since the launch of the TRMM satellite on November 27, 1997. This presentation will discuss current GV methods of validating TRMM operational rain products in conjunction with ongoing research. The challenge facing TRMM GV has been how to best utilize rain information from the GV system to infer the random and systematic error characteristics of the satellite rain estimates. A fundamental problem of validating space-borne rain estimates is that the true mean areal rainfall is an ideal, scale-dependent parameter that cannot be directly measured. Empirical validation uses ground-based rain estimates to determine the error characteristics of the satellite-inferred rain estimates, but ground estimates also incur measurement errors and contribute to the error covariance. Furthermore, sampling errors, associated with the discrete, discontinuous temporal sampling by the rain sensors aboard the TRMM satellite, become statistically entangled in the monthly estimates. Sampling errors complicate the task of linking biases in the rain retrievals to the physics of the satellite algorithms. The TRMM Satellite Validation Office (TSVO) has made key progress towards effective satellite validation. For disentangling the sampling and retrieval errors, TSVO has developed and applied a methodology that statistically separates the two error sources. Using TRMM monthly estimates and high-resolution radar and gauge data, this method has been used to estimate sampling and retrieval error budgets over GV sites. More recently, a multi- year data set of instantaneous rain rates from the TRMM microwave imager (TMI), the precipitation radar (PR), and the combined algorithm was spatio-temporally matched and inter-compared to GV radar rain rates collected during satellite overpasses of select GV sites at the scale of the TMI footprint. The analysis provided a more direct probe of the satellite rain algorithms using ground data as an empirical reference. TSVO has also made significant advances in radar quality control through the development of the Relative Calibration Adjustment (RCA) technique. The RCA is currently being used to provide a long-term record of radar calibration for the radar at Kwajalein, a strategically important GV site in the tropical Pacific. The RCA technique has revealed previously undetected alterations in the radar sensitivity due to engineering changes (e.g., system modifications, antenna offsets, alterations of the receiver, or the data processor), making possible the correction of the radar rainfall measurements and ensuring the integrity of nearly a decade of TRMM GV observations and resources.
Quasi-Supervised Scoring of Human Sleep in Polysomnograms Using Augmented Input Variables
Yaghouby, Farid; Sunderam, Sridhar
2015-01-01
The limitations of manual sleep scoring make computerized methods highly desirable. Scoring errors can arise from human rater uncertainty or inter-rater variability. Sleep scoring algorithms either come as supervised classifiers that need scored samples of each state to be trained, or as unsupervised classifiers that use heuristics or structural clues in unscored data to define states. We propose a quasi-supervised classifier that models observations in an unsupervised manner but mimics a human rater wherever training scores are available. EEG, EMG, and EOG features were extracted in 30s epochs from human-scored polysomnograms recorded from 42 healthy human subjects (18 to 79 years) and archived in an anonymized, publicly accessible database. Hypnograms were modified so that: 1. Some states are scored but not others; 2. Samples of all states are scored but not for transitional epochs; and 3. Two raters with 67% agreement are simulated. A framework for quasi-supervised classification was devised in which unsupervised statistical models—specifically Gaussian mixtures and hidden Markov models—are estimated from unlabeled training data, but the training samples are augmented with variables whose values depend on available scores. Classifiers were fitted to signal features incorporating partial scores, and used to predict scores for complete recordings. Performance was assessed using Cohen's K statistic. The quasi-supervised classifier performed significantly better than an unsupervised model and sometimes as well as a completely supervised model despite receiving only partial scores. The quasi-supervised algorithm addresses the need for classifiers that mimic scoring patterns of human raters while compensating for their limitations. PMID:25679475
Quasi-supervised scoring of human sleep in polysomnograms using augmented input variables.
Yaghouby, Farid; Sunderam, Sridhar
2015-04-01
The limitations of manual sleep scoring make computerized methods highly desirable. Scoring errors can arise from human rater uncertainty or inter-rater variability. Sleep scoring algorithms either come as supervised classifiers that need scored samples of each state to be trained, or as unsupervised classifiers that use heuristics or structural clues in unscored data to define states. We propose a quasi-supervised classifier that models observations in an unsupervised manner but mimics a human rater wherever training scores are available. EEG, EMG, and EOG features were extracted in 30s epochs from human-scored polysomnograms recorded from 42 healthy human subjects (18-79 years) and archived in an anonymized, publicly accessible database. Hypnograms were modified so that: 1. Some states are scored but not others; 2. Samples of all states are scored but not for transitional epochs; and 3. Two raters with 67% agreement are simulated. A framework for quasi-supervised classification was devised in which unsupervised statistical models-specifically Gaussian mixtures and hidden Markov models--are estimated from unlabeled training data, but the training samples are augmented with variables whose values depend on available scores. Classifiers were fitted to signal features incorporating partial scores, and used to predict scores for complete recordings. Performance was assessed using Cohen's Κ statistic. The quasi-supervised classifier performed significantly better than an unsupervised model and sometimes as well as a completely supervised model despite receiving only partial scores. The quasi-supervised algorithm addresses the need for classifiers that mimic scoring patterns of human raters while compensating for their limitations. Copyright © 2015 Elsevier Ltd. All rights reserved.
Big Data Analytics for Scanning Transmission Electron Microscopy Ptychography
NASA Astrophysics Data System (ADS)
Jesse, S.; Chi, M.; Belianinov, A.; Beekman, C.; Kalinin, S. V.; Borisevich, A. Y.; Lupini, A. R.
2016-05-01
Electron microscopy is undergoing a transition; from the model of producing only a few micrographs, through the current state where many images and spectra can be digitally recorded, to a new mode where very large volumes of data (movies, ptychographic and multi-dimensional series) can be rapidly obtained. Here, we discuss the application of so-called “big-data” methods to high dimensional microscopy data, using unsupervised multivariate statistical techniques, in order to explore salient image features in a specific example of BiFeO3 domains. Remarkably, k-means clustering reveals domain differentiation despite the fact that the algorithm is purely statistical in nature and does not require any prior information regarding the material, any coexisting phases, or any differentiating structures. While this is a somewhat trivial case, this example signifies the extraction of useful physical and structural information without any prior bias regarding the sample or the instrumental modality. Further interpretation of these types of results may still require human intervention. However, the open nature of this algorithm and its wide availability, enable broad collaborations and exploratory work necessary to enable efficient data analysis in electron microscopy.
Statistical efficiency of adaptive algorithms.
Widrow, Bernard; Kamenetsky, Max
2003-01-01
The statistical efficiency of a learning algorithm applied to the adaptation of a given set of variable weights is defined as the ratio of the quality of the converged solution to the amount of data used in training the weights. Statistical efficiency is computed by averaging over an ensemble of learning experiences. A high quality solution is very close to optimal, while a low quality solution corresponds to noisy weights and less than optimal performance. In this work, two gradient descent adaptive algorithms are compared, the LMS algorithm and the LMS/Newton algorithm. LMS is simple and practical, and is used in many applications worldwide. LMS/Newton is based on Newton's method and the LMS algorithm. LMS/Newton is optimal in the least squares sense. It maximizes the quality of its adaptive solution while minimizing the use of training data. Many least squares adaptive algorithms have been devised over the years, but no other least squares algorithm can give better performance, on average, than LMS/Newton. LMS is easily implemented, but LMS/Newton, although of great mathematical interest, cannot be implemented in most practical applications. Because of its optimality, LMS/Newton serves as a benchmark for all least squares adaptive algorithms. The performances of LMS and LMS/Newton are compared, and it is found that under many circumstances, both algorithms provide equal performance. For example, when both algorithms are tested with statistically nonstationary input signals, their average performances are equal. When adapting with stationary input signals and with random initial conditions, their respective learning times are on average equal. However, under worst-case initial conditions, the learning time of LMS can be much greater than that of LMS/Newton, and this is the principal disadvantage of the LMS algorithm. But the strong points of LMS are ease of implementation and optimal performance under important practical conditions. For these reasons, the LMS algorithm has enjoyed very widespread application. It is used in almost every modem for channel equalization and echo cancelling. Furthermore, it is related to the famous backpropagation algorithm used for training neural networks.
The Evolution of Random Number Generation in MUVES
2017-01-01
mathematical basis and statistical justification for algorithms used in the code. The working code provided produces results identical to the current...MUVES, includ- ing the mathematical basis and statistical justification for algorithms used in the code. The working code provided produces results...questionable numerical and statistical properties. The development of the modern system is traced through software change requests, resulting in a random number
Estimating the size of the solution space of metabolic networks
Braunstein, Alfredo; Mulet, Roberto; Pagnani, Andrea
2008-01-01
Background Cellular metabolism is one of the most investigated system of biological interactions. While the topological nature of individual reactions and pathways in the network is quite well understood there is still a lack of comprehension regarding the global functional behavior of the system. In the last few years flux-balance analysis (FBA) has been the most successful and widely used technique for studying metabolism at system level. This method strongly relies on the hypothesis that the organism maximizes an objective function. However only under very specific biological conditions (e.g. maximization of biomass for E. coli in reach nutrient medium) the cell seems to obey such optimization law. A more refined analysis not assuming extremization remains an elusive task for large metabolic systems due to algorithmic limitations. Results In this work we propose a novel algorithmic strategy that provides an efficient characterization of the whole set of stable fluxes compatible with the metabolic constraints. Using a technique derived from the fields of statistical physics and information theory we designed a message-passing algorithm to estimate the size of the affine space containing all possible steady-state flux distributions of metabolic networks. The algorithm, based on the well known Bethe approximation, can be used to approximately compute the volume of a non full-dimensional convex polytope in high dimensions. We first compare the accuracy of the predictions with an exact algorithm on small random metabolic networks. We also verify that the predictions of the algorithm match closely those of Monte Carlo based methods in the case of the Red Blood Cell metabolic network. Then we test the effect of gene knock-outs on the size of the solution space in the case of E. coli central metabolism. Finally we analyze the statistical properties of the average fluxes of the reactions in the E. coli metabolic network. Conclusion We propose a novel efficient distributed algorithmic strategy to estimate the size and shape of the affine space of a non full-dimensional convex polytope in high dimensions. The method is shown to obtain, quantitatively and qualitatively compatible results with the ones of standard algorithms (where this comparison is possible) being still efficient on the analysis of large biological systems, where exact deterministic methods experience an explosion in algorithmic time. The algorithm we propose can be considered as an alternative to Monte Carlo sampling methods. PMID:18489757
Maximum entropy models of ecosystem functioning
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bertram, Jason, E-mail: jason.bertram@anu.edu.au
2014-12-05
Using organism-level traits to deduce community-level relationships is a fundamental problem in theoretical ecology. This problem parallels the physical one of using particle properties to deduce macroscopic thermodynamic laws, which was successfully achieved with the development of statistical physics. Drawing on this parallel, theoretical ecologists from Lotka onwards have attempted to construct statistical mechanistic theories of ecosystem functioning. Jaynes’ broader interpretation of statistical mechanics, which hinges on the entropy maximisation algorithm (MaxEnt), is of central importance here because the classical foundations of statistical physics do not have clear ecological analogues (e.g. phase space, dynamical invariants). However, models based on themore » information theoretic interpretation of MaxEnt are difficult to interpret ecologically. Here I give a broad discussion of statistical mechanical models of ecosystem functioning and the application of MaxEnt in these models. Emphasising the sample frequency interpretation of MaxEnt, I show that MaxEnt can be used to construct models of ecosystem functioning which are statistical mechanical in the traditional sense using a savanna plant ecology model as an example.« less
2017-06-01
Training time statistics from Jones’ thesis. . . . . . . . . . . . . . 15 Table 2.2 Evaluation runtime statistics from Camp’s thesis for a single image. 17...Table 2.3 Training and evaluation runtime statistics from Sharpe’s thesis. . . 19 Table 2.4 Sharpe’s screenshot detector results for combinations of...training resources available and time required for each algorithm Jones [15] tested. Table 2.1. Training time statistics from Jones’ [15] thesis. Algorithm
Mukhopadhyay, Nitai D; Sampson, Andrew J; Deniz, Daniel; Alm Carlsson, Gudrun; Williamson, Jeffrey; Malusek, Alexandr
2012-01-01
Correlated sampling Monte Carlo methods can shorten computing times in brachytherapy treatment planning. Monte Carlo efficiency is typically estimated via efficiency gain, defined as the reduction in computing time by correlated sampling relative to conventional Monte Carlo methods when equal statistical uncertainties have been achieved. The determination of the efficiency gain uncertainty arising from random effects, however, is not a straightforward task specially when the error distribution is non-normal. The purpose of this study is to evaluate the applicability of the F distribution and standardized uncertainty propagation methods (widely used in metrology to estimate uncertainty of physical measurements) for predicting confidence intervals about efficiency gain estimates derived from single Monte Carlo runs using fixed-collision correlated sampling in a simplified brachytherapy geometry. A bootstrap based algorithm was used to simulate the probability distribution of the efficiency gain estimates and the shortest 95% confidence interval was estimated from this distribution. It was found that the corresponding relative uncertainty was as large as 37% for this particular problem. The uncertainty propagation framework predicted confidence intervals reasonably well; however its main disadvantage was that uncertainties of input quantities had to be calculated in a separate run via a Monte Carlo method. The F distribution noticeably underestimated the confidence interval. These discrepancies were influenced by several photons with large statistical weights which made extremely large contributions to the scored absorbed dose difference. The mechanism of acquiring high statistical weights in the fixed-collision correlated sampling method was explained and a mitigation strategy was proposed. Copyright © 2011 Elsevier Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Sreejith, Sreevarsha; Pereverzyev, Sergiy, Jr.; Kelvin, Lee S.; Marleau, Francine R.; Haltmeier, Markus; Ebner, Judith; Bland-Hawthorn, Joss; Driver, Simon P.; Graham, Alister W.; Holwerda, Benne W.; Hopkins, Andrew M.; Liske, Jochen; Loveday, Jon; Moffett, Amanda J.; Pimbblet, Kevin A.; Taylor, Edward N.; Wang, Lingyu; Wright, Angus H.
2018-03-01
We apply four statistical learning methods to a sample of 7941 galaxies (z < 0.06) from the Galaxy And Mass Assembly survey to test the feasibility of using automated algorithms to classify galaxies. Using 10 features measured for each galaxy (sizes, colours, shape parameters, and stellar mass), we apply the techniques of Support Vector Machines, Classification Trees, Classification Trees with Random Forest (CTRF) and Neural Networks, and returning True Prediction Ratios (TPRs) of 75.8 per cent, 69.0 per cent, 76.2 per cent, and 76.0 per cent, respectively. Those occasions whereby all four algorithms agree with each other yet disagree with the visual classification (`unanimous disagreement') serves as a potential indicator of human error in classification, occurring in ˜ 9 per cent of ellipticals, ˜ 9 per cent of little blue spheroids, ˜ 14 per cent of early-type spirals, ˜ 21 per cent of intermediate-type spirals, and ˜ 4 per cent of late-type spirals and irregulars. We observe that the choice of parameters rather than that of algorithms is more crucial in determining classification accuracy. Due to its simplicity in formulation and implementation, we recommend the CTRF algorithm for classifying future galaxy data sets. Adopting the CTRF algorithm, the TPRs of the five galaxy types are : E, 70.1 per cent; LBS, 75.6 per cent; S0-Sa, 63.6 per cent; Sab-Scd, 56.4 per cent, and Sd-Irr, 88.9 per cent. Further, we train a binary classifier using this CTRF algorithm that divides galaxies into spheroid-dominated (E, LBS, and S0-Sa) and disc-dominated (Sab-Scd and Sd-Irr), achieving an overall accuracy of 89.8 per cent. This translates into an accuracy of 84.9 per cent for spheroid-dominated systems and 92.5 per cent for disc-dominated systems.
Optical Algorithms at Satellite Wavelengths for Total Suspended Matter in Tropical Coastal Waters.
Ouillon, Sylvain; Douillet, Pascal; Petrenko, Anne; Neveux, Jacques; Dupouy, Cécile; Froidefond, Jean-Marie; Andréfouët, Serge; Muñoz-Caravaca, Alain
2008-07-10
Is it possible to derive accurately Total Suspended Matter concentration or its proxy, turbidity, from remote sensing data in tropical coastal lagoon waters? To investigate this question, hyperspectral remote sensing reflectance, turbidity and chlorophyll pigment concentration were measured in three coral reef lagoons. The three sites enabled us to get data over very diverse environments: oligotrophic and sediment-poor waters in the southwest lagoon of New Caledonia, eutrophic waters in the Cienfuegos Bay (Cuba), and sediment-rich waters in the Laucala Bay (Fiji). In this paper, optical algorithms for turbidity are presented per site based on 113 stations in New Caledonia, 24 stations in Cuba and 56 stations in Fiji. Empirical algorithms are tested at satellite wavebands useful to coastal applications. Global algorithms are also derived for the merged data set (193 stations). The performances of global and local regression algorithms are compared. The best one-band algorithms on all the measurements are obtained at 681 nm using either a polynomial or a power model. The best two-band algorithms are obtained with R412/R620, R443/R670 and R510/R681. Two three-band algorithms based on Rrs620.Rrs681/Rrs412 and Rrs620.Rrs681/Rrs510 also give fair regression statistics. Finally, we propose a global algorithm based on one or three bands: turbidity is first calculated from Rrs681 and then, if < 1 FTU, it is recalculated using an algorithm based on Rrs620.Rrs681/Rrs412. On our data set, this algorithm is suitable for the 0.2-25 FTU turbidity range and for the three sites sampled (mean bias: 3.6 %, rms: 35%, mean quadratic error: 1.4 FTU). This shows that defining global empirical turbidity algorithms in tropical coastal waters is at reach.
Calibrated Noise Measurements with Induced Receiver Gain Fluctuations
NASA Technical Reports Server (NTRS)
Racette, Paul; Walker, David; Gu, Dazhen; Rajola, Marco; Spevacek, Ashly
2011-01-01
The lack of well-developed techniques for modeling changing statistical moments in our observations has stymied the application of stochastic process theory in science and engineering. These limitations were encountered when modeling the performance of radiometer calibration architectures and algorithms in the presence of non stationary receiver fluctuations. Analyses of measured signals have traditionally been limited to a single measurement series. Whereas in a radiometer that samples a set of noise references, the data collection can be treated as an ensemble set of measurements of the receiver state. Noise Assisted Data Analysis is a growing field of study with significant potential for aiding the understanding and modeling of non stationary processes. Typically, NADA entails adding noise to a signal to produce an ensemble set on which statistical analysis is performed. Alternatively as in radiometric measurements, mixing a signal with calibrated noise provides, through the calibration process, the means to detect deviations from the stationary assumption and thereby a measurement tool to characterize the signal's non stationary properties. Data sets comprised of calibrated noise measurements have been limited to those collected with naturally occurring fluctuations in the radiometer receiver. To examine the application of NADA using calibrated noise, a Receiver Gain Modulation Circuit (RGMC) was designed and built to modulate the gain of a radiometer receiver using an external signal. In 2010, an RGMC was installed and operated at the National Institute of Standards and Techniques (NIST) using their Noise Figure Radiometer (NFRad) and national standard noise references. The data collected is the first known set of calibrated noise measurements from a receiver with an externally modulated gain. As an initial step, sinusoidal and step-function signals were used to modulate the receiver gain, to evaluate the circuit characteristics and to study the performance of a variety of calibration algorithms. The receiver noise temperature and time-bandwidth product of the NFRad are calculated from the data. Statistical analysis using temporal-dependent calibration algorithms reveals that the natural occurring fluctuations in the receiver are stationary over long intervals (100s of seconds); however the receiver exhibits local non stationarity over the interval over which one set of reference measurements are collected. A variety of calibration algorithms have been applied to the data to assess algorithms' performance with the gain fluctuation signals. This presentation will describe the RGMC, experiment design and a comparative analysis of calibration algorithms.
Ramón, M; Martínez-Pastor, F
2018-04-23
Computer-aided sperm analysis (CASA) produces a wealth of data that is frequently ignored. The use of multiparametric statistical methods can help explore these datasets, unveiling the subpopulation structure of sperm samples. In this review we analyse the significance of the internal heterogeneity of sperm samples and its relevance. We also provide a brief description of the statistical tools used for extracting sperm subpopulations from the datasets, namely unsupervised clustering (with non-hierarchical, hierarchical and two-step methods) and the most advanced supervised methods, based on machine learning. The former method has allowed exploration of subpopulation patterns in many species, whereas the latter offering further possibilities, especially considering functional studies and the practical use of subpopulation analysis. We also consider novel approaches, such as the use of geometric morphometrics or imaging flow cytometry. Finally, although the data provided by CASA systems provides valuable information on sperm samples by applying clustering analyses, there are several caveats. Protocols for capturing and analysing motility or morphometry should be standardised and adapted to each experiment, and the algorithms should be open in order to allow comparison of results between laboratories. Moreover, we must be aware of new technology that could change the paradigm for studying sperm motility and morphology.
A new algorithm for attitude-independent magnetometer calibration
NASA Technical Reports Server (NTRS)
Alonso, Roberto; Shuster, Malcolm D.
1994-01-01
A new algorithm is developed for inflight magnetometer bias determination without knowledge of the attitude. This algorithm combines the fast convergence of a heuristic algorithm currently in use with the correct treatment of the statistics and without discarding data. The algorithm performance is examined using simulated data and compared with previous algorithms.
Keshavarz, M; Mojra, A
2015-05-01
Geometrical features of a cancerous tumor embedded in biological soft tissue, including tumor size and depth, are a necessity in the follow-up procedure and making suitable therapeutic decisions. In this paper, a new socio-politically motivated global search strategy which is called imperialist competitive algorithm (ICA) is implemented to train a feed forward neural network (FFNN) to estimate the tumor's geometrical characteristics (FFNNICA). First, a viscoelastic model of liver tissue is constructed by using a series of in vitro uniaxial and relaxation test data. Then, 163 samples of the tissue including a tumor with different depths and diameters are generated by making use of PYTHON programming to link the ABAQUS and MATLAB together. Next, the samples are divided into 123 samples as training dataset and 40 samples as testing dataset. Training inputs of the network are mechanical parameters extracted from palpation of the tissue through a developing noninvasive technology called artificial tactile sensing (ATS). Last, to evaluate the FFNNICA performance, outputs of the network including tumor's depth and diameter are compared with desired values for both training and testing datasets. Deviations of the outputs from desired values are calculated by a regression analysis. Statistical analysis is also performed by measuring Root Mean Square Error (RMSE) and Efficiency (E). RMSE in diameter and depth estimations are 0.50 mm and 1.49, respectively, for the testing dataset. Results affirm that the proposed optimization algorithm for training neural network can be useful to characterize soft tissue tumors accurately by employing an artificial palpation approach. Copyright © 2015 John Wiley & Sons, Ltd.
Quantitative imaging biomarkers: a review of statistical methods for computer algorithm comparisons.
Obuchowski, Nancy A; Reeves, Anthony P; Huang, Erich P; Wang, Xiao-Feng; Buckler, Andrew J; Kim, Hyun J Grace; Barnhart, Huiman X; Jackson, Edward F; Giger, Maryellen L; Pennello, Gene; Toledano, Alicia Y; Kalpathy-Cramer, Jayashree; Apanasovich, Tatiyana V; Kinahan, Paul E; Myers, Kyle J; Goldgof, Dmitry B; Barboriak, Daniel P; Gillies, Robert J; Schwartz, Lawrence H; Sullivan, Daniel C
2015-02-01
Quantitative biomarkers from medical images are becoming important tools for clinical diagnosis, staging, monitoring, treatment planning, and development of new therapies. While there is a rich history of the development of quantitative imaging biomarker (QIB) techniques, little attention has been paid to the validation and comparison of the computer algorithms that implement the QIB measurements. In this paper we provide a framework for QIB algorithm comparisons. We first review and compare various study designs, including designs with the true value (e.g. phantoms, digital reference images, and zero-change studies), designs with a reference standard (e.g. studies testing equivalence with a reference standard), and designs without a reference standard (e.g. agreement studies and studies of algorithm precision). The statistical methods for comparing QIB algorithms are then presented for various study types using both aggregate and disaggregate approaches. We propose a series of steps for establishing the performance of a QIB algorithm, identify limitations in the current statistical literature, and suggest future directions for research. © The Author(s) 2014 Reprints and permissions: sagepub.co.uk/journalsPermissions.nav.
Risović, Dubravko; Pavlović, Zivko
2013-01-01
Processing of gray scale images in order to determine the corresponding fractal dimension is very important due to widespread use of imaging technologies and application of fractal analysis in many areas of science, technology, and medicine. To this end, many methods for estimation of fractal dimension from gray scale images have been developed and routinely used. Unfortunately different methods (dimension estimators) often yield significantly different results in a manner that makes interpretation difficult. Here, we report results of comparative assessment of performance of several most frequently used algorithms/methods for estimation of fractal dimension. To that purpose, we have used scanning electron microscope images of aluminum oxide surfaces with different fractal dimensions. The performance of algorithms/methods was evaluated using the statistical Z-score approach. The differences between performances of six various methods are discussed and further compared with results obtained by electrochemical impedance spectroscopy on the same samples. The analysis of results shows that the performance of investigated algorithms varies considerably and that systematically erroneous fractal dimensions could be estimated using certain methods. The differential cube counting, triangulation, and box counting algorithms showed satisfactory performance in the whole investigated range of fractal dimensions. Difference statistic is proved to be less reliable generating 4% of unsatisfactory results. The performances of the Power spectrum, Partitioning and EIS were unsatisfactory in 29%, 38%, and 75% of estimations, respectively. The results of this study should be useful and provide guidelines to researchers using/attempting fractal analysis of images obtained by scanning microscopy or atomic force microscopy. © Wiley Periodicals, Inc.
Understand your Algorithm: Drill Down to Sample Visualizations in Jupyter Notebooks
NASA Astrophysics Data System (ADS)
Mapes, B. E.; Ho, Y.; Cheedela, S. K.; McWhirter, J.
2017-12-01
Statistics are the currency of climate dynamics, but the space of all possible algorithms is fathomless - especially for 4-dimensional weather-resolving data that many "impact" variables depend on. Algorithms are designed on data samples, but how do you know if they measure what you expect when turned loose on Big Data? We will introduce the year-1 prototype of a 3-year scientist-led, NSF-supported, Unidata-quality software stack called DRILSDOWN (https://brianmapes.github.io/EarthCube-DRILSDOWN/) for automatically extracting, integrating, and visualizing multivariate 4D data samples. Based on a customizable "IDV bundle" of data sources, fields and displays supplied by the user, the system will teleport its space-time coordinates to fetch Cases of Interest (edge cases, typical cases, etc.) from large aggregated repositories. These standard displays can serve as backdrops to overlay with your value-added fields (such as derived quantities stored on a user's local disk). Fields can be readily pulled out of the visualization object for further processing in Python. The hope is that algorithms successfully tested in this visualization space will then be lifted out and added to automatic processing toolchains, lending confidence in the next round of processing, to seek the next Cases of Interest, in light of a user's statistical measures of "Interest". To log the scientific work done in this vein, the visualizations are wrapped in iPython-based Jupyter notebooks for rich, human-readable documentation (indeed, quasi-publication with formatted text, LaTex math, etc.). Such notebooks are readable and executable, with digital replicability and provenance built in. The entire digital object of a case study can be stored in a repository, where libraries of these Case Study Notebooks can be examined in a browser. Model data (the session topic) are of course especially convenient for this system, but observations of all sorts can also be brought in, overlain, and differenced or otherwise co-processed. The system is available in various tiers, from minimal-install GUI visualizations only, to GUI+Notebook system, to the full system with the repository software. We seek interested users, initially in a "beta tester" mode with the goodwill to offer reports and requests to help drive improvements in project years 2 and 3.
NASA Technical Reports Server (NTRS)
Xiang, Xuwu; Smith, Eric A.; Tripoli, Gregory J.
1992-01-01
A hybrid statistical-physical retrieval scheme is explored which combines a statistical approach with an approach based on the development of cloud-radiation models designed to simulate precipitating atmospheres. The algorithm employs the detailed microphysical information from a cloud model as input to a radiative transfer model which generates a cloud-radiation model database. Statistical procedures are then invoked to objectively generate an initial guess composite profile data set from the database. The retrieval algorithm has been tested for a tropical typhoon case using Special Sensor Microwave/Imager (SSM/I) data and has shown satisfactory results.
Majumdar, Satya N
2003-08-01
We use the traveling front approach to derive exact asymptotic results for the statistics of the number of particles in a class of directed diffusion-limited aggregation models on a Cayley tree. We point out that some aspects of these models are closely connected to two different problems in computer science, namely, the digital search tree problem in data structures and the Lempel-Ziv algorithm for data compression. The statistics of the number of particles studied here is related to the statistics of height in digital search trees which, in turn, is related to the statistics of the length of the longest word formed by the Lempel-Ziv algorithm. Implications of our results to these computer science problems are pointed out.
NASA Astrophysics Data System (ADS)
Majumdar, Satya N.
2003-08-01
We use the traveling front approach to derive exact asymptotic results for the statistics of the number of particles in a class of directed diffusion-limited aggregation models on a Cayley tree. We point out that some aspects of these models are closely connected to two different problems in computer science, namely, the digital search tree problem in data structures and the Lempel-Ziv algorithm for data compression. The statistics of the number of particles studied here is related to the statistics of height in digital search trees which, in turn, is related to the statistics of the length of the longest word formed by the Lempel-Ziv algorithm. Implications of our results to these computer science problems are pointed out.
USDA-ARS?s Scientific Manuscript database
Tillage management practices have direct impact on water holding capacity, evaporation, carbon sequestration, and water quality. This study examines the feasibility of two statistical learning algorithms, such as Least Square Support Vector Machine (LSSVM) and Relevance Vector Machine (RVM), for cla...
Carvalho, Gustavo A; Minnett, Peter J; Fleming, Lora E; Banzon, Viva F; Baringer, Warner
2010-06-01
In a continuing effort to develop suitable methods for the surveillance of Harmful Algal Blooms (HABs) of Karenia brevis using satellite radiometers, a new multi-algorithm method was developed to explore whether improvements in the remote sensing detection of the Florida Red Tide was possible. A Hybrid Scheme was introduced that sequentially applies the optimized versions of two pre-existing satellite-based algorithms: an Empirical Approach (using water-leaving radiance as a function of chlorophyll concentration) and a Bio-optical Technique (using particulate backscatter along with chlorophyll concentration). The long-term evaluation of the new multi-algorithm method was performed using a multi-year MODIS dataset (2002 to 2006; during the boreal Summer-Fall periods - July to December) along the Central West Florida Shelf between 25.75°N and 28.25°N. Algorithm validation was done with in situ measurements of the abundances of K. brevis; cell counts ≥1.5×10(4) cells l(-1) defined a detectable HAB. Encouraging statistical results were derived when either or both algorithms correctly flagged known samples. The majority of the valid match-ups were correctly identified (~80% of both HABs and non-blooming conditions) and few false negatives or false positives were produced (~20% of each). Additionally, most of the HAB-positive identifications in the satellite data were indeed HAB samples (positive predictive value: ~70%) and those classified as HAB-negative were almost all non-bloom cases (negative predictive value: ~86%). These results demonstrate an excellent detection capability, on average ~10% more accurate than the individual algorithms used separately. Thus, the new Hybrid Scheme could become a powerful tool for environmental monitoring of K. brevis blooms, with valuable consequences including leading to the more rapid and efficient use of ships to make in situ measurements of HABs.
Carvalho, Gustavo A.; Minnett, Peter J.; Fleming, Lora E.; Banzon, Viva F.; Baringer, Warner
2010-01-01
In a continuing effort to develop suitable methods for the surveillance of Harmful Algal Blooms (HABs) of Karenia brevis using satellite radiometers, a new multi-algorithm method was developed to explore whether improvements in the remote sensing detection of the Florida Red Tide was possible. A Hybrid Scheme was introduced that sequentially applies the optimized versions of two pre-existing satellite-based algorithms: an Empirical Approach (using water-leaving radiance as a function of chlorophyll concentration) and a Bio-optical Technique (using particulate backscatter along with chlorophyll concentration). The long-term evaluation of the new multi-algorithm method was performed using a multi-year MODIS dataset (2002 to 2006; during the boreal Summer-Fall periods – July to December) along the Central West Florida Shelf between 25.75°N and 28.25°N. Algorithm validation was done with in situ measurements of the abundances of K. brevis; cell counts ≥1.5×104 cells l−1 defined a detectable HAB. Encouraging statistical results were derived when either or both algorithms correctly flagged known samples. The majority of the valid match-ups were correctly identified (~80% of both HABs and non-blooming conditions) and few false negatives or false positives were produced (~20% of each). Additionally, most of the HAB-positive identifications in the satellite data were indeed HAB samples (positive predictive value: ~70%) and those classified as HAB-negative were almost all non-bloom cases (negative predictive value: ~86%). These results demonstrate an excellent detection capability, on average ~10% more accurate than the individual algorithms used separately. Thus, the new Hybrid Scheme could become a powerful tool for environmental monitoring of K. brevis blooms, with valuable consequences including leading to the more rapid and efficient use of ships to make in situ measurements of HABs. PMID:21037979
An Adaptive Buddy Check for Observational Quality Control
NASA Technical Reports Server (NTRS)
Dee, Dick P.; Rukhovets, Leonid; Todling, Ricardo; DaSilva, Arlindo M.; Larson, Jay W.; Einaudi, Franco (Technical Monitor)
2000-01-01
An adaptive buddy check algorithm is presented that adjusts tolerances for outlier observations based on the variability of surrounding data. The algorithm derives from a statistical hypothesis test combined with maximum-likelihood covariance estimation. Its stability is shown to depend on the initial identification of outliers by a simple background check. The adaptive feature ensures that the final quality control decisions are not very sensitive to prescribed statistics of first-guess and observation errors, nor on other approximations introduced into the algorithm. The implementation of the algorithm in a global atmospheric data assimilation is described. Its performance is contrasted with that of a non-adaptive buddy check, for the surface analysis of an extreme storm that took place in Europe on 27 December 1999. The adaptive algorithm allowed the inclusion of many important observations that differed greatly from the first guess and that would have been excluded on the basis of prescribed statistics. The analysis of the storm development was much improved as a result of these additional observations.
Implementation of Dynamic Extensible Adaptive Locally Exchangeable Measures (IDEALEM) v 0.1
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sim, Alex; Lee, Dongeun; Wu, K. John
2016-03-04
Handling large streaming data is essential for various applications such as network traffic analysis, social networks, energy cost trends, and environment modeling. However, it is in general intractable to store, compute, search, and retrieve large streaming data. This software addresses a fundamental issue, which is to reduce the size of large streaming data and still obtain accurate statistical analysis. As an example, when a high-speed network such as 100 Gbps network is monitored, the collected measurement data rapidly grows so that polynomial time algorithms (e.g., Gaussian processes) become intractable. One possible solution to reduce the storage of vast amounts ofmore » measured data is to store a random sample, such as one out of 1000 network packets. However, such static sampling methods (linear sampling) have drawbacks: (1) it is not scalable for high-rate streaming data, and (2) there is no guarantee of reflecting the underlying distribution. In this software, we implemented a dynamic sampling algorithm, based on the recent technology from the relational dynamic bayesian online locally exchangeable measures, that reduces the storage of data records in a large scale, and still provides accurate analysis of large streaming data. The software can be used for both online and offline data records.« less
Identifying walking trips from GPS and accelerometer data in adolescent females
Rodriguez, Daniel; Cho, GH; Elder, John; Conway, Terry; Evenson, Kelly R; Ghosh-Dastidar, Bonnie; Shay, Elizabeth; Cohen, Deborah A; Veblen-Mortenson, Sarah; Pickrell, Julie; Lytle, Leslie
2013-01-01
Background Studies that have combined accelerometers and global positioning systems (GPS) to identify walking have done so in carefully controlled conditions. This study tested algorithms for identifying walking trips from accelerometer and GPS data in free-living conditions. The study also assessed the accuracy of the locations where walking occurred compared to what participants reported in a diary. Methods A convenience sample of high school females was recruited (N=42) in 2007. Participants wore a GPS unit and an accelerometer, and recorded their out-of-school travel for six days. Split-sample validation was used to examine agreement in the daily and total number of walking trips with Kappa statistics and count regression models, while agreement in locations visited by walking was examined with geographic information systems. Results Agreement varied based on the parameters of the algorithm, with algorithms exhibiting moderate to substantial agreement with self-reported daily (Kappa = 0.33–0.48) and weekly (Kappa = 0.41–0.64) walking trips. Comparison of reported locations reached by walking and GPS data suggest that reported locations are accurate. Conclusions The use of GPS and accelerometers is promising for assessing the number of walking trips and the walking locations of adolescent females. PMID:21934163
Scholl, Zackary N.; Marszalek, Piotr E.
2013-01-01
The benefits of single molecule force spectroscopy (SMFS) clearly outweigh the challenges which include small sample sizes, tedious data collection and introduction of human bias during the subjective data selection. These difficulties can be partially eliminated through automation of the experimental data collection process for atomic force microscopy (AFM). Automation can be accomplished using an algorithm that triages usable force-extension recordings quickly with positive and negative selection. We implemented an algorithm based on the windowed fast Fourier transform of force-extension traces that identifies peaks using force-extension regimes to correctly identify usable recordings from proteins composed of repeated domains. This algorithm excels as a real-time diagnostic because it involves <30 ms computational time, has high sensitivity and specificity, and efficiently detects weak unfolding events. We used the statistics provided by the automated procedure to clearly demonstrate the properties of molecular adhesion and how these properties change with differences in the cantilever tip and protein functional groups and protein age. PMID:24001740
A statistical-based scheduling algorithm in automated data path synthesis
NASA Technical Reports Server (NTRS)
Jeon, Byung Wook; Lursinsap, Chidchanok
1992-01-01
In this paper, we propose a new heuristic scheduling algorithm based on the statistical analysis of the cumulative frequency distribution of operations among control steps. It has a tendency of escaping from local minima and therefore reaching a globally optimal solution. The presented algorithm considers the real world constraints such as chained operations, multicycle operations, and pipelined data paths. The result of the experiment shows that it gives optimal solutions, even though it is greedy in nature.
Holloway, Andrew J; Oshlack, Alicia; Diyagama, Dileepa S; Bowtell, David DL; Smyth, Gordon K
2006-01-01
Background Concerns are often raised about the accuracy of microarray technologies and the degree of cross-platform agreement, but there are yet no methods which can unambiguously evaluate precision and sensitivity for these technologies on a whole-array basis. Results A methodology is described for evaluating the precision and sensitivity of whole-genome gene expression technologies such as microarrays. The method consists of an easy-to-construct titration series of RNA samples and an associated statistical analysis using non-linear regression. The method evaluates the precision and responsiveness of each microarray platform on a whole-array basis, i.e., using all the probes, without the need to match probes across platforms. An experiment is conducted to assess and compare four widely used microarray platforms. All four platforms are shown to have satisfactory precision but the commercial platforms are superior for resolving differential expression for genes at lower expression levels. The effective precision of the two-color platforms is improved by allowing for probe-specific dye-effects in the statistical model. The methodology is used to compare three data extraction algorithms for the Affymetrix platforms, demonstrating poor performance for the commonly used proprietary algorithm relative to the other algorithms. For probes which can be matched across platforms, the cross-platform variability is decomposed into within-platform and between-platform components, showing that platform disagreement is almost entirely systematic rather than due to measurement variability. Conclusion The results demonstrate good precision and sensitivity for all the platforms, but highlight the need for improved probe annotation. They quantify the extent to which cross-platform measures can be expected to be less accurate than within-platform comparisons for predicting disease progression or outcome. PMID:17118209
Spectral gene set enrichment (SGSE).
Frost, H Robert; Li, Zhigang; Moore, Jason H
2015-03-03
Gene set testing is typically performed in a supervised context to quantify the association between groups of genes and a clinical phenotype. In many cases, however, a gene set-based interpretation of genomic data is desired in the absence of a phenotype variable. Although methods exist for unsupervised gene set testing, they predominantly compute enrichment relative to clusters of the genomic variables with performance strongly dependent on the clustering algorithm and number of clusters. We propose a novel method, spectral gene set enrichment (SGSE), for unsupervised competitive testing of the association between gene sets and empirical data sources. SGSE first computes the statistical association between gene sets and principal components (PCs) using our principal component gene set enrichment (PCGSE) method. The overall statistical association between each gene set and the spectral structure of the data is then computed by combining the PC-level p-values using the weighted Z-method with weights set to the PC variance scaled by Tracy-Widom test p-values. Using simulated data, we show that the SGSE algorithm can accurately recover spectral features from noisy data. To illustrate the utility of our method on real data, we demonstrate the superior performance of the SGSE method relative to standard cluster-based techniques for testing the association between MSigDB gene sets and the variance structure of microarray gene expression data. Unsupervised gene set testing can provide important information about the biological signal held in high-dimensional genomic data sets. Because it uses the association between gene sets and samples PCs to generate a measure of unsupervised enrichment, the SGSE method is independent of cluster or network creation algorithms and, most importantly, is able to utilize the statistical significance of PC eigenvalues to ignore elements of the data most likely to represent noise.
Malloch, L; Kadivar, K; Putz, J; Levett, P N; Tang, J; Hatchette, T F; Kadkhoda, K; Ng, D; Ho, J; Kim, J
2013-12-01
The CLSI-M53-A, Criteria for Laboratory Testing and Diagnosis of Human Immunodeficiency Virus (HIV) Infection; Approved Guideline includes an algorithm in which samples that are reactive on a 4th generation EIA screen proceed to a supplemental assay that is able to confirm and differentiate between antibodies to HIV-1 and HIV-2. The recently CE-marked Bio-Rad Geenius HIV-1/2 Confirmatory Assay was evaluated as an alternative to the FDA-approved Bio-Rad Multispot HIV-1/HIV-2 Rapid Test which has been previously validated for use in this new algorithm. This study used reference samples submitted to the Canadian - NLHRS and samples from commercial sources. Data was tabulated in 2×2 tables for statistical analysis; sensitivity, specificity, predictive values, kappa and likelihood ratios. The overall performance of the Geenius and Multispot was very high; sensitivity (100%, 100%), specificity (96.3%, 99.1%), positive (45.3, 181) and negative (0, 0) likelihood ratios respectively, high kappa (0.96) and low bias index (0.0068). The ability to differentiate HIV-1 (99.2%, 100%) and HIV-2 (98.1%, 98.1%) Ab was also very high. The Bio-Rad Geenius HIV-1/2 Confirmatory Assay is a suitable alternative to the validated Multispot for use in the second stage of CLSI M53 algorithm-I. The Geenius has additional features including traceability and sample and cassette barcoding that improve the quality management/assurance of HIV testing. It is anticipated that the CLSI M53 guideline and assays such as the Geenius will reduce the number of indeterminate test results previously associated with the HIV-1 WB and improve the ability to differentiate HIV-2 infections. Crown Copyright © 2013. Published by Elsevier B.V. All rights reserved.
NASA Astrophysics Data System (ADS)
Lavely, Adam; Vijayakumar, Ganesh; Brasseur, James; Paterson, Eric; Kinzel, Michael
2011-11-01
Using large-eddy simulation (LES) of the neutral and moderately convective atmospheric boundary layers (NBL, MCBL), we analyze the impact of coherent turbulence structure of the atmospheric surface layer on the short-time statistics that are commonly collected from wind turbines. The incoming winds are conditionally sampled with a filtering and thresholding algorithm into high/low horizontal and vertical velocity fluctuation coherent events. The time scales of these events are ~5 - 20 blade rotations and are roughly twice as long in the MCBL as the NBL. Horizontal velocity events are associated with greater variability in rotor power, lift and blade-bending moment than vertical velocity events. The variability in the industry standard 10 minute average for rotor power, sectional lift and wind velocity had a standard deviation of ~ 5% relative to the ``infinite time'' statistics for the NBL and ~10% for the MCBL. We conclude that turbulence structure associated with atmospheric stability state contributes considerable, quantifiable, variability to wind turbine statistics. Supported by NSF and DOE.
NASA Astrophysics Data System (ADS)
Shams Esfand Abadi, Mohammad; AbbasZadeh Arani, Seyed Ali Asghar
2011-12-01
This paper extends the recently introduced variable step-size (VSS) approach to the family of adaptive filter algorithms. This method uses prior knowledge of the channel impulse response statistic. Accordingly, optimal step-size vector is obtained by minimizing the mean-square deviation (MSD). The presented algorithms are the VSS affine projection algorithm (VSS-APA), the VSS selective partial update NLMS (VSS-SPU-NLMS), the VSS-SPU-APA, and the VSS selective regressor APA (VSS-SR-APA). In VSS-SPU adaptive algorithms the filter coefficients are partially updated which reduce the computational complexity. In VSS-SR-APA, the optimal selection of input regressors is performed during the adaptation. The presented algorithms have good convergence speed, low steady state mean square error (MSE), and low computational complexity features. We demonstrate the good performance of the proposed algorithms through several simulations in system identification scenario.
Nonlinear Curve-Fitting Program
NASA Technical Reports Server (NTRS)
Everhart, Joel L.; Badavi, Forooz F.
1989-01-01
Nonlinear optimization algorithm helps in finding best-fit curve. Nonlinear Curve Fitting Program, NLINEAR, interactive curve-fitting routine based on description of quadratic expansion of X(sup 2) statistic. Utilizes nonlinear optimization algorithm calculating best statistically weighted values of parameters of fitting function and X(sup 2) minimized. Provides user with such statistical information as goodness of fit and estimated values of parameters producing highest degree of correlation between experimental data and mathematical model. Written in FORTRAN 77.
Forest inventory using multistage sampling with probability proportional to size. [Brazil
NASA Technical Reports Server (NTRS)
Parada, N. D. J. (Principal Investigator); Lee, D. C. L.; Hernandezfilho, P.; Shimabukuro, Y. E.; Deassis, O. R.; Demedeiros, J. S.
1984-01-01
A multistage sampling technique, with probability proportional to size, for forest volume inventory using remote sensing data is developed and evaluated. The study area is located in the Southeastern Brazil. The LANDSAT 4 digital data of the study area are used in the first stage for automatic classification of reforested areas. Four classes of pine and eucalypt with different tree volumes are classified utilizing a maximum likelihood classification algorithm. Color infrared aerial photographs are utilized in the second stage of sampling. In the third state (ground level) the time volume of each class is determined. The total time volume of each class is expanded through a statistical procedure taking into account all the three stages of sampling. This procedure results in an accurate time volume estimate with a smaller number of aerial photographs and reduced time in field work.
A novel measure and significance testing in data analysis of cell image segmentation.
Wu, Jin Chu; Halter, Michael; Kacker, Raghu N; Elliott, John T; Plant, Anne L
2017-03-14
Cell image segmentation (CIS) is an essential part of quantitative imaging of biological cells. Designing a performance measure and conducting significance testing are critical for evaluating and comparing the CIS algorithms for image-based cell assays in cytometry. Many measures and methods have been proposed and implemented to evaluate segmentation methods. However, computing the standard errors (SE) of the measures and their correlation coefficient is not described, and thus the statistical significance of performance differences between CIS algorithms cannot be assessed. We propose the total error rate (TER), a novel performance measure for segmenting all cells in the supervised evaluation. The TER statistically aggregates all misclassification error rates (MER) by taking cell sizes as weights. The MERs are for segmenting each single cell in the population. The TER is fully supported by the pairwise comparisons of MERs using 106 manually segmented ground-truth cells with different sizes and seven CIS algorithms taken from ImageJ. Further, the SE and 95% confidence interval (CI) of TER are computed based on the SE of MER that is calculated using the bootstrap method. An algorithm for computing the correlation coefficient of TERs between two CIS algorithms is also provided. Hence, the 95% CI error bars can be used to classify CIS algorithms. The SEs of TERs and their correlation coefficient can be employed to conduct the hypothesis testing, while the CIs overlap, to determine the statistical significance of the performance differences between CIS algorithms. A novel measure TER of CIS is proposed. The TER's SEs and correlation coefficient are computed. Thereafter, CIS algorithms can be evaluated and compared statistically by conducting the significance testing.
Using entropy to cut complex time series
NASA Astrophysics Data System (ADS)
Mertens, David; Poncela Casasnovas, Julia; Spring, Bonnie; Amaral, L. A. N.
2013-03-01
Using techniques from statistical physics, physicists have modeled and analyzed human phenomena varying from academic citation rates to disease spreading to vehicular traffic jams. The last decade's explosion of digital information and the growing ubiquity of smartphones has led to a wealth of human self-reported data. This wealth of data comes at a cost, including non-uniform sampling and statistically significant but physically insignificant correlations. In this talk I present our work using entropy to identify stationary sub-sequences of self-reported human weight from a weight management web site. Our entropic approach-inspired by the infomap network community detection algorithm-is far less biased by rare fluctuations than more traditional time series segmentation techniques. Supported by the Howard Hughes Medical Institute
New approach to statistical description of fluctuating particle fluxes
DOE Office of Scientific and Technical Information (OSTI.GOV)
Saenko, V. V.
2009-01-15
The probability density functions (PDFs) of the increments of fluctuating particle fluxes are investigated. It is found that the PDFs have heavy power-law tails decreasing as x{sup -{alpha}-1} at x {yields} {infinity}. This makes it possible to describe these PDFs in terms of fractionally stable distributions (FSDs) q(x; {alpha}, {beta}, {theta}, {lambda}). The parameters {alpha}, {beta}, {gamma}, and {lambda} were estimated statistically using as an example the time samples of fluctuating particle fluxes measured in the edge plasma of the L-2M stellarator. Two series of fluctuating fluxes measured before and after boronization of the vacuum chamber were processed. It ismore » shown that the increments of fluctuating fluxes are well described by DSDs. The effect of boronization on the parameters of FSDs is analyzed. An algorithm for statistically estimating the FSD parameters and a procedure for processing experimental data are described.« less
NASA Astrophysics Data System (ADS)
Zhang, Yu; Li, Fei; Zhang, Shengkai; Zhu, Tingting
2017-04-01
Synthetic Aperture Radar (SAR) is significantly important for polar remote sensing since it can provide continuous observations in all days and all weather. SAR can be used for extracting the surface roughness information characterized by the variance of dielectric properties and different polarization channels, which make it possible to observe different ice types and surface structure for deformation analysis. In November, 2016, Chinese National Antarctic Research Expedition (CHINARE) 33rd cruise has set sails in sea ice zone in Antarctic. Accurate leads spatial distribution in sea ice zone for routine planning of ship navigation is essential. In this study, the semantic relationship between leads and sea ice categories has been described by the Conditional Random Fields (CRF) model, and leads characteristics have been modeled by statistical distributions in SAR imagery. In the proposed algorithm, a mixture statistical distribution based CRF is developed by considering the contexture information and the statistical characteristics of sea ice for improving leads detection in Sentinel-1A dual polarization SAR imagery. The unary potential and pairwise potential in CRF model is constructed by integrating the posteriori probability estimated from statistical distributions. For mixture statistical distribution parameter estimation, Method of Logarithmic Cumulants (MoLC) is exploited for single statistical distribution parameters estimation. The iteration based Expectation Maximal (EM) algorithm is investigated to calculate the parameters in mixture statistical distribution based CRF model. In the posteriori probability inference, graph-cut energy minimization method is adopted in the initial leads detection. The post-processing procedures including aspect ratio constrain and spatial smoothing approaches are utilized to improve the visual result. The proposed method is validated on Sentinel-1A SAR C-band Extra Wide Swath (EW) Ground Range Detected (GRD) imagery with a pixel spacing of 40 meters near Prydz Bay area, East Antarctica. Main work is listed as follows: 1) A mixture statistical distribution based CRF algorithm has been developed for leads detection from Sentinel-1A dual polarization images. 2) The assessment of the proposed mixture statistical distribution based CRF method and single distribution based CRF algorithm has been presented. 3) The preferable parameters sets including statistical distributions, the aspect ratio threshold and spatial smoothing window size have been provided. In the future, the proposed algorithm will be developed for the operational Sentinel series data sets processing due to its less time consuming cost and high accuracy in leads detection.
Weiss, Jeremy C; Page, David; Peissig, Peggy L; Natarajan, Sriraam; McCarty, Catherine
2013-01-01
Electronic health records (EHRs) are an emerging relational domain with large potential to improve clinical outcomes. We apply two statistical relational learning (SRL) algorithms to the task of predicting primary myocardial infarction. We show that one SRL algorithm, relational functional gradient boosting, outperforms propositional learners particularly in the medically-relevant high recall region. We observe that both SRL algorithms predict outcomes better than their propositional analogs and suggest how our methods can augment current epidemiological practices. PMID:25360347
A system for learning statistical motion patterns.
Hu, Weiming; Xiao, Xuejuan; Fu, Zhouyu; Xie, Dan; Tan, Tieniu; Maybank, Steve
2006-09-01
Analysis of motion patterns is an effective approach for anomaly detection and behavior prediction. Current approaches for the analysis of motion patterns depend on known scenes, where objects move in predefined ways. It is highly desirable to automatically construct object motion patterns which reflect the knowledge of the scene. In this paper, we present a system for automatically learning motion patterns for anomaly detection and behavior prediction based on a proposed algorithm for robustly tracking multiple objects. In the tracking algorithm, foreground pixels are clustered using a fast accurate fuzzy K-means algorithm. Growing and prediction of the cluster centroids of foreground pixels ensure that each cluster centroid is associated with a moving object in the scene. In the algorithm for learning motion patterns, trajectories are clustered hierarchically using spatial and temporal information and then each motion pattern is represented with a chain of Gaussian distributions. Based on the learned statistical motion patterns, statistical methods are used to detect anomalies and predict behaviors. Our system is tested using image sequences acquired, respectively, from a crowded real traffic scene and a model traffic scene. Experimental results show the robustness of the tracking algorithm, the efficiency of the algorithm for learning motion patterns, and the encouraging performance of algorithms for anomaly detection and behavior prediction.
NASA Astrophysics Data System (ADS)
Buaria, Dhawal; Yeung, P. K.; Sawford, B. L.
2016-11-01
An efficient massively parallel algorithm has allowed us to obtain the trajectories of 300 million fluid particles in an 81923 simulation of isotropic turbulence at Taylor-scale Reynolds number 1300. Conditional single-particle statistics are used to investigate the effect of extreme events in dissipation and enstrophy on turbulent dispersion. The statistics of pairs and tetrads, both forward and backward in time, are obtained via post-processing of single-particle trajectories. For tetrads, since memory of shape is known to be short, we focus, for convenience, on samples which are initially regular, with all sides of comparable length. The statistics of tetrad size show similar behavior as the two-particle relative dispersion, i.e., stronger backward dispersion at intermediate times with larger backward Richardson constant. In contrast, the statistics of tetrad shape show more robust inertial range scaling, in both forward and backward frames. However, the distortion of shape is stronger for backward dispersion. Our results suggest that the Reynolds number reached in this work is sufficient to settle some long-standing questions concerning Lagrangian scale similarity. Supported by NSF Grants CBET-1235906 and ACI-1036170.
A statistical framework for evaluating neural networks to predict recurrent events in breast cancer
NASA Astrophysics Data System (ADS)
Gorunescu, Florin; Gorunescu, Marina; El-Darzi, Elia; Gorunescu, Smaranda
2010-07-01
Breast cancer is the second leading cause of cancer deaths in women today. Sometimes, breast cancer can return after primary treatment. A medical diagnosis of recurrent cancer is often a more challenging task than the initial one. In this paper, we investigate the potential contribution of neural networks (NNs) to support health professionals in diagnosing such events. The NN algorithms are tested and applied to two different datasets. An extensive statistical analysis has been performed to verify our experiments. The results show that a simple network structure for both the multi-layer perceptron and radial basis function can produce equally good results, not all attributes are needed to train these algorithms and, finally, the classification performances of all algorithms are statistically robust. Moreover, we have shown that the best performing algorithm will strongly depend on the features of the datasets, and hence, there is not necessarily a single best classifier.
Optimal two-phase sampling design for comparing accuracies of two binary classification rules.
Xu, Huiping; Hui, Siu L; Grannis, Shaun
2014-02-10
In this paper, we consider the design for comparing the performance of two binary classification rules, for example, two record linkage algorithms or two screening tests. Statistical methods are well developed for comparing these accuracy measures when the gold standard is available for every unit in the sample, or in a two-phase study when the gold standard is ascertained only in the second phase in a subsample using a fixed sampling scheme. However, these methods do not attempt to optimize the sampling scheme to minimize the variance of the estimators of interest. In comparing the performance of two classification rules, the parameters of primary interest are the difference in sensitivities, specificities, and positive predictive values. We derived the analytic variance formulas for these parameter estimates and used them to obtain the optimal sampling design. The efficiency of the optimal sampling design is evaluated through an empirical investigation that compares the optimal sampling with simple random sampling and with proportional allocation. Results of the empirical study show that the optimal sampling design is similar for estimating the difference in sensitivities and in specificities, and both achieve a substantial amount of variance reduction with an over-sample of subjects with discordant results and under-sample of subjects with concordant results. A heuristic rule is recommended when there is no prior knowledge of individual sensitivities and specificities, or the prevalence of the true positive findings in the study population. The optimal sampling is applied to a real-world example in record linkage to evaluate the difference in classification accuracy of two matching algorithms. Copyright © 2013 John Wiley & Sons, Ltd.
Subband Image Coding with Jointly Optimized Quantizers
NASA Technical Reports Server (NTRS)
Kossentini, Faouzi; Chung, Wilson C.; Smith Mark J. T.
1995-01-01
An iterative design algorithm for the joint design of complexity- and entropy-constrained subband quantizers and associated entropy coders is proposed. Unlike conventional subband design algorithms, the proposed algorithm does not require the use of various bit allocation algorithms. Multistage residual quantizers are employed here because they provide greater control of the complexity-performance tradeoffs, and also because they allow efficient and effective high-order statistical modeling. The resulting subband coder exploits statistical dependencies within subbands, across subbands, and across stages, mainly through complexity-constrained high-order entropy coding. Experimental results demonstrate that the complexity-rate-distortion performance of the new subband coder is exceptional.
Wu, Jianning; Wu, Bin
2015-01-01
The accurate identification of gait asymmetry is very beneficial to the assessment of at-risk gait in the clinical applications. This paper investigated the application of classification method based on statistical learning algorithm to quantify gait symmetry based on the assumption that the degree of intrinsic change in dynamical system of gait is associated with the different statistical distributions between gait variables from left-right side of lower limbs; that is, the discrimination of small difference of similarity between lower limbs is considered the reorganization of their different probability distribution. The kinetic gait data of 60 participants were recorded using a strain gauge force platform during normal walking. The classification method is designed based on advanced statistical learning algorithm such as support vector machine algorithm for binary classification and is adopted to quantitatively evaluate gait symmetry. The experiment results showed that the proposed method could capture more intrinsic dynamic information hidden in gait variables and recognize the right-left gait patterns with superior generalization performance. Moreover, our proposed techniques could identify the small significant difference between lower limbs when compared to the traditional symmetry index method for gait. The proposed algorithm would become an effective tool for early identification of the elderly gait asymmetry in the clinical diagnosis. PMID:25705672
Wu, Jianning; Wu, Bin
2015-01-01
The accurate identification of gait asymmetry is very beneficial to the assessment of at-risk gait in the clinical applications. This paper investigated the application of classification method based on statistical learning algorithm to quantify gait symmetry based on the assumption that the degree of intrinsic change in dynamical system of gait is associated with the different statistical distributions between gait variables from left-right side of lower limbs; that is, the discrimination of small difference of similarity between lower limbs is considered the reorganization of their different probability distribution. The kinetic gait data of 60 participants were recorded using a strain gauge force platform during normal walking. The classification method is designed based on advanced statistical learning algorithm such as support vector machine algorithm for binary classification and is adopted to quantitatively evaluate gait symmetry. The experiment results showed that the proposed method could capture more intrinsic dynamic information hidden in gait variables and recognize the right-left gait patterns with superior generalization performance. Moreover, our proposed techniques could identify the small significant difference between lower limbs when compared to the traditional symmetry index method for gait. The proposed algorithm would become an effective tool for early identification of the elderly gait asymmetry in the clinical diagnosis.
Optimization of the Hartmann-Shack microlens array
NASA Astrophysics Data System (ADS)
de Oliveira, Otávio Gomes; de Lima Monteiro, Davies William
2011-04-01
In this work we propose to optimize the microlens-array geometry for a Hartmann-Shack wavefront sensor. The optimization makes possible that regular microlens arrays with a larger number of microlenses are replaced by arrays with fewer microlenses located at optimal sampling positions, with no increase in the reconstruction error. The goal is to propose a straightforward and widely accessible numerical method to calculate an optimized microlens array for a known aberration statistics. The optimization comprises the minimization of the wavefront reconstruction error and/or the number of necessary microlenses in the array. We numerically generate, sample and reconstruct the wavefront, and use a genetic algorithm to discover the optimal array geometry. Within an ophthalmological context, as a case study, we demonstrate that an array with only 10 suitably located microlenses can be used to produce reconstruction errors as small as those of a 36-microlens regular array. The same optimization procedure can be employed for any application where the wavefront statistics is known.
SPICE: exploration and analysis of post-cytometric complex multivariate datasets.
Roederer, Mario; Nozzi, Joshua L; Nason, Martha C
2011-02-01
Polychromatic flow cytometry results in complex, multivariate datasets. To date, tools for the aggregate analysis of these datasets across multiple specimens grouped by different categorical variables, such as demographic information, have not been optimized. Often, the exploration of such datasets is accomplished by visualization of patterns with pie charts or bar charts, without easy access to statistical comparisons of measurements that comprise multiple components. Here we report on algorithms and a graphical interface we developed for these purposes. In particular, we discuss thresholding necessary for accurate representation of data in pie charts, the implications for display and comparison of normalized versus unnormalized data, and the effects of averaging when samples with significant background noise are present. Finally, we define a statistic for the nonparametric comparison of complex distributions to test for difference between groups of samples based on multi-component measurements. While originally developed to support the analysis of T cell functional profiles, these techniques are amenable to a broad range of datatypes. Published 2011 Wiley-Liss, Inc.
Optimizing human activity patterns using global sensitivity analysis.
Fairchild, Geoffrey; Hickmann, Kyle S; Mniszewski, Susan M; Del Valle, Sara Y; Hyman, James M
2014-12-01
Implementing realistic activity patterns for a population is crucial for modeling, for example, disease spread, supply and demand, and disaster response. Using the dynamic activity simulation engine, DASim, we generate schedules for a population that capture regular (e.g., working, eating, and sleeping) and irregular activities (e.g., shopping or going to the doctor). We use the sample entropy (SampEn) statistic to quantify a schedule's regularity for a population. We show how to tune an activity's regularity by adjusting SampEn, thereby making it possible to realistically design activities when creating a schedule. The tuning process sets up a computationally intractable high-dimensional optimization problem. To reduce the computational demand, we use Bayesian Gaussian process regression to compute global sensitivity indices and identify the parameters that have the greatest effect on the variance of SampEn. We use the harmony search (HS) global optimization algorithm to locate global optima. Our results show that HS combined with global sensitivity analysis can efficiently tune the SampEn statistic with few search iterations. We demonstrate how global sensitivity analysis can guide statistical emulation and global optimization algorithms to efficiently tune activities and generate realistic activity patterns. Though our tuning methods are applied to dynamic activity schedule generation, they are general and represent a significant step in the direction of automated tuning and optimization of high-dimensional computer simulations.
Optimizing human activity patterns using global sensitivity analysis
Hickmann, Kyle S.; Mniszewski, Susan M.; Del Valle, Sara Y.; Hyman, James M.
2014-01-01
Implementing realistic activity patterns for a population is crucial for modeling, for example, disease spread, supply and demand, and disaster response. Using the dynamic activity simulation engine, DASim, we generate schedules for a population that capture regular (e.g., working, eating, and sleeping) and irregular activities (e.g., shopping or going to the doctor). We use the sample entropy (SampEn) statistic to quantify a schedule’s regularity for a population. We show how to tune an activity’s regularity by adjusting SampEn, thereby making it possible to realistically design activities when creating a schedule. The tuning process sets up a computationally intractable high-dimensional optimization problem. To reduce the computational demand, we use Bayesian Gaussian process regression to compute global sensitivity indices and identify the parameters that have the greatest effect on the variance of SampEn. We use the harmony search (HS) global optimization algorithm to locate global optima. Our results show that HS combined with global sensitivity analysis can efficiently tune the SampEn statistic with few search iterations. We demonstrate how global sensitivity analysis can guide statistical emulation and global optimization algorithms to efficiently tune activities and generate realistic activity patterns. Though our tuning methods are applied to dynamic activity schedule generation, they are general and represent a significant step in the direction of automated tuning and optimization of high-dimensional computer simulations. PMID:25580080
A Statistical Method to Distinguish Functional Brain Networks
Fujita, André; Vidal, Maciel C.; Takahashi, Daniel Y.
2017-01-01
One major problem in neuroscience is the comparison of functional brain networks of different populations, e.g., distinguishing the networks of controls and patients. Traditional algorithms are based on search for isomorphism between networks, assuming that they are deterministic. However, biological networks present randomness that cannot be well modeled by those algorithms. For instance, functional brain networks of distinct subjects of the same population can be different due to individual characteristics. Moreover, networks of subjects from different populations can be generated through the same stochastic process. Thus, a better hypothesis is that networks are generated by random processes. In this case, subjects from the same group are samples from the same random process, whereas subjects from different groups are generated by distinct processes. Using this idea, we developed a statistical test called ANOGVA to test whether two or more populations of graphs are generated by the same random graph model. Our simulations' results demonstrate that we can precisely control the rate of false positives and that the test is powerful to discriminate random graphs generated by different models and parameters. The method also showed to be robust for unbalanced data. As an example, we applied ANOGVA to an fMRI dataset composed of controls and patients diagnosed with autism or Asperger. ANOGVA identified the cerebellar functional sub-network as statistically different between controls and autism (p < 0.001). PMID:28261045
A Statistical Method to Distinguish Functional Brain Networks.
Fujita, André; Vidal, Maciel C; Takahashi, Daniel Y
2017-01-01
One major problem in neuroscience is the comparison of functional brain networks of different populations, e.g., distinguishing the networks of controls and patients. Traditional algorithms are based on search for isomorphism between networks, assuming that they are deterministic. However, biological networks present randomness that cannot be well modeled by those algorithms. For instance, functional brain networks of distinct subjects of the same population can be different due to individual characteristics. Moreover, networks of subjects from different populations can be generated through the same stochastic process. Thus, a better hypothesis is that networks are generated by random processes. In this case, subjects from the same group are samples from the same random process, whereas subjects from different groups are generated by distinct processes. Using this idea, we developed a statistical test called ANOGVA to test whether two or more populations of graphs are generated by the same random graph model. Our simulations' results demonstrate that we can precisely control the rate of false positives and that the test is powerful to discriminate random graphs generated by different models and parameters. The method also showed to be robust for unbalanced data. As an example, we applied ANOGVA to an fMRI dataset composed of controls and patients diagnosed with autism or Asperger. ANOGVA identified the cerebellar functional sub-network as statistically different between controls and autism ( p < 0.001).
Optimizing human activity patterns using global sensitivity analysis
Fairchild, Geoffrey; Hickmann, Kyle S.; Mniszewski, Susan M.; ...
2013-12-10
Implementing realistic activity patterns for a population is crucial for modeling, for example, disease spread, supply and demand, and disaster response. Using the dynamic activity simulation engine, DASim, we generate schedules for a population that capture regular (e.g., working, eating, and sleeping) and irregular activities (e.g., shopping or going to the doctor). We use the sample entropy (SampEn) statistic to quantify a schedule’s regularity for a population. We show how to tune an activity’s regularity by adjusting SampEn, thereby making it possible to realistically design activities when creating a schedule. The tuning process sets up a computationally intractable high-dimensional optimizationmore » problem. To reduce the computational demand, we use Bayesian Gaussian process regression to compute global sensitivity indices and identify the parameters that have the greatest effect on the variance of SampEn. Here we use the harmony search (HS) global optimization algorithm to locate global optima. Our results show that HS combined with global sensitivity analysis can efficiently tune the SampEn statistic with few search iterations. We demonstrate how global sensitivity analysis can guide statistical emulation and global optimization algorithms to efficiently tune activities and generate realistic activity patterns. Finally, though our tuning methods are applied to dynamic activity schedule generation, they are general and represent a significant step in the direction of automated tuning and optimization of high-dimensional computer simulations.« less
Intensity correlation-based calibration of FRET.
Bene, László; Ungvári, Tamás; Fedor, Roland; Sasi Szabó, László; Damjanovich, László
2013-11-05
Dual-laser flow cytometric resonance energy transfer (FCET) is a statistically efficient and accurate way of determining proximity relationships for molecules of cells even under living conditions. In the framework of this algorithm, absolute fluorescence resonance energy transfer (FRET) efficiency is determined by the simultaneous measurement of donor-quenching and sensitized emission. A crucial point is the determination of the scaling factor α responsible for balancing the different sensitivities of the donor and acceptor signal channels. The determination of α is not simple, requiring preparation of special samples that are generally different from a double-labeled FRET sample, or by the use of sophisticated statistical estimation (least-squares) procedures. We present an alternative, free-from-spectral-constants approach for the determination of α and the absolute FRET efficiency, by an extension of the presented framework of the FCET algorithm with an analysis of the second moments (variances and covariances) of the detected intensity distributions. A quadratic equation for α is formulated with the intensity fluctuations, which is proved sufficiently robust to give accurate α-values on a cell-by-cell basis in a wide system of conditions using the same double-labeled sample from which the FRET efficiency itself is determined. This seemingly new approach is illustrated by FRET measurements between epitopes of the MHCI receptor on the cell surface of two cell lines, FT and LS174T. The figures show that whereas the common way of α determination fails at large dye-per-protein labeling ratios of mAbs, this presented-as-new approach has sufficient ability to give accurate results. Although introduced in a flow cytometer, the new approach can also be straightforwardly used with fluorescence microscopes. Copyright © 2013 Biophysical Society. Published by Elsevier Inc. All rights reserved.
Design of partially supervised classifiers for multispectral image data
NASA Technical Reports Server (NTRS)
Jeon, Byeungwoo; Landgrebe, David
1993-01-01
A partially supervised classification problem is addressed, especially when the class definition and corresponding training samples are provided a priori only for just one particular class. In practical applications of pattern classification techniques, a frequently observed characteristic is the heavy, often nearly impossible requirements on representative prior statistical class characteristics of all classes in a given data set. Considering the effort in both time and man-power required to have a well-defined, exhaustive list of classes with a corresponding representative set of training samples, this 'partially' supervised capability would be very desirable, assuming adequate classifier performance can be obtained. Two different classification algorithms are developed to achieve simplicity in classifier design by reducing the requirement of prior statistical information without sacrificing significant classifying capability. The first one is based on optimal significance testing, where the optimal acceptance probability is estimated directly from the data set. In the second approach, the partially supervised classification is considered as a problem of unsupervised clustering with initially one known cluster or class. A weighted unsupervised clustering procedure is developed to automatically define other classes and estimate their class statistics. The operational simplicity thus realized should make these partially supervised classification schemes very viable tools in pattern classification.
K-Nearest Neighbor Algorithm Optimization in Text Categorization
NASA Astrophysics Data System (ADS)
Chen, Shufeng
2018-01-01
K-Nearest Neighbor (KNN) classification algorithm is one of the simplest methods of data mining. It has been widely used in classification, regression and pattern recognition. The traditional KNN method has some shortcomings such as large amount of sample computation and strong dependence on the sample library capacity. In this paper, a method of representative sample optimization based on CURE algorithm is proposed. On the basis of this, presenting a quick algorithm QKNN (Quick k-nearest neighbor) to find the nearest k neighbor samples, which greatly reduces the similarity calculation. The experimental results show that this algorithm can effectively reduce the number of samples and speed up the search for the k nearest neighbor samples to improve the performance of the algorithm.
Objective forensic analysis of striated, quasi-striated and impressed toolmarks
NASA Astrophysics Data System (ADS)
Spotts, Ryan E.
Following the 1993 Daubert v. Merrell Dow Pharmaceuticals, Inc. court case and continuing to the 2010 National Academy of Sciences report, comparative forensic toolmark examination has received many challenges to its admissibility in court cases and its scientific foundations. Many of these challenges deal with the subjective nature in determining whether toolmarks are identifiable. This questioning of current identification methods has created a demand for objective methods of identification - "objective" implying known error rates and statistically reliability. The demand for objective methods has resulted in research that created a statistical algorithm capable of comparing toolmarks to determine their statistical similarity, and thus the ability to separate matching and nonmatching toolmarks. This was expanded to the creation of virtual toolmarking (characterization of a tool to predict the toolmark it will create). The statistical algorithm, originally designed for two-dimensional striated toolmarks, had been successfully applied to striated screwdriver and quasi-striated plier toolmarks. Following this success, a blind study was conducted to validate the virtual toolmarking capability using striated screwdriver marks created at various angles of incidence. Work was also performed to optimize the statistical algorithm by implementing means to ensure the algorithm operations were constrained to logical comparison regions (e.g. the opposite ends of two toolmarks do not need to be compared because they do not coincide with each other). This work was performed on quasi-striated shear cut marks made with pliers - a previously tested, more difficult application of the statistical algorithm that could demonstrate the difference in results due to optimization. The final research conducted was performed with pseudostriated impression toolmarks made with chisels. Impression marks, which are more complex than striated marks, were analyzed using the algorithm to separate matching and nonmatching toolmarks. Results of the conducted research are presented as well as evidence of the primary assumption of forensic toolmark examination; all tools can create identifiably unique toolmarks.
NASA Astrophysics Data System (ADS)
Ramírez-López, A.; Romero-Romo, M. A.; Muñoz-Negron, D.; López-Ramírez, S.; Escarela-Pérez, R.; Duran-Valencia, C.
2012-10-01
Computational models are developed to create grain structures using mathematical algorithms based on the chaos theory such as cellular automaton, geometrical models, fractals, and stochastic methods. Because of the chaotic nature of grain structures, some of the most popular routines are based on the Monte Carlo method, statistical distributions, and random walk methods, which can be easily programmed and included in nested loops. Nevertheless, grain structures are not well defined as the results of computational errors and numerical inconsistencies on mathematical methods. Due to the finite definition of numbers or the numerical restrictions during the simulation of solidification, damaged images appear on the screen. These images must be repaired to obtain a good measurement of grain geometrical properties. Some mathematical algorithms were developed to repair, measure, and characterize grain structures obtained from cellular automata in the present work. An appropriate measurement of grain size and the corrected identification of interfaces and length are very important topics in materials science because they are the representation and validation of mathematical models with real samples. As a result, the developed algorithms are tested and proved to be appropriate and efficient to eliminate the errors and characterize the grain structures.
Inverse Problems in Geodynamics Using Machine Learning Algorithms
NASA Astrophysics Data System (ADS)
Shahnas, M. H.; Yuen, D. A.; Pysklywec, R. N.
2018-01-01
During the past few decades numerical studies have been widely employed to explore the style of circulation and mixing in the mantle of Earth and other planets. However, in geodynamical studies there are many properties from mineral physics, geochemistry, and petrology in these numerical models. Machine learning, as a computational statistic-related technique and a subfield of artificial intelligence, has rapidly emerged recently in many fields of sciences and engineering. We focus here on the application of supervised machine learning (SML) algorithms in predictions of mantle flow processes. Specifically, we emphasize on estimating mantle properties by employing machine learning techniques in solving an inverse problem. Using snapshots of numerical convection models as training samples, we enable machine learning models to determine the magnitude of the spin transition-induced density anomalies that can cause flow stagnation at midmantle depths. Employing support vector machine algorithms, we show that SML techniques can successfully predict the magnitude of mantle density anomalies and can also be used in characterizing mantle flow patterns. The technique can be extended to more complex geodynamic problems in mantle dynamics by employing deep learning algorithms for putting constraints on properties such as viscosity, elastic parameters, and the nature of thermal and chemical anomalies.
Chen, Yunjie; Roux, Benoît
2015-08-11
Molecular dynamics (MD) trajectories based on a classical equation of motion provide a straightforward, albeit somewhat inefficient approach, to explore and sample the configurational space of a complex molecular system. While a broad range of techniques can be used to accelerate and enhance the sampling efficiency of classical simulations, only algorithms that are consistent with the Boltzmann equilibrium distribution yield a proper statistical mechanical computational framework. Here, a multiscale hybrid algorithm relying simultaneously on all-atom fine-grained (FG) and coarse-grained (CG) representations of a system is designed to improve sampling efficiency by combining the strength of nonequilibrium molecular dynamics (neMD) and Metropolis Monte Carlo (MC). This CG-guided hybrid neMD-MC algorithm comprises six steps: (1) a FG configuration of an atomic system is dynamically propagated for some period of time using equilibrium MD; (2) the resulting FG configuration is mapped onto a simplified CG model; (3) the CG model is propagated for a brief time interval to yield a new CG configuration; (4) the resulting CG configuration is used as a target to guide the evolution of the FG system; (5) the FG configuration (from step 1) is driven via a nonequilibrium MD (neMD) simulation toward the CG target; (6) the resulting FG configuration at the end of the neMD trajectory is then accepted or rejected according to a Metropolis criterion before returning to step 1. A symmetric two-ends momentum reversal prescription is used for the neMD trajectories of the FG system to guarantee that the CG-guided hybrid neMD-MC algorithm obeys microscopic detailed balance and rigorously yields the equilibrium Boltzmann distribution. The enhanced sampling achieved with the method is illustrated with a model system with hindered diffusion and explicit-solvent peptide simulations. Illustrative tests indicate that the method can yield a speedup of about 80 times for the model system and up to 21 times for polyalanine and (AAQAA)3 in water.
2015-01-01
Molecular dynamics (MD) trajectories based on a classical equation of motion provide a straightforward, albeit somewhat inefficient approach, to explore and sample the configurational space of a complex molecular system. While a broad range of techniques can be used to accelerate and enhance the sampling efficiency of classical simulations, only algorithms that are consistent with the Boltzmann equilibrium distribution yield a proper statistical mechanical computational framework. Here, a multiscale hybrid algorithm relying simultaneously on all-atom fine-grained (FG) and coarse-grained (CG) representations of a system is designed to improve sampling efficiency by combining the strength of nonequilibrium molecular dynamics (neMD) and Metropolis Monte Carlo (MC). This CG-guided hybrid neMD-MC algorithm comprises six steps: (1) a FG configuration of an atomic system is dynamically propagated for some period of time using equilibrium MD; (2) the resulting FG configuration is mapped onto a simplified CG model; (3) the CG model is propagated for a brief time interval to yield a new CG configuration; (4) the resulting CG configuration is used as a target to guide the evolution of the FG system; (5) the FG configuration (from step 1) is driven via a nonequilibrium MD (neMD) simulation toward the CG target; (6) the resulting FG configuration at the end of the neMD trajectory is then accepted or rejected according to a Metropolis criterion before returning to step 1. A symmetric two-ends momentum reversal prescription is used for the neMD trajectories of the FG system to guarantee that the CG-guided hybrid neMD-MC algorithm obeys microscopic detailed balance and rigorously yields the equilibrium Boltzmann distribution. The enhanced sampling achieved with the method is illustrated with a model system with hindered diffusion and explicit-solvent peptide simulations. Illustrative tests indicate that the method can yield a speedup of about 80 times for the model system and up to 21 times for polyalanine and (AAQAA)3 in water. PMID:26574442
Godsey, Brian; Heiser, Diane; Civin, Curt
2012-01-01
MicroRNAs (miRs) are known to play an important role in mRNA regulation, often by binding to complementary sequences in "target" mRNAs. Recently, several methods have been developed by which existing sequence-based target predictions can be combined with miR and mRNA expression data to infer true miR-mRNA targeting relationships. It has been shown that the combination of these two approaches gives more reliable results than either by itself. While a few such algorithms give excellent results, none fully addresses expression data sets with a natural ordering of the samples. If the samples in an experiment can be ordered or partially ordered by their expected similarity to one another, such as for time-series or studies of development processes, stages, or types, (e.g. cell type, disease, growth, aging), there are unique opportunities to infer miR-mRNA interactions that may be specific to the underlying processes, and existing methods do not exploit this. We propose an algorithm which specifically addresses [partially] ordered expression data and takes advantage of sample similarities based on the ordering structure. This is done within a Bayesian framework which specifies posterior distributions and therefore statistical significance for each model parameter and latent variable. We apply our model to a previously published expression data set of paired miR and mRNA arrays in five partially ordered conditions, with biological replicates, related to multiple myeloma, and we show how considering potential orderings can improve the inference of miR-mRNA interactions, as measured by existing knowledge about the involved transcripts.
A statistical evaluation of non-ergodic variogram estimators
Curriero, F.C.; Hohn, M.E.; Liebhold, A.M.; Lele, S.R.
2002-01-01
Geostatistics is a set of statistical techniques that is increasingly used to characterize spatial dependence in spatially referenced ecological data. A common feature of geostatistics is predicting values at unsampled locations from nearby samples using the kriging algorithm. Modeling spatial dependence in sampled data is necessary before kriging and is usually accomplished with the variogram and its traditional estimator. Other types of estimators, known as non-ergodic estimators, have been used in ecological applications. Non-ergodic estimators were originally suggested as a method of choice when sampled data are preferentially located and exhibit a skewed frequency distribution. Preferentially located samples can occur, for example, when areas with high values are sampled more intensely than other areas. In earlier studies the visual appearance of variograms from traditional and non-ergodic estimators were compared. Here we evaluate the estimators' relative performance in prediction. We also show algebraically that a non-ergodic version of the variogram is equivalent to the traditional variogram estimator. Simulations, designed to investigate the effects of data skewness and preferential sampling on variogram estimation and kriging, showed the traditional variogram estimator outperforms the non-ergodic estimators under these conditions. We also analyzed data on carabid beetle abundance, which exhibited large-scale spatial variability (trend) and a skewed frequency distribution. Detrending data followed by robust estimation of the residual variogram is demonstrated to be a successful alternative to the non-ergodic approach.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Akcakaya, Murat; Nehorai, Arye; Sen, Satyabrata
Most existing radar algorithms are developed under the assumption that the environment (clutter) is stationary. However, in practice, the characteristics of the clutter can vary enormously depending on the radar-operational scenarios. If unaccounted for, these nonstationary variabilities may drastically hinder the radar performance. Therefore, to overcome such shortcomings, we develop a data-driven method for target detection in nonstationary environments. In this method, the radar dynamically detects changes in the environment and adapts to these changes by learning the new statistical characteristics of the environment and by intelligibly updating its statistical detection algorithm. Specifically, we employ drift detection algorithms to detectmore » changes in the environment; incremental learning, particularly learning under concept drift algorithms, to learn the new statistical characteristics of the environment from the new radar data that become available in batches over a period of time. The newly learned environment characteristics are then integrated in the detection algorithm. Furthermore, we use Monte Carlo simulations to demonstrate that the developed method provides a significant improvement in the detection performance compared with detection techniques that are not aware of the environmental changes.« less
Obuchowski, Nancy A; Barnhart, Huiman X; Buckler, Andrew J; Pennello, Gene; Wang, Xiao-Feng; Kalpathy-Cramer, Jayashree; Kim, Hyun J Grace; Reeves, Anthony P
2015-02-01
Quantitative imaging biomarkers are being used increasingly in medicine to diagnose and monitor patients' disease. The computer algorithms that measure quantitative imaging biomarkers have different technical performance characteristics. In this paper we illustrate the appropriate statistical methods for assessing and comparing the bias, precision, and agreement of computer algorithms. We use data from three studies of pulmonary nodules. The first study is a small phantom study used to illustrate metrics for assessing repeatability. The second study is a large phantom study allowing assessment of four algorithms' bias and reproducibility for measuring tumor volume and the change in tumor volume. The third study is a small clinical study of patients whose tumors were measured on two occasions. This study allows a direct assessment of six algorithms' performance for measuring tumor change. With these three examples we compare and contrast study designs and performance metrics, and we illustrate the advantages and limitations of various common statistical methods for quantitative imaging biomarker studies. © The Author(s) 2014 Reprints and permissions: sagepub.co.uk/journalsPermissions.nav.
Statistical algorithms improve accuracy of gene fusion detection
Hsieh, Gillian; Bierman, Rob; Szabo, Linda; Lee, Alex Gia; Freeman, Donald E.; Watson, Nathaniel; Sweet-Cordero, E. Alejandro
2017-01-01
Abstract Gene fusions are known to play critical roles in tumor pathogenesis. Yet, sensitive and specific algorithms to detect gene fusions in cancer do not currently exist. In this paper, we present a new statistical algorithm, MACHETE (Mismatched Alignment CHimEra Tracking Engine), which achieves highly sensitive and specific detection of gene fusions from RNA-Seq data, including the highest Positive Predictive Value (PPV) compared to the current state-of-the-art, as assessed in simulated data. We show that the best performing published algorithms either find large numbers of fusions in negative control data or suffer from low sensitivity detecting known driving fusions in gold standard settings, such as EWSR1-FLI1. As proof of principle that MACHETE discovers novel gene fusions with high accuracy in vivo, we mined public data to discover and subsequently PCR validate novel gene fusions missed by other algorithms in the ovarian cancer cell line OVCAR3. These results highlight the gains in accuracy achieved by introducing statistical models into fusion detection, and pave the way for unbiased discovery of potentially driving and druggable gene fusions in primary tumors. PMID:28541529
Feature extraction and classification algorithms for high dimensional data
NASA Technical Reports Server (NTRS)
Lee, Chulhee; Landgrebe, David
1993-01-01
Feature extraction and classification algorithms for high dimensional data are investigated. Developments with regard to sensors for Earth observation are moving in the direction of providing much higher dimensional multispectral imagery than is now possible. In analyzing such high dimensional data, processing time becomes an important factor. With large increases in dimensionality and the number of classes, processing time will increase significantly. To address this problem, a multistage classification scheme is proposed which reduces the processing time substantially by eliminating unlikely classes from further consideration at each stage. Several truncation criteria are developed and the relationship between thresholds and the error caused by the truncation is investigated. Next an approach to feature extraction for classification is proposed based directly on the decision boundaries. It is shown that all the features needed for classification can be extracted from decision boundaries. A characteristic of the proposed method arises by noting that only a portion of the decision boundary is effective in discriminating between classes, and the concept of the effective decision boundary is introduced. The proposed feature extraction algorithm has several desirable properties: it predicts the minimum number of features necessary to achieve the same classification accuracy as in the original space for a given pattern recognition problem; and it finds the necessary feature vectors. The proposed algorithm does not deteriorate under the circumstances of equal means or equal covariances as some previous algorithms do. In addition, the decision boundary feature extraction algorithm can be used both for parametric and non-parametric classifiers. Finally, some problems encountered in analyzing high dimensional data are studied and possible solutions are proposed. First, the increased importance of the second order statistics in analyzing high dimensional data is recognized. By investigating the characteristics of high dimensional data, the reason why the second order statistics must be taken into account in high dimensional data is suggested. Recognizing the importance of the second order statistics, there is a need to represent the second order statistics. A method to visualize statistics using a color code is proposed. By representing statistics using color coding, one can easily extract and compare the first and the second statistics.
A New Challenge for Compression Algorithms: Genetic Sequences.
ERIC Educational Resources Information Center
Grumbach, Stephane; Tahi, Fariza
1994-01-01
Analyzes the properties of genetic sequences that cause the failure of classical algorithms used for data compression. A lossless algorithm, which compresses the information contained in DNA and RNA sequences by detecting regularities such as palindromes, is presented. This algorithm combines substitutional and statistical methods and appears to…
Texture Classification by Texton: Statistical versus Binary
Guo, Zhenhua; Zhang, Zhongcheng; Li, Xiu; Li, Qin; You, Jane
2014-01-01
Using statistical textons for texture classification has shown great success recently. The maximal response 8 (Statistical_MR8), image patch (Statistical_Joint) and locally invariant fractal (Statistical_Fractal) are typical statistical texton algorithms and state-of-the-art texture classification methods. However, there are two limitations when using these methods. First, it needs a training stage to build a texton library, thus the recognition accuracy will be highly depended on the training samples; second, during feature extraction, local feature is assigned to a texton by searching for the nearest texton in the whole library, which is time consuming when the library size is big and the dimension of feature is high. To address the above two issues, in this paper, three binary texton counterpart methods were proposed, Binary_MR8, Binary_Joint, and Binary_Fractal. These methods do not require any training step but encode local feature into binary representation directly. The experimental results on the CUReT, UIUC and KTH-TIPS databases show that binary texton could get sound results with fast feature extraction, especially when the image size is not big and the quality of image is not poor. PMID:24520346
An Independent Filter for Gene Set Testing Based on Spectral Enrichment.
Frost, H Robert; Li, Zhigang; Asselbergs, Folkert W; Moore, Jason H
2015-01-01
Gene set testing has become an indispensable tool for the analysis of high-dimensional genomic data. An important motivation for testing gene sets, rather than individual genomic variables, is to improve statistical power by reducing the number of tested hypotheses. Given the dramatic growth in common gene set collections, however, testing is often performed with nearly as many gene sets as underlying genomic variables. To address the challenge to statistical power posed by large gene set collections, we have developed spectral gene set filtering (SGSF), a novel technique for independent filtering of gene set collections prior to gene set testing. The SGSF method uses as a filter statistic the p-value measuring the statistical significance of the association between each gene set and the sample principal components (PCs), taking into account the significance of the associated eigenvalues. Because this filter statistic is independent of standard gene set test statistics under the null hypothesis but dependent under the alternative, the proportion of enriched gene sets is increased without impacting the type I error rate. As shown using simulated and real gene expression data, the SGSF algorithm accurately filters gene sets unrelated to the experimental outcome resulting in significantly increased gene set testing power.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Grant, C W; Lenderman, J S; Gansemer, J D
This document is an update to the 'ADIS Algorithm Evaluation Project Plan' specified in the Statement of Work for the US-VISIT Identity Matching Algorithm Evaluation Program, as deliverable II.D.1. The original plan was delivered in August 2010. This document modifies the plan to reflect modified deliverables reflecting delays in obtaining a database refresh. This document describes the revised schedule of the program deliverables. The detailed description of the processes used, the statistical analysis processes and the results of the statistical analysis will be described fully in the program deliverables. The US-VISIT Identity Matching Algorithm Evaluation Program is work performed bymore » Lawrence Livermore National Laboratory (LLNL) under IAA HSHQVT-07-X-00002 P00004 from the Department of Homeland Security (DHS).« less
Detecting Anomalies in Process Control Networks
NASA Astrophysics Data System (ADS)
Rrushi, Julian; Kang, Kyoung-Don
This paper presents the estimation-inspection algorithm, a statistical algorithm for anomaly detection in process control networks. The algorithm determines if the payload of a network packet that is about to be processed by a control system is normal or abnormal based on the effect that the packet will have on a variable stored in control system memory. The estimation part of the algorithm uses logistic regression integrated with maximum likelihood estimation in an inductive machine learning process to estimate a series of statistical parameters; these parameters are used in conjunction with logistic regression formulas to form a probability mass function for each variable stored in control system memory. The inspection part of the algorithm uses the probability mass functions to estimate the normalcy probability of a specific value that a network packet writes to a variable. Experimental results demonstrate that the algorithm is very effective at detecting anomalies in process control networks.
RESOLVE: A new algorithm for aperture synthesis imaging of extended emission in radio astronomy
NASA Astrophysics Data System (ADS)
Junklewitz, H.; Bell, M. R.; Selig, M.; Enßlin, T. A.
2016-02-01
We present resolve, a new algorithm for radio aperture synthesis imaging of extended and diffuse emission in total intensity. The algorithm is derived using Bayesian statistical inference techniques, estimating the surface brightness in the sky assuming a priori log-normal statistics. resolve estimates the measured sky brightness in total intensity, and the spatial correlation structure in the sky, which is used to guide the algorithm to an optimal reconstruction of extended and diffuse sources. During this process, the algorithm succeeds in deconvolving the effects of the radio interferometric point spread function. Additionally, resolve provides a map with an uncertainty estimate of the reconstructed surface brightness. Furthermore, with resolve we introduce a new, optimal visibility weighting scheme that can be viewed as an extension to robust weighting. In tests using simulated observations, the algorithm shows improved performance against two standard imaging approaches for extended sources, Multiscale-CLEAN and the Maximum Entropy Method.
NASA Astrophysics Data System (ADS)
Xie, Yanan; Zhou, Mingliang; Pan, Dengke
2017-10-01
The forward-scattering model is introduced to describe the response of normalized radar cross section (NRCS) of precipitation with synthetic aperture radar (SAR). Since the distribution of near-surface rainfall is related to the rate of near-surface rainfall and horizontal distribution factor, a retrieval algorithm called modified regression empirical and model-oriented statistical (M-M) based on the volterra integration theory is proposed. Compared with the model-oriented statistical and volterra integration (MOSVI) algorithm, the biggest difference is that the M-M algorithm is based on the modified regression empirical algorithm rather than the linear regression formula to retrieve the value of near-surface rainfall rate. Half of the empirical parameters are reduced in the weighted integral work and a smaller average relative error is received while the rainfall rate is less than 100 mm/h. Therefore, the algorithm proposed in this paper can obtain high-precision rainfall information.
Vargas-Rodriguez, Everardo; Guzman-Chavez, Ana Dinora; Baeza-Serrato, Roberto
2018-06-04
In this work, a novel tailored algorithm to enhance the overall sensitivity of gas concentration sensors based on the Direct Absorption Tunable Laser Absorption Spectroscopy (DA-ATLAS) method is presented. By using this algorithm, the sensor sensitivity can be custom-designed to be quasi constant over a much larger dynamic range compared with that obtained by typical methods based on a single statistics feature of the sensor signal output (peak amplitude, area under the curve, mean or RMS). Additionally, it is shown that with our algorithm, an optimal function can be tailored to get a quasi linear relationship between the concentration and some specific statistics features over a wider dynamic range. In order to test the viability of our algorithm, a basic C 2 H 2 sensor based on DA-ATLAS was implemented, and its experimental measurements support the simulated results provided by our algorithm.
An Algorithm to Improve Test Answer Copying Detection Using the Omega Statistic
ERIC Educational Resources Information Center
Maeda, Hotaka; Zhang, Bo
2017-01-01
The omega (?) statistic is reputed to be one of the best indices for detecting answer copying on multiple choice tests, but its performance relies on the accurate estimation of copier ability, which is challenging because responses from the copiers may have been contaminated. We propose an algorithm that aims to identify and delete the suspected…
NASA Astrophysics Data System (ADS)
Hartmann, Alexander K.; Weigt, Martin
2005-10-01
A concise, comprehensive introduction to the topic of statistical physics of combinatorial optimization, bringing together theoretical concepts and algorithms from computer science with analytical methods from physics. The result bridges the gap between statistical physics and combinatorial optimization, investigating problems taken from theoretical computing, such as the vertex-cover problem, with the concepts and methods of theoretical physics. The authors cover rapid developments and analytical methods that are both extremely complex and spread by word-of-mouth, providing all the necessary basics in required detail. Throughout, the algorithms are shown with examples and calculations, while the proofs are given in a way suitable for graduate students, post-docs, and researchers. Ideal for newcomers to this young, multidisciplinary field.
Chen, Zhiru; Hong, Wenxue
2016-02-01
Considering the low accuracy of prediction in the positive samples and poor overall classification effects caused by unbalanced sample data of MicroRNA (miRNA) target, we proposes a support vector machine (SVM)-integration of under-sampling and weight (IUSM) algorithm in this paper, an under-sampling based on the ensemble learning algorithm. The algorithm adopts SVM as learning algorithm and AdaBoost as integration framework, and embeds clustering-based under-sampling into the iterative process, aiming at reducing the degree of unbalanced distribution of positive and negative samples. Meanwhile, in the process of adaptive weight adjustment of the samples, the SVM-IUSM algorithm eliminates the abnormal ones in negative samples with robust sample weights smoothing mechanism so as to avoid over-learning. Finally, the prediction of miRNA target integrated classifier is achieved with the combination of multiple weak classifiers through the voting mechanism. The experiment revealed that the SVM-IUSW, compared with other algorithms on unbalanced dataset collection, could not only improve the accuracy of positive targets and the overall effect of classification, but also enhance the generalization ability of miRNA target classifier.
Modelling maximum river flow by using Bayesian Markov Chain Monte Carlo
NASA Astrophysics Data System (ADS)
Cheong, R. Y.; Gabda, D.
2017-09-01
Analysis of flood trends is vital since flooding threatens human living in terms of financial, environment and security. The data of annual maximum river flows in Sabah were fitted into generalized extreme value (GEV) distribution. Maximum likelihood estimator (MLE) raised naturally when working with GEV distribution. However, previous researches showed that MLE provide unstable results especially in small sample size. In this study, we used different Bayesian Markov Chain Monte Carlo (MCMC) based on Metropolis-Hastings algorithm to estimate GEV parameters. Bayesian MCMC method is a statistical inference which studies the parameter estimation by using posterior distribution based on Bayes’ theorem. Metropolis-Hastings algorithm is used to overcome the high dimensional state space faced in Monte Carlo method. This approach also considers more uncertainty in parameter estimation which then presents a better prediction on maximum river flow in Sabah.
Aghaeepour, Nima; Chattopadhyay, Pratip; Chikina, Maria; Dhaene, Tom; Van Gassen, Sofie; Kursa, Miron; Lambrecht, Bart N; Malek, Mehrnoush; McLachlan, G J; Qian, Yu; Qiu, Peng; Saeys, Yvan; Stanton, Rick; Tong, Dong; Vens, Celine; Walkowiak, Sławomir; Wang, Kui; Finak, Greg; Gottardo, Raphael; Mosmann, Tim; Nolan, Garry P; Scheuermann, Richard H; Brinkman, Ryan R
2016-01-01
The Flow Cytometry: Critical Assessment of Population Identification Methods (FlowCAP) challenges were established to compare the performance of computational methods for identifying cell populations in multidimensional flow cytometry data. Here we report the results of FlowCAP-IV where algorithms from seven different research groups predicted the time to progression to AIDS among a cohort of 384 HIV+ subjects, using antigen-stimulated peripheral blood mononuclear cell (PBMC) samples analyzed with a 14-color staining panel. Two approaches (FlowReMi.1 and flowDensity-flowType-RchyOptimyx) provided statistically significant predictive value in the blinded test set. Manual validation of submitted results indicated that unbiased analysis of single cell phenotypes could reveal unexpected cell types that correlated with outcomes of interest in high dimensional flow cytometry datasets. © 2015 International Society for Advancement of Cytometry.
Bearing Fault Diagnosis Based on Statistical Locally Linear Embedding
Wang, Xiang; Zheng, Yuan; Zhao, Zhenzhou; Wang, Jinping
2015-01-01
Fault diagnosis is essentially a kind of pattern recognition. The measured signal samples usually distribute on nonlinear low-dimensional manifolds embedded in the high-dimensional signal space, so how to implement feature extraction, dimensionality reduction and improve recognition performance is a crucial task. In this paper a novel machinery fault diagnosis approach based on a statistical locally linear embedding (S-LLE) algorithm which is an extension of LLE by exploiting the fault class label information is proposed. The fault diagnosis approach first extracts the intrinsic manifold features from the high-dimensional feature vectors which are obtained from vibration signals that feature extraction by time-domain, frequency-domain and empirical mode decomposition (EMD), and then translates the complex mode space into a salient low-dimensional feature space by the manifold learning algorithm S-LLE, which outperforms other feature reduction methods such as PCA, LDA and LLE. Finally in the feature reduction space pattern classification and fault diagnosis by classifier are carried out easily and rapidly. Rolling bearing fault signals are used to validate the proposed fault diagnosis approach. The results indicate that the proposed approach obviously improves the classification performance of fault pattern recognition and outperforms the other traditional approaches. PMID:26153771
Big Data Analytics for Scanning Transmission Electron Microscopy Ptychography
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jesse, S.; Chi, M.; Belianinov, A.
Electron microscopy is undergoing a transition; from the model of producing only a few micrographs, through the current state where many images and spectra can be digitally recorded, to a new mode where very large volumes of data (movies, ptychographic and multi-dimensional series) can be rapidly obtained. In this paper, we discuss the application of so-called “big-data” methods to high dimensional microscopy data, using unsupervised multivariate statistical techniques, in order to explore salient image features in a specific example of BiFeO 3 domains. Remarkably, k-means clustering reveals domain differentiation despite the fact that the algorithm is purely statistical in naturemore » and does not require any prior information regarding the material, any coexisting phases, or any differentiating structures. While this is a somewhat trivial case, this example signifies the extraction of useful physical and structural information without any prior bias regarding the sample or the instrumental modality. Further interpretation of these types of results may still require human intervention. Finally, however, the open nature of this algorithm and its wide availability, enable broad collaborations and exploratory work necessary to enable efficient data analysis in electron microscopy.« less
Big Data Analytics for Scanning Transmission Electron Microscopy Ptychography
Jesse, S.; Chi, M.; Belianinov, A.; ...
2016-05-23
Electron microscopy is undergoing a transition; from the model of producing only a few micrographs, through the current state where many images and spectra can be digitally recorded, to a new mode where very large volumes of data (movies, ptychographic and multi-dimensional series) can be rapidly obtained. In this paper, we discuss the application of so-called “big-data” methods to high dimensional microscopy data, using unsupervised multivariate statistical techniques, in order to explore salient image features in a specific example of BiFeO 3 domains. Remarkably, k-means clustering reveals domain differentiation despite the fact that the algorithm is purely statistical in naturemore » and does not require any prior information regarding the material, any coexisting phases, or any differentiating structures. While this is a somewhat trivial case, this example signifies the extraction of useful physical and structural information without any prior bias regarding the sample or the instrumental modality. Further interpretation of these types of results may still require human intervention. Finally, however, the open nature of this algorithm and its wide availability, enable broad collaborations and exploratory work necessary to enable efficient data analysis in electron microscopy.« less
Big Data Analytics for Scanning Transmission Electron Microscopy Ptychography
Jesse, S.; Chi, M.; Belianinov, A.; Beekman, C.; Kalinin, S. V.; Borisevich, A. Y.; Lupini, A. R.
2016-01-01
Electron microscopy is undergoing a transition; from the model of producing only a few micrographs, through the current state where many images and spectra can be digitally recorded, to a new mode where very large volumes of data (movies, ptychographic and multi-dimensional series) can be rapidly obtained. Here, we discuss the application of so-called “big-data” methods to high dimensional microscopy data, using unsupervised multivariate statistical techniques, in order to explore salient image features in a specific example of BiFeO3 domains. Remarkably, k-means clustering reveals domain differentiation despite the fact that the algorithm is purely statistical in nature and does not require any prior information regarding the material, any coexisting phases, or any differentiating structures. While this is a somewhat trivial case, this example signifies the extraction of useful physical and structural information without any prior bias regarding the sample or the instrumental modality. Further interpretation of these types of results may still require human intervention. However, the open nature of this algorithm and its wide availability, enable broad collaborations and exploratory work necessary to enable efficient data analysis in electron microscopy. PMID:27211523
Static versus dynamic sampling for data mining
DOE Office of Scientific and Technical Information (OSTI.GOV)
John, G.H.; Langley, P.
1996-12-31
As data warehouses grow to the point where one hundred gigabytes is considered small, the computational efficiency of data-mining algorithms on large databases becomes increasingly important. Using a sample from the database can speed up the datamining process, but this is only acceptable if it does not reduce the quality of the mined knowledge. To this end, we introduce the {open_quotes}Probably Close Enough{close_quotes} criterion to describe the desired properties of a sample. Sampling usually refers to the use of static statistical tests to decide whether a sample is sufficiently similar to the large database, in the absence of any knowledgemore » of the tools the data miner intends to use. We discuss dynamic sampling methods, which take into account the mining tool being used and can thus give better samples. We describe dynamic schemes that observe a mining tool`s performance on training samples of increasing size and use these results to determine when a sample is sufficiently large. We evaluate these sampling methods on data from the UCI repository and conclude that dynamic sampling is preferable.« less
Nigatu, Yeshambel T; Liu, Yan; Wang, JianLi
2016-07-22
Multivariable risk prediction algorithms are useful for making clinical decisions and for health planning. While prediction algorithms for new onset of major depression in the primary care attendees in Europe and elsewhere have been developed, the performance of these algorithms in different populations is not known. The objective of this study was to validate the PredictD algorithm for new onset of major depressive episode (MDE) in the US general population. Longitudinal study design was conducted with approximate 3-year follow-up data from a nationally representative sample of the US general population. A total of 29,621 individuals who participated in Wave 1 and 2 of the US National Epidemiologic Survey on Alcohol and Related Conditions (NESARC) and who did not have an MDE in the past year at Wave 1 were included. The PredictD algorithm was directly applied to the selected participants. MDE was assessed by the Alcohol Use Disorder and Associated Disabilities Interview Schedule, based on the DSM-IV criteria. Among the participants, 8 % developed an MDE over three years. The PredictD algorithm had acceptable discriminative power (C-statistics = 0.708, 95 % CI: 0.696, 0.720), but poor calibration (p < 0.001) with the NESARC data. In the European primary care attendees, the algorithm had a C-statistics of 0.790 (95 % CI: 0.767, 0.813) with a perfect calibration. The PredictD algorithm has acceptable discrimination, but the calibration capacity was poor in the US general population despite of re-calibration. Therefore, based on the results, at current stage, the use of PredictD in the US general population for predicting individual risk of MDE is not encouraged. More independent validation research is needed.
Adaptation of a Fast Optimal Interpolation Algorithm to the Mapping of Oceangraphic Data
NASA Technical Reports Server (NTRS)
Menemenlis, Dimitris; Fieguth, Paul; Wunsch, Carl; Willsky, Alan
1997-01-01
A fast, recently developed, multiscale optimal interpolation algorithm has been adapted to the mapping of hydrographic and other oceanographic data. This algorithm produces solution and error estimates which are consistent with those obtained from exact least squares methods, but at a small fraction of the computational cost. Problems whose solution would be completely impractical using exact least squares, that is, problems with tens or hundreds of thousands of measurements and estimation grid points, can easily be solved on a small workstation using the multiscale algorithm. In contrast to methods previously proposed for solving large least squares problems, our approach provides estimation error statistics while permitting long-range correlations, using all measurements, and permitting arbitrary measurement locations. The multiscale algorithm itself, published elsewhere, is not the focus of this paper. However, the algorithm requires statistical models having a very particular multiscale structure; it is the development of a class of multiscale statistical models, appropriate for oceanographic mapping problems, with which we concern ourselves in this paper. The approach is illustrated by mapping temperature in the northeastern Pacific. The number of hydrographic stations is kept deliberately small to show that multiscale and exact least squares results are comparable. A portion of the data were not used in the analysis; these data serve to test the multiscale estimates. A major advantage of the present approach is the ability to repeat the estimation procedure a large number of times for sensitivity studies, parameter estimation, and model testing. We have made available by anonymous Ftp a set of MATLAB-callable routines which implement the multiscale algorithm and the statistical models developed in this paper.
NASA Astrophysics Data System (ADS)
Guo, Zhan; Yan, Xuefeng
2018-04-01
Different operating conditions of p-xylene oxidation have different influences on the product, purified terephthalic acid. It is necessary to obtain the optimal combination of reaction conditions to ensure the quality of the products, cut down on consumption and increase revenues. A multi-objective differential evolution (MODE) algorithm co-evolved with the population-based incremental learning (PBIL) algorithm, called PBMODE, is proposed. The PBMODE algorithm was designed as a co-evolutionary system. Each individual has its own parameter individual, which is co-evolved by PBIL. PBIL uses statistical analysis to build a model based on the corresponding symbiotic individuals of the superior original individuals during the main evolutionary process. The results of simulations and statistical analysis indicate that the overall performance of the PBMODE algorithm is better than that of the compared algorithms and it can be used to optimize the operating conditions of the p-xylene oxidation process effectively and efficiently.
A review on the multivariate statistical methods for dimensional reduction studies
NASA Astrophysics Data System (ADS)
Aik, Lim Eng; Kiang, Lam Chee; Mohamed, Zulkifley Bin; Hong, Tan Wei
2017-05-01
In this research study we have discussed multivariate statistical methods for dimensional reduction, which has been done by various researchers. The reduction of dimensionality is valuable to accelerate algorithm progression, as well as really may offer assistance with the last grouping/clustering precision. A lot of boisterous or even flawed info information regularly prompts a not exactly alluring algorithm progression. Expelling un-useful or dis-instructive information segments may for sure help the algorithm discover more broad grouping locales and principles and generally speaking accomplish better exhibitions on new data set.
Sitek, Arkadiusz
2016-12-21
The origin ensemble (OE) algorithm is a new method used for image reconstruction from nuclear tomographic data. The main advantage of this algorithm is the ease of implementation for complex tomographic models and the sound statistical theory. In this comment, the author provides the basics of the statistical interpretation of OE and gives suggestions for the improvement of the algorithm in the application to prompt gamma imaging as described in Polf et al (2015 Phys. Med. Biol. 60 7085).
NASA Astrophysics Data System (ADS)
Sitek, Arkadiusz
2016-12-01
The origin ensemble (OE) algorithm is a new method used for image reconstruction from nuclear tomographic data. The main advantage of this algorithm is the ease of implementation for complex tomographic models and the sound statistical theory. In this comment, the author provides the basics of the statistical interpretation of OE and gives suggestions for the improvement of the algorithm in the application to prompt gamma imaging as described in Polf et al (2015 Phys. Med. Biol. 60 7085).
Optimal sample sizes for the design of reliability studies: power consideration.
Shieh, Gwowen
2014-09-01
Intraclass correlation coefficients are used extensively to measure the reliability or degree of resemblance among group members in multilevel research. This study concerns the problem of the necessary sample size to ensure adequate statistical power for hypothesis tests concerning the intraclass correlation coefficient in the one-way random-effects model. In view of the incomplete and problematic numerical results in the literature, the approximate sample size formula constructed from Fisher's transformation is reevaluated and compared with an exact approach across a wide range of model configurations. These comprehensive examinations showed that the Fisher transformation method is appropriate only under limited circumstances, and therefore it is not recommended as a general method in practice. For advance design planning of reliability studies, the exact sample size procedures are fully described and illustrated for various allocation and cost schemes. Corresponding computer programs are also developed to implement the suggested algorithms.
A note on the kappa statistic for clustered dichotomous data.
Zhou, Ming; Yang, Zhao
2014-06-30
The kappa statistic is widely used to assess the agreement between two raters. Motivated by a simulation-based cluster bootstrap method to calculate the variance of the kappa statistic for clustered physician-patients dichotomous data, we investigate its special correlation structure and develop a new simple and efficient data generation algorithm. For the clustered physician-patients dichotomous data, based on the delta method and its special covariance structure, we propose a semi-parametric variance estimator for the kappa statistic. An extensive Monte Carlo simulation study is performed to evaluate the performance of the new proposal and five existing methods with respect to the empirical coverage probability, root-mean-square error, and average width of the 95% confidence interval for the kappa statistic. The variance estimator ignoring the dependence within a cluster is generally inappropriate, and the variance estimators from the new proposal, bootstrap-based methods, and the sampling-based delta method perform reasonably well for at least a moderately large number of clusters (e.g., the number of clusters K ⩾50). The new proposal and sampling-based delta method provide convenient tools for efficient computations and non-simulation-based alternatives to the existing bootstrap-based methods. Moreover, the new proposal has acceptable performance even when the number of clusters is as small as K = 25. To illustrate the practical application of all the methods, one psychiatric research data and two simulated clustered physician-patients dichotomous data are analyzed. Copyright © 2014 John Wiley & Sons, Ltd.
NASA Astrophysics Data System (ADS)
Zhong, Ke; Lei, Xia; Li, Shaoqian
2013-12-01
Statistics-based intercarrier interference (ICI) mitigation algorithm is proposed for orthogonal frequency division multiplexing systems in presence of both nonstationary and stationary phase noises. By utilizing the statistics of phase noise, which can be obtained from measurements or data sheets, a Wiener filter preprocessing algorithm for ICI mitigation is proposed. The proposed algorithm can be regarded as a performance-improving technique for the previous researches on phase noise cancelation. Simulation results show that the proposed algorithm can effectively mitigate ICI and lower the error floor, and therefore significantly improve the performances of previous researches on phase noise cancelation, especially in the presence of severe phase noise.
FabricS: A user-friendly, complete and robust software for particle shape-fabric analysis
NASA Astrophysics Data System (ADS)
Moreno Chávez, G.; Castillo Rivera, F.; Sarocchi, D.; Borselli, L.; Rodríguez-Sedano, L. A.
2018-06-01
Shape-fabric is a textural parameter related to the spatial arrangement of elongated particles in geological samples. Its usefulness spans a range from sedimentary petrology to igneous and metamorphic petrology. Independently of the process being studied, when a material flows, the elongated particles are oriented with the major axis in the direction of flow. In sedimentary petrology this information has been used for studies of paleo-flow direction of turbidites, the origin of quartz sediments, and locating ignimbrite vents, among others. In addition to flow direction and its polarity, the method enables flow rheology to be inferred. The use of shape-fabric has been limited due to the difficulties of automatically measuring particles and analyzing them with reliable circular statistics programs. This has dampened interest in the method for a long time. Shape-fabric measurement has increased in popularity since the 1980s thanks to the development of new image analysis techniques and circular statistics software. However, the programs currently available are unreliable, old and are incompatible with newer operating systems, or require programming skills. The goal of our work is to develop a user-friendly program, in the MATLAB environment, with a graphical user interface, that can process images and includes editing functions, and thresholds (elongation and size) for selecting a particle population and analyzing it with reliable circular statistics algorithms. Moreover, the method also has to produce rose diagrams, orientation vectors, and a complete series of statistical parameters. All these requirements are met by our new software. In this paper, we briefly explain the methodology from collection of oriented samples in the field to the minimum number of particles needed to obtain reliable fabric data. We obtained the data using specific statistical tests and taking into account the degree of iso-orientation of the samples and the required degree of reliability. The program has been verified by means of several simulations performed using appropriately designed features and by analyzing real samples.
Sapsis, Themistoklis P; Majda, Andrew J
2013-08-20
A framework for low-order predictive statistical modeling and uncertainty quantification in turbulent dynamical systems is developed here. These reduced-order, modified quasilinear Gaussian (ROMQG) algorithms apply to turbulent dynamical systems in which there is significant linear instability or linear nonnormal dynamics in the unperturbed system and energy-conserving nonlinear interactions that transfer energy from the unstable modes to the stable modes where dissipation occurs, resulting in a statistical steady state; such turbulent dynamical systems are ubiquitous in geophysical and engineering turbulence. The ROMQG method involves constructing a low-order, nonlinear, dynamical system for the mean and covariance statistics in the reduced subspace that has the unperturbed statistics as a stable fixed point and optimally incorporates the indirect effect of non-Gaussian third-order statistics for the unperturbed system in a systematic calibration stage. This calibration procedure is achieved through information involving only the mean and covariance statistics for the unperturbed equilibrium. The performance of the ROMQG algorithm is assessed on two stringent test cases: the 40-mode Lorenz 96 model mimicking midlatitude atmospheric turbulence and two-layer baroclinic models for high-latitude ocean turbulence with over 125,000 degrees of freedom. In the Lorenz 96 model, the ROMQG algorithm with just a single mode captures the transient response to random or deterministic forcing. For the baroclinic ocean turbulence models, the inexpensive ROMQG algorithm with 252 modes, less than 0.2% of the total, captures the nonlinear response of the energy, the heat flux, and even the one-dimensional energy and heat flux spectra.
NASA Technical Reports Server (NTRS)
Tan, Bin; Brown de Colstoun, Eric; Wolfe, Robert E.; Tilton, James C.; Huang, Chengquan; Smith, Sarah E.
2012-01-01
An algorithm is developed to automatically screen the outliers from massive training samples for Global Land Survey - Imperviousness Mapping Project (GLS-IMP). GLS-IMP is to produce a global 30 m spatial resolution impervious cover data set for years 2000 and 2010 based on the Landsat Global Land Survey (GLS) data set. This unprecedented high resolution impervious cover data set is not only significant to the urbanization studies but also desired by the global carbon, hydrology, and energy balance researches. A supervised classification method, regression tree, is applied in this project. A set of accurate training samples is the key to the supervised classifications. Here we developed the global scale training samples from 1 m or so resolution fine resolution satellite data (Quickbird and Worldview2), and then aggregate the fine resolution impervious cover map to 30 m resolution. In order to improve the classification accuracy, the training samples should be screened before used to train the regression tree. It is impossible to manually screen 30 m resolution training samples collected globally. For example, in Europe only, there are 174 training sites. The size of the sites ranges from 4.5 km by 4.5 km to 8.1 km by 3.6 km. The amount training samples are over six millions. Therefore, we develop this automated statistic based algorithm to screen the training samples in two levels: site and scene level. At the site level, all the training samples are divided to 10 groups according to the percentage of the impervious surface within a sample pixel. The samples following in each 10% forms one group. For each group, both univariate and multivariate outliers are detected and removed. Then the screen process escalates to the scene level. A similar screen process but with a looser threshold is applied on the scene level considering the possible variance due to the site difference. We do not perform the screen process across the scenes because the scenes might vary due to the phenology, solar-view geometry, and atmospheric condition etc. factors but not actual landcover difference. Finally, we will compare the classification results from screened and unscreened training samples to assess the improvement achieved by cleaning up the training samples. Keywords:
Optical Algorithms at Satellite Wavelengths for Total Suspended Matter in Tropical Coastal Waters
Ouillon, Sylvain; Douillet, Pascal; Petrenko, Anne; Neveux, Jacques; Dupouy, Cécile; Froidefond, Jean-Marie; Andréfouët, Serge; Muñoz-Caravaca, Alain
2008-01-01
Is it possible to derive accurately Total Suspended Matter concentration or its proxy, turbidity, from remote sensing data in tropical coastal lagoon waters? To investigate this question, hyperspectral remote sensing reflectance, turbidity and chlorophyll pigment concentration were measured in three coral reef lagoons. The three sites enabled us to get data over very diverse environments: oligotrophic and sediment-poor waters in the southwest lagoon of New Caledonia, eutrophic waters in the Cienfuegos Bay (Cuba), and sediment-rich waters in the Laucala Bay (Fiji). In this paper, optical algorithms for turbidity are presented per site based on 113 stations in New Caledonia, 24 stations in Cuba and 56 stations in Fiji. Empirical algorithms are tested at satellite wavebands useful to coastal applications. Global algorithms are also derived for the merged data set (193 stations). The performances of global and local regression algorithms are compared. The best one-band algorithms on all the measurements are obtained at 681 nm using either a polynomial or a power model. The best two-band algorithms are obtained with R412/R620, R443/R670 and R510/R681. Two three-band algorithms based on Rrs620.Rrs681/Rrs412 and Rrs620.Rrs681/Rrs510 also give fair regression statistics. Finally, we propose a global algorithm based on one or three bands: turbidity is first calculated from Rrs681 and then, if < 1 FTU, it is recalculated using an algorithm based on Rrs620.Rrs681/Rrs412. On our data set, this algorithm is suitable for the 0.2-25 FTU turbidity range and for the three sites sampled (mean bias: 3.6 %, rms: 35%, mean quadratic error: 1.4 FTU). This shows that defining global empirical turbidity algorithms in tropical coastal waters is at reach. PMID:27879929
*K-means and cluster models for cancer signatures.
Kakushadze, Zura; Yu, Willie
2017-09-01
We present *K-means clustering algorithm and source code by expanding statistical clustering methods applied in https://ssrn.com/abstract=2802753 to quantitative finance. *K-means is statistically deterministic without specifying initial centers, etc. We apply *K-means to extracting cancer signatures from genome data without using nonnegative matrix factorization (NMF). *K-means' computational cost is a fraction of NMF's. Using 1389 published samples for 14 cancer types, we find that 3 cancers (liver cancer, lung cancer and renal cell carcinoma) stand out and do not have cluster-like structures. Two clusters have especially high within-cluster correlations with 11 other cancers indicating common underlying structures. Our approach opens a novel avenue for studying such structures. *K-means is universal and can be applied in other fields. We discuss some potential applications in quantitative finance.
Evaluation of a Text Compression Algorithm Against Computer-Aided Instruction (CAI) Material.
ERIC Educational Resources Information Center
Knight, Joseph M., Jr.
This report describes the initial evaluation of a text compression algorithm against computer assisted instruction (CAI) material. A review of some concepts related to statistical text compression is followed by a detailed description of a practical text compression algorithm. A simulation of the algorithm was programed and used to obtain…
Optimum Value of Original Events on the Pept Technique
NASA Astrophysics Data System (ADS)
Sadremomtaz, Alireza; Taherparvar, Payvand
2011-12-01
Do Positron emission particle tracking (PEPT) has been used to track the motion of a single radioactively labeled tracer particle within a bed of similar particles. In this paper, the effect of the original event fraction on the results precise in two experiments has been reviewed. Results showed that the algorithm can no longer distinguish some corrupt trajectories, in addition to; further iteration reduces the statistical significance of the sample without improving its quality. Results show that the optimum value of trajectories depends on the type of experiment.
Multilevel modelling: Beyond the basic applications.
Wright, Daniel B; London, Kamala
2009-05-01
Over the last 30 years statistical algorithms have been developed to analyse datasets that have a hierarchical/multilevel structure. Particularly within developmental and educational psychology these techniques have become common where the sample has an obvious hierarchical structure, like pupils nested within a classroom. We describe two areas beyond the basic applications of multilevel modelling that are important to psychology: modelling the covariance structure in longitudinal designs and using generalized linear multilevel modelling as an alternative to methods from signal detection theory (SDT). Detailed code for all analyses is described using packages for the freeware R.
NASA Astrophysics Data System (ADS)
Małolepsza, Edyta; Kim, Jaegil; Keyes, Tom
2015-05-01
Metastable β ice holds small guest molecules in stable gas hydrates, so its solid-liquid equilibrium is of interest. However, aqueous crystal-liquid transitions are very difficult to simulate. A new molecular dynamics algorithm generates trajectories in a generalized N P T ensemble and equilibrates states of coexisting phases with a selectable enthalpy. With replicas spanning the range between β ice and liquid water, we find the statistical temperature from the enthalpy histograms and characterize the transition by the entropy, introducing a general computational procedure for first-order transitions.
Multiclass Bayes error estimation by a feature space sampling technique
NASA Technical Reports Server (NTRS)
Mobasseri, B. G.; Mcgillem, C. D.
1979-01-01
A general Gaussian M-class N-feature classification problem is defined. An algorithm is developed that requires the class statistics as its only input and computes the minimum probability of error through use of a combined analytical and numerical integration over a sequence simplifying transformations of the feature space. The results are compared with those obtained by conventional techniques applied to a 2-class 4-feature discrimination problem with results previously reported and 4-class 4-feature multispectral scanner Landsat data classified by training and testing of the available data.
Malolepsza, Edyta; Kim, Jaegil; Keyes, Tom
2015-04-28
Metastable β ice holds small guest molecules in stable gas hydrates, so its solid/liquid equilibrium is of interest. However, aqueous crystal/liquid transitions are very difficult to simulate. A new MD algorithm generates trajectories in a generalized NPT ensemble and equilibrates states of coexisting phases with a selectable enthalpy. Furthermore, with replicas spanning the range between β ice and liquid water we find the statistical temperature from the enthalpy histograms and characterize the transition by the entropy, introducing a general computational procedure for first-order transitions.
Zhang, He-Hua; Yang, Liuyang; Liu, Yuchuan; Wang, Pin; Yin, Jun; Li, Yongming; Qiu, Mingguo; Zhu, Xueru; Yan, Fang
2016-11-16
The use of speech based data in the classification of Parkinson disease (PD) has been shown to provide an effect, non-invasive mode of classification in recent years. Thus, there has been an increased interest in speech pattern analysis methods applicable to Parkinsonism for building predictive tele-diagnosis and tele-monitoring models. One of the obstacles in optimizing classifications is to reduce noise within the collected speech samples, thus ensuring better classification accuracy and stability. While the currently used methods are effect, the ability to invoke instance selection has been seldomly examined. In this study, a PD classification algorithm was proposed and examined that combines a multi-edit-nearest-neighbor (MENN) algorithm and an ensemble learning algorithm. First, the MENN algorithm is applied for selecting optimal training speech samples iteratively, thereby obtaining samples with high separability. Next, an ensemble learning algorithm, random forest (RF) or decorrelated neural network ensembles (DNNE), is used to generate trained samples from the collected training samples. Lastly, the trained ensemble learning algorithms are applied to the test samples for PD classification. This proposed method was examined using a more recently deposited public datasets and compared against other currently used algorithms for validation. Experimental results showed that the proposed algorithm obtained the highest degree of improved classification accuracy (29.44%) compared with the other algorithm that was examined. Furthermore, the MENN algorithm alone was found to improve classification accuracy by as much as 45.72%. Moreover, the proposed algorithm was found to exhibit a higher stability, particularly when combining the MENN and RF algorithms. This study showed that the proposed method could improve PD classification when using speech data and can be applied to future studies seeking to improve PD classification methods.
A sampling algorithm for segregation analysis
Tier, Bruce; Henshall, John
2001-01-01
Methods for detecting Quantitative Trait Loci (QTL) without markers have generally used iterative peeling algorithms for determining genotype probabilities. These algorithms have considerable shortcomings in complex pedigrees. A Monte Carlo Markov chain (MCMC) method which samples the pedigree of the whole population jointly is described. Simultaneous sampling of the pedigree was achieved by sampling descent graphs using the Metropolis-Hastings algorithm. A descent graph describes the inheritance state of each allele and provides pedigrees guaranteed to be consistent with Mendelian sampling. Sampling descent graphs overcomes most, if not all, of the limitations incurred by iterative peeling algorithms. The algorithm was able to find the QTL in most of the simulated populations. However, when the QTL was not modeled or found then its effect was ascribed to the polygenic component. No QTL were detected when they were not simulated. PMID:11742631
Singal, Amit G.; Mukherjee, Ashin; Elmunzer, B. Joseph; Higgins, Peter DR; Lok, Anna S.; Zhu, Ji; Marrero, Jorge A; Waljee, Akbar K
2015-01-01
Background Predictive models for hepatocellular carcinoma (HCC) have been limited by modest accuracy and lack of validation. Machine learning algorithms offer a novel methodology, which may improve HCC risk prognostication among patients with cirrhosis. Our study's aim was to develop and compare predictive models for HCC development among cirrhotic patients, using conventional regression analysis and machine learning algorithms. Methods We enrolled 442 patients with Child A or B cirrhosis at the University of Michigan between January 2004 and September 2006 (UM cohort) and prospectively followed them until HCC development, liver transplantation, death, or study termination. Regression analysis and machine learning algorithms were used to construct predictive models for HCC development, which were tested on an independent validation cohort from the Hepatitis C Antiviral Long-term Treatment against Cirrhosis (HALT-C) Trial. Both models were also compared to the previously published HALT-C model. Discrimination was assessed using receiver operating characteristic curve analysis and diagnostic accuracy was assessed with net reclassification improvement and integrated discrimination improvement statistics. Results After a median follow-up of 3.5 years, 41 patients developed HCC. The UM regression model had a c-statistic of 0.61 (95%CI 0.56-0.67), whereas the machine learning algorithm had a c-statistic of 0.64 (95%CI 0.60–0.69) in the validation cohort. The machine learning algorithm had significantly better diagnostic accuracy as assessed by net reclassification improvement (p<0.001) and integrated discrimination improvement (p=0.04). The HALT-C model had a c-statistic of 0.60 (95%CI 0.50-0.70) in the validation cohort and was outperformed by the machine learning algorithm (p=0.047). Conclusion Machine learning algorithms improve the accuracy of risk stratifying patients with cirrhosis and can be used to accurately identify patients at high-risk for developing HCC. PMID:24169273
Blinded Validation of Breath Biomarkers of Lung Cancer, a Potential Ancillary to Chest CT Screening
Phillips, Michael; Bauer, Thomas L.; Cataneo, Renee N.; Lebauer, Cassie; Mundada, Mayur; Pass, Harvey I.; Ramakrishna, Naren; Rom, William N.; Vallières, Eric
2015-01-01
Background Breath volatile organic compounds (VOCs) have been reported as biomarkers of lung cancer, but it is not known if biomarkers identified in one group can identify disease in a separate independent cohort. Also, it is not known if combining breath biomarkers with chest CT has the potential to improve the sensitivity and specificity of lung cancer screening. Methods Model-building phase (unblinded): Breath VOCs were analyzed with gas chromatography mass spectrometry in 82 asymptomatic smokers having screening chest CT, 84 symptomatic high-risk subjects with a tissue diagnosis, 100 without a tissue diagnosis, and 35 healthy subjects. Multiple Monte Carlo simulations identified breath VOC mass ions with greater than random diagnostic accuracy for lung cancer, and these were combined in a multivariate predictive algorithm. Model-testing phase (blinded validation): We analyzed breath VOCs in an independent cohort of similar subjects (n = 70, 51, 75 and 19 respectively). The algorithm predicted discriminant function (DF) values in blinded replicate breath VOC samples analyzed independently at two laboratories (A and B). Outcome modeling: We modeled the expected effects of combining breath biomarkers with chest CT on the sensitivity and specificity of lung cancer screening. Results Unblinded model-building phase. The algorithm identified lung cancer with sensitivity 74.0%, specificity 70.7% and C-statistic 0.78. Blinded model-testing phase: The algorithm identified lung cancer at Laboratory A with sensitivity 68.0%, specificity 68.4%, C-statistic 0.71; and at Laboratory B with sensitivity 70.1%, specificity 68.0%, C-statistic 0.70, with linear correlation between replicates (r = 0.88). In a projected outcome model, breath biomarkers increased the sensitivity, specificity, and positive and negative predictive values of chest CT for lung cancer when the tests were combined in series or parallel. Conclusions Breath VOC mass ion biomarkers identified lung cancer in a separate independent cohort, in a blinded replicated study. Combining breath biomarkers with chest CT could potentially improve the sensitivity and specificity of lung cancer screening. Trial Registration ClinicalTrials.gov NCT00639067 PMID:26698306
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mărăscu, V.; Dinescu, G.; Faculty of Physics, University of Bucharest, 405 Atomistilor Street, Bucharest-Magurele
In this paper we propose a statistical approach for describing the self-assembling of sub-micronic polystyrene beads on silicon surfaces, as well as the evolution of surface topography due to plasma treatments. Algorithms for image recognition are used in conjunction with Scanning Electron Microscopy (SEM) imaging of surfaces. In a first step, greyscale images of the surface covered by the polystyrene beads are obtained. Further, an adaptive thresholding method was applied for obtaining binary images. The next step consisted in automatic identification of polystyrene beads dimensions, by using Hough transform algorithm, according to beads radius. In order to analyze the uniformitymore » of the self–assembled polystyrene beads, the squared modulus of 2-dimensional Fast Fourier Transform (2- D FFT) was applied. By combining these algorithms we obtain a powerful and fast statistical tool for analysis of micro and nanomaterials with aspect features regularly distributed on surface upon SEM examination.« less
NASA Astrophysics Data System (ADS)
Pankratov, Oleg; Kuvshinov, Alexey
2016-01-01
Despite impressive progress in the development and application of electromagnetic (EM) deterministic inverse schemes to map the 3-D distribution of electrical conductivity within the Earth, there is one question which remains poorly addressed—uncertainty quantification of the recovered conductivity models. Apparently, only an inversion based on a statistical approach provides a systematic framework to quantify such uncertainties. The Metropolis-Hastings (M-H) algorithm is the most popular technique for sampling the posterior probability distribution that describes the solution of the statistical inverse problem. However, all statistical inverse schemes require an enormous amount of forward simulations and thus appear to be extremely demanding computationally, if not prohibitive, if a 3-D set up is invoked. This urges development of fast and scalable 3-D modelling codes which can run large-scale 3-D models of practical interest for fractions of a second on high-performance multi-core platforms. But, even with these codes, the challenge for M-H methods is to construct proposal functions that simultaneously provide a good approximation of the target density function while being inexpensive to be sampled. In this paper we address both of these issues. First we introduce a variant of the M-H method which uses information about the local gradient and Hessian of the penalty function. This, in particular, allows us to exploit adjoint-based machinery that has been instrumental for the fast solution of deterministic inverse problems. We explain why this modification of M-H significantly accelerates sampling of the posterior probability distribution. In addition we show how Hessian handling (inverse, square root) can be made practicable by a low-rank approximation using the Lanczos algorithm. Ultimately we discuss uncertainty analysis based on stochastic inversion results. In addition, we demonstrate how this analysis can be performed within a deterministic approach. In the second part, we summarize modern trends in the development of efficient 3-D EM forward modelling schemes with special emphasis on recent advances in the integral equation approach.
Abraham, N S; Cohen, D C; Rivers, B; Richardson, P
2006-07-15
To validate veterans affairs (VA) administrative data for the diagnosis of nonsteroidal anti-inflammatory drug (NSAID)-related upper gastrointestinal events (UGIE) and to develop a diagnostic algorithm. A retrospective study of veterans prescribed an NSAID as identified from the national pharmacy database merged with in-patient and out-patient data, followed by primary chart abstraction. Contingency tables were constructed to allow comparison with a random sample of patients prescribed an NSAID, but without UGIE. Multivariable logistic regression analysis was used to derive a predictive algorithm. Once derived, the algorithm was validated in a separate cohort of veterans. Of 906 patients, 606 had a diagnostic code for UGIE; 300 were a random subsample of 11 744 patients (control). Only 161 had a confirmed UGIE. The positive predictive value (PPV) of diagnostic codes was poor, but improved from 27% to 51% with the addition of endoscopic procedural codes. The strongest predictors of UGIE were an in-patient ICD-9 code for gastric ulcer, duodenal ulcer and haemorrhage combined with upper endoscopy. This algorithm had a PPV of 73% when limited to patients >or=65 years (c-statistic 0.79). Validation of the algorithm revealed a PPV of 80% among patients with an overlapping NSAID prescription. NSAID-related UGIE can be assessed using VA administrative data. The optimal algorithm includes an in-patient ICD-9 code for gastric or duodenal ulcer and gastrointestinal bleeding combined with a procedural code for upper endoscopy.
Cost-Benefit Analysis of Computer Resources for Machine Learning
Champion, Richard A.
2007-01-01
Machine learning describes pattern-recognition algorithms - in this case, probabilistic neural networks (PNNs). These can be computationally intensive, in part because of the nonlinear optimizer, a numerical process that calibrates the PNN by minimizing a sum of squared errors. This report suggests efficiencies that are expressed as cost and benefit. The cost is computer time needed to calibrate the PNN, and the benefit is goodness-of-fit, how well the PNN learns the pattern in the data. There may be a point of diminishing returns where a further expenditure of computer resources does not produce additional benefits. Sampling is suggested as a cost-reduction strategy. One consideration is how many points to select for calibration and another is the geometric distribution of the points. The data points may be nonuniformly distributed across space, so that sampling at some locations provides additional benefit while sampling at other locations does not. A stratified sampling strategy can be designed to select more points in regions where they reduce the calibration error and fewer points in regions where they do not. Goodness-of-fit tests ensure that the sampling does not introduce bias. This approach is illustrated by statistical experiments for computing correlations between measures of roadless area and population density for the San Francisco Bay Area. The alternative to training efficiencies is to rely on high-performance computer systems. These may require specialized programming and algorithms that are optimized for parallel performance.
Improved pulse laser ranging algorithm based on high speed sampling
NASA Astrophysics Data System (ADS)
Gao, Xuan-yi; Qian, Rui-hai; Zhang, Yan-mei; Li, Huan; Guo, Hai-chao; He, Shi-jie; Guo, Xiao-kang
2016-10-01
Narrow pulse laser ranging achieves long-range target detection using laser pulse with low divergent beams. Pulse laser ranging is widely used in military, industrial, civil, engineering and transportation field. In this paper, an improved narrow pulse laser ranging algorithm is studied based on the high speed sampling. Firstly, theoretical simulation models have been built and analyzed including the laser emission and pulse laser ranging algorithm. An improved pulse ranging algorithm is developed. This new algorithm combines the matched filter algorithm and the constant fraction discrimination (CFD) algorithm. After the algorithm simulation, a laser ranging hardware system is set up to implement the improved algorithm. The laser ranging hardware system includes a laser diode, a laser detector and a high sample rate data logging circuit. Subsequently, using Verilog HDL language, the improved algorithm is implemented in the FPGA chip based on fusion of the matched filter algorithm and the CFD algorithm. Finally, the laser ranging experiment is carried out to test the improved algorithm ranging performance comparing to the matched filter algorithm and the CFD algorithm using the laser ranging hardware system. The test analysis result demonstrates that the laser ranging hardware system realized the high speed processing and high speed sampling data transmission. The algorithm analysis result presents that the improved algorithm achieves 0.3m distance ranging precision. The improved algorithm analysis result meets the expected effect, which is consistent with the theoretical simulation.
NASA Astrophysics Data System (ADS)
Stan Development Team
2018-01-01
Stan facilitates statistical inference at the frontiers of applied statistics and provides both a modeling language for specifying complex statistical models and a library of statistical algorithms for computing inferences with those models. These components are exposed through interfaces in environments such as R, Python, and the command line.
Computational algebraic geometry for statistical modeling FY09Q2 progress.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Thompson, David C.; Rojas, Joseph Maurice; Pebay, Philippe Pierre
2009-03-01
This is a progress report on polynomial system solving for statistical modeling. This is a progress report on polynomial system solving for statistical modeling. This quarter we have developed our first model of shock response data and an algorithm for identifying the chamber cone containing a polynomial system in n variables with n+k terms within polynomial time - a significant improvement over previous algorithms, all having exponential worst-case complexity. We have implemented and verified the chamber cone algorithm for n+3 and are working to extend the implementation to handle arbitrary k. Later sections of this report explain chamber cones inmore » more detail; the next section provides an overview of the project and how the current progress fits into it.« less
Gene genealogies for genetic association mapping, with application to Crohn's disease
Burkett, Kelly M.; Greenwood, Celia M. T.; McNeney, Brad; Graham, Jinko
2013-01-01
A gene genealogy describes relationships among haplotypes sampled from a population. Knowledge of the gene genealogy for a set of haplotypes is useful for estimation of population genetic parameters and it also has potential application in finding disease-predisposing genetic variants. As the true gene genealogy is unknown, Markov chain Monte Carlo (MCMC) approaches have been used to sample genealogies conditional on data at multiple genetic markers. We previously implemented an MCMC algorithm to sample from an approximation to the distribution of the gene genealogy conditional on haplotype data. Our approach samples ancestral trees, recombination and mutation rates at a genomic focal point. In this work, we describe how our sampler can be used to find disease-predisposing genetic variants in samples of cases and controls. We use a tree-based association statistic that quantifies the degree to which case haplotypes are more closely related to each other around the focal point than control haplotypes, without relying on a disease model. As the ancestral tree is a latent variable, so is the tree-based association statistic. We show how the sampler can be used to estimate the posterior distribution of the latent test statistic and corresponding latent p-values, which together comprise a fuzzy p-value. We illustrate the approach on a publicly-available dataset from a study of Crohn's disease that consists of genotypes at multiple SNP markers in a small genomic region. We estimate the posterior distribution of the tree-based association statistic and the recombination rate at multiple focal points in the region. Reassuringly, the posterior mean recombination rates estimated at the different focal points are consistent with previously published estimates. The tree-based association approach finds multiple sub-regions where the case haplotypes are more genetically related than the control haplotypes, and that there may be one or multiple disease-predisposing loci. PMID:24348515
A ground truth based comparative study on clustering of gene expression data.
Zhu, Yitan; Wang, Zuyi; Miller, David J; Clarke, Robert; Xuan, Jianhua; Hoffman, Eric P; Wang, Yue
2008-05-01
Given the variety of available clustering methods for gene expression data analysis, it is important to develop an appropriate and rigorous validation scheme to assess the performance and limitations of the most widely used clustering algorithms. In this paper, we present a ground truth based comparative study on the functionality, accuracy, and stability of five data clustering methods, namely hierarchical clustering, K-means clustering, self-organizing maps, standard finite normal mixture fitting, and a caBIG toolkit (VIsual Statistical Data Analyzer--VISDA), tested on sample clustering of seven published microarray gene expression datasets and one synthetic dataset. We examined the performance of these algorithms in both data-sufficient and data-insufficient cases using quantitative performance measures, including cluster number detection accuracy and mean and standard deviation of partition accuracy. The experimental results showed that VISDA, an interactive coarse-to-fine maximum likelihood fitting algorithm, is a solid performer on most of the datasets, while K-means clustering and self-organizing maps optimized by the mean squared compactness criterion generally produce more stable solutions than the other methods.
Real-time multi-channel stimulus artifact suppression by local curve fitting.
Wagenaar, Daniel A; Potter, Steve M
2002-10-30
We describe an algorithm for suppression of stimulation artifacts in extracellular micro-electrode array (MEA) recordings. A model of the artifact based on locally fitted cubic polynomials is subtracted from the recording, yielding a flat baseline amenable to spike detection by voltage thresholding. The algorithm, SALPA, reduces the period after stimulation during which action potentials cannot be detected by an order of magnitude, to less than 2 ms. Our implementation is fast enough to process 60-channel data sampled at 25 kHz in real-time on an inexpensive desktop PC. It performs well on a wide range of artifact shapes without re-tuning any parameters, because it accounts for amplifier saturation explicitly and uses a statistic to verify successful artifact suppression immediately after the amplifiers become operational. We demonstrate the algorithm's effectiveness on recordings from dense monolayer cultures of cortical neurons obtained from rat embryos. SALPA opens up a previously inaccessible window for studying transient neural oscillations and precisely timed dynamics in short-latency responses to electric stimulation. Copyright 2002 Elsevier Science B.V.
A new hyperchaotic map and its application for image encryption
NASA Astrophysics Data System (ADS)
Natiq, Hayder; Al-Saidi, N. M. G.; Said, M. R. M.; Kilicman, Adem
2018-01-01
Based on the one-dimensional Sine map and the two-dimensional Hénon map, a new two-dimensional Sine-Hénon alteration model (2D-SHAM) is hereby proposed. Basic dynamic characteristics of 2D-SHAM are studied through the following aspects: equilibria, Jacobin eigenvalues, trajectory, bifurcation diagram, Lyapunov exponents and sensitivity dependence test. The complexity of 2D-SHAM is investigated using Sample Entropy algorithm. Simulation results show that 2D-SHAM is overall hyperchaotic with the high complexity, and high sensitivity to its initial values and control parameters. To investigate its performance in terms of security, a new 2D-SHAM-based image encryption algorithm (SHAM-IEA) is also proposed. In this algorithm, the essential requirements of confusion and diffusion are accomplished, and the stochastic 2D-SHAM is used to enhance the security of encrypted image. The stochastic 2D-SHAM generates random values, hence SHAM-IEA can produce different encrypted images even with the same secret key. Experimental results and security analysis show that SHAM-IEA has strong capability to withstand statistical analysis, differential attack, chosen-plaintext and chosen-ciphertext attacks.
High-Performance Mixed Models Based Genome-Wide Association Analysis with omicABEL software
Fabregat-Traver, Diego; Sharapov, Sodbo Zh.; Hayward, Caroline; Rudan, Igor; Campbell, Harry; Aulchenko, Yurii; Bientinesi, Paolo
2014-01-01
To raise the power of genome-wide association studies (GWAS) and avoid false-positive results in structured populations, one can rely on mixed model based tests. When large samples are used, and when multiple traits are to be studied in the ’omics’ context, this approach becomes computationally challenging. Here we consider the problem of mixed-model based GWAS for arbitrary number of traits, and demonstrate that for the analysis of single-trait and multiple-trait scenarios different computational algorithms are optimal. We implement these optimal algorithms in a high-performance computing framework that uses state-of-the-art linear algebra kernels, incorporates optimizations, and avoids redundant computations, increasing throughput while reducing memory usage and energy consumption. We show that, compared to existing libraries, our algorithms and software achieve considerable speed-ups. The OmicABEL software described in this manuscript is available under the GNU GPL v. 3 license as part of the GenABEL project for statistical genomics at http: //www.genabel.org/packages/OmicABEL. PMID:25717363
High-Performance Mixed Models Based Genome-Wide Association Analysis with omicABEL software.
Fabregat-Traver, Diego; Sharapov, Sodbo Zh; Hayward, Caroline; Rudan, Igor; Campbell, Harry; Aulchenko, Yurii; Bientinesi, Paolo
2014-01-01
To raise the power of genome-wide association studies (GWAS) and avoid false-positive results in structured populations, one can rely on mixed model based tests. When large samples are used, and when multiple traits are to be studied in the 'omics' context, this approach becomes computationally challenging. Here we consider the problem of mixed-model based GWAS for arbitrary number of traits, and demonstrate that for the analysis of single-trait and multiple-trait scenarios different computational algorithms are optimal. We implement these optimal algorithms in a high-performance computing framework that uses state-of-the-art linear algebra kernels, incorporates optimizations, and avoids redundant computations, increasing throughput while reducing memory usage and energy consumption. We show that, compared to existing libraries, our algorithms and software achieve considerable speed-ups. The OmicABEL software described in this manuscript is available under the GNU GPL v. 3 license as part of the GenABEL project for statistical genomics at http: //www.genabel.org/packages/OmicABEL.
A framework for evaluating mixture analysis algorithms
NASA Astrophysics Data System (ADS)
Dasaratha, Sridhar; Vignesh, T. S.; Shanmukh, Sarat; Yarra, Malathi; Botonjic-Sehic, Edita; Grassi, James; Boudries, Hacene; Freeman, Ivan; Lee, Young K.; Sutherland, Scott
2010-04-01
In recent years, several sensing devices capable of identifying unknown chemical and biological substances have been commercialized. The success of these devices in analyzing real world samples is dependent on the ability of the on-board identification algorithm to de-convolve spectra of substances that are mixtures. To develop effective de-convolution algorithms, it is critical to characterize the relationship between the spectral features of a substance and its probability of detection within a mixture, as these features may be similar to or overlap with other substances in the mixture and in the library. While it has been recognized that these aspects pose challenges to mixture analysis, a systematic effort to quantify spectral characteristics and their impact, is generally lacking. In this paper, we propose metrics that can be used to quantify these spectral features. Some of these metrics, such as a modification of variance inflation factor, are derived from classical statistical measures used in regression diagnostics. We demonstrate that these metrics can be correlated to the accuracy of the substance's identification in a mixture. We also develop a framework for characterizing mixture analysis algorithms, using these metrics. Experimental results are then provided to show the application of this framework to the evaluation of various algorithms, including one that has been developed for a commercial device. The illustration is based on synthetic mixtures that are created from pure component Raman spectra measured on a portable device.
Liu, L L; Liu, M J; Ma, M
2015-09-28
The central task of this study was to mine the gene-to-medium relationship. Adequate knowledge of this relationship could potentially improve the accuracy of differentially expressed gene mining. One of the approaches to differentially expressed gene mining uses conventional clustering algorithms to identify the gene-to-medium relationship. Compared to conventional clustering algorithms, self-organization maps (SOMs) identify the nonlinear aspects of the gene-to-medium relationships by mapping the input space into another higher dimensional feature space. However, SOMs are not suitable for huge datasets consisting of millions of samples. Therefore, a new computational model, the Function Clustering Self-Organization Maps (FCSOMs), was developed. FCSOMs take advantage of the theory of granular computing as well as advanced statistical learning methodologies, and are built specifically for each information granule (a function cluster of genes), which are intelligently partitioned by the clustering algorithm provided by the DAVID_6.7 software platform. However, only the gene functions, and not their expression values, are considered in the fuzzy clustering algorithm of DAVID. Compared to the clustering algorithm of DAVID, these experimental results show a marked improvement in the accuracy of classification with the application of FCSOMs. FCSOMs can handle huge datasets and their complex classification problems, as each FCSOM (modeled for each function cluster) can be easily parallelized.
Xiao, Chuan-Le; Chen, Xiao-Zhou; Du, Yang-Li; Sun, Xuesong; Zhang, Gong; He, Qing-Yu
2013-01-04
Mass spectrometry has become one of the most important technologies in proteomic analysis. Tandem mass spectrometry (LC-MS/MS) is a major tool for the analysis of peptide mixtures from protein samples. The key step of MS data processing is the identification of peptides from experimental spectra by searching public sequence databases. Although a number of algorithms to identify peptides from MS/MS data have been already proposed, e.g. Sequest, OMSSA, X!Tandem, Mascot, etc., they are mainly based on statistical models considering only peak-matches between experimental and theoretical spectra, but not peak intensity information. Moreover, different algorithms gave different results from the same MS data, implying their probable incompleteness and questionable reproducibility. We developed a novel peptide identification algorithm, ProVerB, based on a binomial probability distribution model of protein tandem mass spectrometry combined with a new scoring function, making full use of peak intensity information and, thus, enhancing the ability of identification. Compared with Mascot, Sequest, and SQID, ProVerB identified significantly more peptides from LC-MS/MS data sets than the current algorithms at 1% False Discovery Rate (FDR) and provided more confident peptide identifications. ProVerB is also compatible with various platforms and experimental data sets, showing its robustness and versatility. The open-source program ProVerB is available at http://bioinformatics.jnu.edu.cn/software/proverb/ .
A Simplified Algorithm for Statistical Investigation of Damage Spreading
NASA Astrophysics Data System (ADS)
Gecow, Andrzej
2009-04-01
On the way to simulating adaptive evolution of complex system describing a living object or human developed project, a fitness should be defined on node states or network external outputs. Feedbacks lead to circular attractors of these states or outputs which make it difficult to define a fitness. The main statistical effects of adaptive condition are the result of small change tendency and to appear, they only need a statistically correct size of damage initiated by evolutionary change of system. This observation allows to cut loops of feedbacks and in effect to obtain a particular statistically correct state instead of a long circular attractor which in the quenched model is expected for chaotic network with feedback. Defining fitness on such states is simple. We calculate only damaged nodes and only once. Such an algorithm is optimal for investigation of damage spreading i.e. statistical connections of structural parameters of initial change with the size of effected damage. It is a reversed-annealed method—function and states (signals) may be randomly substituted but connections are important and are preserved. The small damages important for adaptive evolution are correctly depicted in comparison to Derrida annealed approximation which expects equilibrium levels for large networks. The algorithm indicates these levels correctly. The relevant program in Pascal, which executes the algorithm for a wide range of parameters, can be obtained from the author.
Cost-effective analysis of different algorithms for the diagnosis of hepatitis C virus infection.
Barreto, A M E C; Takei, K; E C, Sabino; Bellesa, M A O; Salles, N A; Barreto, C C; Nishiya, A S; Chamone, D F
2008-02-01
We compared the cost-benefit of two algorithms, recently proposed by the Centers for Disease Control and Prevention, USA, with the conventional one, the most appropriate for the diagnosis of hepatitis C virus (HCV) infection in the Brazilian population. Serum samples were obtained from 517 ELISA-positive or -inconclusive blood donors who had returned to Fundação Pró-Sangue/Hemocentro de São Paulo to confirm previous results. Algorithm A was based on signal-to-cut-off (s/co) ratio of ELISA anti-HCV samples that show s/co ratio > or =95% concordance with immunoblot (IB) positivity. For algorithm B, reflex nucleic acid amplification testing by PCR was required for ELISA-positive or -inconclusive samples and IB for PCR-negative samples. For algorithm C, all positive or inconclusive ELISA samples were submitted to IB. We observed a similar rate of positive results with the three algorithms: 287, 287, and 285 for A, B, and C, respectively, and 283 were concordant with one another. Indeterminate results from algorithms A and C were elucidated by PCR (expanded algorithm) which detected two more positive samples. The estimated cost of algorithms A and B was US$21,299.39 and US$32,397.40, respectively, which were 43.5 and 14.0% more economic than C (US$37,673.79). The cost can vary according to the technique used. We conclude that both algorithms A and B are suitable for diagnosing HCV infection in the Brazilian population. Furthermore, algorithm A is the more practical and economical one since it requires supplemental tests for only 54% of the samples. Algorithm B provides early information about the presence of viremia.
A Classification of Remote Sensing Image Based on Improved Compound Kernels of Svm
NASA Astrophysics Data System (ADS)
Zhao, Jianing; Gao, Wanlin; Liu, Zili; Mou, Guifen; Lu, Lin; Yu, Lina
The accuracy of RS classification based on SVM which is developed from statistical learning theory is high under small number of train samples, which results in satisfaction of classification on RS using SVM methods. The traditional RS classification method combines visual interpretation with computer classification. The accuracy of the RS classification, however, is improved a lot based on SVM method, because it saves much labor and time which is used to interpret images and collect training samples. Kernel functions play an important part in the SVM algorithm. It uses improved compound kernel function and therefore has a higher accuracy of classification on RS images. Moreover, compound kernel improves the generalization and learning ability of the kernel.
Importance Sampling of Word Patterns in DNA and Protein Sequences
Chan, Hock Peng; Chen, Louis H.Y.
2010-01-01
Abstract Monte Carlo methods can provide accurate p-value estimates of word counting test statistics and are easy to implement. They are especially attractive when an asymptotic theory is absent or when either the search sequence or the word pattern is too short for the application of asymptotic formulae. Naive direct Monte Carlo is undesirable for the estimation of small probabilities because the associated rare events of interest are seldom generated. We propose instead efficient importance sampling algorithms that use controlled insertion of the desired word patterns on randomly generated sequences. The implementation is illustrated on word patterns of biological interest: palindromes and inverted repeats, patterns arising from position-specific weight matrices (PSWMs), and co-occurrences of pairs of motifs. PMID:21128856
Van Pamel, Anton; Brett, Colin R; Lowe, Michael J S
2014-12-01
Improving the ultrasound inspection capability for coarse-grained metals remains of longstanding interest and is expected to become increasingly important for next-generation electricity power plants. Conventional ultrasonic A-, B-, and C-scans have been found to suffer from strong background noise caused by grain scattering, which can severely limit the detection of defects. However, in recent years, array probes and full matrix capture (FMC) imaging algorithms have unlocked exciting possibilities for improvements. To improve and compare these algorithms, we must rely on robust methodologies to quantify their performance. This article proposes such a methodology to evaluate the detection performance of imaging algorithms. For illustration, the methodology is applied to some example data using three FMC imaging algorithms; total focusing method (TFM), phase-coherent imaging (PCI), and decomposition of the time-reversal operator with multiple scattering filter (DORT MSF). However, it is important to note that this is solely to illustrate the methodology; this article does not attempt the broader investigation of different cases that would be needed to compare the performance of these algorithms in general. The methodology considers the statistics of detection, presenting the detection performance as probability of detection (POD) and probability of false alarm (PFA). A test sample of coarse-grained nickel super alloy, manufactured to represent materials used for future power plant components and containing some simple artificial defects, is used to illustrate the method on the candidate algorithms. The data are captured in pulse-echo mode using 64-element array probes at center frequencies of 1 and 5 MHz. In this particular case, it turns out that all three algorithms are shown to perform very similarly when comparing their flaw detection capabilities.
NASA Astrophysics Data System (ADS)
Schwartz, Craig R.; Thelen, Brian J.; Kenton, Arthur C.
1995-06-01
A statistical parametric multispectral sensor performance model was developed by ERIM to support mine field detection studies, multispectral sensor design/performance trade-off studies, and target detection algorithm development. The model assumes target detection algorithms and their performance models which are based on data assumed to obey multivariate Gaussian probability distribution functions (PDFs). The applicability of these algorithms and performance models can be generalized to data having non-Gaussian PDFs through the use of transforms which convert non-Gaussian data to Gaussian (or near-Gaussian) data. An example of one such transform is the Box-Cox power law transform. In practice, such a transform can be applied to non-Gaussian data prior to the introduction of a detection algorithm that is formally based on the assumption of multivariate Gaussian data. This paper presents an extension of these techniques to the case where the joint multivariate probability density function of the non-Gaussian input data is known, and where the joint estimate of the multivariate Gaussian statistics, under the Box-Cox transform, is desired. The jointly estimated multivariate Gaussian statistics can then be used to predict the performance of a target detection algorithm which has an associated Gaussian performance model.
A novel data-driven learning method for radar target detection in nonstationary environments
Akcakaya, Murat; Nehorai, Arye; Sen, Satyabrata
2016-04-12
Most existing radar algorithms are developed under the assumption that the environment (clutter) is stationary. However, in practice, the characteristics of the clutter can vary enormously depending on the radar-operational scenarios. If unaccounted for, these nonstationary variabilities may drastically hinder the radar performance. Therefore, to overcome such shortcomings, we develop a data-driven method for target detection in nonstationary environments. In this method, the radar dynamically detects changes in the environment and adapts to these changes by learning the new statistical characteristics of the environment and by intelligibly updating its statistical detection algorithm. Specifically, we employ drift detection algorithms to detectmore » changes in the environment; incremental learning, particularly learning under concept drift algorithms, to learn the new statistical characteristics of the environment from the new radar data that become available in batches over a period of time. The newly learned environment characteristics are then integrated in the detection algorithm. Furthermore, we use Monte Carlo simulations to demonstrate that the developed method provides a significant improvement in the detection performance compared with detection techniques that are not aware of the environmental changes.« less
Time-of-flight PET image reconstruction using origin ensembles.
Wülker, Christian; Sitek, Arkadiusz; Prevrhal, Sven
2015-03-07
The origin ensemble (OE) algorithm is a novel statistical method for minimum-mean-square-error (MMSE) reconstruction of emission tomography data. This method allows one to perform reconstruction entirely in the image domain, i.e. without the use of forward and backprojection operations. We have investigated the OE algorithm in the context of list-mode (LM) time-of-flight (TOF) PET reconstruction. In this paper, we provide a general introduction to MMSE reconstruction, and a statistically rigorous derivation of the OE algorithm. We show how to efficiently incorporate TOF information into the reconstruction process, and how to correct for random coincidences and scattered events. To examine the feasibility of LM-TOF MMSE reconstruction with the OE algorithm, we applied MMSE-OE and standard maximum-likelihood expectation-maximization (ML-EM) reconstruction to LM-TOF phantom data with a count number typically registered in clinical PET examinations. We analyzed the convergence behavior of the OE algorithm, and compared reconstruction time and image quality to that of the EM algorithm. In summary, during the reconstruction process, MMSE-OE contrast recovery (CRV) remained approximately the same, while background variability (BV) gradually decreased with an increasing number of OE iterations. The final MMSE-OE images exhibited lower BV and a slightly lower CRV than the corresponding ML-EM images. The reconstruction time of the OE algorithm was approximately 1.3 times longer. At the same time, the OE algorithm can inherently provide a comprehensive statistical characterization of the acquired data. This characterization can be utilized for further data processing, e.g. in kinetic analysis and image registration, making the OE algorithm a promising approach in a variety of applications.
Time-of-flight PET image reconstruction using origin ensembles
NASA Astrophysics Data System (ADS)
Wülker, Christian; Sitek, Arkadiusz; Prevrhal, Sven
2015-03-01
The origin ensemble (OE) algorithm is a novel statistical method for minimum-mean-square-error (MMSE) reconstruction of emission tomography data. This method allows one to perform reconstruction entirely in the image domain, i.e. without the use of forward and backprojection operations. We have investigated the OE algorithm in the context of list-mode (LM) time-of-flight (TOF) PET reconstruction. In this paper, we provide a general introduction to MMSE reconstruction, and a statistically rigorous derivation of the OE algorithm. We show how to efficiently incorporate TOF information into the reconstruction process, and how to correct for random coincidences and scattered events. To examine the feasibility of LM-TOF MMSE reconstruction with the OE algorithm, we applied MMSE-OE and standard maximum-likelihood expectation-maximization (ML-EM) reconstruction to LM-TOF phantom data with a count number typically registered in clinical PET examinations. We analyzed the convergence behavior of the OE algorithm, and compared reconstruction time and image quality to that of the EM algorithm. In summary, during the reconstruction process, MMSE-OE contrast recovery (CRV) remained approximately the same, while background variability (BV) gradually decreased with an increasing number of OE iterations. The final MMSE-OE images exhibited lower BV and a slightly lower CRV than the corresponding ML-EM images. The reconstruction time of the OE algorithm was approximately 1.3 times longer. At the same time, the OE algorithm can inherently provide a comprehensive statistical characterization of the acquired data. This characterization can be utilized for further data processing, e.g. in kinetic analysis and image registration, making the OE algorithm a promising approach in a variety of applications.
NASA Astrophysics Data System (ADS)
Mazidi, Hesam; Nehorai, Arye; Lew, Matthew D.
2018-02-01
In single-molecule (SM) super-resolution microscopy, the complexity of a biological structure, high molecular density, and a low signal-to-background ratio (SBR) may lead to imaging artifacts without a robust localization algorithm. Moreover, engineered point spread functions (PSFs) for 3D imaging pose difficulties due to their intricate features. We develop a Robust Statistical Estimation algorithm, called RoSE, that enables joint estimation of the 3D location and photon counts of SMs accurately and precisely using various PSFs under conditions of high molecular density and low SBR.
A comparison of fitness-case sampling methods for genetic programming
NASA Astrophysics Data System (ADS)
Martínez, Yuliana; Naredo, Enrique; Trujillo, Leonardo; Legrand, Pierrick; López, Uriel
2017-11-01
Genetic programming (GP) is an evolutionary computation paradigm for automatic program induction. GP has produced impressive results but it still needs to overcome some practical limitations, particularly its high computational cost, overfitting and excessive code growth. Recently, many researchers have proposed fitness-case sampling methods to overcome some of these problems, with mixed results in several limited tests. This paper presents an extensive comparative study of four fitness-case sampling methods, namely: Interleaved Sampling, Random Interleaved Sampling, Lexicase Selection and Keep-Worst Interleaved Sampling. The algorithms are compared on 11 symbolic regression problems and 11 supervised classification problems, using 10 synthetic benchmarks and 12 real-world data-sets. They are evaluated based on test performance, overfitting and average program size, comparing them with a standard GP search. Comparisons are carried out using non-parametric multigroup tests and post hoc pairwise statistical tests. The experimental results suggest that fitness-case sampling methods are particularly useful for difficult real-world symbolic regression problems, improving performance, reducing overfitting and limiting code growth. On the other hand, it seems that fitness-case sampling cannot improve upon GP performance when considering supervised binary classification.
Zhu, Peijuan; Ding, Wei; Tong, Wei; Ghosal, Anima; Alton, Kevin; Chowdhury, Swapan
2009-06-01
A retention-time-shift-tolerant background subtraction and noise reduction algorithm (BgS-NoRA) is implemented using the statistical programming language R to remove non-drug-related ion signals from accurate mass liquid chromatography/mass spectrometry (LC/MS) data. The background-subtraction part of the algorithm is similar to a previously published procedure (Zhang H and Yang Y. J. Mass Spectrom. 2008, 43: 1181-1190). The noise reduction algorithm (NoRA) is an add-on feature to help further clean up the residual matrix ion noises after background subtraction. It functions by removing ion signals that are not consistent across many adjacent scans. The effectiveness of BgS-NoRA was examined in biological matrices by spiking blank plasma extract, bile and urine with diclofenac and ibuprofen that have been pre-metabolized by microsomal incubation. Efficient removal of background ions permitted the detection of drug-related ions in in vivo samples (plasma, bile, urine and feces) obtained from rats orally dosed with (14)C-loratadine with minimal interference. Results from these experiments demonstrate that BgS-NoRA is more effective in removing analyte-unrelated ions than background subtraction alone. NoRA is shown to be particularly effective in the early retention region for urine samples and middle retention region for bile samples, where the matrix ion signals still dominate the total ion chromatograms (TICs) after background subtraction. In most cases, the TICs after BgS-NoRA are in excellent qualitative correlation to the radiochromatograms. BgS-NoRA will be a very useful tool in metabolite detection and identification work, especially in first-in-human (FIH) studies and multiple dose toxicology studies where non-radio-labeled drugs are administered. Data from these types of studies are critical to meet the latest FDA guidance on Metabolite in Safety Testing (MIST). Copyright (c) 2009 John Wiley & Sons, Ltd.
Automated detection of hospital outbreaks: A systematic review of methods.
Leclère, Brice; Buckeridge, David L; Boëlle, Pierre-Yves; Astagneau, Pascal; Lepelletier, Didier
2017-01-01
Several automated algorithms for epidemiological surveillance in hospitals have been proposed. However, the usefulness of these methods to detect nosocomial outbreaks remains unclear. The goal of this review was to describe outbreak detection algorithms that have been tested within hospitals, consider how they were evaluated, and synthesize their results. We developed a search query using keywords associated with hospital outbreak detection and searched the MEDLINE database. To ensure the highest sensitivity, no limitations were initially imposed on publication languages and dates, although we subsequently excluded studies published before 2000. Every study that described a method to detect outbreaks within hospitals was included, without any exclusion based on study design. Additional studies were identified through citations in retrieved studies. Twenty-nine studies were included. The detection algorithms were grouped into 5 categories: simple thresholds (n = 6), statistical process control (n = 12), scan statistics (n = 6), traditional statistical models (n = 6), and data mining methods (n = 4). The evaluation of the algorithms was often solely descriptive (n = 15), but more complex epidemiological criteria were also investigated (n = 10). The performance measures varied widely between studies: e.g., the sensitivity of an algorithm in a real world setting could vary between 17 and 100%. Even if outbreak detection algorithms are useful complementary tools for traditional surveillance, the heterogeneity in results among published studies does not support quantitative synthesis of their performance. A standardized framework should be followed when evaluating outbreak detection methods to allow comparison of algorithms across studies and synthesis of results.
Making adjustments to event annotations for improved biological event extraction.
Baek, Seung-Cheol; Park, Jong C
2016-09-16
Current state-of-the-art approaches to biological event extraction train statistical models in a supervised manner on corpora annotated with event triggers and event-argument relations. Inspecting such corpora, we observe that there is ambiguity in the span of event triggers (e.g., "transcriptional activity" vs. 'transcriptional'), leading to inconsistencies across event trigger annotations. Such inconsistencies make it quite likely that similar phrases are annotated with different spans of event triggers, suggesting the possibility that a statistical learning algorithm misses an opportunity for generalizing from such event triggers. We anticipate that adjustments to the span of event triggers to reduce these inconsistencies would meaningfully improve the present performance of event extraction systems. In this study, we look into this possibility with the corpora provided by the 2009 BioNLP shared task as a proof of concept. We propose an Informed Expectation-Maximization (EM) algorithm, which trains models using the EM algorithm with a posterior regularization technique, which consults the gold-standard event trigger annotations in a form of constraints. We further propose four constraints on the possible event trigger annotations to be explored by the EM algorithm. The algorithm is shown to outperform the state-of-the-art algorithm on the development corpus in a statistically significant manner and on the test corpus by a narrow margin. The analysis of the annotations generated by the algorithm shows that there are various types of ambiguity in event annotations, even though they could be small in number.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bai, T; UT Southwestern Medical Center, Dallas, TX; Yan, H
2014-06-15
Purpose: To develop a 3D dictionary learning based statistical reconstruction algorithm on graphic processing units (GPU), to improve the quality of low-dose cone beam CT (CBCT) imaging with high efficiency. Methods: A 3D dictionary containing 256 small volumes (atoms) of 3x3x3 voxels was trained from a high quality volume image. During reconstruction, we utilized a Cholesky decomposition based orthogonal matching pursuit algorithm to find a sparse representation on this dictionary basis of each patch in the reconstructed image, in order to regularize the image quality. To accelerate the time-consuming sparse coding in the 3D case, we implemented our algorithm inmore » a parallel fashion by taking advantage of the tremendous computational power of GPU. Evaluations are performed based on a head-neck patient case. FDK reconstruction with full dataset of 364 projections is used as the reference. We compared the proposed 3D dictionary learning based method with a tight frame (TF) based one using a subset data of 121 projections. The image qualities under different resolutions in z-direction, with or without statistical weighting are also studied. Results: Compared to the TF-based CBCT reconstruction, our experiments indicated that 3D dictionary learning based CBCT reconstruction is able to recover finer structures, to remove more streaking artifacts, and is less susceptible to blocky artifacts. It is also observed that statistical reconstruction approach is sensitive to inconsistency between the forward and backward projection operations in parallel computing. Using high a spatial resolution along z direction helps improving the algorithm robustness. Conclusion: 3D dictionary learning based CBCT reconstruction algorithm is able to sense the structural information while suppressing noise, and hence to achieve high quality reconstruction. The GPU realization of the whole algorithm offers a significant efficiency enhancement, making this algorithm more feasible for potential clinical application. A high zresolution is preferred to stabilize statistical iterative reconstruction. This work was supported in part by NIH(1R01CA154747-01), NSFC((No. 61172163), Research Fund for the Doctoral Program of Higher Education of China (No. 20110201110011), China Scholarship Council.« less
Joint channel estimation and multi-user detection for multipath fading channels in DS-CDMA systems
NASA Astrophysics Data System (ADS)
Wu, Sau-Hsuan; Kuo, C.-C. Jay
2002-11-01
The technique of joint blind channel estimation and multiple access interference (MAI) suppression for an asynchronous code-division multiple-access (CDMA) system is investigated in this research. To identify and track dispersive time-varying fading channels and to avoid the phase ambiguity that come with the second-order statistic approaches, a sliding-window scheme using the expectation maximization (EM) algorithm is proposed. The complexity of joint channel equalization and symbol detection for all users increases exponentially with system loading and the channel memory. The situation is exacerbated if strong inter-symbol interference (ISI) exists. To reduce the complexity and the number of samples required for channel estimation, a blind multiuser detector is developed. Together with multi-stage interference cancellation using soft outputs provided by this detector, our algorithm can track fading channels with no phase ambiguity even when channel gains attenuate close to zero.
Time-lapse microscopy and image processing for stem cell research: modeling cell migration
NASA Astrophysics Data System (ADS)
Gustavsson, Tomas; Althoff, Karin; Degerman, Johan; Olsson, Torsten; Thoreson, Ann-Catrin; Thorlin, Thorleif; Eriksson, Peter
2003-05-01
This paper presents hardware and software procedures for automated cell tracking and migration modeling. A time-lapse microscopy system equipped with a computer controllable motorized stage was developed. The performance of this stage was improved by incorporating software algorithms for stage motion displacement compensation and auto focus. The microscope is suitable for in-vitro stem cell studies and allows for multiple cell culture image sequence acquisition. This enables comparative studies concerning rate of cell splits, average cell motion velocity, cell motion as a function of cell sample density and many more. Several cell segmentation procedures are described as well as a cell tracking algorithm. Statistical methods for describing cell migration patterns are presented. In particular, the Hidden Markov Model (HMM) was investigated. Results indicate that if the cell motion can be described as a non-stationary stochastic process, then the HMM can adequately model aspects of its dynamic behavior.
Multiscale solvers and systematic upscaling in computational physics
NASA Astrophysics Data System (ADS)
Brandt, A.
2005-07-01
Multiscale algorithms can overcome the scale-born bottlenecks that plague most computations in physics. These algorithms employ separate processing at each scale of the physical space, combined with interscale iterative interactions, in ways which use finer scales very sparingly. Having been developed first and well known as multigrid solvers for partial differential equations, highly efficient multiscale techniques have more recently been developed for many other types of computational tasks, including: inverse PDE problems; highly indefinite (e.g., standing wave) equations; Dirac equations in disordered gauge fields; fast computation and updating of large determinants (as needed in QCD); fast integral transforms; integral equations; astrophysics; molecular dynamics of macromolecules and fluids; many-atom electronic structures; global and discrete-state optimization; practical graph problems; image segmentation and recognition; tomography (medical imaging); fast Monte-Carlo sampling in statistical physics; and general, systematic methods of upscaling (accurate numerical derivation of large-scale equations from microscopic laws).
Textile Pressure Mapping Sensor for Emotional Touch Detection in Human-Robot Interaction
Cruz Zurian, Heber; Atefi, Seyed Reza; Seoane Martinez, Fernando; Lukowicz, Paul
2017-01-01
In this paper, we developed a fully textile sensing fabric for tactile touch sensing as the robot skin to detect human-robot interactions. The sensor covers a 20-by-20 cm2 area with 400 sensitive points and samples at 50 Hz per point. We defined seven gestures which are inspired by the social and emotional interactions of typical people to people or pet scenarios. We conducted two groups of mutually blinded experiments, involving 29 participants in total. The data processing algorithm first reduces the spatial complexity to frame descriptors, and temporal features are calculated through basic statistical representations and wavelet analysis. Various classifiers are evaluated and the feature calculation algorithms are analyzed in details to determine each stage and segments’ contribution. The best performing feature-classifier combination can recognize the gestures with a 93.3% accuracy from a known group of participants, and 89.1% from strangers. PMID:29120389
Textile Pressure Mapping Sensor for Emotional Touch Detection in Human-Robot Interaction.
Zhou, Bo; Altamirano, Carlos Andres Velez; Zurian, Heber Cruz; Atefi, Seyed Reza; Billing, Erik; Martinez, Fernando Seoane; Lukowicz, Paul
2017-11-09
In this paper, we developed a fully textile sensing fabric for tactile touch sensing as the robot skin to detect human-robot interactions. The sensor covers a 20-by-20 cm 2 area with 400 sensitive points and samples at 50 Hz per point. We defined seven gestures which are inspired by the social and emotional interactions of typical people to people or pet scenarios. We conducted two groups of mutually blinded experiments, involving 29 participants in total. The data processing algorithm first reduces the spatial complexity to frame descriptors, and temporal features are calculated through basic statistical representations and wavelet analysis. Various classifiers are evaluated and the feature calculation algorithms are analyzed in details to determine each stage and segments' contribution. The best performing feature-classifier combination can recognize the gestures with a 93 . 3 % accuracy from a known group of participants, and 89 . 1 % from strangers.
Du, Bo; Zhang, Yuxiang; Zhang, Liangpei; Tao, Dacheng
2016-08-18
Hyperspectral images provide great potential for target detection, however, new challenges are also introduced for hyperspectral target detection, resulting that hyperspectral target detection should be treated as a new problem and modeled differently. Many classical detectors are proposed based on the linear mixing model and the sparsity model. However, the former type of model cannot deal well with spectral variability in limited endmembers, and the latter type of model usually treats the target detection as a simple classification problem and pays less attention to the low target probability. In this case, can we find an efficient way to utilize both the high-dimension features behind hyperspectral images and the limited target information to extract small targets? This paper proposes a novel sparsitybased detector named the hybrid sparsity and statistics detector (HSSD) for target detection in hyperspectral imagery, which can effectively deal with the above two problems. The proposed algorithm designs a hypothesis-specific dictionary based on the prior hypotheses for the test pixel, which can avoid the imbalanced number of training samples for a class-specific dictionary. Then, a purification process is employed for the background training samples in order to construct an effective competition between the two hypotheses. Next, a sparse representation based binary hypothesis model merged with additive Gaussian noise is proposed to represent the image. Finally, a generalized likelihood ratio test is performed to obtain a more robust detection decision than the reconstruction residual based detection methods. Extensive experimental results with three hyperspectral datasets confirm that the proposed HSSD algorithm clearly outperforms the stateof- the-art target detectors.
Information processing of motion in facial expression and the geometry of dynamical systems
NASA Astrophysics Data System (ADS)
Assadi, Amir H.; Eghbalnia, Hamid; McMenamin, Brenton W.
2005-01-01
An interesting problem in analysis of video data concerns design of algorithms that detect perceptually significant features in an unsupervised manner, for instance methods of machine learning for automatic classification of human expression. A geometric formulation of this genre of problems could be modeled with help of perceptual psychology. In this article, we outline one approach for a special case where video segments are to be classified according to expression of emotion or other similar facial motions. The encoding of realistic facial motions that convey expression of emotions for a particular person P forms a parameter space XP whose study reveals the "objective geometry" for the problem of unsupervised feature detection from video. The geometric features and discrete representation of the space XP are independent of subjective evaluations by observers. While the "subjective geometry" of XP varies from observer to observer, levels of sensitivity and variation in perception of facial expressions appear to share a certain level of universality among members of similar cultures. Therefore, statistical geometry of invariants of XP for a sample of population could provide effective algorithms for extraction of such features. In cases where frequency of events is sufficiently large in the sample data, a suitable framework could be provided to facilitate the information-theoretic organization and study of statistical invariants of such features. This article provides a general approach to encode motion in terms of a particular genre of dynamical systems and the geometry of their flow. An example is provided to illustrate the general theory.
Ding, Liya; Martinez, Aleix M
2010-11-01
The appearance-based approach to face detection has seen great advances in the last several years. In this approach, we learn the image statistics describing the texture pattern (appearance) of the object class we want to detect, e.g., the face. However, this approach has had limited success in providing an accurate and detailed description of the internal facial features, i.e., eyes, brows, nose, and mouth. In general, this is due to the limited information carried by the learned statistical model. While the face template is relatively rich in texture, facial features (e.g., eyes, nose, and mouth) do not carry enough discriminative information to tell them apart from all possible background images. We resolve this problem by adding the context information of each facial feature in the design of the statistical model. In the proposed approach, the context information defines the image statistics most correlated with the surroundings of each facial component. This means that when we search for a face or facial feature, we look for those locations which most resemble the feature yet are most dissimilar to its context. This dissimilarity with the context features forces the detector to gravitate toward an accurate estimate of the position of the facial feature. Learning to discriminate between feature and context templates is difficult, however, because the context and the texture of the facial features vary widely under changing expression, pose, and illumination, and may even resemble one another. We address this problem with the use of subclass divisions. We derive two algorithms to automatically divide the training samples of each facial feature into a set of subclasses, each representing a distinct construction of the same facial component (e.g., closed versus open eyes) or its context (e.g., different hairstyles). The first algorithm is based on a discriminant analysis formulation. The second algorithm is an extension of the AdaBoost approach. We provide extensive experimental results using still images and video sequences for a total of 3,930 images. We show that the results are almost as good as those obtained with manual detection.
Latest Results From the QuakeFinder Statistical Analysis Framework
NASA Astrophysics Data System (ADS)
Kappler, K. N.; MacLean, L. S.; Schneider, D.; Bleier, T.
2017-12-01
Since 2005 QuakeFinder (QF) has acquired an unique dataset with outstanding spatial and temporal sampling of earth's magnetic field along several active fault systems. This QF network consists of 124 stations in California and 45 stations along fault zones in Greece, Taiwan, Peru, Chile and Indonesia. Each station is equipped with three feedback induction magnetometers, two ion sensors, a 4 Hz geophone, a temperature sensor, and a humidity sensor. Data are continuously recorded at 50 Hz with GPS timing and transmitted daily to the QF data center in California for analysis. QF is attempting to detect and characterize anomalous EM activity occurring ahead of earthquakes. There have been many reports of anomalous variations in the earth's magnetic field preceding earthquakes. Specifically, several authors have drawn attention to apparent anomalous pulsations seen preceding earthquakes. Often studies in long term monitoring of seismic activity are limited by availability of event data. It is particularly difficult to acquire a large dataset for rigorous statistical analyses of the magnetic field near earthquake epicenters because large events are relatively rare. Since QF has acquired hundreds of earthquakes in more than 70 TB of data, we developed an automated approach for finding statistical significance of precursory behavior and developed an algorithm framework. Previously QF reported on the development of an Algorithmic Framework for data processing and hypothesis testing. The particular instance of algorithm we discuss identifies and counts magnetic variations from time series data and ranks each station-day according to the aggregate number of pulses in a time window preceding the day in question. If the hypothesis is true that magnetic field activity increases over some time interval preceding earthquakes, this should reveal itself by the station-days on which earthquakes occur receiving higher ranks than they would if the ranking scheme were random. This can be analysed using the Receiver Operating Characteristic test. In this presentation we give a status report of our latest results, largely focussed on reproducibility of results, robust statistics in the presence of missing data, and exploring optimization landscapes in our parameter space.
A fast elitism Gaussian estimation of distribution algorithm and application for PID optimization.
Xu, Qingyang; Zhang, Chengjin; Zhang, Li
2014-01-01
Estimation of distribution algorithm (EDA) is an intelligent optimization algorithm based on the probability statistics theory. A fast elitism Gaussian estimation of distribution algorithm (FEGEDA) is proposed in this paper. The Gaussian probability model is used to model the solution distribution. The parameters of Gaussian come from the statistical information of the best individuals by fast learning rule. A fast learning rule is used to enhance the efficiency of the algorithm, and an elitism strategy is used to maintain the convergent performance. The performances of the algorithm are examined based upon several benchmarks. In the simulations, a one-dimensional benchmark is used to visualize the optimization process and probability model learning process during the evolution, and several two-dimensional and higher dimensional benchmarks are used to testify the performance of FEGEDA. The experimental results indicate the capability of FEGEDA, especially in the higher dimensional problems, and the FEGEDA exhibits a better performance than some other algorithms and EDAs. Finally, FEGEDA is used in PID controller optimization of PMSM and compared with the classical-PID and GA.
A Fast Elitism Gaussian Estimation of Distribution Algorithm and Application for PID Optimization
Xu, Qingyang; Zhang, Chengjin; Zhang, Li
2014-01-01
Estimation of distribution algorithm (EDA) is an intelligent optimization algorithm based on the probability statistics theory. A fast elitism Gaussian estimation of distribution algorithm (FEGEDA) is proposed in this paper. The Gaussian probability model is used to model the solution distribution. The parameters of Gaussian come from the statistical information of the best individuals by fast learning rule. A fast learning rule is used to enhance the efficiency of the algorithm, and an elitism strategy is used to maintain the convergent performance. The performances of the algorithm are examined based upon several benchmarks. In the simulations, a one-dimensional benchmark is used to visualize the optimization process and probability model learning process during the evolution, and several two-dimensional and higher dimensional benchmarks are used to testify the performance of FEGEDA. The experimental results indicate the capability of FEGEDA, especially in the higher dimensional problems, and the FEGEDA exhibits a better performance than some other algorithms and EDAs. Finally, FEGEDA is used in PID controller optimization of PMSM and compared with the classical-PID and GA. PMID:24892059
Lin, Feng-Chang; Zhu, Jun
2012-01-01
We develop continuous-time models for the analysis of environmental or ecological monitoring data such that subjects are observed at multiple monitoring time points across space. Of particular interest are additive hazards regression models where the baseline hazard function can take on flexible forms. We consider time-varying covariates and take into account spatial dependence via autoregression in space and time. We develop statistical inference for the regression coefficients via partial likelihood. Asymptotic properties, including consistency and asymptotic normality, are established for parameter estimates under suitable regularity conditions. Feasible algorithms utilizing existing statistical software packages are developed for computation. We also consider a simpler additive hazards model with homogeneous baseline hazard and develop hypothesis testing for homogeneity. A simulation study demonstrates that the statistical inference using partial likelihood has sound finite-sample properties and offers a viable alternative to maximum likelihood estimation. For illustration, we analyze data from an ecological study that monitors bark beetle colonization of red pines in a plantation of Wisconsin.
NASA Astrophysics Data System (ADS)
Kim, Kyungmin; Harry, Ian W.; Hodge, Kari A.; Kim, Young-Min; Lee, Chang-Hwan; Lee, Hyun Kyu; Oh, John J.; Oh, Sang Hoon; Son, Edwin J.
2015-12-01
We apply a machine learning algorithm, the artificial neural network, to the search for gravitational-wave signals associated with short gamma-ray bursts (GRBs). The multi-dimensional samples consisting of data corresponding to the statistical and physical quantities from the coherent search pipeline are fed into the artificial neural network to distinguish simulated gravitational-wave signals from background noise artifacts. Our result shows that the data classification efficiency at a fixed false alarm probability (FAP) is improved by the artificial neural network in comparison to the conventional detection statistic. Specifically, the distance at 50% detection probability at a fixed false positive rate is increased about 8%-14% for the considered waveform models. We also evaluate a few seconds of the gravitational-wave data segment using the trained networks and obtain the FAP. We suggest that the artificial neural network can be a complementary method to the conventional detection statistic for identifying gravitational-wave signals related to the short GRBs.
Cui, Zaixu; Gong, Gaolang
2018-06-02
Individualized behavioral/cognitive prediction using machine learning (ML) regression approaches is becoming increasingly applied. The specific ML regression algorithm and sample size are two key factors that non-trivially influence prediction accuracies. However, the effects of the ML regression algorithm and sample size on individualized behavioral/cognitive prediction performance have not been comprehensively assessed. To address this issue, the present study included six commonly used ML regression algorithms: ordinary least squares (OLS) regression, least absolute shrinkage and selection operator (LASSO) regression, ridge regression, elastic-net regression, linear support vector regression (LSVR), and relevance vector regression (RVR), to perform specific behavioral/cognitive predictions based on different sample sizes. Specifically, the publicly available resting-state functional MRI (rs-fMRI) dataset from the Human Connectome Project (HCP) was used, and whole-brain resting-state functional connectivity (rsFC) or rsFC strength (rsFCS) were extracted as prediction features. Twenty-five sample sizes (ranged from 20 to 700) were studied by sub-sampling from the entire HCP cohort. The analyses showed that rsFC-based LASSO regression performed remarkably worse than the other algorithms, and rsFCS-based OLS regression performed markedly worse than the other algorithms. Regardless of the algorithm and feature type, both the prediction accuracy and its stability exponentially increased with increasing sample size. The specific patterns of the observed algorithm and sample size effects were well replicated in the prediction using re-testing fMRI data, data processed by different imaging preprocessing schemes, and different behavioral/cognitive scores, thus indicating excellent robustness/generalization of the effects. The current findings provide critical insight into how the selected ML regression algorithm and sample size influence individualized predictions of behavior/cognition and offer important guidance for choosing the ML regression algorithm or sample size in relevant investigations. Copyright © 2018 Elsevier Inc. All rights reserved.
van Solm, Alexandra I T; Hirdes, John P; Eckel, Leslie A; Heckman, George A; Bigelow, Philip L
Several studies have shown the increased vulnerability of and disproportionate mortality rate among frail community-dwelling older adults as a result of emergencies and disasters. This article will discuss the applicability of the Vulnerable Persons at Risk (VPR) and VPR Plus decision support algorithms designed based on the Resident Assessment Instrument-Home Care (RAI-HC) to identify the most vulnerable community-dwelling (older) adults. A sample was taken from the Ontario RAI-HC database by selecting unique home care clients with assessments closest to December 31, 2014 (N = 275,797). Statistical methods used include cross tabulation, bivariate logistic regression as well as Kaplan-Meier survival plotting and Cox proportional hazards ratios calculations. The VPR and VPR Plus algorithms, were highly predictive of mortality, long-term care admission and hospitalization in ordinary circumstances. This provides a good indication of the strength of the algorithms in identifying vulnerable persons at times of emergencies. Access to real-time person-level information of persons with functional care needs is a vital enabler for emergency responders in prioritizing and allocating resources during a disaster, and has great utility for emergency planning and recovery efforts. The development of valid and reliable algorithms supports the rapid identification and response to vulnerable community-dwelling persons for all phases of emergency management.
Cell Membrane Tracking in Living Brain Tissue Using Differential Interference Contrast Microscopy.
Lee, John; Kolb, Ilya; Forest, Craig R; Rozell, Christopher J
2018-04-01
Differential interference contrast (DIC) microscopy is widely used for observing unstained biological samples that are otherwise optically transparent. Combining this optical technique with machine vision could enable the automation of many life science experiments; however, identifying relevant features under DIC is challenging. In particular, precise tracking of cell boundaries in a thick ( ) slice of tissue has not previously been accomplished. We present a novel deconvolution algorithm that achieves the state-of-the-art performance at identifying and tracking these membrane locations. Our proposed algorithm is formulated as a regularized least squares optimization that incorporates a filtering mechanism to handle organic tissue interference and a robust edge-sparsity regularizer that integrates dynamic edge tracking capabilities. As a secondary contribution, this paper also describes new community infrastructure in the form of a MATLAB toolbox for accurately simulating DIC microscopy images of in vitro brain slices. Building on existing DIC optics modeling, our simulation framework additionally contributes an accurate representation of interference from organic tissue, neuronal cell-shapes, and tissue motion due to the action of the pipette. This simulator allows us to better understand the image statistics (to improve algorithms), as well as quantitatively test cell segmentation and tracking algorithms in scenarios, where ground truth data is fully known.
Empirical Testing of an Algorithm for Defining Somatization in Children
Eisman, Howard D.; Fogel, Joshua; Lazarovich, Regina; Pustilnik, Inna
2007-01-01
Introduction A previous article proposed an algorithm for defining somatization in children by classifying them into three categories: well, medically ill, and somatizer; the authors suggested further empirical validation of the algorithm (Postilnik et al., 2006). We use the Child Behavior Checklist (CBCL) to provide this empirical validation. Method Parents of children seen in pediatric clinics completed the CBCL (n=126). The physicians of these children completed specially-designed questionnaires. The sample comprised of 62 boys and 64 girls (age range 2 to 15 years). Classification categories included: well (n=53), medically ill (n=55), and somatizer (n=18). Analysis of variance (ANOVA) was used for statistical comparisons. Discriminant function analysis was conducted with the CBCL subscales. Results There were significant differences between the classification categories for the somatic complaints (p=<0.001), social problems (p=0.004), thought problems (p=0.01), attention problems (0.006), and internalizing (p=0.003) subscales and also total (p=0.001), and total-t (p=0.001) scales of the CBCL. Discriminant function analysis showed that 78% of somatizers and 66% of well were accurately classified, while only 35% of medically ill were accurately classified. Conclusion The somatization classification algorithm proposed by Postilnik et al. (2006) shows promise for classification of children and adolescents with somatic symptoms. PMID:18421368
Bayesian Analysis of High Dimensional Classification
NASA Astrophysics Data System (ADS)
Mukhopadhyay, Subhadeep; Liang, Faming
2009-12-01
Modern data mining and bioinformatics have presented an important playground for statistical learning techniques, where the number of input variables is possibly much larger than the sample size of the training data. In supervised learning, logistic regression or probit regression can be used to model a binary output and form perceptron classification rules based on Bayesian inference. In these cases , there is a lot of interest in searching for sparse model in High Dimensional regression(/classification) setup. we first discuss two common challenges for analyzing high dimensional data. The first one is the curse of dimensionality. The complexity of many existing algorithms scale exponentially with the dimensionality of the space and by virtue of that algorithms soon become computationally intractable and therefore inapplicable in many real applications. secondly, multicollinearities among the predictors which severely slowdown the algorithm. In order to make Bayesian analysis operational in high dimension we propose a novel 'Hierarchical stochastic approximation monte carlo algorithm' (HSAMC), which overcomes the curse of dimensionality, multicollinearity of predictors in high dimension and also it possesses the self-adjusting mechanism to avoid the local minima separated by high energy barriers. Models and methods are illustrated by simulation inspired from from the feild of genomics. Numerical results indicate that HSAMC can work as a general model selection sampler in high dimensional complex model space.
Graph embedding and extensions: a general framework for dimensionality reduction.
Yan, Shuicheng; Xu, Dong; Zhang, Benyu; Zhang, Hong-Jiang; Yang, Qiang; Lin, Stephen
2007-01-01
Over the past few decades, a large family of algorithms - supervised or unsupervised; stemming from statistics or geometry theory - has been designed to provide different solutions to the problem of dimensionality reduction. Despite the different motivations of these algorithms, we present in this paper a general formulation known as graph embedding to unify them within a common framework. In graph embedding, each algorithm can be considered as the direct graph embedding or its linear/kernel/tensor extension of a specific intrinsic graph that describes certain desired statistical or geometric properties of a data set, with constraints from scale normalization or a penalty graph that characterizes a statistical or geometric property that should be avoided. Furthermore, the graph embedding framework can be used as a general platform for developing new dimensionality reduction algorithms. By utilizing this framework as a tool, we propose a new supervised dimensionality reduction algorithm called Marginal Fisher Analysis in which the intrinsic graph characterizes the intraclass compactness and connects each data point with its neighboring points of the same class, while the penalty graph connects the marginal points and characterizes the interclass separability. We show that MFA effectively overcomes the limitations of the traditional Linear Discriminant Analysis algorithm due to data distribution assumptions and available projection directions. Real face recognition experiments show the superiority of our proposed MFA in comparison to LDA, also for corresponding kernel and tensor extensions.
An Efficient Augmented Lagrangian Method for Statistical X-Ray CT Image Reconstruction.
Li, Jiaojiao; Niu, Shanzhou; Huang, Jing; Bian, Zhaoying; Feng, Qianjin; Yu, Gaohang; Liang, Zhengrong; Chen, Wufan; Ma, Jianhua
2015-01-01
Statistical iterative reconstruction (SIR) for X-ray computed tomography (CT) under the penalized weighted least-squares criteria can yield significant gains over conventional analytical reconstruction from the noisy measurement. However, due to the nonlinear expression of the objective function, most exiting algorithms related to the SIR unavoidably suffer from heavy computation load and slow convergence rate, especially when an edge-preserving or sparsity-based penalty or regularization is incorporated. In this work, to address abovementioned issues of the general algorithms related to the SIR, we propose an adaptive nonmonotone alternating direction algorithm in the framework of augmented Lagrangian multiplier method, which is termed as "ALM-ANAD". The algorithm effectively combines an alternating direction technique with an adaptive nonmonotone line search to minimize the augmented Lagrangian function at each iteration. To evaluate the present ALM-ANAD algorithm, both qualitative and quantitative studies were conducted by using digital and physical phantoms. Experimental results show that the present ALM-ANAD algorithm can achieve noticeable gains over the classical nonlinear conjugate gradient algorithm and state-of-the-art split Bregman algorithm in terms of noise reduction, contrast-to-noise ratio, convergence rate, and universal quality index metrics.
The Mucciardi-Gose Clustering Algorithm and Its Applications in Automatic Pattern Recognition.
A procedure known as the Mucciardi- Gose clustering algorithm, CLUSTR, for determining the geometrical or statistical relationships among groups of N...discussion of clustering algorithms is given; the particular advantages of the Mucciardi- Gose procedure are described. The mathematical basis for, and the
Vectorized Rebinning Algorithm for Fast Data Down-Sampling
NASA Technical Reports Server (NTRS)
Dean, Bruce; Aronstein, David; Smith, Jeffrey
2013-01-01
A vectorized rebinning (down-sampling) algorithm, applicable to N-dimensional data sets, has been developed that offers a significant reduction in computer run time when compared to conventional rebinning algorithms. For clarity, a two-dimensional version of the algorithm is discussed to illustrate some specific details of the algorithm content, and using the language of image processing, 2D data will be referred to as "images," and each value in an image as a "pixel." The new approach is fully vectorized, i.e., the down-sampling procedure is done as a single step over all image rows, and then as a single step over all image columns. Data rebinning (or down-sampling) is a procedure that uses a discretely sampled N-dimensional data set to create a representation of the same data, but with fewer discrete samples. Such data down-sampling is fundamental to digital signal processing, e.g., for data compression applications.
Azad, Ariful; Rajwa, Bartek; Pothen, Alex
2016-08-31
We describe algorithms for discovering immunophenotypes from large collections of flow cytometry samples and using them to organize the samples into a hierarchy based on phenotypic similarity. The hierarchical organization is helpful for effective and robust cytometry data mining, including the creation of collections of cell populations’ characteristic of different classes of samples, robust classification, and anomaly detection. We summarize a set of samples belonging to a biological class or category with a statistically derived template for the class. Whereas individual samples are represented in terms of their cell populations (clusters), a template consists of generic meta-populations (a group ofmore » homogeneous cell populations obtained from the samples in a class) that describe key phenotypes shared among all those samples. We organize an FC data collection in a hierarchical data structure that supports the identification of immunophenotypes relevant to clinical diagnosis. A robust template-based classification scheme is also developed, but our primary focus is in the discovery of phenotypic signatures and inter-sample relationships in an FC data collection. This collective analysis approach is more efficient and robust since templates describe phenotypic signatures common to cell populations in several samples while ignoring noise and small sample-specific variations. We have applied the template-based scheme to analyze several datasets, including one representing a healthy immune system and one of acute myeloid leukemia (AML) samples. The last task is challenging due to the phenotypic heterogeneity of the several subtypes of AML. However, we identified thirteen immunophenotypes corresponding to subtypes of AML and were able to distinguish acute promyelocytic leukemia (APL) samples with the markers provided. Clinically, this is helpful since APL has a different treatment regimen from other subtypes of AML. Core algorithms used in our data analysis are available in the flowMatch package at www.bioconductor.org. It has been downloaded nearly 6,000 times since 2014.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Azad, Ariful; Rajwa, Bartek; Pothen, Alex
We describe algorithms for discovering immunophenotypes from large collections of flow cytometry samples and using them to organize the samples into a hierarchy based on phenotypic similarity. The hierarchical organization is helpful for effective and robust cytometry data mining, including the creation of collections of cell populations’ characteristic of different classes of samples, robust classification, and anomaly detection. We summarize a set of samples belonging to a biological class or category with a statistically derived template for the class. Whereas individual samples are represented in terms of their cell populations (clusters), a template consists of generic meta-populations (a group ofmore » homogeneous cell populations obtained from the samples in a class) that describe key phenotypes shared among all those samples. We organize an FC data collection in a hierarchical data structure that supports the identification of immunophenotypes relevant to clinical diagnosis. A robust template-based classification scheme is also developed, but our primary focus is in the discovery of phenotypic signatures and inter-sample relationships in an FC data collection. This collective analysis approach is more efficient and robust since templates describe phenotypic signatures common to cell populations in several samples while ignoring noise and small sample-specific variations. We have applied the template-based scheme to analyze several datasets, including one representing a healthy immune system and one of acute myeloid leukemia (AML) samples. The last task is challenging due to the phenotypic heterogeneity of the several subtypes of AML. However, we identified thirteen immunophenotypes corresponding to subtypes of AML and were able to distinguish acute promyelocytic leukemia (APL) samples with the markers provided. Clinically, this is helpful since APL has a different treatment regimen from other subtypes of AML. Core algorithms used in our data analysis are available in the flowMatch package at www.bioconductor.org. It has been downloaded nearly 6,000 times since 2014.« less
An Efficient MCMC Algorithm to Sample Binary Matrices with Fixed Marginals
ERIC Educational Resources Information Center
Verhelst, Norman D.
2008-01-01
Uniform sampling of binary matrices with fixed margins is known as a difficult problem. Two classes of algorithms to sample from a distribution not too different from the uniform are studied in the literature: importance sampling and Markov chain Monte Carlo (MCMC). Existing MCMC algorithms converge slowly, require a long burn-in period and yield…
Bayesian analysis of the flutter margin method in aeroelasticity
Khalil, Mohammad; Poirel, Dominique; Sarkar, Abhijit
2016-08-27
A Bayesian statistical framework is presented for Zimmerman and Weissenburger flutter margin method which considers the uncertainties in aeroelastic modal parameters. The proposed methodology overcomes the limitations of the previously developed least-square based estimation technique which relies on the Gaussian approximation of the flutter margin probability density function (pdf). Using the measured free-decay responses at subcritical (preflutter) airspeeds, the joint non-Gaussain posterior pdf of the modal parameters is sampled using the Metropolis–Hastings (MH) Markov chain Monte Carlo (MCMC) algorithm. The posterior MCMC samples of the modal parameters are then used to obtain the flutter margin pdfs and finally the fluttermore » speed pdf. The usefulness of the Bayesian flutter margin method is demonstrated using synthetic data generated from a two-degree-of-freedom pitch-plunge aeroelastic model. The robustness of the statistical framework is demonstrated using different sets of measurement data. In conclusion, it will be shown that the probabilistic (Bayesian) approach reduces the number of test points required in providing a flutter speed estimate for a given accuracy and precision.« less
Comparison of sampling techniques for Bayesian parameter estimation
NASA Astrophysics Data System (ADS)
Allison, Rupert; Dunkley, Joanna
2014-02-01
The posterior probability distribution for a set of model parameters encodes all that the data have to tell us in the context of a given model; it is the fundamental quantity for Bayesian parameter estimation. In order to infer the posterior probability distribution we have to decide how to explore parameter space. Here we compare three prescriptions for how parameter space is navigated, discussing their relative merits. We consider Metropolis-Hasting sampling, nested sampling and affine-invariant ensemble Markov chain Monte Carlo (MCMC) sampling. We focus on their performance on toy-model Gaussian likelihoods and on a real-world cosmological data set. We outline the sampling algorithms themselves and elaborate on performance diagnostics such as convergence time, scope for parallelization, dimensional scaling, requisite tunings and suitability for non-Gaussian distributions. We find that nested sampling delivers high-fidelity estimates for posterior statistics at low computational cost, and should be adopted in favour of Metropolis-Hastings in many cases. Affine-invariant MCMC is competitive when computing clusters can be utilized for massive parallelization. Affine-invariant MCMC and existing extensions to nested sampling naturally probe multimodal and curving distributions.
Non-convex Statistical Optimization for Sparse Tensor Graphical Model
Sun, Wei; Wang, Zhaoran; Liu, Han; Cheng, Guang
2016-01-01
We consider the estimation of sparse graphical models that characterize the dependency structure of high-dimensional tensor-valued data. To facilitate the estimation of the precision matrix corresponding to each way of the tensor, we assume the data follow a tensor normal distribution whose covariance has a Kronecker product structure. The penalized maximum likelihood estimation of this model involves minimizing a non-convex objective function. In spite of the non-convexity of this estimation problem, we prove that an alternating minimization algorithm, which iteratively estimates each sparse precision matrix while fixing the others, attains an estimator with the optimal statistical rate of convergence as well as consistent graph recovery. Notably, such an estimator achieves estimation consistency with only one tensor sample, which is unobserved in previous work. Our theoretical results are backed by thorough numerical studies. PMID:28316459
Large-Scale Optimization for Bayesian Inference in Complex Systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Willcox, Karen; Marzouk, Youssef
2013-11-12
The SAGUARO (Scalable Algorithms for Groundwater Uncertainty Analysis and Robust Optimization) Project focused on the development of scalable numerical algorithms for large-scale Bayesian inversion in complex systems that capitalize on advances in large-scale simulation-based optimization and inversion methods. The project was a collaborative effort among MIT, the University of Texas at Austin, Georgia Institute of Technology, and Sandia National Laboratories. The research was directed in three complementary areas: efficient approximations of the Hessian operator, reductions in complexity of forward simulations via stochastic spectral approximations and model reduction, and employing large-scale optimization concepts to accelerate sampling. The MIT--Sandia component of themore » SAGUARO Project addressed the intractability of conventional sampling methods for large-scale statistical inverse problems by devising reduced-order models that are faithful to the full-order model over a wide range of parameter values; sampling then employs the reduced model rather than the full model, resulting in very large computational savings. Results indicate little effect on the computed posterior distribution. On the other hand, in the Texas--Georgia Tech component of the project, we retain the full-order model, but exploit inverse problem structure (adjoint-based gradients and partial Hessian information of the parameter-to-observation map) to implicitly extract lower dimensional information on the posterior distribution; this greatly speeds up sampling methods, so that fewer sampling points are needed. We can think of these two approaches as ``reduce then sample'' and ``sample then reduce.'' In fact, these two approaches are complementary, and can be used in conjunction with each other. Moreover, they both exploit deterministic inverse problem structure, in the form of adjoint-based gradient and Hessian information of the underlying parameter-to-observation map, to achieve their speedups.« less
Final Report: Large-Scale Optimization for Bayesian Inference in Complex Systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ghattas, Omar
2013-10-15
The SAGUARO (Scalable Algorithms for Groundwater Uncertainty Analysis and Robust Optimiza- tion) Project focuses on the development of scalable numerical algorithms for large-scale Bayesian inversion in complex systems that capitalize on advances in large-scale simulation-based optimiza- tion and inversion methods. Our research is directed in three complementary areas: efficient approximations of the Hessian operator, reductions in complexity of forward simulations via stochastic spectral approximations and model reduction, and employing large-scale optimization concepts to accelerate sampling. Our efforts are integrated in the context of a challenging testbed problem that considers subsurface reacting flow and transport. The MIT component of the SAGUAROmore » Project addresses the intractability of conventional sampling methods for large-scale statistical inverse problems by devising reduced-order models that are faithful to the full-order model over a wide range of parameter values; sampling then employs the reduced model rather than the full model, resulting in very large computational savings. Results indicate little effect on the computed posterior distribution. On the other hand, in the Texas-Georgia Tech component of the project, we retain the full-order model, but exploit inverse problem structure (adjoint-based gradients and partial Hessian information of the parameter-to- observation map) to implicitly extract lower dimensional information on the posterior distribution; this greatly speeds up sampling methods, so that fewer sampling points are needed. We can think of these two approaches as "reduce then sample" and "sample then reduce." In fact, these two approaches are complementary, and can be used in conjunction with each other. Moreover, they both exploit deterministic inverse problem structure, in the form of adjoint-based gradient and Hessian information of the underlying parameter-to-observation map, to achieve their speedups.« less
Detecting chaos in irregularly sampled time series.
Kulp, C W
2013-09-01
Recently, Wiebe and Virgin [Chaos 22, 013136 (2012)] developed an algorithm which detects chaos by analyzing a time series' power spectrum which is computed using the Discrete Fourier Transform (DFT). Their algorithm, like other time series characterization algorithms, requires that the time series be regularly sampled. Real-world data, however, are often irregularly sampled, thus, making the detection of chaotic behavior difficult or impossible with those methods. In this paper, a characterization algorithm is presented, which effectively detects chaos in irregularly sampled time series. The work presented here is a modification of Wiebe and Virgin's algorithm and uses the Lomb-Scargle Periodogram (LSP) to compute a series' power spectrum instead of the DFT. The DFT is not appropriate for irregularly sampled time series. However, the LSP is capable of computing the frequency content of irregularly sampled data. Furthermore, a new method of analyzing the power spectrum is developed, which can be useful for differentiating between chaotic and non-chaotic behavior. The new characterization algorithm is successfully applied to irregularly sampled data generated by a model as well as data consisting of observations of variable stars.
ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition.
Koslicki, David; Chatterjee, Saikat; Shahrivar, Damon; Walker, Alan W; Francis, Suzanna C; Fraser, Louise J; Vehkaperä, Mikko; Lan, Yueheng; Corander, Jukka
2015-01-01
Estimation of bacterial community composition from high-throughput sequenced 16S rRNA gene amplicons is a key task in microbial ecology. Since the sequence data from each sample typically consist of a large number of reads and are adversely impacted by different levels of biological and technical noise, accurate analysis of such large datasets is challenging. There has been a recent surge of interest in using compressed sensing inspired and convex-optimization based methods to solve the estimation problem for bacterial community composition. These methods typically rely on summarizing the sequence data by frequencies of low-order k-mers and matching this information statistically with a taxonomically structured database. Here we show that the accuracy of the resulting community composition estimates can be substantially improved by aggregating the reads from a sample with an unsupervised machine learning approach prior to the estimation phase. The aggregation of reads is a pre-processing approach where we use a standard K-means clustering algorithm that partitions a large set of reads into subsets with reasonable computational cost to provide several vectors of first order statistics instead of only single statistical summarization in terms of k-mer frequencies. The output of the clustering is then processed further to obtain the final estimate for each sample. The resulting method is called Aggregation of Reads by K-means (ARK), and it is based on a statistical argument via mixture density formulation. ARK is found to improve the fidelity and robustness of several recently introduced methods, with only a modest increase in computational complexity. An open source, platform-independent implementation of the method in the Julia programming language is freely available at https://github.com/dkoslicki/ARK. A Matlab implementation is available at http://www.ee.kth.se/ctsoftware.
A Method for the Evaluation of Thousands of Automated 3D Stem Cell Segmentations
Bajcsy, Peter; Simon, Mylene; Florczyk, Stephen; Simon, Carl G.; Juba, Derek; Brady, Mary
2016-01-01
There is no segmentation method that performs perfectly with any data set in comparison to human segmentation. Evaluation procedures for segmentation algorithms become critical for their selection. The problems associated with segmentation performance evaluations and visual verification of segmentation results are exaggerated when dealing with thousands of 3D image volumes because of the amount of computation and manual inputs needed. We address the problem of evaluating 3D segmentation performance when segmentation is applied to thousands of confocal microscopy images (z-stacks). Our approach is to incorporate experimental imaging and geometrical criteria, and map them into computationally efficient segmentation algorithms that can be applied to a very large number of z-stacks. This is an alternative approach to considering existing segmentation methods and evaluating most state-of-the-art algorithms. We designed a methodology for 3D segmentation performance characterization that consists of design, evaluation and verification steps. The characterization integrates manual inputs from projected surrogate “ground truth” of statistically representative samples and from visual inspection into the evaluation. The novelty of the methodology lies in (1) designing candidate segmentation algorithms by mapping imaging and geometrical criteria into algorithmic steps, and constructing plausible segmentation algorithms with respect to the order of algorithmic steps and their parameters, (2) evaluating segmentation accuracy using samples drawn from probability distribution estimates of candidate segmentations, and (3) minimizing human labor needed to create surrogate “truth” by approximating z-stack segmentations with 2D contours from three orthogonal z-stack projections and by developing visual verification tools. We demonstrate the methodology by applying it to a dataset of 1253 mesenchymal stem cells. The cells reside on 10 different types of biomaterial scaffolds, and are stained for actin and nucleus yielding 128 460 image frames (on average 125 cells/scaffold × 10 scaffold types × 2 stains × 51 frames/cell). After constructing and evaluating six candidates of 3D segmentation algorithms, the most accurate 3D segmentation algorithm achieved an average precision of 0.82 and an accuracy of 0.84 as measured by the Dice similarity index where values greater than 0.7 indicate a good spatial overlap. A probability of segmentation success was 0.85 based on visual verification, and a computation time was 42.3 h to process all z-stacks. While the most accurate segmentation technique was 4.2 times slower than the second most accurate algorithm, it consumed on average 9.65 times less memory per z-stack segmentation. PMID:26268699
NASA Astrophysics Data System (ADS)
Obozov, A. A.; Serpik, I. N.; Mihalchenko, G. S.; Fedyaeva, G. A.
2017-01-01
In the article, the problem of application of the pattern recognition (a relatively young area of engineering cybernetics) for analysis of complicated technical systems is examined. It is shown that the application of a statistical approach for hard distinguishable situations could be the most effective. The different recognition algorithms are based on Bayes approach, which estimates posteriori probabilities of a certain event and an assumed error. Application of the statistical approach to pattern recognition is possible for solving the problem of technical diagnosis complicated systems and particularly big powered marine diesel engines.
Truth, Damn Truth, and Statistics
ERIC Educational Resources Information Center
Velleman, Paul F.
2008-01-01
Statisticians and Statistics teachers often have to push back against the popular impression that Statistics teaches how to lie with data. Those who believe incorrectly that Statistics is solely a branch of Mathematics (and thus algorithmic), often see the use of judgment in Statistics as evidence that we do indeed manipulate our results. In the…
Missing value imputation for microarray data: a comprehensive comparison study and a web tool.
Chiu, Chia-Chun; Chan, Shih-Yao; Wang, Chung-Ching; Wu, Wei-Sheng
2013-01-01
Microarray data are usually peppered with missing values due to various reasons. However, most of the downstream analyses for microarray data require complete datasets. Therefore, accurate algorithms for missing value estimation are needed for improving the performance of microarray data analyses. Although many algorithms have been developed, there are many debates on the selection of the optimal algorithm. The studies about the performance comparison of different algorithms are still incomprehensive, especially in the number of benchmark datasets used, the number of algorithms compared, the rounds of simulation conducted, and the performance measures used. In this paper, we performed a comprehensive comparison by using (I) thirteen datasets, (II) nine algorithms, (III) 110 independent runs of simulation, and (IV) three types of measures to evaluate the performance of each imputation algorithm fairly. First, the effects of different types of microarray datasets on the performance of each imputation algorithm were evaluated. Second, we discussed whether the datasets from different species have different impact on the performance of different algorithms. To assess the performance of each algorithm fairly, all evaluations were performed using three types of measures. Our results indicate that the performance of an imputation algorithm mainly depends on the type of a dataset but not on the species where the samples come from. In addition to the statistical measure, two other measures with biological meanings are useful to reflect the impact of missing value imputation on the downstream data analyses. Our study suggests that local-least-squares-based methods are good choices to handle missing values for most of the microarray datasets. In this work, we carried out a comprehensive comparison of the algorithms for microarray missing value imputation. Based on such a comprehensive comparison, researchers could choose the optimal algorithm for their datasets easily. Moreover, new imputation algorithms could be compared with the existing algorithms using this comparison strategy as a standard protocol. In addition, to assist researchers in dealing with missing values easily, we built a web-based and easy-to-use imputation tool, MissVIA (http://cosbi.ee.ncku.edu.tw/MissVIA), which supports many imputation algorithms. Once users upload a real microarray dataset and choose the imputation algorithms, MissVIA will determine the optimal algorithm for the users' data through a series of simulations, and then the imputed results can be downloaded for the downstream data analyses.
Ladar range image denoising by a nonlocal probability statistics algorithm
NASA Astrophysics Data System (ADS)
Xia, Zhi-Wei; Li, Qi; Xiong, Zhi-Peng; Wang, Qi
2013-01-01
According to the characteristic of range images of coherent ladar and the basis of nonlocal means (NLM), a nonlocal probability statistics (NLPS) algorithm is proposed in this paper. The difference is that NLM performs denoising using the mean of the conditional probability distribution function (PDF) while NLPS using the maximum of the marginal PDF. In the algorithm, similar blocks are found out by the operation of block matching and form a group. Pixels in the group are analyzed by probability statistics and the gray value with maximum probability is used as the estimated value of the current pixel. The simulated range images of coherent ladar with different carrier-to-noise ratio and real range image of coherent ladar with 8 gray-scales are denoised by this algorithm, and the results are compared with those of median filter, multitemplate order mean filter, NLM, median nonlocal mean filter and its incorporation of anatomical side information, and unsupervised information-theoretic adaptive filter. The range abnormality noise and Gaussian noise in range image of coherent ladar are effectively suppressed by NLPS.
Anandakrishnan, Ramu; Onufriev, Alexey
2008-03-01
In statistical mechanics, the equilibrium properties of a physical system of particles can be calculated as the statistical average over accessible microstates of the system. In general, these calculations are computationally intractable since they involve summations over an exponentially large number of microstates. Clustering algorithms are one of the methods used to numerically approximate these sums. The most basic clustering algorithms first sub-divide the system into a set of smaller subsets (clusters). Then, interactions between particles within each cluster are treated exactly, while all interactions between different clusters are ignored. These smaller clusters have far fewer microstates, making the summation over these microstates, tractable. These algorithms have been previously used for biomolecular computations, but remain relatively unexplored in this context. Presented here, is a theoretical analysis of the error and computational complexity for the two most basic clustering algorithms that were previously applied in the context of biomolecular electrostatics. We derive a tight, computationally inexpensive, error bound for the equilibrium state of a particle computed via these clustering algorithms. For some practical applications, it is the root mean square error, which can be significantly lower than the error bound, that may be more important. We how that there is a strong empirical relationship between error bound and root mean square error, suggesting that the error bound could be used as a computationally inexpensive metric for predicting the accuracy of clustering algorithms for practical applications. An example of error analysis for such an application-computation of average charge of ionizable amino-acids in proteins-is given, demonstrating that the clustering algorithm can be accurate enough for practical purposes.
NASA Astrophysics Data System (ADS)
Abedini, M. J.; Nasseri, M.; Burn, D. H.
2012-04-01
In any geostatistical study, an important consideration is the choice of an appropriate, repeatable, and objective search strategy that controls the nearby samples to be included in the location-specific estimation procedure. Almost all geostatistical software available in the market puts the onus on the user to supply search strategy parameters in a heuristic manner. These parameters are solely controlled by geographical coordinates that are defined for the entire area under study, and the user has no guidance as to how to choose these parameters. The main thesis of the current study is that the selection of search strategy parameters has to be driven by data—both the spatial coordinates and the sample values—and cannot be chosen beforehand. For this purpose, a genetic-algorithm-based ordinary kriging with moving neighborhood technique is proposed. The search capability of a genetic algorithm is exploited to search the feature space for appropriate, either local or global, search strategy parameters. Radius of circle/sphere and/or radii of standard or rotated ellipse/ellipsoid are considered as the decision variables to be optimized by GA. The superiority of GA-based ordinary kriging is demonstrated through application to the Wolfcamp Aquifer piezometric head data. Assessment of numerical results showed that definition of search strategy parameters based on both geographical coordinates and sample values improves cross-validation statistics when compared with that based on geographical coordinates alone. In the case of a variable search neighborhood for each estimation point, optimization of local search strategy parameters for an elliptical support domain—the orientation of which is dictated by anisotropic axes—via GA was able to capture the dynamics of piezometric head in west Texas/New Mexico in an efficient way.
Nalbantoglu, Sinem; Abu-Asab, Mones; Tan, Ming; Zhang, Xuemin; Cai, Ling; Amri, Hakima
2016-07-01
Pancreatic ductal adenocarcinoma (PDAC) is one of the rapidly growing forms of pancreatic cancer with a poor prognosis and less than 5% 5-year survival rate. In this study, we characterized the genetic signatures and signaling pathways related to survival from PDAC, using a parsimony phylogenetic algorithm. We applied the parsimony phylogenetic algorithm to analyze the publicly available whole-genome in silico array analysis of a gene expression data set in 25 early-stage human PDAC specimens. We explain here that the parsimony phylogenetics is an evolutionary analytical method that offers important promise to uncover clonal (driver) and nonclonal (passenger) aberrations in complex diseases. In our analysis, parsimony and statistical analyses did not identify significant correlations between survival times and gene expression values. Thus, the survival rankings did not appear to be significantly different between patients for any specific gene (p > 0.05). Also, we did not find correlation between gene expression data and tumor stage in the present data set. While the present analysis was unable to identify in this relatively small sample of patients a molecular signature associated with pancreatic cancer prognosis, we suggest that future research and analyses with the parsimony phylogenetic algorithm in larger patient samples are worthwhile, given the devastating nature of pancreatic cancer and its early diagnosis, and the need for novel data analytic approaches. The future research practices might want to place greater emphasis on phylogenetics as one of the analytical paradigms, as our findings presented here are on the cusp of this shift, especially in the current era of Big Data and innovation policies advocating for greater data sharing and reanalysis.
Improving UWB-Based Localization in IoT Scenarios with Statistical Models of Distance Error.
Monica, Stefania; Ferrari, Gianluigi
2018-05-17
Interest in the Internet of Things (IoT) is rapidly increasing, as the number of connected devices is exponentially growing. One of the application scenarios envisaged for IoT technologies involves indoor localization and context awareness. In this paper, we focus on a localization approach that relies on a particular type of communication technology, namely Ultra Wide Band (UWB). UWB technology is an attractive choice for indoor localization, owing to its high accuracy. Since localization algorithms typically rely on estimated inter-node distances, the goal of this paper is to evaluate the improvement brought by a simple (linear) statistical model of the distance error. On the basis of an extensive experimental measurement campaign, we propose a general analytical framework, based on a Least Square (LS) method, to derive a novel statistical model for the range estimation error between a pair of UWB nodes. The proposed statistical model is then applied to improve the performance of a few illustrative localization algorithms in various realistic scenarios. The obtained experimental results show that the use of the proposed statistical model improves the accuracy of the considered localization algorithms with a reduction of the localization error up to 66%.
Glass-Kaastra, Shiona K; Pearl, David L; Reid-Smith, Richard J; McEwen, Beverly; Slavic, Durda; Fairles, Jim; McEwen, Scott A
2014-10-01
Susceptibility results for Pasteurella multocida and Streptococcus suis isolated from swine clinical samples were obtained from January 1998 to October 2010 from the Animal Health Laboratory at the University of Guelph, Guelph, Ontario, and used to describe variation in antimicrobial resistance (AMR) to 4 drugs of importance in the Ontario swine industry: ampicillin, tetracycline, tiamulin, and trimethoprim-sulfamethoxazole. Four temporal data-analysis options were used: visualization of trends in 12-month rolling averages, logistic-regression modeling, temporal-scan statistics, and a scan with the "What's strange about recent events?" (WSARE) algorithm. The AMR trends varied among the antimicrobial drugs for a single pathogen and between pathogens for a single antimicrobial, suggesting that pathogen-specific AMR surveillance may be preferable to indicator data. The 4 methods provided complementary and, at times, redundant results. The most appropriate combination of analysis methods for surveillance using these data included temporal-scan statistics with a visualization method (rolling-average or predicted-probability plots following logistic-regression models). The WSARE algorithm provided interesting results for quality control and has the potential to detect new resistance patterns; however, missing data created problems for displaying the results in a way that would be meaningful to all surveillance stakeholders.
Glass-Kaastra, Shiona K.; Pearl, David L.; Reid-Smith, Richard J.; McEwen, Beverly; Slavic, Durda; Fairles, Jim; McEwen, Scott A.
2014-01-01
Susceptibility results for Pasteurella multocida and Streptococcus suis isolated from swine clinical samples were obtained from January 1998 to October 2010 from the Animal Health Laboratory at the University of Guelph, Guelph, Ontario, and used to describe variation in antimicrobial resistance (AMR) to 4 drugs of importance in the Ontario swine industry: ampicillin, tetracycline, tiamulin, and trimethoprim–sulfamethoxazole. Four temporal data-analysis options were used: visualization of trends in 12-month rolling averages, logistic-regression modeling, temporal-scan statistics, and a scan with the “What’s strange about recent events?” (WSARE) algorithm. The AMR trends varied among the antimicrobial drugs for a single pathogen and between pathogens for a single antimicrobial, suggesting that pathogen-specific AMR surveillance may be preferable to indicator data. The 4 methods provided complementary and, at times, redundant results. The most appropriate combination of analysis methods for surveillance using these data included temporal-scan statistics with a visualization method (rolling-average or predicted-probability plots following logistic-regression models). The WSARE algorithm provided interesting results for quality control and has the potential to detect new resistance patterns; however, missing data created problems for displaying the results in a way that would be meaningful to all surveillance stakeholders. PMID:25355992
Analysis of High-Throughput ELISA Microarray Data
DOE Office of Scientific and Technical Information (OSTI.GOV)
White, Amanda M.; Daly, Don S.; Zangar, Richard C.
Our research group develops analytical methods and software for the high-throughput analysis of quantitative enzyme-linked immunosorbent assay (ELISA) microarrays. ELISA microarrays differ from DNA microarrays in several fundamental aspects and most algorithms for analysis of DNA microarray data are not applicable to ELISA microarrays. In this review, we provide an overview of the steps involved in ELISA microarray data analysis and how the statistically sound algorithms we have developed provide an integrated software suite to address the needs of each data-processing step. The algorithms discussed are available in a set of open-source software tools (http://www.pnl.gov/statistics/ProMAT).
MSEBAG: a dynamic classifier ensemble generation based on `minimum-sufficient ensemble' and bagging
NASA Astrophysics Data System (ADS)
Chen, Lei; Kamel, Mohamed S.
2016-01-01
In this paper, we propose a dynamic classifier system, MSEBAG, which is characterised by searching for the 'minimum-sufficient ensemble' and bagging at the ensemble level. It adopts an 'over-generation and selection' strategy and aims to achieve a good bias-variance trade-off. In the training phase, MSEBAG first searches for the 'minimum-sufficient ensemble', which maximises the in-sample fitness with the minimal number of base classifiers. Then, starting from the 'minimum-sufficient ensemble', a backward stepwise algorithm is employed to generate a collection of ensembles. The objective is to create a collection of ensembles with a descending fitness on the data, as well as a descending complexity in the structure. MSEBAG dynamically selects the ensembles from the collection for the decision aggregation. The extended adaptive aggregation (EAA) approach, a bagging-style algorithm performed at the ensemble level, is employed for this task. EAA searches for the competent ensembles using a score function, which takes into consideration both the in-sample fitness and the confidence of the statistical inference, and averages the decisions of the selected ensembles to label the test pattern. The experimental results show that the proposed MSEBAG outperforms the benchmarks on average.
Rapidly locating and characterizing pollutant releases in buildings.
Sohn, Michael D; Reynolds, Pamela; Singh, Navtej; Gadgil, Ashok J
2002-12-01
Releases of airborne contaminants in or near a building can lead to significant human exposures unless prompt response measures are taken. However, possible responses can include conflicting strategies, such as shutting the ventilation system off versus running it in a purge mode or having occupants evacuate versus sheltering in place. The proper choice depends in part on knowing the source locations, the amounts released, and the likely future dispersion routes of the pollutants. We present an approach that estimates this information in real time. It applies Bayesian statistics to interpret measurements of airborne pollutant concentrations from multiple sensors placed in the building and computes best estimates and uncertainties of the release conditions. The algorithm is fast, capable of continuously updating the estimates as measurements stream in from sensors. We demonstrate the approach using a hypothetical pollutant release in a five-room building. Unknowns to the interpretation algorithm include location, duration, and strength of the source, and some building and weather conditions. Two sensor sampling plans and three levels of data quality are examined. Data interpretation in all examples is rapid; however, locating and characterizing the source with high probability depends on the amount and quality of data and the sampling plan.
Ellipsoids for anomaly detection in remote sensing imagery
NASA Astrophysics Data System (ADS)
Grosklos, Guenchik; Theiler, James
2015-05-01
For many target and anomaly detection algorithms, a key step is the estimation of a centroid (relatively easy) and a covariance matrix (somewhat harder) that characterize the background clutter. For a background that can be modeled as a multivariate Gaussian, the centroid and covariance lead to an explicit probability density function that can be used in likelihood ratio tests for optimal detection statistics. But ellipsoidal contours can characterize a much larger class of multivariate density function, and the ellipsoids that characterize the outer periphery of the distribution are most appropriate for detection in the low false alarm rate regime. Traditionally the sample mean and sample covariance are used to estimate ellipsoid location and shape, but these quantities are confounded both by large lever-arm outliers and non-Gaussian distributions within the ellipsoid of interest. This paper compares a variety of centroid and covariance estimation schemes with the aim of characterizing the periphery of the background distribution. In particular, we will consider a robust variant of the Khachiyan algorithm for minimum-volume enclosing ellipsoid. The performance of these different approaches is evaluated on multispectral and hyperspectral remote sensing imagery using coverage plots of ellipsoid volume versus false alarm rate.
Study on the medical meteorological forecast of the number of hypertension inpatient based on SVR
NASA Astrophysics Data System (ADS)
Zhai, Guangyu; Chai, Guorong; Zhang, Haifeng
2017-06-01
The purpose of this study is to build a hypertension prediction model by discussing the meteorological factors for hypertension incidence. The research method is selecting the standard data of relative humidity, air temperature, visibility, wind speed and air pressure of Lanzhou from 2010 to 2012(calculating the maximum, minimum and average value with 5 days as a unit ) as the input variables of Support Vector Regression(SVR) and the standard data of hypertension incidence of the same period as the output dependent variables to obtain the optimal prediction parameters by cross validation algorithm, then by SVR algorithm learning and training, a SVR forecast model for hypertension incidence is built. The result shows that the hypertension prediction model is composed of 15 input independent variables, the training accuracy is 0.005, the final error is 0.0026389. The forecast accuracy based on SVR model is 97.1429%, which is higher than statistical forecast equation and neural network prediction method. It is concluded that SVR model provides a new method for hypertension prediction with its simple calculation, small error as well as higher historical sample fitting and Independent sample forecast capability.
Taylor, Donna B
2017-04-01
The objective of this study was to investigate the incidence of plagiarism in a sample of manuscripts submitted to the AJR using CrossCheck, develop an algorithm to identify significant plagiarism, and formulate management pathways. A sample of 110 of 1610 (6.8%) manuscripts submitted to AJR in 2014 in the categories of Original Research or Review were analyzed using CrossCheck and manual assessment. The overall similarity index (OSI), highest similarity score from a single source, whether duplication was from single or multiple origins, journal section, and presence or absence of referencing the source were recorded. The criteria outlined by the International Committee of Medical Journal Editors were the reference standard for identifying manuscripts containing plagiarism. Statistical analysis was used to develop a screening algorithm to maximize sensitivity and specificity for the detection of plagiarism. Criteria for defining the severity of plagiarism and management pathways based on the severity of the plagiarism were determined. Twelve manuscripts (10.9%) contained plagiarism. Nine had an OSI excluding quotations and references of less than 20%. In seven, the highest similarity score from a single source was less than 10%. The highest similarity score from a single source was the work of the same author or authors in nine. Common sections for duplication were the Materials and Methods, Discussion, and abstract. Referencing the original source was lacking in 11. Plagiarism was undetected at submission in five of these 12 articles; two had been accepted for publication. The most effective screening algorithm was to average the OSI including quotations and references and the highest similarity score from a single source and to submit manuscripts with an average value of more than 12% for further review. The current methods for detecting plagiarism are suboptimal. A new screening algorithm is proposed.
Comparison analysis for classification algorithm in data mining and the study of model use
NASA Astrophysics Data System (ADS)
Chen, Junde; Zhang, Defu
2018-04-01
As a key technique in data mining, classification algorithm was received extensive attention. Through an experiment of classification algorithm in UCI data set, we gave a comparison analysis method for the different algorithms and the statistical test was used here. Than that, an adaptive diagnosis model for preventive electricity stealing and leakage was given as a specific case in the paper.
The generalization ability of online SVM classification based on Markov sampling.
Xu, Jie; Yan Tang, Yuan; Zou, Bin; Xu, Zongben; Li, Luoqing; Lu, Yang
2015-03-01
In this paper, we consider online support vector machine (SVM) classification learning algorithms with uniformly ergodic Markov chain (u.e.M.c.) samples. We establish the bound on the misclassification error of an online SVM classification algorithm with u.e.M.c. samples based on reproducing kernel Hilbert spaces and obtain a satisfactory convergence rate. We also introduce a novel online SVM classification algorithm based on Markov sampling, and present the numerical studies on the learning ability of online SVM classification based on Markov sampling for benchmark repository. The numerical studies show that the learning performance of the online SVM classification algorithm based on Markov sampling is better than that of classical online SVM classification based on random sampling as the size of training samples is larger.
MC3: Multi-core Markov-chain Monte Carlo code
NASA Astrophysics Data System (ADS)
Cubillos, Patricio; Harrington, Joseph; Lust, Nate; Foster, AJ; Stemm, Madison; Loredo, Tom; Stevenson, Kevin; Campo, Chris; Hardin, Matt; Hardy, Ryan
2016-10-01
MC3 (Multi-core Markov-chain Monte Carlo) is a Bayesian statistics tool that can be executed from the shell prompt or interactively through the Python interpreter with single- or multiple-CPU parallel computing. It offers Markov-chain Monte Carlo (MCMC) posterior-distribution sampling for several algorithms, Levenberg-Marquardt least-squares optimization, and uniform non-informative, Jeffreys non-informative, or Gaussian-informative priors. MC3 can share the same value among multiple parameters and fix the value of parameters to constant values, and offers Gelman-Rubin convergence testing and correlated-noise estimation with time-averaging or wavelet-based likelihood estimation methods.
Gradient Learning Algorithms for Ontology Computing
Gao, Wei; Zhu, Linli
2014-01-01
The gradient learning model has been raising great attention in view of its promising perspectives for applications in statistics, data dimensionality reducing, and other specific fields. In this paper, we raise a new gradient learning model for ontology similarity measuring and ontology mapping in multidividing setting. The sample error in this setting is given by virtue of the hypothesis space and the trick of ontology dividing operator. Finally, two experiments presented on plant and humanoid robotics field verify the efficiency of the new computation model for ontology similarity measure and ontology mapping applications in multidividing setting. PMID:25530752
Comparing CNV detection methods for SNP arrays.
Winchester, Laura; Yau, Christopher; Ragoussis, Jiannis
2009-09-01
Data from whole genome association studies can now be used for dual purposes, genotyping and copy number detection. In this review we discuss some of the methods for using SNP data to detect copy number events. We examine a number of algorithms designed to detect copy number changes through the use of signal-intensity data and consider methods to evaluate the changes found. We describe the use of several statistical models in copy number detection in germline samples. We also present a comparison of data using these methods to assess accuracy of prediction and detection of changes in copy number.
The statistical theory of the fracture of fragile bodies. Part 2: The integral equation method
NASA Technical Reports Server (NTRS)
Kittl, P.
1984-01-01
It is demonstrated how with the aid of a bending test, the Weibull fracture risk function can be determined - without postulating its analytical form - by resolving an integral equation. The respective solutions for rectangular and circular section beams are given. In the first case the function is expressed as an algorithm and in the second, in the form of series. Taking into account that the cumulative fracture probability appearing in the solution to the integral equation must be continuous and monotonically increasing, any case of fabrication or selection of samples can be treated.
Bourgkard, Eve; Wild, Pascal; Gonzalez, Maria; Févotte, Joëlle; Penven, Emmanuelle; Paris, Christophe
2013-12-01
To describe the performance of a lifelong task-based questionnaire (TBQ) in estimating exposures compared with other approaches in the context of a case-control study. A sample of 93 subjects was randomly selected from a lung cancer case-control study corresponding to 497 jobs. For each job, exposure assessments for asbestos and polycyclic aromatic hydrocarbons (PAHs) were obtained by expertise (TBQ expertise) and by algorithm using the TBQ (TBQ algorithm) as well as by expert appraisals based on all available occupational data (REFERENCE expertise) considered to be the gold standard. Additionally, a Job Exposure Matrix (JEM)-based evaluation for asbestos was also obtained. On the 497 jobs, the various evaluations were contrasted using Cohen's κ coefficient of agreement. Additionally, on the total case-control population, the asbestos dose-response relationship based on the TBQ algorithm was compared with the JEM-based assessment. Regarding asbestos, the TBQ-exposure estimates agreed well with the REFERENCE estimate (TBQ expertise: level-weighted κ (lwk)=0.68; TBQ algorithm: lwk=0.61) but less so with the JEM estimate (TBQ expertise: lwk=0.31; TBQ algorithm: lwk=0.26). Regarding PAHs, the agreements between REFERENCE expertise and TBQ were less good (TBQ expertise: lwk=0.43; TBQ algorithm: lwk=0.36). In the case-control study analysis, the dose-response relationship between lung cancer and cumulative asbestos based on the JEM is less steep than with the TBQ-algorithm exposure assessment and statistically non-significant. Asbestos-exposure estimates based on the TBQ were consistent with the REFERENCE expertise and yielded a steeper dose-response relationship than the JEM. For PAHs, results were less clear.
New multirate sampled-data control law structure and synthesis algorithm
NASA Technical Reports Server (NTRS)
Berg, Martin C.; Mason, Gregory S.; Yang, Gen-Sheng
1992-01-01
A new multirate sampled-data control law structure is defined and a new parameter-optimization-based synthesis algorithm for that structure is introduced. The synthesis algorithm can be applied to multirate, multiple-input/multiple-output, sampled-data control laws having a prescribed dynamic order and structure, and a priori specified sampling/update rates for all sensors, processor states, and control inputs. The synthesis algorithm is applied to design two-input, two-output tip position controllers of various dynamic orders for a sixth-order, two-link robot arm model.
Peng, Bo; Wang, Yuqi; Hall, Timothy J; Jiang, Jingfeng
2017-01-01
Our primary objective of this work was to extend a previously published 2D coupled sub-sample tracking algorithm for 3D speckle tracking in the framework of ultrasound breast strain elastography. In order to overcome heavy computational cost, we investigated the use of a graphic processing unit (GPU) to accelerate the 3D coupled sub-sample speckle tracking method. The performance of the proposed GPU implementation was tested using a tissue-mimicking (TM) phantom and in vivo breast ultrasound data. The performance of this 3D sub-sample tracking algorithm was compared with the conventional 3D quadratic sub-sample estimation algorithm. On the basis of these evaluations, we concluded that the GPU implementation of this 3D sub-sample estimation algorithm can provide high-quality strain data (i.e. high correlation between the pre- and the motion-compensated post-deformation RF echo data and high contrast-to-noise ratio strain images), as compared to the conventional 3D quadratic sub-sample algorithm. Using the GPU implementation of the 3D speckle tracking algorithm, volumetric strain data can be achieved relatively fast (approximately 20 seconds per volume [2.5 cm × 2.5 cm × 2.5 cm]). PMID:28166493
Automated detection of hospital outbreaks: A systematic review of methods
Buckeridge, David L.; Lepelletier, Didier
2017-01-01
Objectives Several automated algorithms for epidemiological surveillance in hospitals have been proposed. However, the usefulness of these methods to detect nosocomial outbreaks remains unclear. The goal of this review was to describe outbreak detection algorithms that have been tested within hospitals, consider how they were evaluated, and synthesize their results. Methods We developed a search query using keywords associated with hospital outbreak detection and searched the MEDLINE database. To ensure the highest sensitivity, no limitations were initially imposed on publication languages and dates, although we subsequently excluded studies published before 2000. Every study that described a method to detect outbreaks within hospitals was included, without any exclusion based on study design. Additional studies were identified through citations in retrieved studies. Results Twenty-nine studies were included. The detection algorithms were grouped into 5 categories: simple thresholds (n = 6), statistical process control (n = 12), scan statistics (n = 6), traditional statistical models (n = 6), and data mining methods (n = 4). The evaluation of the algorithms was often solely descriptive (n = 15), but more complex epidemiological criteria were also investigated (n = 10). The performance measures varied widely between studies: e.g., the sensitivity of an algorithm in a real world setting could vary between 17 and 100%. Conclusion Even if outbreak detection algorithms are useful complementary tools for traditional surveillance, the heterogeneity in results among published studies does not support quantitative synthesis of their performance. A standardized framework should be followed when evaluating outbreak detection methods to allow comparison of algorithms across studies and synthesis of results. PMID:28441422
Contextual classification of multispectral image data: Approximate algorithm
NASA Technical Reports Server (NTRS)
Tilton, J. C. (Principal Investigator)
1980-01-01
An approximation to a classification algorithm incorporating spatial context information in a general, statistical manner is presented which is computationally less intensive. Classifications that are nearly as accurate are produced.
Building on crossvalidation for increasing the quality of geostatistical modeling
Olea, R.A.
2012-01-01
The random function is a mathematical model commonly used in the assessment of uncertainty associated with a spatially correlated attribute that has been partially sampled. There are multiple algorithms for modeling such random functions, all sharing the requirement of specifying various parameters that have critical influence on the results. The importance of finding ways to compare the methods and setting parameters to obtain results that better model uncertainty has increased as these algorithms have grown in number and complexity. Crossvalidation has been used in spatial statistics, mostly in kriging, for the analysis of mean square errors. An appeal of this approach is its ability to work with the same empirical sample available for running the algorithms. This paper goes beyond checking estimates by formulating a function sensitive to conditional bias. Under ideal conditions, such function turns into a straight line, which can be used as a reference for preparing measures of performance. Applied to kriging, deviations from the ideal line provide sensitivity to the semivariogram lacking in crossvalidation of kriging errors and are more sensitive to conditional bias than analyses of errors. In terms of stochastic simulation, in addition to finding better parameters, the deviations allow comparison of the realizations resulting from the applications of different methods. Examples show improvements of about 30% in the deviations and approximately 10% in the square root of mean square errors between reasonable starting modelling and the solutions according to the new criteria. ?? 2011 US Government.
2018-01-01
collected data. These statistical techniques are under the area of descriptive statistics, which is a methodology to condense the data in quantitative ...ARL-TR-8270 ● JAN 2018 US Army Research Laboratory An Automated Energy Detection Algorithm Based on Morphological Filter...report when it is no longer needed. Do not return it to the originator. ARL-TR-8270 ● JAN 2017 US Army Research Laboratory An
Liu, Hongcheng; Yao, Tao; Li, Runze; Ye, Yinyu
2017-11-01
This paper concerns the folded concave penalized sparse linear regression (FCPSLR), a class of popular sparse recovery methods. Although FCPSLR yields desirable recovery performance when solved globally, computing a global solution is NP-complete. Despite some existing statistical performance analyses on local minimizers or on specific FCPSLR-based learning algorithms, it still remains open questions whether local solutions that are known to admit fully polynomial-time approximation schemes (FPTAS) may already be sufficient to ensure the statistical performance, and whether that statistical performance can be non-contingent on the specific designs of computing procedures. To address the questions, this paper presents the following threefold results: (i) Any local solution (stationary point) is a sparse estimator, under some conditions on the parameters of the folded concave penalties. (ii) Perhaps more importantly, any local solution satisfying a significant subspace second-order necessary condition (S 3 ONC), which is weaker than the second-order KKT condition, yields a bounded error in approximating the true parameter with high probability. In addition, if the minimal signal strength is sufficient, the S 3 ONC solution likely recovers the oracle solution. This result also explicates that the goal of improving the statistical performance is consistent with the optimization criteria of minimizing the suboptimality gap in solving the non-convex programming formulation of FCPSLR. (iii) We apply (ii) to the special case of FCPSLR with minimax concave penalty (MCP) and show that under the restricted eigenvalue condition, any S 3 ONC solution with a better objective value than the Lasso solution entails the strong oracle property. In addition, such a solution generates a model error (ME) comparable to the optimal but exponential-time sparse estimator given a sufficient sample size, while the worst-case ME is comparable to the Lasso in general. Furthermore, to guarantee the S 3 ONC admits FPTAS.
Hensman, James; Lawrence, Neil D; Rattray, Magnus
2013-08-20
Time course data from microarrays and high-throughput sequencing experiments require simple, computationally efficient and powerful statistical models to extract meaningful biological signal, and for tasks such as data fusion and clustering. Existing methodologies fail to capture either the temporal or replicated nature of the experiments, and often impose constraints on the data collection process, such as regularly spaced samples, or similar sampling schema across replications. We propose hierarchical Gaussian processes as a general model of gene expression time-series, with application to a variety of problems. In particular, we illustrate the method's capacity for missing data imputation, data fusion and clustering.The method can impute data which is missing both systematically and at random: in a hold-out test on real data, performance is significantly better than commonly used imputation methods. The method's ability to model inter- and intra-cluster variance leads to more biologically meaningful clusters. The approach removes the necessity for evenly spaced samples, an advantage illustrated on a developmental Drosophila dataset with irregular replications. The hierarchical Gaussian process model provides an excellent statistical basis for several gene-expression time-series tasks. It has only a few additional parameters over a regular GP, has negligible additional complexity, is easily implemented and can be integrated into several existing algorithms. Our experiments were implemented in python, and are available from the authors' website: http://staffwww.dcs.shef.ac.uk/people/J.Hensman/.
A Laboratory Experiment for the Statistical Evaluation of Aerosol Retrieval (STEAR) Algorithms
NASA Astrophysics Data System (ADS)
Schuster, G. L.; Espinosa, R.; Ziemba, L. D.; Beyersdorf, A. J.; Rocha Lima, A.; Anderson, B. E.; Martins, J. V.; Dubovik, O.; Ducos, F.; Fuertes, D.; Lapyonok, T.; Shook, M.; Derimian, Y.; Moore, R.
2016-12-01
We have developed a method for validating Aerosol Robotic Network (AERONET) retrieval algorithms by mimicking atmospheric extinction and radiance measurements in a laboratory experiment. This enables radiometric retrievals that utilize the same sampling volumes, relative humidities, and particle size ranges as observed by other in situ instrumentation in the experiment. We utilize three Cavity Attenuated Phase Shift (CAPS) monitors for extinction and UMBC's three-wavelength Polarized Imaging Nephelometer (PI-Neph) for angular scattering measurements. We subsample the PI-Neph radiance measurements to angles that correspond to AERONET almucantar scans, with solar zenith angles ranging from 50 to 77 degrees. These measurements are then used as input to the Generalized Retrieval of Aerosol and Surface Properties (GRASP) algorithm, which retrieves size distributions, complex refractive indices, single-scatter albedos (SSA), and lidar ratios for the in situ samples. We obtained retrievals with residuals R < 10% for 100 samples. The samples that we tested include Arizona Test Dust, Arginotec NX, Senegal clay, Israel clay, montmorillonite, hematite, goethite, volcanic ash, ammonium nitrate, ammonium sulfate, and fullerene soot. Samples were alternately dried or humidified, and size distributions were limited to diameters of 1.0 or 2.5 um by using a cyclone. The SSA at 532 nm for these samples ranged from 0.59 to 1.00 when computed with CAPS extinction and PSAP absorption measurements. The GRASP retrieval provided SSAs that are highly correlated with the in situ SSAs, and the correlation coefficients ranged from 0.955 to 0.976, depending upon the simulated solar zenith angle. The GRASP SSAs exhibited an average absolute bias of +0.023 +/-0.01 with respect to the extinction and absorption measurements for the entire dataset. Although our apparatus was not capable of measuring backscatter lidar ratio, we did measure bistatic lidar ratios at a scattering angle of 173 deg. The GRASP bistatic lidar ratios had correlations of 0.488 to 0.735 (depending upon simulated SZA) with respect to in situ measurements, positive relative biases of 6-10%, and average absolute biases of 4.0-6.6 sr. We also compared the GRASP size distributions to aerodynamic particle size measurements.
Maintaining and Enhancing Diversity of Sampled Protein Conformations in Robotics-Inspired Methods.
Abella, Jayvee R; Moll, Mark; Kavraki, Lydia E
2018-01-01
The ability to efficiently sample structurally diverse protein conformations allows one to gain a high-level view of a protein's energy landscape. Algorithms from robot motion planning have been used for conformational sampling, and several of these algorithms promote diversity by keeping track of "coverage" in conformational space based on the local sampling density. However, large proteins present special challenges. In particular, larger systems require running many concurrent instances of these algorithms, but these algorithms can quickly become memory intensive because they typically keep previously sampled conformations in memory to maintain coverage estimates. In addition, robotics-inspired algorithms depend on defining useful perturbation strategies for exploring the conformational space, which is a difficult task for large proteins because such systems are typically more constrained and exhibit complex motions. In this article, we introduce two methodologies for maintaining and enhancing diversity in robotics-inspired conformational sampling. The first method addresses algorithms based on coverage estimates and leverages the use of a low-dimensional projection to define a global coverage grid that maintains coverage across concurrent runs of sampling. The second method is an automatic definition of a perturbation strategy through readily available flexibility information derived from B-factors, secondary structure, and rigidity analysis. Our results show a significant increase in the diversity of the conformations sampled for proteins consisting of up to 500 residues when applied to a specific robotics-inspired algorithm for conformational sampling. The methodologies presented in this article may be vital components for the scalability of robotics-inspired approaches.
Adaptive detection of noise signal according to Neumann-Pearson criterion
NASA Astrophysics Data System (ADS)
Padiryakov, Y. A.
1985-03-01
Optimum detection according to the Neumann-Pearson criterion is considered in the case of a random Gaussian noise signal, stationary during measurement, and a stationary random Gaussian background interference. Detection is based on two samples, their statistics characterized by estimates of their spectral densities, it being a priori known that sample A from the signal channel is either the sum of signal and interference or interference alone and sample B from the reference interference channel is an interference with the same spectral density as that of the interference in sample A for both hypotheses. The probability of correct detection is maximized on the average, first in the 2N-dimensional space of signal spectral density and interference spectral density readings, by fixing the probability of false alarm at each point so as to stabilize it at a constant level against variation of the interference spectral density. Deterministic decision rules are established. The algorithm is then reduced to equivalent detection in the N-dimensional space of the ratio of sample A readings to sample B readings.
Optimal Budget Allocation for Sample Average Approximation
2011-06-01
an optimization algorithm applied to the sample average problem. We examine the convergence rate of the estimator as the computing budget tends to...regime for the optimization algorithm . 1 Introduction Sample average approximation (SAA) is a frequently used approach to solving stochastic programs...appealing due to its simplicity and the fact that a large number of standard optimization algorithms are often available to optimize the resulting sample
Bayesian microsaccade detection
Mihali, Andra; van Opheusden, Bas; Ma, Wei Ji
2017-01-01
Microsaccades are high-velocity fixational eye movements, with special roles in perception and cognition. The default microsaccade detection method is to determine when the smoothed eye velocity exceeds a threshold. We have developed a new method, Bayesian microsaccade detection (BMD), which performs inference based on a simple statistical model of eye positions. In this model, a hidden state variable changes between drift and microsaccade states at random times. The eye position is a biased random walk with different velocity distributions for each state. BMD generates samples from the posterior probability distribution over the eye state time series given the eye position time series. Applied to simulated data, BMD recovers the “true” microsaccades with fewer errors than alternative algorithms, especially at high noise. Applied to EyeLink eye tracker data, BMD detects almost all the microsaccades detected by the default method, but also apparent microsaccades embedded in high noise—although these can also be interpreted as false positives. Next we apply the algorithms to data collected with a Dual Purkinje Image eye tracker, whose higher precision justifies defining the inferred microsaccades as ground truth. When we add artificial measurement noise, the inferences of all algorithms degrade; however, at noise levels comparable to EyeLink data, BMD recovers the “true” microsaccades with 54% fewer errors than the default algorithm. Though unsuitable for online detection, BMD has other advantages: It returns probabilities rather than binary judgments, and it can be straightforwardly adapted as the generative model is refined. We make our algorithm available as a software package. PMID:28114483
Standard and Robust Methods in Regression Imputation
ERIC Educational Resources Information Center
Moraveji, Behjat; Jafarian, Koorosh
2014-01-01
The aim of this paper is to provide an introduction of new imputation algorithms for estimating missing values from official statistics in larger data sets of data pre-processing, or outliers. The goal is to propose a new algorithm called IRMI (iterative robust model-based imputation). This algorithm is able to deal with all challenges like…
NASA Astrophysics Data System (ADS)
Nishiura, Takanobu; Nakamura, Satoshi
2002-11-01
It is very important to capture distant-talking speech for a hands-free speech interface with high quality. A microphone array is an ideal candidate for this purpose. However, this approach requires localizing the target talker. Conventional talker localization algorithms in multiple sound source environments not only have difficulty localizing the multiple sound sources accurately, but also have difficulty localizing the target talker among known multiple sound source positions. To cope with these problems, we propose a new talker localization algorithm consisting of two algorithms. One is DOA (direction of arrival) estimation algorithm for multiple sound source localization based on CSP (cross-power spectrum phase) coefficient addition method. The other is statistical sound source identification algorithm based on GMM (Gaussian mixture model) for localizing the target talker position among localized multiple sound sources. In this paper, we particularly focus on the talker localization performance based on the combination of these two algorithms with a microphone array. We conducted evaluation experiments in real noisy reverberant environments. As a result, we confirmed that multiple sound signals can be identified accurately between ''speech'' or ''non-speech'' by the proposed algorithm. [Work supported by ATR, and MEXT of Japan.
Asquith, William H.
2014-01-01
The implementation characteristics of two method of L-moments (MLM) algorithms for parameter estimation of the 4-parameter Asymmetric Exponential Power (AEP4) distribution are studied using the R environment for statistical computing. The objective is to validate the algorithms for general application of the AEP4 using R. An algorithm was introduced in the original study of the L-moments for the AEP4. A second or alternative algorithm is shown to have a larger L-moment-parameter domain than the original. The alternative algorithm is shown to provide reliable parameter production and recovery of L-moments from fitted parameters. A proposal is made for AEP4 implementation in conjunction with the 4-parameter Kappa distribution to create a mixed-distribution framework encompassing the joint L-skew and L-kurtosis domains. The example application provides a demonstration of pertinent algorithms with L-moment statistics and two 4-parameter distributions (AEP4 and the Generalized Lambda) for MLM fitting to a modestly asymmetric and heavy-tailed dataset using R.
Comparison of probability statistics for automated ship detection in SAR imagery
NASA Astrophysics Data System (ADS)
Henschel, Michael D.; Rey, Maria T.; Campbell, J. W. M.; Petrovic, D.
1998-12-01
This paper discuses the initial results of a recent operational trial of the Ocean Monitoring Workstation's (OMW) ship detection algorithm which is essentially a Constant False Alarm Rate filter applied to Synthetic Aperture Radar data. The choice of probability distribution and methodologies for calculating scene specific statistics are discussed in some detail. An empirical basis for the choice of probability distribution used is discussed. We compare the results using a l-look, k-distribution function with various parameter choices and methods of estimation. As a special case of sea clutter statistics the application of a (chi) 2-distribution is also discussed. Comparisons are made with reference to RADARSAT data collected during the Maritime Command Operation Training exercise conducted in Atlantic Canadian Waters in June 1998. Reference is also made to previously collected statistics. The OMW is a commercial software suite that provides modules for automated vessel detection, oil spill monitoring, and environmental monitoring. This work has been undertaken to fine tune the OMW algorithm's, with special emphasis on the false alarm rate of each algorithm.
A bootstrap based Neyman-Pearson test for identifying variable importance.
Ditzler, Gregory; Polikar, Robi; Rosen, Gail
2015-04-01
Selection of most informative features that leads to a small loss on future data are arguably one of the most important steps in classification, data analysis and model selection. Several feature selection (FS) algorithms are available; however, due to noise present in any data set, FS algorithms are typically accompanied by an appropriate cross-validation scheme. In this brief, we propose a statistical hypothesis test derived from the Neyman-Pearson lemma for determining if a feature is statistically relevant. The proposed approach can be applied as a wrapper to any FS algorithm, regardless of the FS criteria used by that algorithm, to determine whether a feature belongs in the relevant set. Perhaps more importantly, this procedure efficiently determines the number of relevant features given an initial starting point. We provide freely available software implementations of the proposed methodology.
Gladstone, Emilie; Smolina, Kate; Morgan, Steven G.; Fernandes, Kimberly A.; Martins, Diana; Gomes, Tara
2016-01-01
Background: Comprehensive systems for surveilling prescription opioid–related harms provide clear evidence that deaths from prescription opioids have increased dramatically in the United States. However, these harms are not systematically monitored in Canada. In light of a growing public health crisis, accessible, nationwide data sources to examine prescription opioid–related harms in Canada are needed. We sought to examine the performance of 5 algorithms to identify prescription opioid–related deaths from vital statistics data against data abstracted from the Office of the Chief Coroner of Ontario as a gold standard. Methods: We identified all prescription opioid–related deaths from Ontario coroners’ data that occurred between Jan. 31, 2003, and Dec. 31, 2010. We then used 5 different algorithms to identify prescription opioid–related deaths from vital statistics death data in 2010. We selected the algorithm with the highest sensitivity and a positive predictive value of more than 80% as the optimal algorithm for identifying prescription opioid–related deaths. Results: Four of the 5 algorithms had positive predictive values of more than 80%. The algorithm with the highest sensitivity (75%) in 2010 improved slightly in its predictive performance from 2003 to 2010. Interpretation: In the absence of specific systems for monitoring prescription opioid–related deaths in Canada, readily available national vital statistics data can be used to study prescription opioid–related mortality with considerable accuracy. Despite some limitations, these data may facilitate the implementation of national surveillance and monitoring strategies. PMID:26622006
Gladstone, Emilie; Smolina, Kate; Morgan, Steven G; Fernandes, Kimberly A; Martins, Diana; Gomes, Tara
2016-03-01
Comprehensive systems for surveilling prescription opioid-related harms provide clear evidence that deaths from prescription opioids have increased dramatically in the United States. However, these harms are not systematically monitored in Canada. In light of a growing public health crisis, accessible, nationwide data sources to examine prescription opioid-related harms in Canada are needed. We sought to examine the performance of 5 algorithms to identify prescription opioid-related deaths from vital statistics data against data abstracted from the Office of the Chief Coroner of Ontario as a gold standard. We identified all prescription opioid-related deaths from Ontario coroners' data that occurred between Jan. 31, 2003, and Dec. 31, 2010. We then used 5 different algorithms to identify prescription opioid-related deaths from vital statistics death data in 2010. We selected the algorithm with the highest sensitivity and a positive predictive value of more than 80% as the optimal algorithm for identifying prescription opioid-related deaths. Four of the 5 algorithms had positive predictive values of more than 80%. The algorithm with the highest sensitivity (75%) in 2010 improved slightly in its predictive performance from 2003 to 2010. In the absence of specific systems for monitoring prescription opioid-related deaths in Canada, readily available national vital statistics data can be used to study prescription opioid-related mortality with considerable accuracy. Despite some limitations, these data may facilitate the implementation of national surveillance and monitoring strategies. © 2016 Canadian Medical Association or its licensors.
Research on Abnormal Detection Based on Improved Combination of K - means and SVDD
NASA Astrophysics Data System (ADS)
Hao, Xiaohong; Zhang, Xiaofeng
2018-01-01
In order to improve the efficiency of network intrusion detection and reduce the false alarm rate, this paper proposes an anomaly detection algorithm based on improved K-means and SVDD. The algorithm first uses the improved K-means algorithm to cluster the training samples of each class, so that each class is independent and compact in class; Then, according to the training samples, the SVDD algorithm is used to construct the minimum superspheres. The subordinate relationship of the samples is determined by calculating the distance of the minimum superspheres constructed by SVDD. If the test sample is less than the center of the hypersphere, the test sample belongs to this class, otherwise it does not belong to this class, after several comparisons, the final test of the effective detection of the test sample.In this paper, we use KDD CUP99 data set to simulate the proposed anomaly detection algorithm. The results show that the algorithm has high detection rate and low false alarm rate, which is an effective network security protection method.
Integrative missing value estimation for microarray data.
Hu, Jianjun; Li, Haifeng; Waterman, Michael S; Zhou, Xianghong Jasmine
2006-10-12
Missing value estimation is an important preprocessing step in microarray analysis. Although several methods have been developed to solve this problem, their performance is unsatisfactory for datasets with high rates of missing data, high measurement noise, or limited numbers of samples. In fact, more than 80% of the time-series datasets in Stanford Microarray Database contain less than eight samples. We present the integrative Missing Value Estimation method (iMISS) by incorporating information from multiple reference microarray datasets to improve missing value estimation. For each gene with missing data, we derive a consistent neighbor-gene list by taking reference data sets into consideration. To determine whether the given reference data sets are sufficiently informative for integration, we use a submatrix imputation approach. Our experiments showed that iMISS can significantly and consistently improve the accuracy of the state-of-the-art Local Least Square (LLS) imputation algorithm by up to 15% improvement in our benchmark tests. We demonstrated that the order-statistics-based integrative imputation algorithms can achieve significant improvements over the state-of-the-art missing value estimation approaches such as LLS and is especially good for imputing microarray datasets with a limited number of samples, high rates of missing data, or very noisy measurements. With the rapid accumulation of microarray datasets, the performance of our approach can be further improved by incorporating larger and more appropriate reference datasets.
LC-MSsim – a simulation software for liquid chromatography mass spectrometry data
Schulz-Trieglaff, Ole; Pfeifer, Nico; Gröpl, Clemens; Kohlbacher, Oliver; Reinert, Knut
2008-01-01
Background Mass Spectrometry coupled to Liquid Chromatography (LC-MS) is commonly used to analyze the protein content of biological samples in large scale studies. The data resulting from an LC-MS experiment is huge, highly complex and noisy. Accordingly, it has sparked new developments in Bioinformatics, especially in the fields of algorithm development, statistics and software engineering. In a quantitative label-free mass spectrometry experiment, crucial steps are the detection of peptide features in the mass spectra and the alignment of samples by correcting for shifts in retention time. At the moment, it is difficult to compare the plethora of algorithms for these tasks. So far, curated benchmark data exists only for peptide identification algorithms but no data that represents a ground truth for the evaluation of feature detection, alignment and filtering algorithms. Results We present LC-MSsim, a simulation software for LC-ESI-MS experiments. It simulates ESI spectra on the MS level. It reads a list of proteins from a FASTA file and digests the protein mixture using a user-defined enzyme. The software creates an LC-MS data set using a predictor for the retention time of the peptides and a model for peak shapes and elution profiles of the mass spectral peaks. Our software also offers the possibility to add contaminants, to change the background noise level and includes a model for the detectability of peptides in mass spectra. After the simulation, LC-MSsim writes the simulated data to mzData, a public XML format. The software also stores the positions (monoisotopic m/z and retention time) and ion counts of the simulated ions in separate files. Conclusion LC-MSsim generates simulated LC-MS data sets and incorporates models for peak shapes and contaminations. Algorithm developers can match the results of feature detection and alignment algorithms against the simulated ion lists and meaningful error rates can be computed. We anticipate that LC-MSsim will be useful to the wider community to perform benchmark studies and comparisons between computational tools. PMID:18842122
Onder, Devrim; Sarioglu, Sulen; Karacali, Bilge
2013-04-01
Quasi-supervised learning is a statistical learning algorithm that contrasts two datasets by computing estimate for the posterior probability of each sample in either dataset. This method has not been applied to histopathological images before. The purpose of this study is to evaluate the performance of the method to identify colorectal tissues with or without adenocarcinoma. Light microscopic digital images from histopathological sections were obtained from 30 colorectal radical surgery materials including adenocarcinoma and non-neoplastic regions. The texture features were extracted by using local histograms and co-occurrence matrices. The quasi-supervised learning algorithm operates on two datasets, one containing samples of normal tissues labelled only indirectly, and the other containing an unlabeled collection of samples of both normal and cancer tissues. As such, the algorithm eliminates the need for manually labelled samples of normal and cancer tissues for conventional supervised learning and significantly reduces the expert intervention. Several texture feature vector datasets corresponding to different extraction parameters were tested within the proposed framework. The Independent Component Analysis dimensionality reduction approach was also identified as the one improving the labelling performance evaluated in this series. In this series, the proposed method was applied to the dataset of 22,080 vectors with reduced dimensionality 119 from 132. Regions containing cancer tissue could be identified accurately having false and true positive rates up to 19% and 88% respectively without using manually labelled ground-truth datasets in a quasi-supervised strategy. The resulting labelling performances were compared to that of a conventional powerful supervised classifier using manually labelled ground-truth data. The supervised classifier results were calculated as 3.5% and 95% for the same case. The results in this series in comparison with the benchmark classifier, suggest that quasi-supervised image texture labelling may be a useful method in the analysis and classification of pathological slides but further study is required to improve the results. Copyright © 2013 Elsevier Ltd. All rights reserved.
AVNM: A Voting based Novel Mathematical Rule for Image Classification.
Vidyarthi, Ankit; Mittal, Namita
2016-12-01
In machine learning, the accuracy of the system depends upon classification result. Classification accuracy plays an imperative role in various domains. Non-parametric classifier like K-Nearest Neighbor (KNN) is the most widely used classifier for pattern analysis. Besides its easiness, simplicity and effectiveness characteristics, the main problem associated with KNN classifier is the selection of a number of nearest neighbors i.e. "k" for computation. At present, it is hard to find the optimal value of "k" using any statistical algorithm, which gives perfect accuracy in terms of low misclassification error rate. Motivated by the prescribed problem, a new sample space reduction weighted voting mathematical rule (AVNM) is proposed for classification in machine learning. The proposed AVNM rule is also non-parametric in nature like KNN. AVNM uses the weighted voting mechanism with sample space reduction to learn and examine the predicted class label for unidentified sample. AVNM is free from any initial selection of predefined variable and neighbor selection as found in KNN algorithm. The proposed classifier also reduces the effect of outliers. To verify the performance of the proposed AVNM classifier, experiments are made on 10 standard datasets taken from UCI database and one manually created dataset. The experimental result shows that the proposed AVNM rule outperforms the KNN classifier and its variants. Experimentation results based on confusion matrix accuracy parameter proves higher accuracy value with AVNM rule. The proposed AVNM rule is based on sample space reduction mechanism for identification of an optimal number of nearest neighbor selections. AVNM results in better classification accuracy and minimum error rate as compared with the state-of-art algorithm, KNN, and its variants. The proposed rule automates the selection of nearest neighbor selection and improves classification rate for UCI dataset and manually created dataset. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Oriani, Fabio
2017-04-01
The unpredictable nature of rainfall makes its estimation as much difficult as it is essential to hydrological applications. Stochastic simulation is often considered a convenient approach to asses the uncertainty of rainfall processes, but preserving their irregular behavior and variability at multiple scales is a challenge even for the most advanced techniques. In this presentation, an overview on the Direct Sampling technique [1] and its recent application to rainfall and hydrological data simulation [2, 3] is given. The algorithm, having its roots in multiple-point statistics, makes use of a training data set to simulate the outcome of a process without inferring any explicit probability measure: the data are simulated in time or space by sampling the training data set where a sufficiently similar group of neighbor data exists. This approach allows preserving complex statistical dependencies at different scales with a good approximation, while reducing the parameterization to the minimum. The straights and weaknesses of the Direct Sampling approach are shown through a series of applications to rainfall and hydrological data: from time-series simulation to spatial rainfall fields conditioned by elevation or a climate scenario. In the era of vast databases, is this data-driven approach a valid alternative to parametric simulation techniques? [1] Mariethoz G., Renard P., and Straubhaar J. (2010), The Direct Sampling method to perform multiple-point geostatistical simulations, Water. Rerous. Res., 46(11), http://dx.doi.org/10.1029/2008WR007621 [2] Oriani F., Straubhaar J., Renard P., and Mariethoz G. (2014), Simulation of rainfall time series from different climatic regions using the direct sampling technique, Hydrol. Earth Syst. Sci., 18, 3015-3031, http://dx.doi.org/10.5194/hess-18-3015-2014 [3] Oriani F., Borghi A., Straubhaar J., Mariethoz G., Renard P. (2016), Missing data simulation inside flow rate time-series using multiple-point statistics, Environ. Model. Softw., vol. 86, pp. 264 - 276, http://dx.doi.org/10.1016/j.envsoft.2016.10.002
Recursive algorithms for phylogenetic tree counting.
Gavryushkina, Alexandra; Welch, David; Drummond, Alexei J
2013-10-28
In Bayesian phylogenetic inference we are interested in distributions over a space of trees. The number of trees in a tree space is an important characteristic of the space and is useful for specifying prior distributions. When all samples come from the same time point and no prior information available on divergence times, the tree counting problem is easy. However, when fossil evidence is used in the inference to constrain the tree or data are sampled serially, new tree spaces arise and counting the number of trees is more difficult. We describe an algorithm that is polynomial in the number of sampled individuals for counting of resolutions of a constraint tree assuming that the number of constraints is fixed. We generalise this algorithm to counting resolutions of a fully ranked constraint tree. We describe a quadratic algorithm for counting the number of possible fully ranked trees on n sampled individuals. We introduce a new type of tree, called a fully ranked tree with sampled ancestors, and describe a cubic time algorithm for counting the number of such trees on n sampled individuals. These algorithms should be employed for Bayesian Markov chain Monte Carlo inference when fossil data are included or data are serially sampled.
A study on the application of topic models to motif finding algorithms.
Basha Gutierrez, Josep; Nakai, Kenta
2016-12-22
Topic models are statistical algorithms which try to discover the structure of a set of documents according to the abstract topics contained in them. Here we try to apply this approach to the discovery of the structure of the transcription factor binding sites (TFBS) contained in a set of biological sequences, which is a fundamental problem in molecular biology research for the understanding of transcriptional regulation. Here we present two methods that make use of topic models for motif finding. First, we developed an algorithm in which first a set of biological sequences are treated as text documents, and the k-mers contained in them as words, to then build a correlated topic model (CTM) and iteratively reduce its perplexity. We also used the perplexity measurement of CTMs to improve our previous algorithm based on a genetic algorithm and several statistical coefficients. The algorithms were tested with 56 data sets from four different species and compared to 14 other methods by the use of several coefficients both at nucleotide and site level. The results of our first approach showed a performance comparable to the other methods studied, especially at site level and in sensitivity scores, in which it scored better than any of the 14 existing tools. In the case of our previous algorithm, the new approach with the addition of the perplexity measurement clearly outperformed all of the other methods in sensitivity, both at nucleotide and site level, and in overall performance at site level. The statistics obtained show that the performance of a motif finding method based on the use of a CTM is satisfying enough to conclude that the application of topic models is a valid method for developing motif finding algorithms. Moreover, the addition of topic models to a previously developed method dramatically increased its performance, suggesting that this combined algorithm can be a useful tool to successfully predict motifs in different kinds of sets of DNA sequences.
NASA Astrophysics Data System (ADS)
Mayvan, Ali D.; Aghaeinia, Hassan; Kazemi, Mohammad
2017-12-01
This paper focuses on robust transceiver design for throughput enhancement on the interference channel (IC), under imperfect channel state information (CSI). In this paper, two algorithms are proposed to improve the throughput of the multi-input multi-output (MIMO) IC. Each transmitter and receiver has, respectively, M and N antennas and IC operates in a time division duplex mode. In the first proposed algorithm, each transceiver adjusts its filter to maximize the expected value of signal-to-interference-plus-noise ratio (SINR). On the other hand, the second algorithm tries to minimize the variances of the SINRs to hedge against the variability due to CSI error. Taylor expansion is exploited to approximate the effect of CSI imperfection on mean and variance. The proposed robust algorithms utilize the reciprocity of wireless networks to optimize the estimated statistical properties in two different working modes. Monte Carlo simulations are employed to investigate sum rate performance of the proposed algorithms and the advantage of incorporating variation minimization into the transceiver design.
Ramadas, Gisela C V; Rocha, Ana Maria A C; Fernandes, Edite M G P
2015-01-01
This paper addresses the challenging task of computing multiple roots of a system of nonlinear equations. A repulsion algorithm that invokes the Nelder-Mead (N-M) local search method and uses a penalty-type merit function based on the error function, known as 'erf', is presented. In the N-M algorithm context, different strategies are proposed to enhance the quality of the solutions and improve the overall efficiency. The main goal of this paper is to use a two-level factorial design of experiments to analyze the statistical significance of the observed differences in selected performance criteria produced when testing different strategies in the N-M based repulsion algorithm. The main goal of this paper is to use a two-level factorial design of experiments to analyze the statistical significance of the observed differences in selected performance criteria produced when testing different strategies in the N-M based repulsion algorithm.
Algorithms for constructing optimal paths and statistical analysis of passenger traffic
NASA Astrophysics Data System (ADS)
Trofimov, S. P.; Druzhinina, N. G.; Trofimova, O. G.
2018-01-01
Several existing information systems of urban passenger transport (UPT) are considered. Author’s UPT network model is presented. To a passenger a new service is offered that is the best path from one stop to another stop at a specified time. The algorithm and software implementation for finding the optimal path are presented. The algorithm uses the current UPT schedule. The article also describes the algorithm of statistical analysis of trip payments by the electronic E-cards. The algorithm allows obtaining the density of passenger traffic during the day. This density is independent of the network topology and UPT schedules. The resulting density of the traffic flow can solve a number of practical problems. In particular, the forecast for the overflow of passenger transport in the «rush» hours, the quantitative comparison of different topologies transport networks, constructing of the best UPT timetable. The efficiency of the proposed integrated approach is demonstrated by the example of the model town with arbitrary dimensions.
Equilibrium Sampling in Biomolecular Simulation
2015-01-01
Equilibrium sampling of biomolecules remains an unmet challenge after more than 30 years of atomistic simulation. Efforts to enhance sampling capability, which are reviewed here, range from the development of new algorithms to parallelization to novel uses of hardware. Special focus is placed on classifying algorithms — most of which are underpinned by a few key ideas — in order to understand their fundamental strengths and limitations. Although algorithms have proliferated, progress resulting from novel hardware use appears to be more clear-cut than from algorithms alone, partly due to the lack of widely used sampling measures. PMID:21370970
2013-06-01
collection are the facts that devices the lack encryption or compression methods and that the log file must be saved on the host system prior to transfer...time. Statistical correlation utilizes numerical algorithms to detect deviations from normal event levels and other routine activities (Chuvakin...can also assist in detecting low volume threats. Although easy and logical to implement, the implementation of statistical correlation algorithms
Statistical hadronization and microcanonical ensemble
Becattini, F.; Ferroni, L.
2004-01-01
We present a Monte Carlo calculation of the microcanonical ensemble of the of the ideal hadron-resonance gas including all known states up to a mass of 1. 8 GeV, taking into account quantum statistics. The computing method is a development of a previous one based on a Metropolis Monte Carlo algorithm, with a the grand-canonical limit of the multi-species multiplicity distribution as proposal matrix. The microcanonical average multiplicities of the various hadron species are found to converge to the canonical ones for moderately low values of the total energy. This algorithm opens the way for event generators based for themore » statistical hadronization model.« less
Vomweg, T W; Buscema, M; Kauczor, H U; Teifke, A; Intraligi, M; Terzi, S; Heussel, C P; Achenbach, T; Rieker, O; Mayer, D; Thelen, M
2003-09-01
The aim of this study was to evaluate the capability of improved artificial neural networks (ANN) and additional novel training methods in distinguishing between benign and malignant breast lesions in contrast-enhanced magnetic resonance-mammography (MRM). A total of 604 histologically proven cases of contrast-enhanced lesions of the female breast at MRI were analyzed. Morphological, dynamic and clinical parameters were collected and stored in a database. The data set was divided into several groups using random or experimental methods [Training & Testing (T&T) algorithm] to train and test different ANNs. An additional novel computer program for input variable selection was applied. Sensitivity and specificity were calculated and compared with a statistical method and an expert radiologist. After optimization of the distribution of cases among the training and testing sets by the T & T algorithm and the reduction of input variables by the Input Selection procedure a highly sophisticated ANN achieved a sensitivity of 93.6% and a specificity of 91.9% in predicting malignancy of lesions within an independent prediction sample set. The best statistical method reached a sensitivity of 90.5% and a specificity of 68.9%. An expert radiologist performed better than the statistical method but worse than the ANN (sensitivity 92.1%, specificity 85.6%). Features extracted out of dynamic contrast-enhanced MRM and additional clinical data can be successfully analyzed by advanced ANNs. The quality of the resulting network strongly depends on the training methods, which are improved by the use of novel training tools. The best results of an improved ANN outperform expert radiologists.
Kuntzelman, Karl; Jack Rhodes, L; Harrington, Lillian N; Miskovic, Vladimir
2018-06-01
There is a broad family of statistical methods for capturing time series regularity, with increasingly widespread adoption by the neuroscientific community. A common feature of these methods is that they permit investigators to quantify the entropy of brain signals - an index of unpredictability/complexity. Despite the proliferation of algorithms for computing entropy from neural time series data there is scant evidence concerning their relative stability and efficiency. Here we evaluated several different algorithmic implementations (sample, fuzzy, dispersion and permutation) of multiscale entropy in terms of their stability across sessions, internal consistency and computational speed, accuracy and precision using a combination of electroencephalogram (EEG) and synthetic 1/ƒ noise signals. Overall, we report fair to excellent internal consistency and longitudinal stability over a one-week period for the majority of entropy estimates, with several caveats. Computational timing estimates suggest distinct advantages for dispersion and permutation entropy over other entropy estimates. Considered alongside the psychometric evidence, we suggest several ways in which researchers can maximize computational resources (without sacrificing reliability), especially when working with high-density M/EEG data or multivoxel BOLD time series signals. Copyright © 2018 Elsevier Inc. All rights reserved.
Marcek, Dusan; Durisova, Maria
2016-01-01
This paper deals with application of quantitative soft computing prediction models into financial area as reliable and accurate prediction models can be very helpful in management decision-making process. The authors suggest a new hybrid neural network which is a combination of the standard RBF neural network, a genetic algorithm, and a moving average. The moving average is supposed to enhance the outputs of the network using the error part of the original neural network. Authors test the suggested model on high-frequency time series data of USD/CAD and examine the ability to forecast exchange rate values for the horizon of one day. To determine the forecasting efficiency, they perform a comparative statistical out-of-sample analysis of the tested model with autoregressive models and the standard neural network. They also incorporate genetic algorithm as an optimizing technique for adapting parameters of ANN which is then compared with standard backpropagation and backpropagation combined with K-means clustering algorithm. Finally, the authors find out that their suggested hybrid neural network is able to produce more accurate forecasts than the standard models and can be helpful in eliminating the risk of making the bad decision in decision-making process. PMID:26977450
Falat, Lukas; Marcek, Dusan; Durisova, Maria
2016-01-01
This paper deals with application of quantitative soft computing prediction models into financial area as reliable and accurate prediction models can be very helpful in management decision-making process. The authors suggest a new hybrid neural network which is a combination of the standard RBF neural network, a genetic algorithm, and a moving average. The moving average is supposed to enhance the outputs of the network using the error part of the original neural network. Authors test the suggested model on high-frequency time series data of USD/CAD and examine the ability to forecast exchange rate values for the horizon of one day. To determine the forecasting efficiency, they perform a comparative statistical out-of-sample analysis of the tested model with autoregressive models and the standard neural network. They also incorporate genetic algorithm as an optimizing technique for adapting parameters of ANN which is then compared with standard backpropagation and backpropagation combined with K-means clustering algorithm. Finally, the authors find out that their suggested hybrid neural network is able to produce more accurate forecasts than the standard models and can be helpful in eliminating the risk of making the bad decision in decision-making process.
ICPD-a new peak detection algorithm for LC/MS.
Zhang, Jianqiu; Haskins, William
2010-12-01
The identification and quantification of proteins using label-free Liquid Chromatography/Mass Spectrometry (LC/MS) play crucial roles in biological and biomedical research. Increasing evidence has shown that biomarkers are often low abundance proteins. However, LC/MS systems are subject to considerable noise and sample variability, whose statistical characteristics are still elusive, making computational identification of low abundance proteins extremely challenging. As a result, the inability of identifying low abundance proteins in a proteomic study is the main bottleneck in protein biomarker discovery. In this paper, we propose a new peak detection method called Information Combining Peak Detection (ICPD ) for high resolution LC/MS. In LC/MS, peptides elute during a certain time period and as a result, peptide isotope patterns are registered in multiple MS scans. The key feature of the new algorithm is that the observed isotope patterns registered in multiple scans are combined together for estimating the likelihood of the peptide existence. An isotope pattern matching score based on the likelihood probability is provided and utilized for peak detection. The performance of the new algorithm is evaluated based on protein standards with 48 known proteins. The evaluation shows better peak detection accuracy for low abundance proteins than other LC/MS peak detection methods.
An evaluation of voice stress analysis techniques in a simulated AWACS environment
NASA Astrophysics Data System (ADS)
Jones, William A., Jr.
1990-08-01
The purpose was to determine if voice analysis algorithms are an effective measure of stress resulting from high workload. Fundamental frequency, frequency jitter, and amplitude shimmer algorithms were employed to measure the effects of stress in crewmember communications data in simulated AWACS mission scenarios. Two independent workload measures were used to identify levels of stress: a predictor model developed by the simulation author based upon scenario generated stimulus events; and the duration of communication for each weapons director, representative of the individual's response to the induced stress. Between eight and eleven speech samples were analyzed for each of the sixteen Air Force officers who participated in the study. Results identified fundamental frequency and frequency jitter as statistically significant vocal indicators of stress, while amplitude shimmer showed no signs of any significant relationship with workload or stress. Consistent with previous research, the frequency algorithm was identified as the most reliable measure. However, the results did not reveal a sensitive discrimination measure between levels of stress, but rather, did distinguish between the presence or absence of stress. The results illustrate a significant relationship between fundamental frequency and the effects of stress and also a significant inverse relationship with jitter, though less dramatic.
Demidov, German; Simakova, Tamara; Vnuchkova, Julia; Bragin, Anton
2016-10-22
Multiplex polymerase chain reaction (PCR) is a common enrichment technique for targeted massive parallel sequencing (MPS) protocols. MPS is widely used in biomedical research and clinical diagnostics as the fast and accurate tool for the detection of short genetic variations. However, identification of larger variations such as structure variants and copy number variations (CNV) is still being a challenge for targeted MPS. Some approaches and tools for structural variants detection were proposed, but they have limitations and often require datasets of certain type, size and expected number of amplicons affected by CNVs. In the paper, we describe novel algorithm for high-resolution germinal CNV detection in the PCR-enriched targeted sequencing data and present accompanying tool. We have developed a machine learning algorithm for the detection of large duplications and deletions in the targeted sequencing data generated with PCR-based enrichment step. We have performed verification studies and established the algorithm's sensitivity and specificity. We have compared developed tool with other available methods applicable for the described data and revealed its higher performance. We showed that our method has high specificity and sensitivity for high-resolution copy number detection in targeted sequencing data using large cohort of samples.
NASA Astrophysics Data System (ADS)
Lorenzetti, G.; Foresta, A.; Palleschi, V.; Legnaioli, S.
2009-09-01
The recent development of mobile instrumentation, specifically devoted to in situ analysis and study of museum objects, allows the acquisition of many LIBS spectra in very short time. However, such large amount of data calls for new analytical approaches which would guarantee a prompt analysis of the results obtained. In this communication, we will present and discuss the advantages of statistical analytical methods, such as Partial Least Squares Multiple Regression algorithms vs. the classical calibration curve approach. PLS algorithms allows to obtain in real time the information on the composition of the objects under study; this feature of the method, compared to the traditional off-line analysis of the data, is extremely useful for the optimization of the measurement times and number of points associated with the analysis. In fact, the real time availability of the compositional information gives the possibility of concentrating the attention on the most `interesting' parts of the object, without over-sampling the zones which would not provide useful information for the scholars or the conservators. Some example on the applications of this method will be presented, including the studies recently performed by the researcher of the Applied Laser Spectroscopy Laboratory on museum bronze objects.
Broadband spectral fitting of blazars using XSPEC
NASA Astrophysics Data System (ADS)
Sahayanathan, Sunder; Sinha, Atreyee; Misra, Ranjeev
2018-03-01
The broadband spectral energy distribution (SED) of blazars is generally interpreted as radiation arising from synchrotron and inverse Compton mechanisms. Traditionally, the underlying source parameters responsible for these emission processes, like particle energy density, magnetic field, etc., are obtained through simple visual reproduction of the observed fluxes. However, this procedure is incapable of providing confidence ranges for the estimated parameters. In this work, we propose an efficient algorithm to perform a statistical fit of the observed broadband spectrum of blazars using different emission models. Moreover, we use the observable quantities as the fit parameters, rather than the direct source parameters which govern the resultant SED. This significantly improves the convergence time and eliminates the uncertainty regarding initial guess parameters. This approach also has an added advantage of identifying the degenerate parameters, which can be removed by including more observable information and/or additional constraints. A computer code developed based on this algorithm is implemented as a user-defined routine in the standard X-ray spectral fitting package, XSPEC. Further, we demonstrate the efficacy of the algorithm by fitting the well sampled SED of blazar 3C 279 during its gamma ray flare in 2014.
Adaptive convergence nonuniformity correction algorithm.
Qian, Weixian; Chen, Qian; Bai, Junqi; Gu, Guohua
2011-01-01
Nowadays, convergence and ghosting artifacts are common problems in scene-based nonuniformity correction (NUC) algorithms. In this study, we introduce the idea of space frequency to the scene-based NUC. Then the convergence speed factor is presented, which can adaptively change the convergence speed by a change of the scene dynamic range. In fact, the convergence speed factor role is to decrease the statistical data standard deviation. The nonuniformity space relativity characteristic was summarized by plenty of experimental statistical data. The space relativity characteristic was used to correct the convergence speed factor, which can make it more stable. Finally, real and simulated infrared image sequences were applied to demonstrate the positive effect of our algorithm.
Wind profiling based on the optical beam intensity statistics in a turbulent atmosphere.
Banakh, Victor A; Marakasov, Dimitrii A
2007-10-01
Reconstruction of the wind profile from the statistics of intensity fluctuations of an optical beam propagating in a turbulent atmosphere is considered. The equations for the spatiotemporal correlation function and the spectrum of weak intensity fluctuations of a Gaussian beam are obtained. The algorithms of wind profile retrieval from the spatiotemporal intensity spectrum are described and the results of end-to-end computer experiments on wind profiling based on the developed algorithms are presented. It is shown that the developed algorithms allow retrieval of the wind profile from the turbulent optical beam intensity fluctuations with acceptable accuracy in many practically feasible laser measurements set up in the atmosphere.
Statistical simplex approach to primary and secondary color correction in thick lens assemblies
NASA Astrophysics Data System (ADS)
Ament, Shelby D. V.; Pfisterer, Richard
2017-11-01
A glass selection optimization algorithm is developed for primary and secondary color correction in thick lens systems. The approach is based on the downhill simplex method, and requires manipulation of the surface color equations to obtain a single glass-dependent parameter for each lens element. Linear correlation is used to relate this parameter to all other glass-dependent variables. The algorithm provides a statistical distribution of Abbe numbers for each element in the system. Examples of several lenses, from 2-element to 6-element systems, are performed to verify this approach. The optimization algorithm proposed is capable of finding glass solutions with high color correction without requiring an exhaustive search of the glass catalog.
A sample implementation for parallelizing Divide-and-Conquer algorithms on the GPU.
Mei, Gang; Zhang, Jiayin; Xu, Nengxiong; Zhao, Kunyang
2018-01-01
The strategy of Divide-and-Conquer (D&C) is one of the frequently used programming patterns to design efficient algorithms in computer science, which has been parallelized on shared memory systems and distributed memory systems. Tzeng and Owens specifically developed a generic paradigm for parallelizing D&C algorithms on modern Graphics Processing Units (GPUs). In this paper, by following the generic paradigm proposed by Tzeng and Owens, we provide a new and publicly available GPU implementation of the famous D&C algorithm, QuickHull, to give a sample and guide for parallelizing D&C algorithms on the GPU. The experimental results demonstrate the practicality of our sample GPU implementation. Our research objective in this paper is to present a sample GPU implementation of a classical D&C algorithm to help interested readers to develop their own efficient GPU implementations with fewer efforts.
Mande, Sharmila S.
2016-01-01
The nature of inter-microbial metabolic interactions defines the stability of microbial communities residing in any ecological niche. Deciphering these interaction patterns is crucial for understanding the mode/mechanism(s) through which an individual microbial community transitions from one state to another (e.g. from a healthy to a diseased state). Statistical correlation techniques have been traditionally employed for mining microbial interaction patterns from taxonomic abundance data corresponding to a given microbial community. In spite of their efficiency, these correlation techniques can capture only 'pair-wise interactions'. Moreover, their emphasis on statistical significance can potentially result in missing out on several interactions that are relevant from a biological standpoint. This study explores the applicability of one of the earliest association rule mining algorithm i.e. the 'Apriori algorithm' for deriving 'microbial association rules' from the taxonomic profile of given microbial community. The classical Apriori approach derives association rules by analysing patterns of co-occurrence/co-exclusion between various '(subsets of) features/items' across various samples. Using real-world microbiome data, the efficiency/utility of this rule mining approach in deciphering multiple (biologically meaningful) association patterns between 'subsets/subgroups' of microbes (constituting microbiome samples) is demonstrated. As an example, association rules derived from publicly available gut microbiome datasets indicate an association between a group of microbes (Faecalibacterium, Dorea, and Blautia) that are known to have mutualistic metabolic associations among themselves. Application of the rule mining approach on gut microbiomes (sourced from the Human Microbiome Project) further indicated similar microbial association patterns in gut microbiomes irrespective of the gender of the subjects. A Linux implementation of the Association Rule Mining (ARM) software (customised for deriving 'microbial association rules' from microbiome data) is freely available for download from the following link: http://metagenomics.atc.tcs.com/arm. PMID:27124399